## Regular Expressions

Defining REs in Python is straightforward:

In [None]:
import re

pattern = re.compile('[bcr]at')

We can then use the pattern to `search()` or `match()` strings to it. Both have a number of attributes to access the results.

In [None]:
matches = re.match(pattern, 'batter')
searches = re.search(pattern, 'batter')
matches.group(), searches.span()

We can also use the pattern to replace elements of a string that match with `sub()`

In [None]:
re.sub(pattern, 'ANIMAL', 'A cat sat on a mat with a bat and a rat.')

## Exercise

Write a RegEx to remove all user names from the tweets and replace them with the token "@USER"

In [None]:
tweets = [line.strip() for line in open('../data/tweets_en.txt', encoding='utf8')]
# your code here


## Exercise

Write a RegEx to search for all hashtags containing the word `good` in them.

In [None]:
# your code here

## TF-IDF

Let's extract the most important words from Moby Dick

In [None]:
import pandas as pd
documents = [line.strip() for line in open('../data/moby_dick.txt', encoding='utf8')]
print(documents[1])

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(analyzer='word', min_df=0.001, max_df=0.75, stop_words='english', sublinear_tf=True)

X = tfidf_vectorizer.fit_transform(documents)

Now, let's get the same information as raw counts:

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(analyzer='word', min_df=0.001, max_df=0.75, stop_words='english')

X2 = vectorizer.fit_transform(documents)

In [None]:
X.shape, X2.shape

In [None]:
word_counts = X2.toarray()
word_tfidf = X.toarray()
word_tfidf[word_tfidf < 0.2] = 0

df = pd.DataFrame(data={'word': vectorizer.get_feature_names(), 
                        'tf': word_counts.sum(axis=0), 
                        'idf': tfidf_vectorizer.idf_,
                        'tfidf': word_tfidf.sum(axis=0)
                       })

In [None]:
df = df.sort_values(['tfidf', 'tf', 'idf'])
df

## Exercise
Extract **only** the bigrams (no unigrams) from Moby Dick and find the top 10.

In [None]:
# your code here

## Exercise
Extract **only** the bigrams (no unigrams) from the Tweets and find the top 10.

In [None]:
# your code here

## PMI
Extracting PMI from text is relatively straightforward, and `nltk` offer some functions to do so flexibly.

In [None]:
from nltk.collocations import BigramCollocationFinder, BigramAssocMeasures
from nltk.corpus import stopwords

stopwords_ = set(stopwords.words('english'))

words = [word.lower() for document in documents for word in document.split() 
         if len(word) > 2 
         and word not in stopwords_]
finder = BigramCollocationFinder.from_words(words)
bgm = BigramAssocMeasures()
score = bgm.mi_like
collocations = {'_'.join(bigram): pmi for bigram, pmi in finder.score_ngrams(score)}
collocations

## Exercise

Extract the top 10 collocations for the Twitter data. You need to preprocess the data first!

In [None]:
# your code here

# Entropy

## Exercise

Compute the entropy over the vowel distribution below.

In [None]:
import numpy as np

vowels = {'a':74114, 'e':114114, 'i':61038, 'o':67362, 'u':26137}

In [None]:
# your code here