## Regular Expressions

Defining REs in Python is straightforward:

In [2]:
import re

pattern = re.compile('[bcr]at')

We can then use the pattern to `search()` or `match()` strings to it. 

`search()` will return a result if the pattern occurs **anywhere** in the input string.

`match()` will only return a result if the pattern **completely** matches the input string.

In [3]:
word = 'the batter won the game'
matches = re.match(pattern, word) # won't return a a result, i.e., matches = None
searches = re.search(pattern, word) # finds a substring

Both have a number of attributes to access the results. 
- `span()` gives us a tuple of the substring that matches
- `group()`returns the matched substring

In [4]:
span = searches.span()
word[span[0]:span[1]], span

('bat', (4, 7))

In [5]:
searches.group()

'bat'

If we have used several RE groups (in brackets `()`), we can access them individually via `groups()`

In [15]:
word = 'preconstitutionalism'
affixes = re.compile('(...).+(...)')
re.search(affixes, word).groups()

'pre'

For the email address finder, we can use a more advanced pattern and test it:

In [7]:
email = re.compile('^[A-Za-z0-9][A-Za-z0-9\.-]*@[A-Za-z0-9][A-Za-z0-9\.-]+\.[A-Za-z0-9\.-][A-Za-z0-9\.-][A-Za-z0-9\.-]?$')
for address in ['me.@unibocconi.it', '@web.de', '.@gmx.com', 'not working@aol.com']:
    print(re.match(email, address))
bocconi_address = re.match(email, 'me.@unibocconi.it')

<_sre.SRE_Match object; span=(0, 17), match='me.@unibocconi.it'>
None
None
None


We can also use the pattern to replace elements of a string that match with `sub()`

In [11]:
numbers = re.compile('[0-9]')
re.sub(numbers, '0', 'Back in the 90s, when I was a 12-year-old, a CD cost just 15,99EUR!')

'Back in the 00s, when I was a 00-year-old, a CD cost just 00,00EUR!'

## Exercise

Write a RegEx to remove all user names from the tweets and replace them with the token "@USER"

In [61]:
tweets = [line.strip() for line in open('../data/tweets_en.txt', encoding='utf8')]
# your code here
s = 'hey, @User123, how are you?'
user_pattern = re.compile('@[^ \W]+')
cleaned_tweets = [re.sub(user_pattern, '@USER', tweet) for tweet in tweets]
cleaned_tweets[0], tweets[0], re.sub(user_pattern, '@USER', s)

('@USER I think a lot of people just enjoy being a pain in the ass on there',
 '@cosmetic_candy I think a lot of people just enjoy being a pain in the ass on there',
 'hey, @USER, how are you?')

## Exercise

Write a RegEx to search for all hashtags containing the word `good` in them.

In [None]:
# your code here

## TF-IDF

Let's extract the most important words from Moby Dick

In [62]:
import pandas as pd
documents = [line.strip() for line in open('../data/moby_dick.txt', encoding='utf8')]
print(documents[1])

Call me Ishmael .


In [64]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(analyzer='word', min_df=0.001, max_df=0.75, stop_words='english', sublinear_tf=True)

X = tfidf_vectorizer.fit_transform(documents)

Now, let's get the same information as raw counts:

In [63]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(analyzer='word', min_df=0.001, max_df=0.75, stop_words='english')

X2 = vectorizer.fit_transform(documents)

<1x1850 sparse matrix of type '<class 'numpy.int64'>'
	with 0 stored elements in Compressed Sparse Row format>

In [66]:
X.shape, X2.shape

((9768, 1850), (9768, 1850))

In [71]:
df = pd.DataFrame(data={'word': vectorizer.get_feature_names(), 
                        'tf': X2.sum(axis=0).A1, 
                        'idf': tfidf_vectorizer.idf_,
                        'tfidf': X.sum(axis=0).A1
                       })

In [70]:
df = df.sort_values(['tfidf', 'tf', 'idf'])
df

Unnamed: 0,word,tf,idf,tfidf
1071,nations,10,7.789074,2.818093
1602,surprise,10,7.789074,2.934600
1735,valiant,10,7.789074,3.017954
1423,shortly,10,7.789074,3.032615
554,fleet,11,7.702063,3.049731
407,downward,10,7.789074,3.111894
1192,pitched,10,7.789074,3.124130
283,concluding,11,7.702063,3.142589
1318,retained,10,7.789074,3.149837
1654,thither,10,7.789074,3.212621


## Exercise
Extract **only** the bigrams (no unigrams) from Moby Dick and find the top 10.

In [None]:
# your code here

## Exercise
Extract **only** the bigrams (no unigrams) from the Tweets and find the top 10.

In [None]:
# your code here

## PMI
Extracting PMI from text is relatively straightforward, and `nltk` offer some functions to do so flexibly.

In [81]:
import nltk
nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     /Users/dirkhovy/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     /Users/dirkhovy/nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /Users/dirkhovy/nltk_data...
[nltk_data]    |   Package biocreative_ppi is already up-to-date!
[nltk_data]    | Downloading package brown to
[nltk_data]    |     /Users/dirkhovy/nltk_data...
[nltk_data]    |   Package brown is already up-to-date!
[nltk_data]    | Downloading package brown_tei to
[nltk_data]    |     /Users/dirkhovy/nltk_data...
[nltk_data]    |   Package brown_tei is already up-to-date!
[nltk_data]    | Downloading package cess_cat to
[nltk_data]    |     /Users/dirkhovy/nltk_data...
[nltk_data]    |   Package cess_cat is already up-to-dat

[nltk_data]    |   Package subjectivity is already up-to-date!
[nltk_data]    | Downloading package swadesh to
[nltk_data]    |     /Users/dirkhovy/nltk_data...
[nltk_data]    |   Package swadesh is already up-to-date!
[nltk_data]    | Downloading package switchboard to
[nltk_data]    |     /Users/dirkhovy/nltk_data...
[nltk_data]    |   Package switchboard is already up-to-date!
[nltk_data]    | Downloading package timit to
[nltk_data]    |     /Users/dirkhovy/nltk_data...
[nltk_data]    |   Package timit is already up-to-date!
[nltk_data]    | Downloading package toolbox to
[nltk_data]    |     /Users/dirkhovy/nltk_data...
[nltk_data]    |   Package toolbox is already up-to-date!
[nltk_data]    | Downloading package treebank to
[nltk_data]    |     /Users/dirkhovy/nltk_data...
[nltk_data]    |   Package treebank is already up-to-date!
[nltk_data]    | Downloading package twitter_samples to
[nltk_data]    |     /Users/dirkhovy/nltk_data...
[nltk_data]    |   Package twitter_samples is

True

In [79]:
from nltk.collocations import BigramCollocationFinder, BigramAssocMeasures
from nltk.corpus import stopwords

stopwords_ = set(stopwords.words('english'))

words = [word.lower() for document in documents for word in document.split() 
         if len(word) > 2 
         and word not in stopwords_]
finder = BigramCollocationFinder.from_words(words)
bgm = BigramAssocMeasures()
score = bgm.mi_like
collocations = {'_'.join(bigram): pmi for bigram, pmi in finder.score_ngrams(score)}
Counter(collocations).most_common(10)

[('moby_dick', 83.0),
 ('sperm_whale', 20.002847184002935),
 ('mrs_hussey', 10.5625),
 ('mast_heads', 4.391152941176471),
 ('sag_harbor', 4.0),
 ('vinegar_cruet', 4.0),
 ('try_works', 3.7944046844502277),
 ('dough_boy', 3.7067873303167422),
 ('white_whale', 3.698807453416149),
 ('caw_caw', 3.4722222222222223)]

## Exercise

Extract the top 10 collocations for the Twitter data. You need to preprocess the data first!

In [None]:
# your code here

# Entropy

## Exercise

Compute the entropy over the vowel distribution below.

In [None]:
import numpy as np

vowels = {'a':74114, 'e':114114, 'i':61038, 'o':67362, 'u':26137}

In [None]:
# your code here