# A quick overview on preprocessing

These materials are mostly borrowed from Lane et al. (2019)

As usual, let us first import all the dependencies

In [68]:
import re

## Tokenisation

In [76]:
txt = "Thomas Jefferson started building Monticello at the age of 26."

A simple "tokeniser" which gets only alphabetical characters.


In [70]:
tokens = re.findall('[A-Za-z]+', txt)
print(tokens)

['Thomas', 'Jefferson', 'started', 'building', 'Monticello', 'at', 'the', 'age', 'of']


Python provides a ``similar'' tool to tokenise but, in general, it is not enough


In [71]:
tokens = txt.split()
print(tokens)

['Thomas', 'Jefferson', 'started', 'building', 'Monticello', 'at', 'the', 'age', 'of', '26.']


 Obviously, we can design a better regular expression


In [77]:
tokens = re.split(r'([-\s.,;!?])+', txt)
print(tokens)

['Thomas', ' ', 'Jefferson', ' ', 'started', ' ', 'building', ' ', 'Monticello', ' ', 'at', ' ', 'the', ' ', 'age', ' ', 'of', ' ', '26', '.', '']


The community has created multiple libraries for pre-processing, which include options for tokenisation. One of the most popular ones is [NLTK](http://www.nltk.org). 

Before using it, you should install it. If using pip, you should do: 

\$ pip install --user -U nltk

\$ pip install --user -U numpy


An now we can import and use one of its tokenisers

In [78]:
from nltk.tokenize import TreebankWordTokenizer # import one of the many tokenizers available
tokenizer = TreebankWordTokenizer()             # invoke it 
tokens = tokenizer.tokenize(txt)
print(tokens)

['Thomas', 'Jefferson', 'started', 'building', 'Monticello', 'at', 'the', 'age', 'of', '26', '.']


Now, see the difference between tokenising with split() and with NLTK's treebank tokeniser on a different sentence.

In [79]:
sentence = "Monticello wasn't designated as UNESCO World Heritage Site until 1987."
tokens_split = sentence.split()
tokens_tree = tokenizer.tokenize(sentence)

print("OUTPUT USING split()\t\t", tokens_split)
print("OUTPUT USING TreebankWordTokenizer\t", tokens_tree)

OUTPUT USING split()		 ['Monticello', "wasn't", 'designated', 'as', 'UNESCO', 'World', 'Heritage', 'Site', 'until', '1987.']
OUTPUT USING TreebankWordTokenizer	 ['Monticello', 'was', "n't", 'designated', 'as', 'UNESCO', 'World', 'Heritage', 'Site', 'until', '1987', '.']


## Normalisation

### Casefolding

In [80]:
sentence  = sentence.lower()
print(sentence)

monticello wasn't designated as unesco world heritage site until 1987.


## Stemming

Once again, we can use a regular expression to do stemming

In [81]:
def stem(phrase):
    return ' '.join([re.findall('^(.*ss|.*?)(s)?$',
         word)[0][0].strip("'") for word in phrase.lower()
         .split()])

In [82]:
print("'houses' \t\t->", stem('houses'))
print("'Doctor House's calls' \t->", stem("Doctor House's calls"))
print("'stress' \t\t->", stem("stress"))

'houses' 		-> house
'Doctor House's calls' 	-> doctor house call
'stress' 		-> stress


But we would need to include many more expressions to deal with all cases and exceptions.

Instead, once again we can rely on a library. Let's consider the **Porter stemmer**, available in NLTK.

In [83]:
from nltk.stem.porter import PorterStemmer # Import the stemmer
stemmer = PorterStemmer()                  # invoke the stemmer

# Notice that we are "tokenising" and stemming in one line
x = ' '.join([stemmer.stem(w).strip("'") for w in "dish washer's washed dishes".split()])
print(x.split())

['dish', 'washer', 'wash', 'dish']


## Lemmatisation

This is a more complex process, compared to stemming. Let us go straight to use a library.
In this particular case we are going to use NLTK's WordNet lemmatiser. If it is the first time you use it (or you are in an ephemeral environment!), you should download it as follows:

In [84]:
import nltk 
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/albarron/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [85]:
from nltk.stem import WordNetLemmatizer # importing the lemmatiser
lemmatizer = WordNetLemmatizer()        # invoking it

print("'better' alone \t->",lemmatizer.lemmatize("better"))
print("'better' including it's part of speech (adj) \t->",lemmatizer.lemmatize("better", pos="a"))

'better' alone 	-> better
'better' including it's part of speech (adj) 	-> good


## A quick overview on representations

### Bag of Words (BoW)

First, let us see a simple construction, using a dictionary

In [86]:
sentence = """Thomas Jefferson began building Monticello at the age of 26. Thomas"""

sentence_bow = {}
for token in sentence.split():
     sentence_bow[token] = 1
sorted(sentence_bow.items())


[('26.', 1),
 ('Jefferson', 1),
 ('Monticello', 1),
 ('Thomas', 1),
 ('age', 1),
 ('at', 1),
 ('began', 1),
 ('building', 1),
 ('of', 1),
 ('the', 1)]

Another option would be using **pandas**

In [12]:
import pandas as pd

# Loading the corpus
sentences = """Thomas Jefferson began building Monticello at the age of 26.\n"""
sentences += """Construction was done mostly by local masons and carpenters.\n"""
sentences += "He moved into the South Pavilion in 1770.\n"
sentences += """Turning Monticello into a neoclassical masterpiece was Jefferson's obsession."""

# Loading the tokens into a dictionary (notice that we asume that each line is a document)
corpus = {}
for i, sent in enumerate(sentences.split('\n')):
    corpus['sent{}'.format(i)] = dict((tok, 1) for tok in
         sent.split())

# Loading the dictionary contents into a pandas dataframe. 
df = pd.DataFrame.from_records(corpus).fillna(0).astype(int).T
# SEE THE .T, which transposes the matrix for visualisation purposes.


df[df.columns[:10]]


Unnamed: 0,1770.,26.,Construction,He,Jefferson,Jefferson's,Monticello,Pavilion,South,Thomas
sent0,0,1,0,0,1,0,1,0,0,1
sent1,0,0,1,0,0,0,0,0,0,0
sent2,1,0,0,1,0,0,0,1,1,0
sent3,0,0,0,0,0,1,1,0,0,0


### One-hot vectors

This is our input sentence (and its vocabulary)

In [None]:
import numpy as np
sentence = "Thomas Jefferson began building Monticello at the age of 26."
token_sequence = str.split(sentence)
vocab = sorted(set(token_sequence))
print(vocab)

And now, we produce the one-hot representation

In [None]:
num_tokens = len(token_sequence)
vocab_size = len(vocab)
onehot_vectors = np.zeros((num_tokens, vocab_size), int) # create the |tokens| x |vocabulary size| matrix of zeros 
for i, word in enumerate(token_sequence):
   onehot_vectors[i, vocab.index(word)] = 1  # set one to right dimension to 1

print("Vocabulary:\t", vocab)
print("Sentence:\t", token_sequence)
onehot_vectors

Let us bring pandas into the game

In [None]:
pd.DataFrame(onehot_vectors, columns=vocab)