<a href="https://colab.research.google.com/github/erinmcmahon26/NLP-Learning/blob/main/Tokenization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Separating tokens

The simplest way to tokenize is to use whitespace. Here we are splitting a sentence on the whitespace using .split() which defaults to splitting on whitespace. Two ways to accomplish the same thing:

In [1]:
sentence = "The dog is 12 years old and is named Spot."

In [2]:
sentence.split()

['The', 'dog', 'is', '12', 'years', 'old', 'and', 'is', 'named', 'Spot.']

In [3]:
str.split(sentence)

['The', 'dog', 'is', '12', 'years', 'old', 'and', 'is', 'named', 'Spot.']

The issue here is that it was unable to split "Spot" from ".". Ideally, we want those to be separate tokens. We will deal with this later in the notebook.

Below we will be creating numerical vector representations for each word, called one-hot vectors. We will use the numpy library to accomplish this. 

In [4]:
import numpy as np

In [5]:
token_sequence = sentence.split()
vocab = sorted(set(token_sequence))

print(token_sequence)
print(len(token_sequence))
print(vocab)
print(len(vocab))

['The', 'dog', 'is', '12', 'years', 'old', 'and', 'is', 'named', 'Spot.']
10
['12', 'Spot.', 'The', 'and', 'dog', 'is', 'named', 'old', 'years']
9


The vocab list shows all of the unique words in the string, that is why its length is one less than the token_sequence variable, it only counted "is" once. The words are also sorted lexographically - numbers come before letters and capital letters come before lowercase. 

In [6]:
','.join(vocab)

'12,Spot.,The,and,dog,is,named,old,years'

In [7]:
num_tokens = len(token_sequence)
vocab_amount = len(vocab)

In [8]:
onehot_vectors = np.zeros((num_tokens, vocab_amount), int)

In [9]:
for i, word in enumerate(token_sequence):
  onehot_vectors[i, vocab.index(word)] = 1

In [10]:
' '.join(vocab)

'12 Spot. The and dog is named old years'

In [11]:
onehot_vectors

array([[0, 0, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 0, 0],
       [1, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 0, 0],
       [0, 1, 0, 0, 0, 0, 0, 0, 0]])

The 1 in each row represents the word in the vocab variable. Each row is a vector for a single word. A 1 in a column indicates a word that was present at that position in the sentence (document). Each row is a binary row vector and can be used to represent a word in the NLP pipeline. This can be presented using a pandas dataframe:

In [12]:
import pandas as pd

In [13]:
pd.DataFrame(onehot_vectors, columns = vocab)

Unnamed: 0,12,Spot.,The,and,dog,is,named,old,years
0,0,0,1,0,0,0,0,0,0
1,0,0,0,0,1,0,0,0,0
2,0,0,0,0,0,1,0,0,0
3,1,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,1
5,0,0,0,0,0,0,0,1,0
6,0,0,0,1,0,0,0,0,0
7,0,0,0,0,0,1,0,0,0
8,0,0,0,0,0,0,1,0,0
9,0,1,0,0,0,0,0,0,0


This becomes very impractical with large documents. There needs to be some form of dimension reduction for this process to work on a scale. You can do this by creating a bag-of-words vector (frequency vector) as it only counts the frequency of words, not their order, or you could OR the one-hot vectors into a binary bag-of-words. Below is what the sentence above looks like as a binary bag-of-words.

In [14]:
bow = {}
for token in sentence.split():
  bow[token] = 1
sorted(bow.items())

[('12', 1),
 ('Spot.', 1),
 ('The', 1),
 ('and', 1),
 ('dog', 1),
 ('is', 1),
 ('named', 1),
 ('old', 1),
 ('years', 1)]

This dictionary representation of the above sentence only cares about the words labeled with "1", it does not require a lot of bytes for each word, therefore saving a lot of space. 

In [15]:
# use pandas series for bow
df = pd.DataFrame(pd.Series(dict([(token, 1) for token in sentence.split()])), columns=['sent']).T
df

Unnamed: 0,The,dog,is,12,years,old,and,named,Spot.
sent,1,1,1,1,1,1,1,1,1


In [16]:
# what happens if we add another sentence?
sentences = "The dog is 12 years old and is named Spot.\n"
sentences += "He likes to go on long walks and take very long naps afterwards."

In [17]:
# corpus is a collection of text/audio organized into a dataset
corpus = {}
for i, sent in enumerate(sentences.split('\n')):
  corpus['sent{}'.format(i)] = dict((tok, 1) for tok in sent.split())
df = pd.DataFrame.from_records(corpus).fillna(0).astype(int) # .T afterwards will flip df so sent are rows and the terms are columns
df[df.columns[:10]]

Unnamed: 0,sent0,sent1
The,1,0
dog,1,0
is,1,0
12,1,0
years,1,0
old,1,0
and,1,1
named,1,0
Spot.,1,0
He,0,1


Here we are splitting on \n but can also use .splitlines(). Here we can see the first 10 tokens.

We can check for overlap between sentences by counting the number of overlapping tokens using a dot product. Measuring the bag of words overlap for two vectors (sentences in this case), we can gain a good understanding of how similar they are with the words they use, which can also give an idea of how similar they are in meaning as well. 

In [18]:
df.sent0.dot(df.sent1)

1

This is telling us there is one word that is used in both sent0 and sent1. This overlap of words is a measure of their similarity.

### Separating tokens on more than just whitespace

In [19]:
import re
tokens = re.split(r'[-\s.,;!?]+', sentence)
tokens

['The', 'dog', 'is', '12', 'years', 'old', 'and', 'is', 'named', 'Spot', '']

Here we are using the re libaray to define what we want the tokens to be split on. The r (regular expression) statement states we want to split the sentence on whitespace or the listed punctuation that occurs at least once. Regular expression will not be discussed further in this notebook but can easily be found online or in Appendix B in the book. Here we mainly wanted to take the period off of "Spot." Now we have to filter out the whitespace token as we don't want to include that in our analysis.

NOTE: Regex library will eventually replace re library.

In [20]:
#for loop
tokens_0 = []
for x in tokens:
  if x and x != '- \t\n.,;!?': 
    tokens_0.append(x)
tokens_0

['The', 'dog', 'is', '12', 'years', 'old', 'and', 'is', 'named', 'Spot']

In [21]:
# list comprhension
tokens_2 = [x for x in tokens if x and x not in '- \t\n.,;!?'] 
tokens_2

['The', 'dog', 'is', '12', 'years', 'old', 'and', 'is', 'named', 'Spot']

In [22]:
# using lambda 
tokens_1 = list(filter(lambda x:x if x and x not in '- \t\n.,;!?' else None, tokens))
tokens_1

['The', 'dog', 'is', '12', 'years', 'old', 'and', 'is', 'named', 'Spot']

In [23]:
#!pip install regex

### Tokenizer Libraries

Many different libraries. Here we look at NLTK.

In [24]:
import nltk
from nltk.tokenize import RegexpTokenizer

In [25]:
sentence2 = "I don't know how to dance."

In [26]:
tokenizer = RegexpTokenizer(r'\w+|$[0-9.]+|\S+')
tokenizer.tokenize(sentence2)

['I', 'don', "'t", 'know', 'how', 'to', 'dance', '.']

In [27]:
from nltk.tokenize import TreebankWordTokenizer
tokenizer2 = TreebankWordTokenizer()
tokenizer2.tokenize(sentence2)

['I', 'do', "n't", 'know', 'how', 'to', 'dance', '.']

These tokenizers do very similar things but the TreebankWork tokenizer seperates contractions differently to acknowledge that don't is two words and don is not the stem like with the RegexpTokenizer. 

### N-grams

n-grams: n=# of words your want to keep together. 2-grams means we want to keep a pair of words. We use this when we want to convey/keep meaning with a set of words. 

In [28]:
# here we want to keep ice cream together
from nltk.util import ngrams
sentence3 = "We all scream for ice cream"
tokens3 = re.split(r'[-\s.,;!?]+',sentence3)
tokens3

['We', 'all', 'scream', 'for', 'ice', 'cream']

In [29]:
two_grams = list(ngrams(tokens3,2))
two_grams

[('We', 'all'),
 ('all', 'scream'),
 ('scream', 'for'),
 ('for', 'ice'),
 ('ice', 'cream')]

In [30]:
[" ".join(x) for x in two_grams]

['We all', 'all scream', 'scream for', 'for ice', 'ice cream']

Not all of the ngrams are useful and can often be counter productive. We really only want the ngram of ice cream here and nothing else, so we need to filter out the unnecessary ngrams. 

Stop words filter exludes words that occur with frequency/less information. Examples include: a, an, the, this, etc. Taking out these words can decrease meaning though, which means we have to create higher # grams (which can have different issues). 

In [31]:
# stop word filter example
# or use nltk.download('stopwords') --> complete list of "canonical" stop words

stop_words = ['a', 'an', 'the', 'on', 'of', 'off', 'this', 'is']
stop_tokens = ['the', 'house', 'is', 'on', 'fire']
tokens_without_stopwords = [x for x in stop_tokens if x not in stop_words]
print(tokens_without_stopwords)

['house', 'fire']


### Normalizing Vocabulary 

A vocabulary reduction technique that combines similar tokens into a single normalized form. 

#### Case Folding - consolidate words with different capitalization

In [32]:
cap_words = ['Hello', 'Cat', 'Ball']
case_folding = [x.lower() for x in cap_words]
case_folding

['hello', 'cat', 'ball']

This can be helpful or it can take away important meaning. One method is just to lower case the first word in a sentence in order to preserve meaning with other capitalizations later on in the sentence. Case folding is often not performed due to the loss of information. 

#### Stemming - identifying a common stem among various forms of a word

This can be a great dimension reducation technique and help with improving recall but has some cons as well as it can effect precision. 

In [33]:
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
' '.join([stemmer.stem(w).strip("'") for w in "dish washer's washed dishes".split()])

'dish washer wash dish'

#### Lemmatization - normalization to the semantic root of a word

This method could be more accurate than stemming or case folding due to its ability to keep meaning. It makes sure words with similar meanings are consolidated together. 

In [34]:
from nltk.classify.rte_classify import lemmatize
#nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize("better")

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


'better'

In [38]:
# n=noun, a indicates an adjective
print(lemmatizer.lemmatize("better", pos = "a"))
print(lemmatizer.lemmatize("better", pos = "n"))

good
better


The firs lemmatizer did not work due to not setting a pos, it will assumme it is a noun first. 

NOTE: Try to avoid using stemming/lemmatization for large English datasets. If there is limited data then these could be potentially useful. 