# Lab 01 - Tokenisation and basic feature vectors

This lab is an introduction to some basic test processing that will give us the first impression of the challenges in translating text into meaningful feature vector representations.

So let's first set some text...

In [1]:
sentence = """ Welcome to the Natural Language Processing lab!
               We'll learn many things in this no 1 lab, so we will take it easy.
               Natural Language Processing is fun."""

Let's also download some libraries we'll be using:

In [2]:
%pip install numpy pandas sklearn nltk

Note: you may need to restart the kernel to use updated packages.


## Using native Python functions for tokenisation

Let's try to split the text based on spaces using the built in string function `split()`

In [3]:
sentence.split()

['Welcome',
 'to',
 'the',
 'Natural',
 'Language',
 'Processing',
 'lab!',
 "We'll",
 'learn',
 'many',
 'things',
 'in',
 'this',
 'no',
 '1',
 'lab,',
 'so',
 'we',
 'will',
 'take',
 'it',
 'easy.',
 'Natural',
 'Language',
 'Processing',
 'is',
 'fun.']

You will observe that some words are not well separated from punctuation and contain some appended onto them.
So we need to find a way to remove those characters... but, before we do that, let's see how we can create a quick feature vector first!

In [4]:
tokens = sentence.split()  # splitting based on spaces
vocab = sorted(set(tokens))  # sorting and removing duplicates by using set()
vocab  # just printing the vocab so we can look at it

['1',
 'Language',
 'Natural',
 'Processing',
 "We'll",
 'Welcome',
 'easy.',
 'fun.',
 'in',
 'is',
 'it',
 'lab!',
 'lab,',
 'learn',
 'many',
 'no',
 'so',
 'take',
 'the',
 'things',
 'this',
 'to',
 'we',
 'will']

We can see that the sorted list has the numbers first, followed by capital and then lower case letters (all alphabetically sorted). We also see that repeating words appear only once in our vocabulary list. Let's compare the size of the two lists.

In [5]:
tokens_len = len(tokens)
vocab_len = len(vocab)

print(f"Tokens: {tokens_len}")
print(f"Vocab: {vocab_len}")

Tokens: 27
Vocab: 24


Let's try and print the matrix of tokens against vocabulary. We will use the numpy lib for that.

In [6]:
import numpy as np

matrix = np.zeros((tokens_len, vocab_len), int)
for i, token in enumerate(tokens):
    matrix[i, vocab.index(token)] = 1

matrix

array([[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
        0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
        0, 0],
       [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0],
       [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0],
       [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0],
       [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
        0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
        0, 0],
       [0, 0, 0, 0, 0

It's not easy to see, but the second, third and fourth columns have the value of 1 in two rows, whereas the rest only take the value of 1 in a single row. To make it a little more readable, we could use Pandas and DataFrame! Both Pandas and NumPy are very useful libs that we will use many times.

In [7]:
import pandas as pd

pd.DataFrame(matrix, columns=vocab)

Unnamed: 0,1,Language,Natural,Processing,We'll,Welcome,easy.,fun.,in,is,...,many,no,so,take,the,things,this,to,we,will
0,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
3,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0


Now this is a lot more clear and if we wanted we could carry on making it look nicer.

Let's now carry on building the bag of words (BoW)

In [9]:
bow = {token: 1 for token in tokens}  # setting this up as a dictionary
sorted(bow.items())  # lets print it

[('1', 1),
 ('Language', 1),
 ('Natural', 1),
 ('Processing', 1),
 ("We'll", 1),
 ('Welcome', 1),
 ('easy.', 1),
 ('fun.', 1),
 ('in', 1),
 ('is', 1),
 ('it', 1),
 ('lab!', 1),
 ('lab,', 1),
 ('learn', 1),
 ('many', 1),
 ('no', 1),
 ('so', 1),
 ('take', 1),
 ('the', 1),
 ('things', 1),
 ('this', 1),
 ('to', 1),
 ('we', 1),
 ('will', 1)]

Since bow is a dictionary, we see that there are no duplicate words.

Pandas also has a more efficient form of a dictionary called `Series`.

In [10]:
df = pd.DataFrame(pd.Series(dict([(token, 1) for token in tokens])), columns=["sent"]).T
df

Unnamed: 0,Welcome,to,the,Natural,Language,Processing,lab!,We'll,learn,many,...,1,"lab,",so,we,will,take,it,easy.,is,fun.
sent,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1


In [11]:
corpus = {}
for i, sent in enumerate(sentence.split('\n')):
    corpus[f"sent{i}"] = dict((tok, 1) for tok in sent.split())

df = pd.DataFrame.from_records(corpus).fillna(0).astype(int).T
df

Unnamed: 0,Welcome,to,the,Natural,Language,Processing,lab!,We'll,learn,many,...,1,"lab,",so,we,will,take,it,easy.,is,fun.
sent0,1,1,1,1,1,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
sent1,0,0,0,0,0,0,0,1,1,1,...,1,1,1,1,1,1,1,1,0,0
sent2,0,0,0,1,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,1,1


Now we see how we managed to build feature vectors for the three sentences we originally had. Now let's do a Dot Product calculation.

In [12]:
df = df.T
print(f"Dot product of sent0 from sent1: {df.sent0.dot(df.sent1)} and dot product of sent0 from sent1: {df.sent0.dot(df.sent2)}")

Dot product of sent0 from sent1: 0 and dot product of sent0 from sent1: 3


As we see from the results, the higher the dot product to more similar the vectors are... so given that only the first and last sentence have some common words, we see that this comes back as 3, where as the two sentences who have nothing in common come bak as 0.

We can improve our vocabulary now if we were to remove all other punctuation. Let's do that with regular expressions.

In [13]:
import re

tokens = re.split(r"[-\s.,;!?]+", sentence)
tokens

['',
 'Welcome',
 'to',
 'the',
 'Natural',
 'Language',
 'Processing',
 'lab',
 "We'll",
 'learn',
 'many',
 'things',
 'in',
 'this',
 'no',
 '1',
 'lab',
 'so',
 'we',
 'will',
 'take',
 'it',
 'easy',
 'Natural',
 'Language',
 'Processing',
 'is',
 'fun',
 '']

## NLTK

Although this seems to be great... you might still have issues with different characters that are not anticipated. So we usually use an existing NLP related tokeniser to do this job. Let's try the NLTK lib.

NLTK also supports regular expressions:

In [14]:
from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer(r"\w+|$[0-9.]+|\S+")
tokenizer.tokenize(sentence)

['Welcome',
 'to',
 'the',
 'Natural',
 'Language',
 'Processing',
 'lab',
 '!',
 'We',
 "'ll",
 'learn',
 'many',
 'things',
 'in',
 'this',
 'no',
 '1',
 'lab',
 ',',
 'so',
 'we',
 'will',
 'take',
 'it',
 'easy',
 '.',
 'Natural',
 'Language',
 'Processing',
 'is',
 'fun',
 '.']

but there are other more specialised tokenisers:

In [15]:
from nltk.tokenize import TreebankWordTokenizer

tokenizer = TreebankWordTokenizer()
tokenizer.tokenize(sentence)

['Welcome',
 'to',
 'the',
 'Natural',
 'Language',
 'Processing',
 'lab',
 '!',
 'We',
 "'ll",
 'learn',
 'many',
 'things',
 'in',
 'this',
 'no',
 '1',
 'lab',
 ',',
 'so',
 'we',
 'will',
 'take',
 'it',
 'easy.',
 'Natural',
 'Language',
 'Processing',
 'is',
 'fun',
 '.']

For now let's use the regular expression special word pattern `\w`, so we can control what the tokeniser does.

In [16]:
from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer(r"\w+")
tokens = tokenizer.tokenize(sentence)
print(tokens)



['Welcome', 'to', 'the', 'Natural', 'Language', 'Processing', 'lab', 'We', 'll', 'learn', 'many', 'things', 'in', 'this', 'no', '1', 'lab', 'so', 'we', 'will', 'take', 'it', 'easy', 'Natural', 'Language', 'Processing', 'is', 'fun']


At the point you could try out different other tokenisers from other libraries and see if there are any differences.

We will now calculate the 2-grams:

In [17]:
from nltk.util import ngrams

list(ngrams(tokens, 2))

[('Welcome', 'to'),
 ('to', 'the'),
 ('the', 'Natural'),
 ('Natural', 'Language'),
 ('Language', 'Processing'),
 ('Processing', 'lab'),
 ('lab', 'We'),
 ('We', 'll'),
 ('ll', 'learn'),
 ('learn', 'many'),
 ('many', 'things'),
 ('things', 'in'),
 ('in', 'this'),
 ('this', 'no'),
 ('no', '1'),
 ('1', 'lab'),
 ('lab', 'so'),
 ('so', 'we'),
 ('we', 'will'),
 ('will', 'take'),
 ('take', 'it'),
 ('it', 'easy'),
 ('easy', 'Natural'),
 ('Natural', 'Language'),
 ('Language', 'Processing'),
 ('Processing', 'is'),
 ('is', 'fun')]

and 3-grams:

In [18]:
list(ngrams(tokens, 3))

[('Welcome', 'to', 'the'),
 ('to', 'the', 'Natural'),
 ('the', 'Natural', 'Language'),
 ('Natural', 'Language', 'Processing'),
 ('Language', 'Processing', 'lab'),
 ('Processing', 'lab', 'We'),
 ('lab', 'We', 'll'),
 ('We', 'll', 'learn'),
 ('ll', 'learn', 'many'),
 ('learn', 'many', 'things'),
 ('many', 'things', 'in'),
 ('things', 'in', 'this'),
 ('in', 'this', 'no'),
 ('this', 'no', '1'),
 ('no', '1', 'lab'),
 ('1', 'lab', 'so'),
 ('lab', 'so', 'we'),
 ('so', 'we', 'will'),
 ('we', 'will', 'take'),
 ('will', 'take', 'it'),
 ('take', 'it', 'easy'),
 ('it', 'easy', 'Natural'),
 ('easy', 'Natural', 'Language'),
 ('Natural', 'Language', 'Processing'),
 ('Language', 'Processing', 'is'),
 ('Processing', 'is', 'fun')]

If we want to include the n-grams as strings rather than tuples, then we need to convert them:

In [21]:
bigrams = [" ".join(x) for x in list(ngrams(tokens, 2))]
print(f"Bigrams: {bigrams}\n")

trigrams = [" ".join(x) for x in list(ngrams(tokens, 3))]
print(f"Trigrams: {trigrams}")

Bigrams: ['Welcome to', 'to the', 'the Natural', 'Natural Language', 'Language Processing', 'Processing lab', 'lab We', 'We ll', 'll learn', 'learn many', 'many things', 'things in', 'in this', 'this no', 'no 1', '1 lab', 'lab so', 'so we', 'we will', 'will take', 'take it', 'it easy', 'easy Natural', 'Natural Language', 'Language Processing', 'Processing is', 'is fun']

Trigrams: ['Welcome to the', 'to the Natural', 'the Natural Language', 'Natural Language Processing', 'Language Processing lab', 'Processing lab We', 'lab We ll', 'We ll learn', 'll learn many', 'learn many things', 'many things in', 'things in this', 'in this no', 'this no 1', 'no 1 lab', '1 lab so', 'lab so we', 'so we will', 'we will take', 'will take it', 'take it easy', 'it easy Natural', 'easy Natural Language', 'Natural Language Processing', 'Language Processing is', 'Processing is fun']


Another important step we looked at in the lectures are the stop words. Let's try to use the nltk stop word list to remove them.

First, let's download the list.

In [22]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /home/studio-lab-
[nltk_data]     user/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

and now check it:

In [23]:
stop_words = nltk.corpus.stopwords.words('english')
print(f"number of stopwords: {len(stop_words)}")
print(stop_words)

number of stopwords: 179
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own'

Other libs have different stopwords. Let's see a much larger set from sklearn.

In [24]:
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS as sklearn_stop_words

print(f"number of stopwords: {len(sklearn_stop_words)}")
print(sklearn_stop_words)

number of stopwords: 318
frozenset({'without', 'mine', 'another', 'almost', 'further', 'thereupon', 'it', 'third', 'until', 'when', 'for', 'only', 'every', 'nevertheless', 'somehow', 'neither', 'formerly', 'find', 'if', 'all', 'get', 'mostly', 'while', 'thus', 'wherein', 'three', 'around', 'ourselves', 'should', 'except', 'have', 'seem', 'anyone', 'thick', 'take', 'con', 'again', 'fire', 'seeming', 'couldnt', 'a', 'on', 'whereafter', 'this', 'will', 'whereby', 'few', 'against', 'must', 'before', 'mill', 'etc', 'thence', 'our', 'interest', 'thru', 'within', 'done', 'herself', 'hereupon', 'yet', 'un', 'been', 'was', 'nothing', 'still', 'empty', 'anyway', 'two', 'anyhow', 'sometime', 'together', 'up', 'whenever', 'between', 'becoming', 'me', 'everywhere', 'themselves', 'namely', 'since', 'fill', 'more', 'beside', 'four', 'ie', 'us', 'wherever', 'several', 'of', 'itself', 'latterly', 'please', 'off', 'show', 'any', 'first', 'give', 'also', 'well', 'whereupon', 'part', 'might', 'down', 'nor

Strangely enough, although there are more stop words in sklearn, you will find that nltk has words that are not contained in sklearn. So you might want to join the two lists.

For normalising the text you could do something as simple as making sure all words are lower case.

In [25]:
norm_tokens = [x.lower() for x in tokens]
print(norm_tokens)

['welcome', 'to', 'the', 'natural', 'language', 'processing', 'lab', 'we', 'll', 'learn', 'many', 'things', 'in', 'this', 'no', '1', 'lab', 'so', 'we', 'will', 'take', 'it', 'easy', 'natural', 'language', 'processing', 'is', 'fun']


For stemming the words we could use nltk again.

In [26]:
from nltk.stem.porter import PorterStemmer

stemmer = PorterStemmer()
stem_tokens = [stemmer.stem(x) for x in norm_tokens]
print(stem_tokens)

['welcom', 'to', 'the', 'natur', 'languag', 'process', 'lab', 'we', 'll', 'learn', 'mani', 'thing', 'in', 'thi', 'no', '1', 'lab', 'so', 'we', 'will', 'take', 'it', 'easi', 'natur', 'languag', 'process', 'is', 'fun']


For lemmatising, nltk also does the trick.

In [27]:
from nltk.stem import WordNetLemmatizer

nltk.download("wordnet")
nltk.download("omw-1.4")

lemmatizer = WordNetLemmatizer()
stem_tokens = [lemmatizer.lemmatize(x) for x in norm_tokens]
print(stem_tokens)

[nltk_data] Downloading package wordnet to /home/studio-lab-
[nltk_data]     user/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/studio-lab-
[nltk_data]     user/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


['welcome', 'to', 'the', 'natural', 'language', 'processing', 'lab', 'we', 'll', 'learn', 'many', 'thing', 'in', 'this', 'no', '1', 'lab', 'so', 'we', 'will', 'take', 'it', 'easy', 'natural', 'language', 'processing', 'is', 'fun']


With this example sentence, there are no issues with the lemmatisation... but let's look at the following example:

In [28]:
print(lemmatizer.lemmatize("better"))
print(lemmatizer.lemmatize("better", "a"))  # declaring the POS as adjective

better
good


If we don't include the Part-of-speech (POS), nltk, using wordnet, does not work well. So let's try to fix that.

In [29]:
from nltk.corpus import wordnet

def get_wordnet_pos(word):
    """Map the POS tag to the first character lemmatize() accepts."""

    try:  # download nltk's POS tagger if it doesn't exist
        nltk.data.find("taggers/averaged_perceptron_tagger")
    except LookupError:
        nltk.download("averaged_perceptron_tagger")
    tag = nltk.pos_tag([word])[0][1][0].upper()  # use ntlk's POS tagger on the word

    # now we need to convert from nltk to wordnet POS notations (for compatibility reasons)
    tag_dict = {
        "J": wordnet.ADJ,
        "N": wordnet.NOUN,
        "V": wordnet.VERB,
        "R": wordnet.ADV
    }

    return tag_dict.get(tag, wordnet.NOUN)  # return and default to noun if not found

In [30]:

stem_tokens = [lemmatizer.lemmatize(x, pos=get_wordnet_pos(x)) for x in norm_tokens]
print(stem_tokens)

['welcome', 'to', 'the', 'natural', 'language', 'processing', 'lab', 'we', 'll', 'learn', 'many', 'thing', 'in', 'this', 'no', '1', 'lab', 'so', 'we', 'will', 'take', 'it', 'easy', 'natural', 'language', 'processing', 'be', 'fun']


If we look at the words now we are getting more counts for our bag of words.

In [31]:
from collections import Counter

bow = Counter(stem_tokens)
bow

Counter({'welcome': 1,
         'to': 1,
         'the': 1,
         'natural': 2,
         'language': 2,
         'processing': 2,
         'lab': 2,
         'we': 2,
         'll': 1,
         'learn': 1,
         'many': 1,
         'thing': 1,
         'in': 1,
         'this': 1,
         'no': 1,
         '1': 1,
         'so': 1,
         'will': 1,
         'take': 1,
         'it': 1,
         'easy': 1,
         'be': 1,
         'fun': 1})

Now let's check the most frequent 6 words.

In [32]:
bow.most_common(6)

[('natural', 2),
 ('language', 2),
 ('processing', 2),
 ('lab', 2),
 ('we', 2),
 ('welcome', 1)]

Now let's remove the stop words and check the count again.

In [33]:
no_stop_tokens = [x for x in stem_tokens if x not in stop_words]
count = Counter(no_stop_tokens)
count

Counter({'welcome': 1,
         'natural': 2,
         'language': 2,
         'processing': 2,
         'lab': 2,
         'learn': 1,
         'many': 1,
         'thing': 1,
         '1': 1,
         'take': 1,
         'easy': 1,
         'fun': 1})

Finally... let's make our feature vector using the frequency ratio (term count / total number of terms in the doc):

In [34]:
document_vector = []
doc_length = len(no_stop_tokens)
for key, value in count.most_common():
    document_vector.append(value / doc_length)

print(document_vector)

[0.125, 0.125, 0.125, 0.125, 0.0625, 0.0625, 0.0625, 0.0625, 0.0625, 0.0625, 0.0625, 0.0625]


We have explored many many options already and we will continue with more advanced feature vectors in the next lab, plus some visualisations in charts. So until then please try different experiments on your own:
1. See if you can change the text and have more sentences with different topics (so you can compare the feature vectors later)
2. Try to use different libraries for tokenising, PoS tagging, stemming and lemmatising.
3. Try to use distance metrics to compare vectors, such as Euclidian distance.