# Lesson I - Text representation

In this lesson we will see in some details how we can best represent text in our application. Let's start by importing the modules we will be using:

In [None]:
import string
from collections import Counter
from pprint import pprint
import gzip
import matplotlib.pyplot as plt
import numpy as np

%matplotlib inline

We choose a well known nursery rhyme, that has the added distinction of having been the first audio ever recorded, to be the short snippet of text that we will use in our examples:

In [None]:
text = """Mary had a little lamb, little lamb,
    little lamb. Mary had a little lamb
    whose fleece was white as snow.
    And everywhere that Mary went
    Mary went, Mary went. Everywhere
    that Mary went,
    The lamb was sure to go"""

## Tokenization

The first step in any analysis is to tokenize the text. What this means is that we will extract all the individual words in the text. For the sake of simplicity, we will assume that our text is well formed and that our words are delimited either by white space or punctuation characters.

In [None]:
def extract_words(text):
    temp = text.split() # Split the text on whitespace
    text_words = []

    for word in temp:
        # Remove any punctuation characters present in the beginning of the word
        while word[0] in string.punctuation:
            word = word[1:]

        # Remove any punctuation characters present in the end of the word
        while word[-1] in string.punctuation:
            word = word[:-1]

        # Append this word into our list of words.
        text_words.append(word.lower())

    return text_words

After this step we now have our text represented as an array of individual, lowercase, words:

In [None]:
text_words = extract_words(text)
print(text_words)

['mary', 'had', 'a', 'little', 'lamb', 'little', 'lamb', 'little', 'lamb', 'mary', 'had', 'a', 'little', 'lamb', 'whose', 'fleece', 'was', 'white', 'as', 'snow', 'and', 'everywhere', 'that', 'mary', 'went', 'mary', 'went', 'mary', 'went', 'everywhere', 'that', 'mary', 'went', 'the', 'lamb', 'was', 'sure', 'to', 'go']


As we saw, this is a wasteful way to represent text. We can be much more efficient by representing each word by a number

In [None]:
word_dict = {}
word_list = []
vocabulary_size = 0
text_tokens = []

for word in text_words:
    # If we are seeing this word for the first time, create an id for it and added it to our word dictionary
    if word not in word_dict:
        word_dict[word] = vocabulary_size
        word_list.append(word)
        vocabulary_size += 1

    # add the token corresponding to the current word to the tokenized text.
    text_tokens.append(word_dict[word])

When we were tokenizing our text, we also generated a dictionary **word_dict** that maps words to integers and a **word_list** that maps each integer to the corresponding word.

In [None]:
print("Word list:", word_list, "\n\n Word dictionary:")
pprint(word_dict)

Word list: ['mary', 'had', 'a', 'little', 'lamb', 'whose', 'fleece', 'was', 'white', 'as', 'snow', 'and', 'everywhere', 'that', 'went', 'the', 'sure', 'to', 'go'] 

 Word dictionary:
{'a': 2,
 'and': 11,
 'as': 9,
 'everywhere': 12,
 'fleece': 6,
 'go': 18,
 'had': 1,
 'lamb': 4,
 'little': 3,
 'mary': 0,
 'snow': 10,
 'sure': 16,
 'that': 13,
 'the': 15,
 'to': 17,
 'was': 7,
 'went': 14,
 'white': 8,
 'whose': 5}


These two datastructures already proved their usefulness when we converted our text to a list of tokens.

In [None]:
print(text_tokens)

[0, 1, 2, 3, 4, 3, 4, 3, 4, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 0, 14, 0, 14, 0, 14, 12, 13, 0, 14, 15, 4, 7, 16, 17, 18]


Unfortunately, while this representation is convenient for memory reasons it has some severe limitations. Perhaps the most important of which is the fact that computers naturally assume that numbers can be operated on mathematically (by addition, subtraction, etc) in a way that doesn't match our understanding of words.

## One-hot encoding

One typical way of overcoming this difficulty is to represent each word by a one-hot encoded vector where every element is zero except the one corresponding to a specific word.

In [None]:
def one_hot(word, word_dict):
    """
        Generate a one-hot encoded vector corresponding to *word*
    """

    vector = np.zeros(len(word_dict))
    vector[word_dict[word]] = 1

    return vector

So, for example, the word "fleece" would be represented by:

In [None]:
fleece_hot = one_hot("fleece", word_dict)
print(fleece_hot)

[0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]


This vector has every element set to zero, except element 6, since:

In [None]:
print(word_dict["fleece"])
fleece_hot[6] == 1

6


True

## Bag of words

We can now use the one-hot encoded vector for each word to produce a vector representation of our original text, by simply adding up all the one-hot encoded vectors:

In [None]:
text_vector1 = np.zeros(vocabulary_size)

for word in text_words:
    hot_word = one_hot(word, word_dict)
    text_vector1 += hot_word

print(text_vector1)

[6. 2. 2. 4. 5. 1. 1. 2. 1. 1. 1. 1. 2. 2. 4. 1. 1. 1. 1.]


In practice, we can also easily skip the encoding step at the word level by using the *word_dict* defined above:

In [None]:
text_vector = np.zeros(vocabulary_size)

for word in text_words:
    text_vector[word_dict[word]] += 1

print(text_vector)

[6. 2. 2. 4. 5. 1. 1. 2. 1. 1. 1. 1. 2. 2. 4. 1. 1. 1. 1.]


Naturally, this approach is completely equivalent to the previous one and has the added advantage of being more efficient in terms of both speed and memory requirements.

This is known as the __bag of words__ representation of the text. It should be noted that these vectors simply contains the number of times each word appears in our document, so we can easily tell that the word *mary* appears exactly 6 times in our little nursery rhyme.

In [None]:
text_vector[word_dict["mary"]]

6.0

A more pythonic (and efficient) way of producing the same result is to use the standard __Counter__ module:

In [None]:
word_counts = Counter(text_words)
pprint(word_counts)

Counter({'mary': 6,
         'lamb': 5,
         'little': 4,
         'went': 4,
         'had': 2,
         'a': 2,
         'was': 2,
         'everywhere': 2,
         'that': 2,
         'whose': 1,
         'fleece': 1,
         'white': 1,
         'as': 1,
         'snow': 1,
         'and': 1,
         'the': 1,
         'sure': 1,
         'to': 1,
         'go': 1})


From which we can easily generate the __text_vector__ and __word_dict__ data structures:

In [None]:
items = list(word_counts.items())

# Extract word dictionary and vector representation
word_dict2 = dict([[items[i][0], i] for i in range(len(items))])
text_vector2 = [items[i][1] for i in range(len(items))]

And let's take a look at them:

In [None]:
print("Text vector:", text_vector2, "\n\nWord dictionary:")
pprint(word_dict2)

NameError: name 'text_vector2' is not defined

The results using this approach are slightly different than the previous ones, because the words are mapped to different integer ids but the corresponding values are the same:

In [None]:
for word in word_dict.keys():
    if text_vector[word_dict[word]] != text_vector2[word_dict2[word]]:
        print("Error!")

As expected, there are no differences!