# Natural Language Processing

We will see in some details how we can best represent text in our application. Let's start by importing the modules we will be using:

In [1]:
import string
from collections import Counter
from pprint import pprint
import gzip
import matplotlib.pyplot as plt 
import numpy as np
import watermark
%matplotlib inline
plt.style.use('./d4sci.mplstyle')

We choose a well known nursery rhyme, that has the added distinction of having been the first audio ever recorded, to be the short snippet of text that we will use in our examples:

In [2]:
text = """Mary had a little lamb, little lamb,
    little lamb. Mary had a little lamb
    whose fleece was white as snow.
    And everywhere that Mary went
    Mary went, Mary went. Everywhere
    that Mary went,
    The lamb was sure to go"""

## Tokenization

The first step in any analysis is to tokenize the text. What this means is that we will extract all the individual words in the text. For the sake of simplicity, we will assume that our text is well formed and that our words are delimited either by white space or punctuation characters.

In [3]:
# Write a function to tokenize the given text.

After this step we now have our text represented as an array of individual, lowercase, words:

In [4]:
# Represent the text in the above manner

We can be much more efficient by representing each word by a number

In [5]:
# Write a function to represent the text in a dictionary fashion and print it. 
#You need to map the frequency of words with integers.

Unfortunately, while this representation is convenient for memory reasons it has some severe limitations. Perhaps the most important of which is the fact that computers naturally assume that numbers can be operated on mathematically (by addition, subtraction, etc) in a way that doesn't match our understanding of words.

## One-hot encoding

One typical way of overcoming this difficulty is to represent each word by a one-hot encoded vector where every element is zero except the one corresponding to a specific word.

In [6]:
# Write a function to one hot a word and print its value.

## Bag of words

We can now use the one-hot encoded vector for each word to produce a vector representation of our original text, by simply adding up all the one-hot encoded vectors:

In [7]:
# Write a function to create and print a bag of words representation of the given text.

A more pythonic (and efficient) way of producing the same result is to use the standard __Counter__ module:

In [8]:
# Print all the words in the text

In [9]:
# Use counter module to represent the text.

In [10]:
# Create and print a word dictionary using the above result.

You may observe the results using this approach are slightly different than the previous ones, because the words are mapped to different integer ids but the corresponding values are the same:

In [11]:
# Check if the results using both the approaches are same.

## Term Frequency

The bag of words vector representation introduced above relies simply on the frequency of occurence of each word. Following a long tradition of giving fancy names to simple ideas, this is known as __Term Frequency__.

Intuitively, we expect the the frequency with which a given word is mentioned should correspond to the relevance of that word for the piece of text we are considering. For example, **Mary** is a pretty important word in our little nursery rhyme and indeed it is the one that occurs the most often:

In [12]:
# Print the words in the sorted order of their frequency

However, it's hard to draw conclusions from such a small piece of text. Let us consider a significantly larger piece of text.

In [13]:
# Import and extract words from text8 file

Now let's take a look at the first 50 words in this large corpus:

In [14]:
# Print first 50 words

And the top 10 most common words

In [15]:
# Use counter and print top 10 most common words with their count.

We can observe that the most common words are not particularly meaningful. Indeed, this is a common occurence in Natural Language Processing. The most frequent words are typically auxiliaries required due to gramatical rules.
On the other hand, there is also a large number of words that occur very infrequently as can be easily seen by glancing at the word freqency distribution.

In [16]:
# Create a plot for word frequency distribution for the above data.

## Stopwords

One common technique to simplify NLP tasks is to remove what are known as Stopwords, words that are very frequent but not meaningful. If we simply remove the most common 100 words, we significantly reduce the amount of data we have to consider while losing little information.

In [17]:
# Clean the text and remove stopwords.

In [18]:
# Print first 50 words after cleaning the text.

We see that the dataset size was significantly reduced.

In practice, we don't simply remove the most common words in our corpus but rather a manually curate list of stopwords. Lists for dozens of languages and applications can easily be found online.

## Term Frequency/Inverse Document Frequency

One way of determining of the relative importance of a word is to see how often it appears across multiple documents. Words that are relevant to a specific topic are more likely to appear in documents about that topic and much less in documents about other topics. On the other hand, less meaningful words (like **the**) will be common across documents about any subject.

To measure the document frequency of a word we will need to have multiple documents. For the sake of simplicity, let us treat each sentence of our nursery rhyme as an individual document:

In [19]:
# Print the text.

In [20]:
# Reform the data to treat each sentence as a document.

In [21]:
# Print the corpus

Let us now calculate the number of documents in which each word appears:

In [22]:
# Print the above instruction

In [23]:
# Print the Inverse Document Frequency to gather more insights

When we multiply the term frequency of each word by it's inverse document frequency, we have a good way of quantifying how relevant a word is to understand the meaning of a specific document.

In [24]:
# Print the vector representation of the words after doing the above steps.

## Porter Stemmer

There is still, however, one issue with our approach to representing text. Since we treat each word as a unique token and completely independently from all others, for large documents we will end up with many variations of the same word such as verb conjugations, the corresponding adverbs and nouns, etc. 

One way around this difficulty is to use stemming algorithm to reduce words to their root (or stem) version.

The stem of a word is what is left of that word after a speficic ending has been removed. A function to do this is easy to implement:

In [25]:
# Write a function to do this.

In [26]:
# Write a fucntion to clean the text based on the ending of words, vowels etc.

Finally, we define a function to apply a specific rule to a word or stem:

In [27]:
# Apply these functions to the text.

In [28]:
# Experiment with this on various words.