# Lab 02 - Feature Vectors and Language Models
In this lab we will continue working with words as features, but the focus will be to build language models.

First, let's make sure the libraries we'll be using are installed and initialized:

In [None]:
%pip install numpy pandas nltk matplotlib
%matplotlib inline


## Words as Feature Vectors
First, let's look at some text collections directly available in [NLTK](https://www.nltk.org).

To explore all the available Corpora in NLTK we just need to run a command and list the available resources, after we have downloaded them. The documentation on NLTK's [website](https://www.nltk.org/book/ch02.html) will give you more details on each Corpus.

There are many interesting collections such as: the Gutenberg collection of books, the  Brown collection of news, novels and other stories, and the USA presidential inaugural speeches.

Let's start by downloading the resources. NLTK's `download()` command will launch a GUI, or a CLI to let you select the data you want to install.

**Use it to download the `inaugural` Corpus.**

In [None]:
import nltk

nltk.download()

Now we should be able to list the downloaded resources as well.

In [None]:
import os

print(os.listdir(nltk.data.find("corpora")))

So to start with, let's experiment with the USA Presidential inaugural speeches.

In [None]:
from nltk.corpus import inaugural
print(inaugural.fileids())

Lets print Trump's one!

In [None]:
doc = "2017-Trump.txt"
print(inaugural.raw(doc))

NLTK corpora gives us all the words and sentences as well as other statistics out of the box.

In [None]:
print(inaugural.words(doc))

In [None]:
nltk.download("punkt")
print(inaugural.sents(doc))

## Challenge 01
Given this incomplete function, write the necessary `TODO X` code to let the function return the total number of words and the total number of distinct words, for a given document name in that corpus.

In [None]:
def calculate_inaugural_stats(doc):
    # TODO 1 - Get the pre-tokenised list of words from the inaugural corpus
    doc_words = ...

    # TODO 2 - Calculate the total number of words
    num_words = ...

    # TODO 3 - Calculate the total number of distinct words (vocabulary)
    vocab = ...

    return num_words, vocab

Now let's test it!

In [None]:
def print_inaugural_stats(speech_name):
    tokens, vocab = calculate_inaugural_stats(speech_name)
    print(f"Num words in {speech_name}: {tokens}")
    print(f"Vocabulary size: {vocab}")

print_inaugural_stats("2017-Trump.txt")

Let's compare Trump's speech against Obama's inaugural speech...

In [None]:
print_inaugural_stats("2009-Obama.txt")

## Challenge 02
Complete the missing code (`TODO X`) in the function to calculate the average word length (i.e. number of characters per word) of a given document.

In [None]:
def calculate_inaugural_word_stats(doc):
    doc_words = inaugural.words(doc)

    # TODO 1 - Construct a list that contains the word lengths for each DISTINCT word in the document
    vocab_lengths = ...

    # TODO 2 - Find the average word type length
    avg_vocab_length = ...

    return avg_vocab_length

Let's try it!

In [None]:
speech_name = "2017-Trump.txt"
avg_length = calculate_inaugural_word_stats(speech_name)
print(f"Average word length for {speech_name}: {avg_length:.2f} characters long")

Now it will be interesting to look at the word distribution and see how the last two USA Presidents compare. NLTK again has a nice class with functions for that, `FreqDist`!

In [None]:
from nltk import FreqDist

obama_words = inaugural.words("2009-Obama.txt")
trump_words = inaugural.words("2017-Trump.txt")

# Construct a frequency distribution over the lowercased words in the document
fd_obama = FreqDist(w.lower() for w in obama_words)
# Find the top 50 most frequently used words in the speech
print("\nOBAMA\n", fd_obama.most_common(50))

# Construct a frequency distribution over the lowercased words in the document
fd_trump = FreqDist(w.lower() for w in trump_words)
# Find the top 50 most frequently used words in the speech
print("\nTRUMP\n", fd_trump.most_common(50))

As you might have expected... popular words in Trump's speech are: `will`, `america`, `american`, `people`, `country`, `again`... :-)

Now let's plot the distributions!

In [None]:
fd_obama.plot(50)
fd_trump.plot(50)

Those distributions are "normal" and most documents or corpora will follow a very similar curve.

Let's compare some word frequencies now between the two Presidents

In [None]:
print(f"Obama -> peace: {fd_obama['peace']} - america: {fd_obama['america']}")
print(f"Trump -> peace: {fd_trump['peace']} - america: {fd_trump['america']}")

## Challenge 03
Let's try to build a similar function that calculates the top most frequent words in a document, but using the `Counter` class that we used in the previous lab. Complete the `TODO X` sections.

In [None]:
from collections import Counter

def get_top_freq(doc, k=50):
    doc_words = inaugural.words(doc)
    
    # TODO 1 - Construct a frequency distribution over the words in the document, ensuring all words are lowercase
    fd_doc_words = ...
    
    # TODO 2 - Find the top x most frequently used words in the document
    top_words = ...

    return top_words

Now let's test it!

In [None]:
print(f"Top 50 words for Trump's 2017 speech:\n{get_top_freq('2017-Trump.txt')}")

## Challenge 04
Now let's try to build a TFIDF feature vector!

The first thing is to calculate the Term Frequency (TF).

There are different ways to calculate the Term Frequency. Try to implement the formula $tf_{t,d}=\frac{count(t,d)}{count(d)}$ that we used before in the lecture. Complete the `TODO X` section.

In [None]:
def calculate_tf(token_count, bow):
    tf = {}
    num_bow = len(bow)

    for token, count in token_count.items():
        # TODO - Calculate the term frequency using the formula:
        # "count of term in the document" / "total number of words in the document"
        tf[token] = ...

    return tf

Let's try it!

In [None]:
tokens_01 = inaugural.words("2017-Trump.txt")
tokens_02 = inaugural.words("2009-Obama.txt")
vocab = set(tokens_01).union(set(tokens_02))

def _get_tf(tokens, vocab):
    token_count = dict.fromkeys(vocab, 0)
    for token in tokens:
        token_count[token] += 1
    return calculate_tf(token_count, tokens)

tf_01 = _get_tf(tokens_01, vocab)
tf_02 = _get_tf(tokens_02, vocab)


## Challenge 05
Now that we have our term frequency, let's calculate the Inverse Term Frequency (IDF) for a list of documents.

We are going to use the original formula here: $\log{\frac{N}{n_t}}$, where $N$ is the number of documents and $n_t$ is the number of documents that contain the term $t$. Complete the `TODO X` sections.

In [None]:
import math

def calculate_idf(docs):
    N = len(docs)
    
    # TODO 1 - Initialise a new dictionary with the keys from the documents and the values set to 0
    idf = ...
    for doc in docs:
        for word, val in doc.items():
            if val > 0:
                # TODO 2 - Increase the idf dictionary counter by one
                ...
    
    for word, val in idf.items():
        idf[word] = math.log(N / float(val))

    return idf

Now let's collect all the IDFs.

In [None]:
tfs = [tf_01, tf_02]
idfs = calculate_idf(tfs)

## Challenge 06
And finally... the TFIDF calculation.

Calculate the TFIDF for all the documents. Complete the `TODO X` section.

In [None]:

def calculate_tfidf(tfs, idfs):
    tfidf = {}

    for word, val in tfs.items():
        # TODO - Calculate and store the tfidf
        ...

    return tfidf

Let's test test the TFIDFs!

In [None]:
tfidf_01 = calculate_tfidf(tf_01, idfs)
tfidf_02 = calculate_tfidf(tf_02, idfs)
print(f"Trump: {tfidf_01}")
print(f"\nObama: {tfidf_02}")

## Challenge 07
Let's try it and then visualise it as a DataFrame!

Try to fit the two dictionaries into a single DataFrame so that we can visualise it better. Complete the `TODO X` section.

In [None]:
import pandas as pd

# TODO
df = ...
df

Now we have TFIDF feature vectors!

# Language Model Experiments
Let's play with some popular n-gram language models. We are going to use NLTK again for that and try to predict the sequence of words.

First, let's import some libraries.

In [None]:
from nltk.corpus import gutenberg
from nltk.lm import WittenBellInterpolated, MLE, Laplace
from nltk.util import ngrams, pad_sequence, everygrams

Next we will build a function that performs the prediction based on MLE

In [None]:
def build_mle_estimator(doc_name, n):
    # Construct a list of lowercase words from the document
    words = [w.lower() for w in inaugural.words(doc_name)]
    
    # generate ngrams
    ngrams = list(everygrams(words, max_len=n))

    # build ngram language models
    lm = MLE(n)
    lm.fit([ngrams], vocabulary_text=words)
    print(lm.vocab)
    
    return lm

Build the estimator first.

In [None]:
lm = build_mle_estimator("2017-Trump.txt", 2)

Now let's try it!

In [None]:
def print_lm_scores(lm):
    print(f"Probability of 'first' followed by 'america': {lm.score(word='america', context=['first']):.5f}")
    print(f"Probability of 'america' followed by 'first': {lm.score(word='first', context=['america']):.5f}")

    print(f"Probability of 'you' followed by 'thank': {lm.score(word='thank', context=['you']):.5f}")
    print(f"Probability of 'thank' followed by 'you': {lm.score(word='you', context=['thank']):.5f}")

print_lm_scores(lm)

## Challenge 08
Try an add-one Laplace smoothing model instead.

In [None]:
def build_laplace_estimator(doc_name, n):
    # TODO - Implement a function simlar to `build_mle_estimator` that instead uses an add-one Laplace smoothing model.
    # Hint: you might want to check the NLTK documentation (https://www.nltk.org/api/nltk.lm.html) on that!
    ...

    return lm

Build this estimator as well.

In [None]:
lm2 = build_laplace_estimator('2017-Trump.txt', 2)

Now test it and observe any differences in the results.

In [None]:
print_lm_scores(lm2)