# Solo work for week 3

## Word meaning in the seasons corpus

In class, we built a termxdocument matrix of the seasons data. Each *column* of that matrix corresponded to a document - it was the document vector for that document. In essence, we were understanding each column as the *meaning* of the corresponding document.

Now get ready to have your mind blown: Just as we understood each column in the termxdocument matrix as the meaning of a document, we can understand each row in the documnet as a meaning of the corresponding *term*. The idea here is that the meaning of a term can be understood as a union of all of the context in which that word appears.

You are going to explore this idea a bit. I actually haven't tried this myself, so I'm not sure anything interesting will fall out. Maybe that's a poor way to design an instructional activity. I think it's more fun this way.

### To make your lives easier, I have copied over the first parts of notebook 10.

**Load the corpus**

In [None]:
from seasons_module import load_seasons_corpus
seasons_corpus = load_seasons_corpus()

**Compile the vocabulary in the usual way.**

In [None]:
set_vocab = set([])
for fname in seasons_corpus.keys():
    set_vocab = set_vocab.union(set(seasons_corpus[fname][0]))
f = open("lists/seasons_stop_list.txt")
stop_list = set(f.read().split("\n"))
pruned_vocab = set(sorted([w for w in list(set_vocab) if w not in stop_list]))

**Compute the corpus and document frequency for each term.**

In [None]:
import nltk
word_fdist = nltk.FreqDist() # the corpus frequences
doc_fdist = nltk.FreqDist()# the document frequencies
for word in pruned_vocab:
    word_fdist[word] = 0
    doc_fdist[word] = 0
    for name in seasons_corpus.keys():
        if word in seasons_corpus[name][0]:
            doc_fdist[word] += 1
            word_fdist[word] += seasons_corpus[name][0].count(word)

**Create a very small vocabulary**

Just 10 words, to make it more simple to understand what's going on.

In [None]:
small_vocab = [w[0] for w in word_fdist.most_common(10)]
print(small_vocab)

**Compute the weighted document vector for each document**

Same as before, but now using our smaller vocabulary.

In [None]:
def tf(tf, df, cf, N):
    return tf

def logtf(tf, df, cf, N):
    if tf == 0:
        result = 0
    else:
        result = (1 + np.log(tf))
    return result

def onehot(tf, df, cf, N):
    if tf == 0:
        return 0
    else:
        return 1

def tfidf(tf, df, cf, N):
    if tf == 0:
        result = 0
    else:
        result = (1 + np.log(tf)) * np.log(N  / df)
    return result

def compute_vector(words, vocab, df, N, weight_function):
    new_vector = []
    for w in vocab:
        tf = words.count(w)
        new_vector.append(weight_function(tf, df[w], 0, N))
    return norm_vec(np.array(new_vector))

In [None]:
import numpy as np
def norm_vec(v):
    return v / np.linalg.norm(v)
np.set_printoptions(precision=3)

In [None]:
# compute the document vector for each document
doc_vectors = {}
N = len(seasons_corpus.keys())
wf = tf
for fname in seasons_corpus.keys():
    doc_vectors[fname] = compute_vector(seasons_corpus[fname][0], small_vocab, doc_fdist, N, wf)

In [None]:
print(len(doc_vectors), len(doc_vectors['angelapre']))

**Create a termxdocument matrix**

This is a matrix where every row corresponds to a word in the vocabulary, and every column corresponds to a document.

Another way to say this: Each column in the matrix is the document vector for a document.

In [None]:
td_matrix = np.zeros([len(small_vocab), len(doc_vectors)])
i = 0
name_index = {}
name_list = []
for name in doc_vectors.keys():
    td_matrix[:, i] = doc_vectors[name]
    name_index[name] = i
    name_list += [name]
    i = i + 1

### Here is where things get new

The line below is what we executed to compute a matrix reflecting the documentxdocument similarity. You want to modify this line to that it instead computes a matrix showing the termxterm similarity. Let's call it `tt_matrix`

In [None]:
dd_matrix = np.dot(td_matrix.transpose(), td_matrix)

Once you have the termxterm similarity, here are some additional things to do. Go as far as you like in these tasks. If you get board of this, you can try another set of tasks I'll put below.

1. Look at `tt_matrix`. Are the term similarities what you expected?
2. It might help to make a heatmap of the sort we made when looking at word co-occurrences. Try that.
3. Above we used only the 10 most common terms. Expand this a bit to include some more terms. You might want to think a bit about what terms would be interesting to compare, and make sure that you have included those terms.

In [None]:
tt_matrix = np.dot(td_matrix, td_matrix.transpose())

In [None]:
def round_matrix(the_matrix, prec = 2):
    sh = the_matrix.shape
    if len(sh) == 1:
        for i in range(sh[0]):
            the_matrix[i] = round(the_matrix[i], prec)
    else:
        for i in range(sh[0]):
            for j in range(sh[1]):
                the_matrix[i, j] = round(the_matrix[i, j], prec)
    return the_matrix
from sympy import *
init_printing()

In [None]:
Matrix(round_matrix(tt_matrix))

In [None]:
import matplotlib
import matplotlib.pyplot as plt

def matrix_heatmap(mtx, name_list):
    fig=plt.figure(figsize=(10, 10), dpi= 80, facecolor='w', edgecolor='k')
    n = len(name_list)
    x_tick_marks = np.arange(n)
    y_tick_marks = np.arange(n)
    plt.xticks(x_tick_marks, name_list, fontsize=8, rotation=90)
    plt.yticks(y_tick_marks, name_list, fontsize=8)
    plt.tick_params("x", top=True, labeltop=True, bottom=False, labelbottom=False)
    plt.imshow(mtx, norm=matplotlib.colors.LogNorm(), interpolation='nearest', cmap='YlOrBr')

In [None]:
matrix_heatmap(tt_matrix, small_vocab)

### Create a termxdocument matrics with norming document vectors

In [None]:
def compute_vector_n(words, vocab, df, N, weight_function, norm_vector=True):
    new_vector = []
    for w in vocab:
        tf = words.count(w)
        new_vector.append(weight_function(tf, df[w], 0, N))
    if norm_vector:
        return norm_vec(np.array(new_vector))
    else:
        return np.array(new_vector)

In [None]:
# compute the document vector for each document
doc_vectors = {}
N = len(seasons_corpus.keys())
wf = tf
for fname in seasons_corpus.keys():
    doc_vectors[fname] = compute_vector_n(seasons_corpus[fname][0], small_vocab, doc_fdist, N, wf, False)
td_matrix = np.zeros([len(small_vocab), len(doc_vectors)])
i = 0
name_index = {}
name_list = []
for name in doc_vectors.keys():
    td_matrix[:, i] = doc_vectors[name]
    name_index[name] = i
    name_list += [name]
    i = i + 1

#### Normalize the word vectors (the rows)

In [None]:
for r in range(len(small_vocab)):
    td_matrix[r, :] = norm_vec(td_matrix[r, :])

In [None]:
tt_matrix = np.dot(td_matrix, td_matrix.transpose())

In [None]:
matrix_heatmap(tt_matrix, small_vocab)

### Let's try a somewhat larger vocabulary
You can set the size of the vocabulary in the next line

In [None]:
vocab_size = 25

In [None]:
medium_vocab = [w[0] for w in word_fdist.most_common(vocab_size)]
doc_vectors = {}
N = len(seasons_corpus.keys())
wf = tf
for fname in seasons_corpus.keys():
    doc_vectors[fname] = compute_vector_n(seasons_corpus[fname][0], medium_vocab, doc_fdist, N, wf, False)
td_matrix = np.zeros([len(medium_vocab), len(doc_vectors)])
i = 0
name_index = {}
name_list = []
for name in doc_vectors.keys():
    td_matrix[:, i] = doc_vectors[name]
    name_index[name] = i
    name_list += [name]
    i = i + 1
for r in range(len(medium_vocab)):
    td_matrix[r, :] = norm_vec(td_matrix[r, :])
tt_matrix = np.dot(td_matrix, td_matrix.transpose())

In [None]:
matrix_heatmap(tt_matrix, medium_vocab)

And if you're really ambitious:

4. In this exercise, we are understanding a term's meaning as the collection of contexts in which it appears. So far we have been defining a context simply as the entire document in which the word appears. But there are many other ways we could define a context. The task here is to try some other contexts. One possibility is to look at the collection of utterances in which the word appears. In that case, we'd be treating each utterance as a document when we make are termxdocument matrix. To facilitate this a bit, I made a function that loads the corpus for you, but gives each transcript as a list of tokenized utterances. It's invoked in the next cell.

In [None]:
from seasons_module import load_seasons_corpus_as_utterances
seasons_corpus_with_utterances = load_seasons_corpus_as_utterances()

In [None]:
all_utterances = []
for entry in seasons_corpus_with_utterances.values():
    all_utterances += entry[0]

In [None]:
vocab_size = 25
medium_vocab = [w[0] for w in word_fdist.most_common(vocab_size)]
doc_vectors = {}
N = len(seasons_corpus.keys())
wf = tf
utterance_vector_list = []
for utterance in all_utterances:
    utterance_vector_list.append(compute_vector_n(utterance, medium_vocab, doc_fdist, N, wf, False))
td_matrix = np.zeros([len(medium_vocab), len(utterance_vector_list)])
i = 0
name_index = {}
name_list = []
for utterance_vector in utterance_vector_list:
    td_matrix[:, i] = utterance_vector
    name_index[name] = i
    name_list += [name]
    i = i + 1
for r in range(len(medium_vocab)):
    td_matrix[r, :] = norm_vec(td_matrix[r, :])
tt_matrix = np.dot(td_matrix, td_matrix.transpose())

In [None]:
matrix_heatmap(tt_matrix, medium_vocab)

## Paragraph-paragraph similarity as a measure of text difficulty

Here's something completely different you can try, if you're board of the seasons corpus.

Our task here is to develop a measure of how difficult a text is by looking at the inter-paragraph cohesion. 
The idea is that we'll:

* construct a vector for every sentence or paragraph in a text. 
* find the similarity between sequential pair of paragraphs
* and, finally, we'll find the averages of these similarities.

If the average of the similarities is high (i.e., closer to one) then the cohesion is high.

I'm think that you'd pick a couple of text from the Gutenberg corpus, and compare their cohesions.

The first step will be to read in one of these corpora. For example

In [None]:
import nltk
mycorpus = nltk.corpus.PlaintextCorpusReader("corpora/gutenberg", 'melville-moby_dick.txt')

Once you've got this, you can access the book as paragraphs or sentences. For example, this gives you the paragraphs:

In [None]:
mycorpus.paras()

Notice that `mycorpus.paras()` give you a list, where each item in the list is a paragraph. 
But each paragraph is itself a list of sentences.

Probably a first step then is to combine the sentences in each paragraph into one long list of words.
I guess I'll do that for you:

In [None]:
paragraphs = []
for para in mycorpus.paras():
    flat_para = []
    for sent in para:
        flat_para += sent
    paragraphs.append(flat_para)

Now `paragraphs` is a long list of paragraphs.
The next step is to convert each of these paragraphs into a vector

First, I'll get a stop list of all of the things we want to ignore.

In [None]:
f = open("lists/stop-words_english_5_en.txt")
stop_list = f.read().split("\n")
stop_list += list('!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~’')
stop_list += list("abcdefghijklmnopqrstuvwxyz0123456789")
stop_list = set(stop_list)

Next get the word and document frequencies, ignoring words on the stop list

In [None]:
import nltk
word_fdist = nltk.FreqDist() # the corpus frequences
doc_fdist = nltk.FreqDist()# the document frequencies
for paragraph in paragraphs:
    for wordraw in paragraph:
        word = wordraw.lower()
        if not word in stop_list:
            word_fdist[word] += 1
            if word not in doc_fdist:
                doc_fdist[word] = 1

For the vocabulary, we'll take the 500 most common terms.

In [None]:
vocab_size = 500
vocab = [w[0] for w in word_fdist.most_common(vocab_size)]

Now compute the document vector for each document.

First, I'm creating a new version of `norm_vec` that watches out for zeros.

In [None]:
def norm_vec(v):
    if np.linalg.norm(v) == 0:
        return np.zeros(len(v))
    else:
        return v / np.linalg.norm(v)

In [None]:
paragraph_vectors = []
N = len(paragraphs)
wf = tf
for paragraph in paragraphs:
    paragraph_vectors.append(compute_vector(paragraph, vocab, doc_fdist, N, wf))

Next find the dot product between each successive pair of paragraphs.
Then find their average.

That's your measure of coherence

In [None]:
dot_prods = []
for r in range(len(paragraph_vectors) - 1):
    dot_prods.append(np.dot(paragraph_vectors[r], paragraph_vectors[r + 1]))

In [None]:
np.mean(dot_prods)

### Now let's make the whole process into a function and do it for a different book

In [None]:
def compute_coherence(book_title):
    mycorpus = nltk.corpus.PlaintextCorpusReader("corpora/gutenberg", book_title)
    paragraphs = []
    for para in mycorpus.paras():
        flat_para = []
        for sent in para:
            flat_para += sent
        paragraphs.append(flat_para)
    f = open("lists/stop-words_english_5_en.txt")
    stop_list = f.read().split("\n")
    stop_list += list('!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~’')
    stop_list += list("abcdefghijklmnopqrstuvwxyz0123456789")
    stop_list = set(stop_list)
    word_fdist = nltk.FreqDist() # the corpus frequences
    doc_fdist = nltk.FreqDist()# the document frequencies
    for paragraph in paragraphs:
        for wordraw in paragraph:
            word = wordraw.lower()
            if not word in stop_list:
                word_fdist[word] += 1
                if word not in doc_fdist:
                    doc_fdist[word] = 1
    vocab_size = 100
    vocab = [w[0] for w in word_fdist.most_common(vocab_size)]
    # compute the document vector for each document
    paragraph_vectors = []
    N = len(paragraphs)
    wf = tf
    for paragraph in paragraphs:
        paragraph_vectors.append(compute_vector(paragraph, vocab, doc_fdist, N, wf))
    dot_prods = []
    for r in range(len(paragraph_vectors) - 1):
        dot_prods.append(np.dot(paragraph_vectors[r], paragraph_vectors[r + 1]))
    print(np.mean(dot_prods))

In [None]:
compute_coherence("milton-paradise.txt")

In [None]:
compute_coherence("burgess-busterbrown.txt")

In [None]:
compute_coherence("melville-moby_dick.txt")