# Introduction: Working up to Word Vectors

Today we are going to review a bit, and build up a bit more slowly to *word vectors.* 

This is important because it will put us in a position to understand some of the more recent developments.

## Some preliminaries

First some imports

In [None]:
import numpy as np
import pandas as pd
import nltk
import matplotlib.pyplot as plt

A few useful functions for dealing with vectors and matrices

In [None]:
def norm_vec(v):
    mag = np.linalg.norm(v)
    if mag == 0:
        return v
    return v / np.linalg.norm(v)

from sklearn.preprocessing import normalize

def normalize_rows(x):
    return normalize(x, axis=1)

def normalize_columns(x):
    return normalize(x, axis=0)

Finally some functions that are for displaying things nicely.

In [None]:
def check_float(potential_float):
    try:
        float(potential_float)
        return True
    except ValueError:
        return False

def round_if_float(v, prec=3):
    if check_float(v):
        return round(float(v), prec)
    return v

from IPython.core.display import display, HTML
def list_table(the_list, color_nums=False):
    html = ["<table style= 'border: 1px solid black; display:inline-block'>"]
    for row in the_list:
        html.append("<tr>")
        for col in row:
            if color_nums and check_float(col) and not float(col) == 0:
                html.append("<td align='left' style='border: .5px solid gray; color: {1}; font-weight: bold'>{0}</td>".format(round_if_float(col), color_nums))
            else:
                html.append("<td align='left' style='border: .5px solid gray;'>{0}</td>".format(round_if_float(col)))
        html.append("</tr>")
    html.append("</table>")
    return display(HTML(''.join(html)))

def show_labeled_table(mat, col_names=None, row_names=None, nrows=10, ncols=10, color_nums="red"):
    sml = mat[:nrows, :ncols]
    if col_names is not None:
        sml = np.vstack([col_names[:ncols], sml])
    if row_names is not None:
        rnames = [[p] for p in row_names[:nrows]]
        if col_names is not None:
            new_col = np.array([["_"]] + rnames)
        else:
            new_col = np.array(rnames)
        sml = np.hstack((new_col, sml))
    return list_table(sml, color_nums)

# Document vectors, once more

To keep things simple, we are going to make a small number of short documents.

In [1]:
raw_transcript_docs = {
    "d1": "That's because of the sun is in the center and the Earth moves around the sun and the Earth is like at one point in the winter", 
    "d2": "it's like farther away from the sun and towards the summer it's closer it's near, towards the sun.",
    "d3": "The sun's in the middle  and the Earth kind of orbits around it.",
    "d4": "And like say at one - it's probably more of an ovally type thing  In the winter, er probably this will be winter since it's further away",
    "d5": "that's winter would be like, the Earth orbits around the sun .  Like summer is the closest to the sun", 
    "d6": "Spring is kind of a little further away, and then like Fall  is further away then spring but not as far as winter, and then winter is the furthest.",
    "d7": "the sun doesn't, like the flashlight and the bulb, it hits summer, the lines like fade in , they get there closer, like quicker",
    "d8": "And by the time they get there [winter], it fades and it's a lot colder for winter"
}

transcript_doc_names = list(raw_transcript_docs.keys())

We'll load my tokenizer and stop list

In [None]:
from seasons_module import seasons_tokenize

f = open("lists/seasons_stop_list.txt")
stop_list = set(f.read().split("\n"))

We tokenize these short documents.

In [None]:
tokenized_transcript_docs = [seasons_tokenize(doc) for doc in raw_transcript_docs.values()]

Next we make a vocabulary by finding the 10 most common words in these short documents

In [None]:
fdist = nltk.FreqDist()
for doc in tokenized_transcript_docs:
    fdist.update(doc)

In [None]:
vocab = []
for tup in fdist.most_common(100):
    if tup[0] not in stop_list:
        vocab.append(tup[0])
vocab = vocab[:10]

In [None]:
print(vocab)

Finally, we compute the document vectors

In [None]:
def compute_doc_vector(tdoc, vocab):
    return np.array([tdoc.count(w) for w in vocab])


transcript_doc_vectors = []
for tdoc in tokenized_transcript_docs:
    transcript_doc_vectors.append(compute_doc_vector(tdoc, vocab))

Heres' what one of these document vectors looks like:

In [None]:
transcript_doc_vectors[0]

It will often be convenient to take a bunch of these document vectors and put them into a big table.

Here's what that looks like

In [None]:
transcript_dt_matrix = np.array(transcript_doc_vectors)

show_labeled_table(transcript_dt_matrix, vocab, transcript_doc_names)

In this case, each of the rows corresponds to one of our documents, and each column corresponds to one of the terms in our vocabulary.

We'll call this a document-by-term matrix.

To compare two documents we can pick out the corresponding rows, then find normalize them and find their dot products.

In [None]:
def compare_doc_vectors(docA, docB, dnames, dt_matrix):
    v1 = dt_matrix[dnames.index(docA)]
    v2 = dt_matrix[dnames.index(docB)]
    return np.dot(norm_vec(v1), norm_vec(v2))

In [None]:
compare_doc_vectors("d1", "d3", transcript_doc_names, transcript_dt_matrix)

Sometimes we'll want to flip or "transpose" our document-by-term matrix so that so that the terms are the rows and the columns are the documents. This is a term-by-document matrix.

In [None]:
transcript_td_matrix = transcript_dt_matrix.transpose()

show_labeled_table(transcript_td_matrix, transcript_doc_names, vocab)