Lecture 4: vector similarity
===============

9/18/2023, CS 4/6120 Natural Language Processing, Muzny


Task 1: Calculating tf-idf
----

To calculate tf-idf, we'll need to first construct a term-document matrix from our data.

In [9]:
# for tokenization, not necessary
# (comment out if you don't have nltk installed yet)
import nltk
nltk.download('punkt')
# helpful for calculating log10 and because
# numpy arrays make certain manipulations
# (like getting a column of numbers)
# easier
import numpy as np

# useful for counting :)
from collections import Counter

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/davebudhram/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [10]:
def tf(term: str, document: list) -> float:
    """
    Calculate term frequency
    Parameters:
    term - string
    document - list of strings (tokenized document)
    Return:
    float term frequency
    """
    return np.log10(document.count(term) + 1)

def idf(N: int, df_t: int) -> float:
    """
    Calculate inverse document frequency given the 
    number of documents and the number of documents the term appears in.
    Paramenters:
    N - int (number of documents)
    df_t - int (number of documents the term appears in)
    Return:
    float inverse document frequency
    """
    return np.log10(N / df_t)


def term_in_documents(term: str, documents: list) -> int:
    """
    Calculate the number of documents in a list of documents that a target
    term appears in.
    Parameters:
    term - str
    documents - list of list of str (list of tokenized documents)
    Return:
    int number of documents the term appears in
    """
    total = 0 
    for document in documents:
        if term in document:
            total +=1
    return total


# load in the data
def load_tokens(filename):
    f = open(filename, 'r')
    contents = f.read().lower()
    f.close()
    # if you don't have nltk installed, use another tokenization
    # strategy here like str.split()
    return nltk.word_tokenize(contents)

mobydick = load_tokens('./mobydick.txt')
shakes = load_tokens('./shakesdown.txt')
pandp = load_tokens('./prideandprejudice.txt')
books = [mobydick, shakes, pandp]

In [11]:
# we won't create the entire term-document matrix, we'll just do it for a few key terms that we
# care about for the sake of time
# TODO: pick 3 - 5 words
words = ['the', 'girl', 'people']

In [None]:
# create the term-document matrix
matrix = []

# iterate through your chosen words
for word in words:
    # count how many times it occurs in each book
    # this is just for debugging for you
    counts = [str(b.count(word)) for b in books]
    print(word, ":\t", "\t".join(counts))
    
    
    # calculate the term frequency for each book
    # for this word
    # tf_words should be the same length as counts
    # tf_words = YOUR CODE HERE
    tf_words = []
    for book in books:
       tf_words.append(tf(word, book))
    
    # calculate idf for this term
    # this will be a single scalar
    N = len(books)
    df_t = term_in_documents(word, books)
    idf_t = idf(N, df_t)
    
    # multiply tf with idf for each book/each
    # term frequency you calculated
    tfidf_words = 
    
    # add the tfidf numbers to your matrix
    matrix.append(tfidf_words)
    
    # uncomment to see visually the different components (helpful for debugging)
#     print(word, " tf:\t", "\t".join([str(x) for x in tf_words]))
#     print(word, "idf:\t", idf_t)
#     print(word, " tf-idf:\t", "\t".join([str(x) for x in tfidf_words]))
    
# if you'd like to, uncomment the following code to make the matrix into a numpy array
# matrix = np.array(matrix)

1. What should the dimensions of your matrix be? __number of words (5) by number of books (3)__
2. What happens if you attempt to calculate tfidf of a term that exists in *none* of your books? __Error, unless you adjust the code__
2. What happens if you attempt to calculate tfidf of a term that exists in *all* of your books? __it's zero!__

In [None]:
# check the dimensions of your matrix
# number of rows should match number of words
# number of cols should match number of documents (books)


In [None]:
def cosine_sim(v1: list, v2: list) -> float:
    """
    Calculate the cosine similarity between two vectors of the same length.
    Parameters:
    v1 - list (of numbers)
    v2 - list (of numbers)
    Return:
    float cosine similarity
    """
    # you may find the numpy functions np.dot() and np.linalg.norm() useful
    # TODO: implement
    return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))

# calculate the similarity between two *word* vectors
# we'll just do word vectors because unless matrix is a numpy array
# it is (more) difficult to get column vectors

In [None]:
# if you finish, which book is closest to moby dick?
# you'll need a *column* vector here (instead of a row vector)


In [None]:
# if you finish finish, which book is closest to moby dick, but remake your matrix with all vocabulary terms?
# create the term-document matrix
# you'll want to use counters for each book's vocabulary for the sake of efficiency


# you may want to re-define new counter versions of your tf, idf, term_in_documents 
# functions so that they work with counters


In [None]:
# verify the shape of your matrix

In [None]:
# which book is actually closest to moby dick?

Task 2: install `nltk` (this is the same task 2 from lecture 3)
-----

If you finish the first task, work on making sure that you have `nltk` downloaded and accessible to your jupyter notebooks. While you will not be allowed to use `nltk` for *most* of your homework, we will use it frequently in class to demonstrate tools. 

[`nltk`](https://www.nltk.org/) (natural language toolkit) is a python package that comes with many useful implementations of NLP tools and datasets.

From the command line, using pip: `pip3 install nltk`

[installing nltk](https://www.nltk.org/install.html)

In [None]:
import nltk

# the stemmer we'll use
from nltk.stem.porter import PorterStemmer

# also grab a lemmatizer
from nltk.stem import WordNetLemmatizer

# for the tokenizer that we're going to use
# won't cause an error if you've already downloaded it
nltk.download('punkt')
# for the lemmatizer
nltk.download('wordnet')

In [None]:
example = "N.K. Jemison is a science fiction author."
words = nltk.word_tokenize(example)

# not perfect, but much better
print(words)

In [None]:
# using the nltk tokenizer, tokenize Moby Dick

4. How many tokens do you have now? how big is the vocabulary? Are these numbers larger or smaller than using `str.split()` to tokenize? __YOUR ANSWER HERE__

In [None]:
porter_stemmer = PorterStemmer()

# example stemming at the individual token level
for w in moby_nltk_tokens[:5]:
    print(porter_stemmer.stem(w))

5. How big is the vocabulary of stems? __YOUR ANSWER HERE__

In [None]:
# how many lemmas in the vocabulary?
wordnet_lemmatizer = WordNetLemmatizer()
        
# the lemmatizer works using the method .lemmatize(word)

6. How big is the vocabulary of lemmas? __YOUR ANSWER HERE__

In [None]:
# code

7. make a graph of heap's law with separate series for tokens, stems, and lemmas. Do they follow the same patterns? __YOUR ANSWER HERE__