# Lab 7

This lab will introduce "word embeddings" (or a mapping of a word into a vector space), beginning with the word2vec algorithm - which started a revolution in NLP in 2013. This revolution is still ongoing, and it is currently culminating with "contextualized word embeddings" and "large language models" which even outperform humans at certain tasks (GPT3, BERT) - but conceptually BERT is a model that's built on the same concept as word2vec - ie that words can be distributed in a vector space - but is more "dynamic" than word2vec - ie can capture context information. We will cover this in next week's lab.


So far we have focused on bag-of-word approaches i.e representations of documents and words as a vectors **documents** and **word frequencies** (if we want TFIDF matrix, we apply TFIDF weighting). The problem with bag of word approaches is that they do not capture any information about similarity or meaning of words. 

Recall the example Document-Term Matrices that we used [here](https://i.stack.imgur.com/hMe5D.png)
or
[here](https://i.stack.imgur.com/Aj2A7.png). The bag of words representation has no information of context - the columns are even alphabetic - hence "bag of words."


## Conceptual point - words and documents can be thought of as vectors in the DTM
As we should be familiar by now, in a Document Term Matrix (DTM) rows represent documents, and the columns words. 

Note that the words (columns) are **sparse vectors** - meaning, that for most words, the column is going to be full of 0s - reflecting the fact that not every word is used in every document. Also note that the DTM lists words alphabetically, reflecting the fact that order is completely disregarded - and thus, context is ignored entirely. 


## Words as vectors 
If we were to take the **word1** as a vector, we would have **[0,0,0,1]** vector. And if we represent another **word2** as a vector we'd get **[0,1,0,0]**. Cosine similarity between two orthogonal vectors is 0 - they point in completely different directions. Thus, even if we wanted to get "similarity" between two words in a DTM, we wouldn't get much information. But we still have some luck using consime similarity with documents. 

## Documents as vectors 
Document comparison using cosine similarity is another fruitful task. 


## Word2vec and [distributional hypothesis](https://en.wikipedia.org/wiki/Distributional_semantics)

As we will see in a bit, word2vec allows us to represent words as __dense__ vectors. Each word is __embedded__ in a vector space of a fixed dimension (usually 300) where __similar words__ are located together in a vector space. This allows us to do similarity calculation between words - thus gaining insight into their semantic content. The fact that each word is embedded as a vector in a vector space is why each word represented by this method is called a __word embedding__




In [None]:
import os
import sys
sys.path.append('..')

import lzma
import json
import pandas as pd
import numpy as np

from config import settings_base as settings
from config import utils

## Getting the data

In [None]:
compressed_file = utils.get_cases_from_bulk(jurisdiction="Delaware", data_format="json")

In [None]:
cases = []
print("File path:", compressed_file)
with lzma.open(compressed_file) as infile:
    for line in infile:
        record = json.loads(str(line, 'utf-8'))
        cases.append(record)

print("Case count: %s" % len(cases))

In [None]:
df = pd.DataFrame(cases)
df.head()

In [None]:
opinion_data = []
for case in cases:
    for opinion in case["casebody"]["data"]["opinions"]:
        temp = {}
        keys = list(case.keys())
        keys.remove('casebody')
        for key in keys:         
            temp[key] = case[key]
        keys = list(opinion.keys())
        for key in keys:         
            temp[key] = opinion[key]
        opinion_data.append(temp)
        
df = pd.DataFrame(opinion_data)
df["citations"] = df["citations"].apply(lambda x:x[0]['cite'])
df["court"] = df["court"].apply(lambda x:x['name'])
df["decision_date"] = df["decision_date"].apply(lambda x:int(x[:4]))

#df = df[df['court'] =='Delaware Court of Chancery'] ## if we want to just focus on one court

df["text"] = df["text"].str.lower()
df = df.drop(["docket_number", "first_page", 
                                "last_page", "name",
                                "reporter", "volume", "jurisdiction"], axis=1)
df = df[["name_abbreviation", "decision_date", "court", "author", "type", "text"]]

In [None]:
df.head()

In [None]:
sample_df = df.sample(500, 
                      replace=False , 
                      random_state=1)

In [None]:
len(sample_df)

## Part 1 - Cosine similarity for document similarity

According to Wikpiedia, "Cosine similarity is a measure of similarity between two sequences of numbers." 

[In essence](https://storage.googleapis.com/lds-media/images/cosine-similarity-vectors.original.jpg) - if there is a small angle between two vectors (which is just a sequence of numbers) - that means they are similar. 

Don't underestimate the usefulness of cosine similarity measures. Imagine you have to find the most similar case to another case in a corpus. How would you go about it? Cosine similarity can help here.

The idea is very simple:

* Cosine(10 degrees) gives you 0.98 __"cosine similarity"__ measure. This can be interpreted as vectors are "98% similar". 

* Cosine(90 degrees) gives you a 0 __"cosine similiarty"__. This can be interpreted as vectors are "0% similar"


In [None]:
from sklearn.feature_extraction.text import CountVectorizer
tf_vectorizer = CountVectorizer(min_df=0.1,
                         max_df=.9,  
                         max_features=1000,
                         stop_words='english',
                         ngram_range=(1,1))

In [None]:
X_tf = tf_vectorizer.fit_transform(sample_df['text'])

tf = pd.DataFrame(data = X_tf.toarray(), 
                  columns = tf_vectorizer.get_feature_names())

tf.head()

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(min_df=0.1,
                                   max_df=.9,  
                                   max_features=1000,
                                   stop_words='english',
                                   ngram_range=(1,1))

In [None]:
X_tfidf = tfidf_vectorizer.fit_transform(sample_df['text'])

tf_idf = pd.DataFrame(data = X_tfidf.toarray(), 
                      columns = tfidf_vectorizer.get_feature_names())

tf_idf.head()

In [None]:
# the cosine similarity measures similarity between rows of a matrix - making it into a Square matrix.
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
cos_sim_tf = cosine_similarity(X_tf)
cos_sim_tfidf = cosine_similarity(X_tfidf)

### CountVectorizer cosine similarity matrix

In [None]:
cv_cos_sim = pd.DataFrame(data = cos_sim_tf, 
                          columns = sample_df['name_abbreviation'],
                          index = sample_df['name_abbreviation'])

cv_cos_sim.head()

We can sort the column values to get the "top similar" cases.

In [None]:
cv_cos_sim.sort_values(by='Silvers v. Jones', 
                          ascending=False)

### TFIDF cosine similarity matrix

In [None]:
tfidf_cos_sim = pd.DataFrame(data = cos_sim_tfidf, 
                             columns = sample_df['name_abbreviation'],
                             index = sample_df['name_abbreviation'])

In [None]:
tfidf_cos_sim.head()

In [None]:
tfidf_cos_sim.sort_values(by='Silvers v. Jones', 
                          ascending=False)

Notice, there's a bit of a difference between TF and TFIDF similarity scores. 

For instance, 3rd most similar to __Silvers v. Jones__ in TFIDF matrix is __Hayes v. Hayes__ and not __National Building, Loan & Provident Ass'n v. Alfree.__

Let's actually read those "similar cases" to Silvers v Jones - are they actually similar? 

### Finding similar case

In [None]:
def get_index(case):
    return sample_df.name_abbreviation[sample_df.name_abbreviation == case].index.tolist()[0]
    

In [None]:
case_of_interest = 'Vredenburgh v. Jones'

get_index(case_of_interest)

In [None]:
print(sample_df['text'][get_index(case_of_interest)])

In [None]:
## Silvers v. Jones is 4528
sample_df['text'][4528]

They both seem to be talking about legacies, estates and other things related to inheritance laws.

Seems like cosine similarity is working.

__NOTE:__  Recall that we could also represent documents as "topics" rather than words - this can also be used for cosine similarity purposes.

---------------

## Part 2 - word2vec: some theory
Now that we know what cosine similarity is  we can move on to Word2Vec. 

The theory behind the word2vec algorithm relies on two theoretical foundations - 

* **[Distributional Semantics](https://en.wikipedia.org/wiki/Distributional_semantics),** - ie that "meaning" of words is known by the context in which the word is used, and 

* **[Language Modeling](https://thegradient.pub/content/images/2019/10/lm-1.png)** - a task in NLP where you predict the next word in a sequence based on probabilities

###  1) Distributional Semantics 

Word2vec algorithm is based on a Linguistic theory called "distributional semantics". 

This theory can be summerized by the linguist Firth's famous statement that **You shall know a word by the company it keeps**. This should not be surprising to anyone who encountered a weird word in a book - we usually tend to re-read the sentence and look for other words around the word we don't know. __Thus, we kinda get an idea of what the word is by looking at other words around it.__ 

* For example - we all know Lewis Carroll's famous poem [Jobberwocky](https://www.poetryfoundation.org/poems/42916/jabberwocky)

Distributional semantics assumes a **distributional hypothesis**. In simple terms, distributional hypothesis argues that the usage of words is **a distribution.** Not only that, but the distribution is constrained/changes depending on various contexts. What this means is that there's only a __limited number of words that can occur in a given context.__

#### Consider the following examples (recall, that we did something similar in the kindgergarden - this is how fundamental this stuff is):

I like to think of this as follows: consider these example: 
* 1) **You are a ___________** 

(how many words can fit in the blank here? ie what is the **"distribution"**, probability wise, what words can fit here?) 

* 2) **You are a very very unlikable _____** 

(could we say that less words can in the blank than before) 

* 3) **I am typing on a _____** 

(how many words can fit in the blank space? - probably only a couple - a typewriter/computer/my phone). The distribution is more constrained. It definitely has to be a noun.)

* 4) **I like drinking ____** 

(how many words can fit in the blank space - probably a billion) 

* 5) **I like drinking  ____, ice cold** 

(less words can fit here - for example, we can't talk about coffee any more, unless you drink ice cold coffee) 

* Thus, words that co-occur in the same context have similar meanings/functions/usages (you can't drink a "table" for example - your language is constrained and the context determines the word that you will use)

#### This "fill in the blank exercise" above, when done by computers is called a ["language modeling task."](https://miro.medium.com/max/1400/1*_MrDp6w3Xc-yLuCTbco0xw.png)

-------------------------
###  2) Language modeling with word2vec

This "fill in the blanks" exercise that we did in kindergarden is unironically the way language modeling works. 


There are two model architectures for word2vec:

* The first of the architectures is called __"continious bag of words"__ (CBOW) which predicts the **current word based on the context** (ie a word is blanked out and the algorithm looks at the data to see which words fit best/highest probability), 

* The second architecture, __"Skip-gram"__ (SGNS) predicts __surrounding words given the current word__ (it's literally  called **"skipgram"** ie - you "skip" an "ngram"). 

In both CBOW and SGNS You set the window size (context size around a target word) and the algorithm does this for all the words. 

See page 5 of the famous word2vec paper [here](https://arxiv.org/pdf/1301.3781.pdf)


Recall that theoretically speaking when we learn word vector representations via context information, we kinda do **the same thing as a concordance,** only on a much larger scale. That's pretty cool! - one can say that word2vec is actually fundamentally based on concordances.

The math is not important here. Conceptually, word2vec algorithm works via a prediction task where you use as input the context word vectors (from the DTM) and for output, you know what the word is - and you use this to update the values in the __"hidden layer"__ - which subsequently becomes your "word embedding" - ie a dense representation of a word (rather than sparse). The vectors __keep updating__ until the algorithm gets matching predictions for the output word from the context words (in the case of CBOW). 



When trained on very large corpora (like all of English Wikipedia) it can perform very strong analogies such as finding that the vector corresponding the most to the output of the operation ['king' - 'man' + 'woman' is 'queen'](https://static.packt-cdn.com/products/9781787287600/graphics/d4b8d439-e136-44f7-895d-71de1d84342c.png)

[Word2vec is very powerful!](https://www.distilled.net/uploads/word2vec_chart.jpg) - this is precisely the "analogical reasoning" that we humans are so good at. And remember - __analogy is one of the key tasks of a judge__

## Part 3 - Word2Vec with gensim library

For this part, we'll be using the [Gensim library](https://radimrehurek.com/gensim/auto_examples/core/run_corpora_and_vector_spaces.html#sphx-glr-auto-examples-core-run-corpora-and-vector-spaces-py) to make our very own word2vec model

Note that word2vec requires sentences as inputs.






In [None]:
#!pip install gensim

In [None]:
import string
import nltk
from nltk import sent_tokenize
from string import punctuation
from nltk.corpus import stopwords
nltk.download('stopwords')
stoplist = set(stopwords.words('english'))

def normalize_text(doc):
    doc = doc.replace('\r', ' ').replace('\n', ' ')
    lower = doc.lower() # all lower case
    nopunc = lower.translate(str.maketrans('', '', string.punctuation)) # remove punctuation using translate
    words = nopunc.split() # split into tokens
    nostop = [w for w in words if w not in stoplist] # remove stopwords
    no_numbers = [w if not w.isdigit() else '#' for w in nostop] # normalize numbers
    return no_numbers

def get_sentences(doc):
    sent = []
    for raw_text in sent_tokenize(doc):
        normalized = normalize_text(raw_text)
        sent.append(normalized)
    return sent

In [None]:
sentences = []
for doc in sample_df['text']:
    sentences += get_sentences(doc)

In [None]:
sentences[:5]

In [None]:
# train the model
from gensim.models import Word2Vec

w2v_model = Word2Vec(sentences,  # list of tokenized sentences
               workers = 4, # Number of threads to run in parallel
               vector_size=100,  # Word vector dimensionality     
               min_count = 2, # Minimum word count  
               window = 10 # Context window size      
               )

In [None]:
words = list(w2v_model.wv.index_to_key)
words[:10]

In [None]:
## how many words in vocab
print(len(words))


In [None]:
## Print actual values of word embedding - this is the hidden leayer aka the word embedding we "learned"

print(w2v_model.wv['judge']) # vector for "judge"

In [None]:
print(w2v_model.wv.get_vector('law'))

In [None]:
## Cosine similarity between two vectors
print(w2v_model.wv.similarity('crime', 'law'))

In [None]:
## Most similar words
w2v_model.wv.similar_by_word('crime')

In [None]:
## We can even see which words are not fitting in a given pattern
w2v_model.wv.doesnt_match("he committed a crime horse with a weapon".split())

In [None]:
## vector addition - we can add vectors to get to a new "vector" (that might not exist)

vector = w2v_model.wv.get_vector('corporation') - w2v_model.wv.get_vector('criminal') 
w2v_model.wv.similar_by_vector(vector)



In [None]:
vector = w2v_model.wv.get_vector('crime') + w2v_model.wv.get_vector('judge')  ## impeachment?
w2v_model.wv.similar_by_vector(vector)

In [None]:
w2v_model.wv.most_similar(positive=['law', 'court'], negative = ['man'])

## Visualizing word2vec word embeddings
Once we have our word embedding model, we can viszualize it using the standard techniques - such as PCA and TSNE. 

The problem is that we're reducing from 100 dimensions to 2. Note that PCA has a unique singular representation whereas TSNE is a bit more complex, so it will always have a different representation every time you print the graph. 





In [None]:
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import seaborn as sns
import matplotlib.pyplot as plt

# This code is adapted from https://github.com/drelhaj/NLP_ML_Visualization_Tutorial/blob/master/6_Word_embeddings_Tutorial.ipynb
def tsnescatterplot(model, word, list_names):
    """ Plot in seaborn the results from the t-SNE dimensionality reduction algorithm of the vectors of a query word,
    its list of most similar words, and a list of words.
    """
    arrays = np.empty((0, 100), dtype='f')
    word_labels = [word]
    color_list  = ['red']
    # adds the vector of the query word
    arrays = np.append(arrays, model.wv.__getitem__([word]), axis=0)
    
    # gets list of most similar words
    close_words = model.wv.most_similar([word])
    
    # adds the vector for each of the closest words to the array
    for wrd_score in close_words:
        wrd_vector = model.wv.__getitem__([wrd_score[0]])
        word_labels.append(wrd_score[0])
        color_list.append('blue')
        arrays = np.append(arrays, wrd_vector, axis=0)
    
    # adds the vector for each of the words from list_names to the array
    for wrd in list_names:
        wrd_vector = model.wv.__getitem__([wrd])
        word_labels.append(wrd)
        color_list.append('green')
        arrays = np.append(arrays, wrd_vector, axis=0)
        
    # Reduces the dimensionality from 300 to 50 dimensions with PCA
    reduc = PCA(n_components=20).fit_transform(arrays)
    
    # Finds t-SNE coordinates for 2 dimensions
    np.set_printoptions(suppress=True)
    
    Y = TSNE(n_components=2, random_state=0, perplexity=15).fit_transform(reduc)
    
    # Sets everything up to plot
    df = pd.DataFrame({'x': [x for x in Y[:, 0]],
                       'y': [y for y in Y[:, 1]],
                       'words': word_labels,
                       'color': color_list})
    
    fig = plt.subplots()
    #fig.set_size_inches(9, 9)
    
    # Basic plot
    p1 = sns.regplot(data=df,
                     x="x",
                     y="y",
                     fit_reg=False,
                     marker="o",
                     scatter_kws={'s': 40,
                                  'facecolors': df['color']
                                 }
                    )
    
    # Adds annotations one by one with a loop
    for line in range(0, df.shape[0]):
         p1.text(df["x"][line],
                 df['y'][line],
                 '  ' + df["words"][line].title(),
                 horizontalalignment='left',
                 verticalalignment='bottom', size='medium',
                 color=df['color'][line],
                 weight='normal'
                ).set_size(15)

    
    plt.xlim(Y[:, 0].min()-50, Y[:, 0].max()+50)
    plt.ylim(Y[:, 1].min()-50, Y[:, 1].max()+50)
            
    plt.title('t-SNE visualization for {}'.format(word.title()))
    

In [None]:
word = 'crime'
tsnescatterplot(w2v_model, word,
                [t[0] for t in w2v_model.wv.most_similar(positive=[word], 
                                                         topn=20)][10:])

### PCA plot of all words

In [None]:
pca_df = pd.DataFrame(w2v_model.wv[w2v_model.wv.index_to_key], 
                      index=w2v_model.wv.index_to_key)

In [None]:
pca_df.head()

In [None]:
pca_df_reduced = pca_df.sample(100)

In [None]:
pca_df_reduced.head()

Full word dataset

In [None]:
pca = PCA(n_components=2)
components = pca.fit_transform(pca_df)
plot = plt.scatter(components[:,0], components[:,1])
plt.show()

### Use reduced word dataset

In [None]:
import plotly.express as px
fig = px.scatter_matrix(
    components,
    dimensions=range(2),
    color=pca_df_reduced.index
)
fig.update_traces(diagonal_visible=False)
fig.show()

### Save word2vec Model

In [None]:
# w2v_model.save('w2v_model_vectors.pkl')

## Part 3 - [Doc2Vec](https://radimrehurek.com/gensim/auto_examples/tutorials/run_doc2vec_lee.html#sphx-glr-auto-examples-tutorials-run-doc2vec-lee-py)
Doc2Vec is the same thing as word2vec, but with an extra representation [for a given document](https://miro.medium.com/max/640/0*x-gtU4UlO8FAsRvL.) 





In [None]:
from nltk import word_tokenize

docs = []
for i, row in sample_df.iterrows():
    docs += [word_tokenize(row['text'])]


In [None]:
docs[0][:20]

In [None]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

doc_iterator = [TaggedDocument(doc, [i]) for i, doc in enumerate(docs)]

d2v_model = Doc2Vec(doc_iterator, # list of tokenized documents
                   workers = 4, # Number of threads to run in parallel
                   vector_size = 100,  # Word vector dimensionality     
                   min_count = 2, # Minimum word count  
                   window = 10 # Context window size      
                   #max_vocab_size =  10000
                  )

In [None]:
# d2v_model.save('d2v-vectors.pkl')

In [None]:
# matrix of all document vectors:
doc2vec_matrix = d2v_model.dv.vectors
doc2vec_matrix.shape

In [None]:
d2v_matrix = pd.DataFrame(data = doc2vec_matrix, 
                          index = sample_df['name_abbreviation'])

In [None]:
d2v_matrix.head()

In [None]:
#to find the vector of a document which is NOT in training data
a = d2v_model.infer_vector(['the murder was committed by the defendant'])

b = d2v_model.infer_vector(['the criminal assaulted the victim'])

c = d2v_model.infer_vector(['the corporation is not able to pay its taxes'])

In [None]:

print(cosine_similarity(np.expand_dims(a, axis=0), 
                        np.expand_dims(b, axis=0)))
print(cosine_similarity(np.expand_dims(a, axis=0), 
                        np.expand_dims(c, axis=0)))

In [None]:
# get all pair-wise document similarities
pairwise_sims = cosine_similarity(doc2vec_matrix)
pairwise_sims.shape

In [None]:
d2v_similarity_matrix = pd.DataFrame(data = pairwise_sims, 
                                  columns = sample_df['name_abbreviation'],
                                  index = sample_df['name_abbreviation'])
d2v_similarity_matrix

In [None]:
d2v_similarity_matrix.sort_values(by='Silvers v. Jones', 
                          ascending=False)

In [None]:
case_of_interest = 'Getchell v. Rust' ## 2nd best match

get_index(case_of_interest)

In [None]:
print(sample_df['text'][get_index(case_of_interest)])

In [None]:
sample_df['text'][4528]

### We can also cluster documents

See previous lab for different clustering methods and approaches.

In [None]:
# Document clusters
from sklearn.cluster import KMeans

# create 10 clusters of similar documents
num_clusters = 4
kmw = KMeans(n_clusters=num_clusters)
kmw.fit(doc2vec_matrix)

In [None]:
# Documents from an example cluster
for i, doc in enumerate(docs):
    if kmw.labels_[i] == 25:
        print(' '.join(doc[:9]))
    if i == 1000:
        break

In [None]:
#%% PCA Viz
import matplotlib.pyplot as plt

#plt.scatter(Xpca[:,0],Xpca[:,1], alpha=.1)

cdict = {1: 'red', 2: 'blue', 3: 'green'}
fig, ax = plt.subplots()
#for g, label in cdict.items():
for g in np.unique(kmw.labels_):
    ix = np.where(kmw.labels_ == g)
    #ix = np.where(kmw == g)
    #    ax.scatter(scatter_x[ix], scatter_y[ix], c = cdict[g], label = g, s = 100)
    if g in cdict:
        # use color from cdict
        color = cdict[g]
        ax.scatter(Xpca[:,0][ix], Xpca[:,1][ix], c = color, label = g, s = 100, alpha=1)
    else:
        if g < 10:
            color = "black"
            ax.scatter(Xpca[:,0][ix], Xpca[:,1][ix], c = color, label = g, s = 100, alpha=1)
    

        
ax.legend()
plt.show()

In [None]:
d2v_matrix_reduced = d2v_matrix.sample(200)

In [None]:
d2v_matrix_reduced

In [None]:
import plotly.express as px

pca = PCA(n_components=2)
components = pca.fit_transform(d2v_matrix_reduced)

fig = px.scatter_matrix(
    components,
    dimensions=range(2),
    color=d2v_matrix_reduced.index
)
fig.update_traces(diagonal_visible=False)
fig.show()