In [1]:
import re
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
import pandas as pd
import numpy as np
import scipy
from collections import Counter
from sklearn.decomposition import PCA
from sklearn.neighbors import NearestNeighbors



## Distributional Semantics: Representing the Meaning of Words  with Numbers



So far we haven't talked through representing the **meaning** of language with statistical methods.

We *have* done machine learning. N-gram models learn probabilistic rules for generating sentences based on statistical patterns inferred from the data. But this models might seem shallow in some way, because it focuses on form.


How might we go about learning representations of the meaning of words in our corpus?

In [2]:
corpus = """By this liberty they entered into a very laudable emulation to do all of them \
what they saw did please one. If any of the gallants or ladies should say, Let us drink, \
they would all drink.  If any one of them said, Let us play, they all played.  If one said, \
Let us go a-walking into the fields they went all."""

In **distributional semantics***, the basic intuition is that similar words appear in similar contexts.

Great. How do we *measure* the similarity of contexts? 

It might help at this point to distinguish two kinds of meaning relations between words.

**syntagmatic** relationships between words arise because of a sequential ordering. Words in a sentence form a syntagm. 
* think of the word *syntax*, which links words with different functions in a sentence

This contrasts with **paradigmatic** relationships. Words in a **paradigm** are related because they may be substituted for one another in certain contexts. 
* think of a verb tense paradigm or a noun declension in learning a new languages
* these words can go in the same "slot" in the sentence

### Question: syntagmatic or paradigmatic?

flower - leaf

of - the

ice - cream

word - sick

Let's take a look at the guts of n-gram model from yesterday.

In [3]:
def ngrams(tokenized_corpus, n):
    # get the ngrams for a corpus
    n_grams = []
    for i in range(n-1, len(tokenized_corpus)): 
        n_grams.append(tuple(tokenized_corpus[i-(n-1):i+1]))
    return n_grams

def ngram_tokenize(text):
    # tokenize a text for an ngram model
    tokenized_corpus = re.sub(r'(\w)([.,?!;:])', r'\1 \2', corpus) 
    tokenized_corpus = tokenized_corpus.split()
    tokenized_corpus = [word.lower() for word in tokenized_corpus]
    return tokenized_corpus

# print the first 10 tri-grams
print(ngrams(ngram_tokenize(corpus), 2)[:10])

[('by', 'this'), ('this', 'liberty'), ('liberty', 'they'), ('they', 'entered'), ('entered', 'into'), ('into', 'a'), ('a', 'very'), ('very', 'laudable'), ('laudable', 'emulation'), ('emulation', 'to')]


In [4]:
# bigram conditional frequency distribution

cfd = nltk.ConditionalFreqDist(ngrams(ngram_tokenize(corpus), 2))
cfd['all']

FreqDist({'of': 1, 'drink': 1, 'played': 1, '.': 1})

### Question : What kinds of relationships between words do N-gram models learn? Syntagmatic or paradigmatic? 


#### You can write your notes here

####














####
####
####

####
####
####

####
####
####

**Hint**: the numbers encode a relationship between two words. Which words? How are those words related?


#### Answer:

N-gram models explicitly encode syntagmatic (sequence-oriented) relationships. The probabilities represent relationships between two words in a sequence. If the frequency is very high, then they occur together very often, but they might not have a semantically similar meaning. (e.g. `let us`)

But they do capture a kind of paradigmatic relationship as well, if not numerically. 

All of the words that end up in the frequency distribution for a single word have something in common. They do occur in the same "slot" in the sentence. Often this leads to them having semantic commonalites.

We can see this for the frequency distribution of 'us'

In [5]:
cfd['us']

FreqDist({'drink': 1, 'play': 1, 'go': 1})

The words in the frequency distribution for *us* form a meaningful paradigm! They are all active verbs.


But in the n-gram model there's still no numerical representation of similarity between two arbitrary words.

### perhaps we could imagine one.



If two words occur in the same frequency distribution, they are similar. If they don't, they are dissimilar. 

Maybe if they occur in many of the same frequency distributions, they are more similar.

In [6]:
# illustrate this 
print("they contexts: ", cfd['go'].items()) 
print("us contexts: ", cfd['play'].items())

they contexts:  dict_items([('a-walking', 1)])
us contexts:  dict_items([(',', 1)])


Our corpus is very small. *go* and *play* only occur in one context (us). but *go* and *played* occur in another ("all"). If we had lemmatized our corpus (collapsed different forms of the same word(, then the lemmas GO and PLAY would share two contexts.

Words that don't share contexts are dissimilar:

In [7]:
# illustrate this 
print("they contexts: ", cfd['gallants'].items()) 
print("us contexts: ", cfd['all'].items())

they contexts:  dict_items([('or', 1)])
us contexts:  dict_items([('of', 1), ('drink', 1), ('played', 1), ('.', 1)])


So our whole idea of  "semantic similarity" is based on shared context. As such, its totally dependent on our definition of context. What counts as a context? What counts as the same context? Our answers to these questions can change everything.

## Visualizing co-occurrence

Perhaps this is more intuitive as a table. We make a row in the table for each word type, and a column for all of the contexts. Then, we compare it to the contexts of another word. 

Here is a function to convert our cfd to a large Pandas dataframe. Pandas is a super useful package for working with and organizing data, but we are just using it here for visualization purposes. You can ignore this function.

In [8]:
"""
turn a cfd into a big dataframe with rows and columns labeled for easy visualization
"""

def cfd_to_dataframe(cfd):
    
    # We put our rows and columns in order.
    # For now, our rows are the same as our columns:
    # All target words are also context words, and vice versa.
    # But that doesn't have to be the case.
    targetlist = sorted(cfd.conditions())
    contextlist = sorted(list(set(c for t in cfd.conditions() 
                                  for c in cfd[t].keys())))

    # make a numpy matrix out of our sparse dictionary entries
    rows = [ ]
    for t in targetlist:
        # for context words c for which we don't have an entry,
        # the ConditionalFreqDist returns zero
        rows.append( [cfd[t][c] for c in contextlist])

    count_matrix = np.array(rows)
    count_matrix

    # add a space so that pandas doesn't get confused that our rows and columns have the same name
    contextlist1 = [w+ " " for w in contextlist] 
    
    # make the pandas dataframe
    df =  pd.DataFrame.from_records(data=count_matrix, index=targetlist, columns = [c+" " for c in contextlist1])
    df.index.name = 'target'
    
    return df

Running the function on our n-gram cfd gives another view on the frequency distribution, with all the zeros filed in.

In [9]:
cfd_to_dataframe(cfd)

Unnamed: 0_level_0,",",.,a,a-walking,all,any,did,do,drink,emulation,...,the,them,they,this,to,us,very,went,what,would
target,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
",",0,0,0,0,0,0,0,0,0,0,...,0,0,2,0,0,0,0,0,0,0
.,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
a,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
a-walking,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
all,0,1,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
any,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
by,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
did,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
do,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
drink,1,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


That's all it took!

This is a co-occurence based distributional semantic model! 

The cool thing about representing words like this is that we can think of the words and their relationships to each other spatially.



Each row is a **vector** representing that word. Each column is a dimension. 

I think of vectors as points in space. A point in 3-D Euclidean space has 3 coordinates:

    (x,y,z)
    
A point in N-dimensional space has N coordinates. You can read a word vector as a long list of coordinates, with each value giving the location of the point along that dimension. N-dimensional space is much harder to visualize, but it works exactly the same way as 2 and 3-D space in terms of calculating distances and so on. 

    (x, y, z, ....) 



**Prompt**: How many dimensions does our Theleme-space have?

Words that have similar meanings occur in similar contexts.
Words in similar contexts are near together in semantic space.

Ergo?

Words that are similar in meaning are near together in semantic space!

## Cosine Similarity

Most of the time, people talk about similarity in semantic space in terms of cosine, which measures the angle between two vectors. 

<img src="../img/cosinesimilarity.png" alt="Alternative text" width="500px" />

If we measured the absolute distance (Euclidean distance), we would end up with differences where we don't want them. Vectors for very common words would be very long and thus very far apart from vectors for infrquent words.



<img src="../img/sadapple.png" alt="Alternative text" width="600px"/>

**question:** why would frequent words have longer vectors? (by length i mean, how far away from the origin)

Here is the formula for cosine similarity

$\text{Cosine Similarity} = \cos(\theta) = \frac{(A \cdot B)}{ (||A|| \cdot ||B||)} = \frac{\sum_{i=1}^{n} A_i \cdot B_i}{\sqrt{\sum_{i=1}^{n} A_i^2} \cdot \sqrt{\sum_{i=1}^{n} B_i^2}}$

The numerator is the dot product (a linear algebra operation that measures similarity), and the bottom is a normalization term that gets rid of differences due to magnitude (aka frequency).

We will share another notebook that introduces linear algebra and goes through the derivation of cosine. 

Cosine of two angles varies between -1 and 1. As the angle between two vectors gets smaller, the cosine gets closer to one. Similar words have a high cosine.



# Rewind: How did we make this giant graph?

We have something very like the co-occurrence model already with the n-gram frequency distribution, just represented in a different way. 

The goal is to build a giant table, where each row contains a numeric representation of a word. 
There will be one row and one column for each word **type** in our corpus. 

Each row is a **vector**: a numerical representation of the meaning of the word in that row.

### What counts as context? Differences between N-gram modeling and distributional semantics

The idea of context is different in these paradigms. What's our idea of context in an n-gram model?

What is the idea of context we want for thinking about word meaning?

We want to look at words on both sides of our target word.

## Preprocessing: Tokenization, Lemmatization and Normalization

There are also different preprocessing concerns. We might want to remove stop words and limit low frequency words, which we wouldn't want to do with n-gram models.


**Tokenization:** Separate into contexts, words.

**(Normalization):** Multiple spellings, capitalization.

**(Lemmatization):** Different forms of a word collapse into one.

**Stop-word removal:** Discarding common words.

**Frequency Limits:** Discard very infrequent words.


We'll do all of these things in one go. We are using the nltk lemmatizer
    

In [10]:
def preprocess(corpus):
    
    # discard punctuation
    corpus = re.sub(r'[\.,]', '', corpus)

    # tokenization
    words = nltk.word_tokenize(corpus)
    
    # normalization
    words = [word.lower() for word in words]
    
    # lemmatization
    lemmatizer = WordNetLemmatizer()
    words = [lemmatizer.lemmatize(word) for word in words]
    
    # stemming - mpt necessary for small corpus
    # stemmer = PorterStemmer()
    # words = [stemmer.stem(word) for word in words]

    
    # stop-word-removal
    stopwords = set('for a of the and to in'.split())
    words = list(filter(lambda x: x not in stopwords, words))
    
    # a heftier stop list (unnecessary for small example) but you might want it at some point
    # nltk_stop_words = nltk.corpus.stopwords.words('english')
    
    # discard low-frequency words
    # don't need this for small corpus
    # wordcounts = Counter(words)
    # frequency_threshold = 20
    # words = list(w for w in wordcounts if wordcounts[w] >= frequency_threshold)
    
    return words

In [11]:
preprocess(corpus)[:10]

['by',
 'this',
 'liberty',
 'they',
 'entered',
 'into',
 'very',
 'laudable',
 'emulation',
 'do']

## Counting contexts

Yesterday we used a left only version of context because we were generating text from left to right.
Distributional models typically use a context window on both sides of the target word.

In [12]:
def context_counts(words, window_size=1):
    
    context_counts = nltk.ConditionalFreqDist()
    
    # iterate over the whole corpus, but
    # start a little to the right and end a little to the left
    # so we don't have to worry about (literal) edge cases

    for i in range(window_size+1, len(words)-window_size): 
        
        # i is the index of the target word. 
        target = words[i]
        
        # we want to get the words to the left within the window 
        # and then do the same with the words to the right within the window
        left_context_words = words[i-1-window_size:i-1]
        right_context_words = words[i+1:i + 1 + window_size ]
        
        # and add them to the context count
        for context in (left_context_words + right_context_words):
            context_counts[target][context] += 1
        
        #break
        
    return context_counts

It may be helpful to compare this function with the one that builds the frequency distribution in our ngram model

In [13]:
context_counts = context_counts(preprocess(corpus), window_size=2)

Now we have a data structure containing context counts for each word. Our ConditionalFreqDist has our words as our conditions, and contexts as its outcomes.

You can think of this as, the word "play" potentially occurs in four contexts. 

In [14]:
context_counts["play"].items()

dict_items([('said', 1), ('let', 1), ('they', 1), ('all', 1)])

We use the function defined above  to visualize the cfd as a table.

In [15]:
cfd_to_dataframe(context_counts)

Unnamed: 0_level_0,a-walking,all,any,by,did,do,drink,emulation,entered,field,...,say,should,them,they,this,u,very,went,what,would
target,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
a-walking,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,1,0,0,0,0
all,0,0,0,0,0,0,2,1,0,0,...,0,0,1,1,0,1,0,0,1,0
any,0,1,0,0,0,0,1,0,0,0,...,0,0,1,0,0,0,0,0,0,0
did,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,1,0
do,0,1,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,1,0,0,0
drink,0,0,1,0,0,0,0,0,0,0,...,1,0,0,2,0,0,0,0,0,2
emulation,0,1,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
entered,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,1,0,0,0
field,1,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,1,0,0
gallant,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


# Comparing words

The most exciting thing about representing words as points in space is that it gives us access to geometric notions like distance to compare signs.

Look again at the table. Words that occur in the same contexts have positive values along the same dimensions. This results in them being near to one another in space. I'll say that again

"play" shares contexts with "drink", making it similar

In [16]:
context_counts["play"].items()

dict_items([('said', 1), ('let', 1), ('they', 1), ('all', 1)])

In [17]:
context_counts["drink"]

FreqDist({'they': 2, 'would': 2, 'say': 1, 'let': 1, 'if': 1, 'any': 1})

It doesn't share any contexts with "gallant", making them *maximally dissimilar*

In [18]:
context_counts["gallant"]

FreqDist({'one': 1, 'if': 1, 'or': 1, 'lady': 1})

The conditional frequency distribution for "play" is one way to represent the sparse row vector for the word "go". It has positive values for ",", "let", "a-walking", and "into", and zero values everywhere else

## Comparing Theleme Vectors

Our cfd vector representations are sparse. To do linear algebra operations on them---multiplication and comparison, we need to represent them as a whole vector. We use Numpy

Here's a function to make a dictionary of numpy vectors out of our cfd. It's just a utility

In [19]:
def cfd_to_vectors(cfd):
    # We put our rows and columns in order.
    # For now, our rows are the same as our columns:
    # All target words are also context words, and vice versa.
    # But that doesn't have to be the case.
    targetlist = sorted(cfd.conditions())
    contextlist = sorted(list(set(c for t in cfd.conditions() 
                                  for c in cfd[t].keys())))

    # make a numpy matrix out of our sparse dictionary entries
    rows = {}
    
    for t in targetlist:
        # for context words c for which we don't have an entry,
        # the ConditionalFreqDist returns zero
        rows[t] = ( np.array( [cfd[t][c] for c in contextlist] ))

    #count_matrix = np.array(rows)
    return rows

We can use it to convert our sparse representations to numpy vectors

In [20]:
vectors = cfd_to_vectors(cfd)

In [21]:
vectors['go']

array([0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

We can now compare two words using cosine.

Python has a built-in function for cosine -- with one catch: It computes 1 - cosine, as a distance. Here's how to get back to cosine similarity:

In [22]:
def cosine_sim(vec1, vec2):
    """
    vec1 and vec2 are Numpy vectors
    """
    return 1 - scipy.spatial.distance.cosine(vec1, vec2)

### Question: how do we calculate the cosine similarity of 'drink' and 'all' ?

In [23]:
# answer here


### What about the similarity of 'drink' and 'play'?

In [24]:
# answer here

## From word co-occurrence counts to association weights

As we discussed in class, raw frequency counts may not be what we want -- we don't need to know that all words co-occur a lot with "the" and "a". Even if we ditch stopwords, the frequency bias in the data may not be what we want: Do we need to know that all words co-occur a lot with "said"? 

Several methods have been developed for going from counts to association weights, including tf/idf and pointwise mutual information. Here, we demonstrate how to compute pointwise mutual information, defined as

$PMI(a, b) = \log \frac{P(a, b)}{P(a)P(b)}$ 

In the numerator, we have the joint probability of a *and* b. The formula compares this to the denominator, which has the product of the probability of a and the probability of b: If a and b were completely independent, had zero association, we would expect them to co-occur only by chance, that is, we would expect $P(a, b) = P(a)P(b)$. If $P(a, b)$ is larger than $P(a)P(b)$, then a and b are positively associated -- they co-occur more often than you would expect just from chance encounters. If $P(a, b)$ is smaller than $P(a)P(b)$, then a and b are negatively associated -- they really don't want to go together. 

In practice, we are often not interested in negative associations, and only use positive ones. Then we get PPMI:


$PPMI(a, b) = \left\{\begin{array}{ll}PMI(a, b) & \text{if } PMI(a, b) > 0\\
0 & \text{else}
\end{array}\right.$



## Homework: Transform the co-occurence count matrix into an association matrix using PPMI

The easiest way to do this is to store our word vectors as a large matrix (a 2 dimensional array), and perform operations on this. Here is a function that converts a cfd into a numpy matrix. It also returns two dictionaries that help us get the labels of the rows and columns. 

In [25]:
def cfd_to_matrix_and_label_lookups(cfd):
    targetlist = sorted(cfd.conditions())
    contextlist = sorted(list(set(c for t in cfd.conditions() 
                                  for c in cfd[t].keys())))
    rows = [ ]
    for t in targetlist:
        # for context words c for which we don't have an entry,
        # the ConditionalFreqDist returns zero
        rows.append( [cfd[t][c] for c in contextlist])

    count_matrix = np.array(rows)

    # and now we make a dictionary to look up rows by target word
    target_dict = { }
    for index, target in enumerate(targetlist):
        target_dict[target] = index

    # and now we make a dictionary to look up columns by context word
    context_dict = { }
    for index, context in enumerate(contextlist):
        context_dict[context] = index
    
    return count_matrix, target_dict, context_dict

In [26]:
count_matrix, target_dict, context_dict = cfd_to_matrix_and_label_lookups(cfd)

Here is the information you'll need:

Formula for pointwise mutual information (PMI)

$PMI(t, c) = log \frac{P(target, context)}{P(target) P(context)}$

Definitions of different sums
```
   #(t, c): the co-occurrence count of t with c
   #(_, _): the sum of counts in the whole table, across all targets
   #(t, _): the sum of counts in the row of target t
   #(_, c): the sum of counts in the column of context item c
```

Formula expressions in terms of these sums
```
    P(t, c) = #(t, c) / #(_, _)
    P(t) = #(t, _) / #(_, _)
    P(c) = #(_, c) / #(_, _)
```
and finally
```
PPMI(t, c) = { PMI(t, c) if PMI(t, c) >= 0
               0, else
```


In [27]:
# target count #(t, _):
print("target count for one", count_matrix[ target_dict["one"]].sum())

# overall count #(_, _):
print("overall count", count_matrix.sum())

# context item count #(_, c):
print("context item count for drink", count_matrix[ :, context_dict["drink"]].sum())

target count for one 3
overall count 70
context item count for drink 2


Your code should take `count_matrix` and transform it into a matrix with PPMI weighted values

In [28]:
# Start with the big matrix we made and  transform it to a matrix with PPMI weighted values

#
# do stuff here
#

## Dimensionality Reduction

Having a large number of dimensions in the feature space can mean that the volume of that space is very large, and in turn, the points that we have in that space (rows of data) often represent a small and non-representative sample.

This can dramatically impact the performance of machine learning algorithms fit on data with many input features, generally referred to as the “curse of dimensionality.”

Our data are very sparse. The vectors are mostly zeros witha few nonzero values. SVD is the technique of choice for sparse data.

Principal Components Analysis (PCA) is super common as well.


In [29]:
# dimensionality reduction
# Principal component analysis

pcaobj = PCA()
pca_matrix = pcaobj.fit_transform(count_matrix)
# and let's actually reduce dimensionality
keep_this_many_dimensions = 10
pca_matrix = pca_matrix[:, :keep_this_many_dimensions]

Here is a utility function to turn a matrix into a dataframe. use this if you like

In [30]:
def matrix_to_dataframe(count_matrix, target_dict, context_dict):
    
    # We put our rows and columns in order.
    # For now, our rows are the same as our columns:
    # All target words are also context words, and vice versa.
    # But that doesn't have to be the case.
    targetlist = sorted(cfd.conditions())
    contextlist = sorted(list(set(c for t in cfd.conditions() 
                                  for c in cfd[t].keys())))

    # make a numpy matrix out of our sparse dictionary entries
    rows = [ ]
    for t in targetlist:
        # for context words c for which we don't have an entry,
        # the ConditionalFreqDist returns zero
        rows.append( [cfd[t][c] for c in contextlist])

    count_matrix = np.array(rows)
    count_matrix

    # add a space so that pandas doesn't get confused that our rows and columns have the same name
    contextlist1 = [w+ " " for w in contextlist] 
    
    # make the pandas dataframe
    df =  pd.DataFrame.from_records(data=count_matrix, index=targetlist, columns = [c+" " for c in contextlist1])
    df.index.name = 'target'
    
    return df

When we compute similarity again, the absolute value of the cosine similarity is different! -- Absolute similarity values can vary widely across the original and dimensionality-reduced spaces, but they will probably still predict the same nearest neighbors.

In [31]:
# and computing similarity again

cosine_sim( pca_matrix[target_dict['drink']], pca_matrix[target_dict['play']])

0.7953539509042691

## An aside: Nearest Neighbors

We want to know about a word's nearest neighbors. Computing this by hand is a major pain: You would have to compute all pairwise cosines, and then rummage through them to find the maximum. In our tiny corpus, this is feasible, but not in a large corpus. Fortunatly scikit-learn has a function NearestNeighbors that can do the work for us. One downside: It does not know cosine similarity outright. 

First option: We give it the cosine distance function that we used above. 

In [32]:

# we make a nearest-neighbors object and tell it we'll always want the 3 nearest neighbors at a time
nearest_neighbors_obj = NearestNeighbors(n_neighbors=3, metric = scipy.spatial.distance.cosine)

# we then allow it to compute an internal datastructure from our data
nearest_neighbors_obj.fit(pca_matrix)

In [33]:
cosine_distances, target_indices = nearest_neighbors_obj.kneighbors([pca_matrix[target_dict["us"]]])

`cosine_distances` and `target_indices` are both two-dimensional arrays. Let's extract  lists of values

In [34]:
## this block prints out the nearest neighbors of the word we select in the previous block

target_list = sorted(target_dict.keys())

cosine_distances = cosine_distances[0].tolist()
target_indices = target_indices[0].tolist()

for cosinedist, targetindex in zip(cosine_distances, target_indices):
    print("Neighbor is", target_list[targetindex], "with similarity", 1 - cosinedist)

Neighbor is us with similarity 1.0
Neighbor is all with similarity 0.316358691133761
Neighbor is did with similarity 0.01736892319791339


## Exercise: What are the 4 closest neighbors of drink?

In [35]:
# Do your work here



## Neural networks are actually just the same thing

<img src="../img/implicit.png" alt="Alternative text" />

So: Why not just do this?

## Creative Writing with the vectorized word

This lesson has been about taking relationships of form and turning them into relations about meaning using  vector spaces  (they are called semantic spaces for a reason).

In this project, Allison parrish turns the concept on its head by using it to organize form rather than meaning. 

https://www.youtube.com/watch?v=L3D0JEA1Jdc&t=2055s

## Question: How else could we imagine context???

potential projects???

comparing semantic similarity.