**Disclaimer:**
This is a direct port of https://gist.github.com/aparrish/2f562e3737544cf29aaf1af30362f469 or https://www.kaggle.com/code/john77eipe/understanding-word-vectors


In [None]:
!python -m spacy download en_vectors_web_lg
!python -m spacy link en_vectors_web_lg en_vectors_web_lg

In [None]:
from __future__ import unicode_literals
import spacy
nlp = spacy.load('en_vectors_web_lg')

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


# Understanding word vectors

... for, like, actual poets. By [Allison Parrish](http://www.decontextualize.com/)

In this tutorial, I'm going to show you how word vectors work. This tutorial assumes a good amount of Python knowledge, but even if you're not a Python expert, you should be able to follow along and make small changes to the examples without too much trouble.

This is a "[Jupyter Notebook](https://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/)," which consists of text and "cells" of code. After you've loaded the notebook, you can execute the code in a cell by highlighting it and hitting Ctrl+Enter. In general, you need to execute the cells from top to bottom, but you can usually run a cell more than once without messing anything up. Experiment!

If things start acting strange, you can interrupt the Python process by selecting "Kernel > Interrupt"—this tells Python to stop doing whatever it was doing. Select "Kernel > Restart" to clear all of your variables and start from scratch.

## Why word vectors?

Poetry is, at its core, the art of identifying and manipulating linguistic similarity. I have discovered a truly marvelous proof of this, which this notebook is too narrow to contain. (By which I mean: I will elaborate on this some other time)

## Animal similarity and simple linear algebra

We'll begin by considering a small subset of English: words for animals. Our task is to be able to write computer programs to find similarities among these words and the creatures they designate. To do this, we might start by making a spreadsheet of some animals and their characteristics. For example:

![Animal spreadsheet](http://static.decontextualize.com/snaps/animal-spreadsheet.png)

This spreadsheet associates a handful of animals with two numbers: their cuteness and their size, both in a range from zero to one hundred. (The values themselves are simply based on my own judgment. Your taste in cuteness and evaluation of size may differ significantly from mine. As with all data, these data are simply a mirror reflection of the person who collected them.)

These values give us everything we need to make determinations about which animals are similar (at least, similar in the properties that we've included in the data). Try to answer the following question: Which animal is most similar to a capybara? You could go through the values one by one and do the math to make that evaluation, but visualizing the data as points in 2-dimensional space makes finding the answer very intuitive:

![Animal space](http://static.decontextualize.com/snaps/animal-space.png)

The plot shows us that the closest animal to the capybara is the panda bear (again, in terms of its subjective size and cuteness). One way of calculating how "far apart" two points are is to find their *Euclidean distance*. (This is simply the length of the line that connects the two points.) For points in two dimensions, Euclidean distance can be calculated with the following Python function:

In [None]:
import math
def distance2d(x1, y1, x2, y2):
    return math.sqrt((x1 - x2)**2 + (y1 - y2)**2)

(The `**` operator raises the value on its left to the power on its right.)

So, the distance between "capybara" (70, 30) and "panda" (74, 40):

In [None]:
distance2d(70, 30, 75, 40) # panda and capybara

... is less than the distance between "tarantula" and "elephant":

In [None]:
distance2d(8, 3, 65, 90) # tarantula and elephant

Modeling animals in this way has a few other interesting properties. For example, you can pick an arbitrary point in "animal space" and then find the animal closest to that point. If you imagine an animal of size 25 and cuteness 30, you can easily look at the space to find the animal that most closely fits that description: the chicken.

Reasoning visually, you can also answer questions like: what's halfway between a chicken and an elephant? Simply draw a line from "elephant" to "chicken," mark off the midpoint and find the closest animal. (According to our chart, halfway between an elephant and a chicken is a horse.)

You can also ask: what's the *difference* between a hamster and a tarantula? According to our plot, it's about seventy five units of cute (and a few units of size).

The relationship of "difference" is an interesting one, because it allows us to reason about *analogous* relationships. In the chart below, I've drawn an arrow from "tarantula" to "hamster" (in blue):

![Animal analogy](http://static.decontextualize.com/snaps/animal-space-analogy.png)

You can understand this arrow as being the *relationship* between a tarantula and a hamster, in terms of their size and cuteness (i.e., hamsters and tarantulas are about the same size, but hamsters are much cuter). In the same diagram, I've also transposed this same arrow (this time in red) so that its origin point is "chicken." The arrow ends closest to "kitten." What we've discovered is that the animal that is about the same size as a chicken but much cuter is... a kitten. To put it in terms of an analogy:

    Tarantulas are to hamsters as chickens are to kittens.
    
A sequence of numbers used to identify a point is called a *vector*, and the kind of math we've been doing so far is called *linear algebra.* (Linear algebra is surprisingly useful across many domains: It's the same kind of math you might do to, e.g., simulate the velocity and acceleration of a sprite in a video game.)

A set of vectors that are all part of the same data set is often called a *vector space*. The vector space of animals in this section has two *dimensions*, by which I mean that each vector in the space has two numbers associated with it (i.e., two columns in the spreadsheet). The fact that this space has two dimensions just happens to make it easy to *visualize* the space by drawing a 2D plot. But most vector spaces you'll work with will have more than two dimensions—sometimes many hundreds. In those cases, it's more difficult to visualize the "space," but the math works pretty much the same.

## Language with vectors: colors

So far, so good. We have a system in place—albeit highly subjective—for talking about animals and the words used to name them. I want to talk about another vector space that has to do with language: the vector space of colors.

Colors are often represented in computers as vectors with three dimensions: red, green, and blue. Just as with the animals in the previous section, we can use these vectors to answer questions like: which colors are similar? What's the most likely color name for an arbitrarily chosen set of values for red, green and blue? Given the names of two colors, what's the name of those colors' "average"?

We'll be working with this [color data](https://github.com/dariusk/corpora/blob/master/data/colors/xkcd.json) from the [xkcd color survey](https://blog.xkcd.com/2010/05/03/color-survey-results/). The data relates a color name to the RGB value associated with that color. [Here's a page that shows what the colors look like](https://xkcd.com/color/rgb/). Download the color data and put it in the same directory as this notebook.

A few notes before we proceed:

* The linear algebra functions implemented below (`addv`, `meanv`, etc.) are slow, potentially inaccurate, and shouldn't be used for "real" code—I wrote them so beginner programmers can understand how these kinds of functions work behind the scenes. Use [numpy](http://www.numpy.org/) for fast and accurate math in Python.
* If you're interested in perceptually accurate color math in Python, consider using the [colormath library](http://python-colormath.readthedocs.io/en/latest/).

Now, import the `json` library and load the color data:

In [None]:
import json

In [None]:
color_data = json.loads(open("/kaggle/input/color-data-xkcd/xkcd.json").read())

The following function converts colors from hex format (`#1a2b3c`) to a tuple of integers:

In [None]:
def hex_to_int(s):
    s = s.lstrip("#")
    return int(s[:2], 16), int(s[2:4], 16), int(s[4:6], 16)

And the following cell creates a dictionary and populates it with mappings from color names to RGB vectors for each color in the data:

In [None]:
colors = dict()
for item in color_data['colors']:
    colors[item["color"]] = hex_to_int(item["hex"])

Testing it out:

In [None]:
colors['olive']

In [None]:
colors['red']

In [None]:
colors['black']

### Vector math

Before we keep going, we'll need some functions for performing basic vector "arithmetic." These functions will work with vectors in spaces of any number of dimensions.

The first function returns the Euclidean distance between two points:

In [None]:
import math
def distance(coord1, coord2):
    # note, this is VERY SLOW, don't use for actual code
    return math.sqrt(sum([(i - j)**2 for i, j in zip(coord1, coord2)]))
distance([10, 1], [5, 2])

The `subtractv` function subtracts one vector from another:

In [None]:
def subtractv(coord1, coord2):
    return [c1 - c2 for c1, c2 in zip(coord1, coord2)]
subtractv([10, 1], [5, 2])

The `addv` vector adds two vectors together:

In [None]:
def addv(coord1, coord2):
    return [c1 + c2 for c1, c2 in zip(coord1, coord2)]
addv([10, 1], [5, 2])

And the `meanv` function takes a list of vectors and finds their mean or average:

In [None]:
def meanv(coords):
    # assumes every item in coords has same length as item 0
    sumv = [0] * len(coords[0])
    for item in coords:
        for i in range(len(item)):
            sumv[i] += item[i]
    mean = [0] * len(sumv)
    for i in range(len(sumv)):
        mean[i] = float(sumv[i]) / len(coords)
    return mean
meanv([[0, 1], [2, 2], [4, 3]])

Just as a test, the following cell shows that the distance from "red" to "green" is greater than the distance from "red" to "pink":

In [None]:
distance(colors['red'], colors['green']) > distance(colors['red'], colors['pink'])

### Finding the closest item

Just as we wanted to find the animal that most closely matched an arbitrary point in cuteness/size space, we'll want to find the closest color name to an arbitrary point in RGB space. The easiest way to find the closest item to an arbitrary vector is simply to find the distance between the target vector and each item in the space, in turn, then sort the list from closest to farthest. The `closest()` function below does just that. By default, it returns a list of the ten closest items to the given vector.

> Note: Calculating "closest neighbors" like this is fine for the examples in this notebook, but unmanageably slow for vector spaces of any appreciable size. As your vector space grows, you'll want to move to a faster solution, like SciPy's [kdtree](https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.spatial.KDTree.html) or [Annoy](https://pypi.python.org/pypi/annoy).

In [None]:
def closest(space, coord, n=10):
    closest = []
    for key in sorted(space.keys(),
                        key=lambda x: distance(coord, space[x]))[:n]:
        closest.append(key)
    return closest

Testing it out, we can find the ten colors closest to "red":

In [None]:
closest(colors, colors['red'])

... or the ten colors closest to (150, 60, 150):

In [None]:
closest(colors, [150, 60, 150])

### Color magic

The magical part of representing words as vectors is that the vector operations we defined earlier appear to operate on language the same way they operate on numbers. For example, if we find the word closest to the vector resulting from subtracting "red" from "purple," we get a series of "blue" colors:

In [None]:
closest(colors, subtractv(colors['purple'], colors['red']))

This matches our intuition about RGB colors, which is that purple is a combination of red and blue. Take away the red, and blue is all you have left.

You can do something similar with addition. What's blue plus green?

In [None]:
closest(colors, addv(colors['blue'], colors['green']))

That's right, it's something like turquoise or cyan! What if we find the average of black and white? Predictably, we get gray:

In [None]:
# the average of black and white: medium grey
closest(colors, meanv([colors['black'], colors['white']]))

Just as with the tarantula/hamster example from the previous section, we can use color vectors to reason about relationships between colors. In the cell below, finding the difference between "pink" and "red" then adding it to "blue" seems to give us a list of colors that are to blue what pink is to red (i.e., a slightly lighter, less saturated shade):

In [None]:
# an analogy: pink is to red as X is to blue
pink_to_red = subtractv(colors['pink'], colors['red'])
closest(colors, addv(pink_to_red, colors['blue']))

Another example of color analogies: Navy is to blue as true green/dark grass green is to green:

In [None]:
# another example:
navy_to_blue = subtractv(colors['navy'], colors['blue'])
closest(colors, addv(navy_to_blue, colors['green']))

The examples above are fairly simple from a mathematical perspective but nevertheless *feel* magical: they're demonstrating that it's possible to use math to reason about how people use language.

### Interlude: A Love Poem That Loses Its Way

In [None]:
import random
red = colors['red']
blue = colors['blue']
for i in range(14):
    rednames = closest(colors, red)
    bluenames = closest(colors, blue)
    print ("Roses are " + rednames[0] + ", violets are " + bluenames[0])
    red = colors[random.choice(rednames[1:])]
    blue = colors[random.choice(bluenames[1:])]

### Doing bad digital humanities with color vectors

With the tools above in hand, we can start using our vectorized knowledge of language toward academic ends. In the following example, I'm going to calculate the average color of Bram Stoker's *Dracula*.

(Before you proceed, make sure to [download the text file from Project Gutenberg](http://www.gutenberg.org/cache/epub/345/pg345.txt) and place it in the same directory as this notebook.)

First, we'll load [spaCy](https://spacy.io/):

To calculate the average color, we'll follow these steps:

1. Parse the text into words
2. Check every word to see if it names a color in our vector space. If it does, add it to a list of vectors.
3. Find the average of that list of vectors.
4. Find the color(s) closest to that average vector.

The following cell performs steps 1-3:

In [None]:
doc = nlp(open("/kaggle/input/dracula-story-pg345/pg345.txt").read())
# use word.lower_ to normalize case
drac_colors = [colors[word.lower_] for word in doc if word.lower_ in colors]
avg_color = meanv(drac_colors)
print (avg_color)

Now, we'll pass the averaged color vector to the `closest()` function, yielding... well, it's just a brown mush, which is kinda what you'd expect from adding a bunch of colors together willy-nilly.

In [None]:
closest(colors, avg_color)

Exercise for the reader: Use the vector arithmetic functions to rewrite a text, making it...

* more blue (i.e., add `colors['blue']` to each occurrence of a color word); or
* more light (i.e., add `colors['white']` to each occurrence of a color word); or
* darker (i.e., attenuate each color. You might need to write a vector multiplication function to do this one right.)

## Distributional semantics

In the previous section, the examples are interesting because of a simple fact: colors that we think of as similar are "closer" to each other in RGB vector space. In our color vector space, or in our animal cuteness/size space, you can think of the words identified by vectors close to each other as being *synonyms*, in a sense: they sort of "mean" the same thing. They're also, for many purposes, *functionally identical*. Think of this in terms of writing, say, a search engine. If someone searches for "mauve trousers," then it's probably also okay to show them results for, say,

In [None]:
for cname in closest(colors, colors['mauve']):
    print (cname + " trousers")

That's all well and good for color words, which intuitively seem to exist in a multidimensional continuum of perception, and for our animal space, where we've written out the vectors ahead of time. But what about... arbitrary words? Is it possible to create a vector space for all English words that has this same "closer in space is closer in meaning" property?

To answer that, we have to back up a bit and ask the question: what does *meaning* mean? No one really knows, but one theory popular among computational linguists, computer scientists and other people who make search engines is the [Distributional Hypothesis](https://en.wikipedia.org/wiki/Distributional_semantics), which states that:

    Linguistic items with similar distributions have similar meanings.
    
What's meant by "similar distributions" is *similar contexts*. Take for example the following sentences:

    It was really cold yesterday.
    It will be really warm today, though.
    It'll be really hot tomorrow!
    Will it be really cool Tuesday?
    
According to the Distributional Hypothesis, the words `cold`, `warm`, `hot` and `cool` must be related in some way (i.e., be close in meaning) because they occur in a similar context, i.e., between the word "really" and a word indicating a particular day. (Likewise, the words `yesterday`, `today`, `tomorrow` and `Tuesday` must be related, since they occur in the context of a word indicating a temperature.)

In other words, according to the Distributional Hypothesis, a word's meaning is just a big list of all the contexts it occurs in. Two words are closer in meaning if they share contexts.

## Text Processing

In the world of Natural Language Processing, we construct Word Embeddings using different Word vector methodologies.

NLP => Converting text into structured data.
NLU => After getting structured data, understanding what the user input means.
NLG => Converting structured data into text with some insights from NLU.

NLU and NLG are subsets within NLP.

## Different types of Word Embeddings

The different types of word embeddings can be broadly classified into two categories-

Frequency based Embedding
Prediction based Embedding

### 2.1 Frequency based Embedding

There are generally three types of vectors that we encounter under this category.

1. Count Vector
1. TF-IDF Vector
1. 1. Co-Occurrence Vector

Let's define the documents and the common cleanup and preprocessing first.

In [None]:
# let's define documents with 1 or 2 sentences each.

documents =[
            "I was hungry ,so i ate a cake. Now I like cakes more.",
            "I ate chicken but I'm still hungry so I bought a tasty cake and gave it to my hungry sister",
            "My sister was hungry so she ate the tasty cake."
]

#### Constructing Vocabulary or Tokens

For every algorithm this is a common pre-processing stage. Consists of 4 steps

1. Normalize
1. Tokenize
1. Stopwords removal
1. Stem/Lemmatize

We do all this using some library like nltk

In [None]:
import re
import nltk
nltk.download('punkt')
nltk.download('stopwords')

In [None]:
# normalization
processed_documents = []
for document in documents:
    processed_text = document.lower()
    processed_text = re.sub('[^a-zA-Z]', ' ', processed_text )
    processed_documents.append(processed_text)

processed_documents

In [None]:
# tokenization
# note that we didn't use setence tokenization because we are considering the document wise
all_words = [nltk.word_tokenize(sent) for sent in processed_documents]
print(all_words)

In [None]:
# stopwords removal

from nltk.corpus import stopwords
for i in range(len(all_words)):
    all_words[i] = [w for w in all_words[i] if w not in stopwords.words('english')]
all_words

In [None]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
final_docs = []
for words in all_words:
    doc = [lemmatizer.lemmatize(word) for word in words]
    final_docs.append(doc)
final_docs

In [None]:
# constructing vocab
vocab = []
for words in final_docs:
    for word in words:
        if not (word in vocab):
            vocab.append(word)

In [None]:
vocab

### Count Vectorization
Consider a Corpus C of D documents {d1,d2…..dD} and N unique tokens (Vocabulary) extracted out of the corpus C. The N tokens will form our dictionary and the size of the Count Vector matrix M will be given by D X N. Each row in the matrix M contains the frequency of tokens in document D(i).[](http://)

In [None]:
# Document Matrix
dm=[]
def initTemparray(length_vocab):
    a=[]
    for i in range(length_vocab):
        a.append(0)
      #print(a)
    return a

for doc in final_docs:
    temparray = initTemparray(len(vocab))
    for word in doc:
        #word = removeChars(word)
        if word in vocab:
            temparray[vocab.index(word)]= temparray[vocab.index(word)] + 1
    dm.append(temparray)


In [None]:
dm
# dm means document matrix and dv means document vector
# Here, the rows correspond to the documents in the corpus and the columns correspond to the tokens in the dictionary.

In [None]:
# Cosidering only the first 3 words in vocabulary for easy representation

vocab_wise = list(zip(*dm[::-1]))

In [None]:
vocab_wise

In [None]:
from matplotlib import pyplot
from mpl_toolkits.mplot3d import Axes3D
import random

# Since we can plot only 3 dimentions at a time we could take only the first 3 rows representing hungry, ate, cake.
x_vals, y_vals, z_vals = list(vocab_wise[0]), list(vocab_wise[1]), list(vocab_wise[2])


In [None]:
import plotly.express as px
fig = px.scatter_3d(x=x_vals, y=y_vals, z=z_vals, text=vocab[0:3])

fig.update_layout(scene = dict(
                    xaxis_title='hungry',
                    yaxis_title='ate',
                    zaxis_title='cake'),
                    width=1000,
                    margin=dict(r=20, b=5, l=5, t=5))

fig.show()

Now there may be quite a few variations while preparing the above matrix M. The variations will be generally in-

- The way dictionary is prepared.
Why? Because in real world applications we might have a corpus which contains millions of documents. And with millions of document, we can extract hundreds of millions of unique words. So basically, the matrix that will be prepared like above will be a very sparse one and inefficient for any computation. So an alternative to using every unique word as a dictionary element would be to pick say top 10,000 words based on frequency and then prepare a dictionary.
- The way count is taken for each word.
We may either take the frequency (number of times a word has appeared in the document) or the presence(has the word appeared in the document?) to be the entry in the count matrix M. But generally, frequency method is preferred over the latter.

### Manual verification of Document Matrix

In [None]:
print(dm[0])
print(final_docs[0])
print(vocab)

In [None]:
print(dm[1])
print(final_docs[1])
print(vocab)

In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits import mplot3d

### Co-Occurrence Matrix

Let’s say there are V unique words in the corpus. So Vocabulary size = V. The columns of the Co-occurrence matrix form the context words. The different variations of Co-Occurrence Matrix are-

- A co-occurrence matrix of size V X V. Now, for even a decent corpus V gets very large and difficult to handle. So generally, this architecture is never preferred in practice.
- A co-occurrence matrix of size V X N where N is a subset of V and can be obtained by removing irrelevant words like stopwords etc. for example. This is still very large and presents computational difficulties.
But, remember this co-occurrence matrix is not the word vector representation that is generally used. Instead, this Co-occurrence matrix is decomposed using techniques like PCA, SVD etc. into factors and combination of these factors forms the word vector representation.

We will be creating a co-occurance matrix of size V X V

In [None]:
# Co-occurance matrix
# We build the map by taking into account each word, say hungry and finding the documents that has hungry (usage) and adding them
cmatrix=[]
for word in vocab:
    current_index =vocab.index(word)
    temparray = initTemparray(len(vocab))
    print(f'Constructing for word: {word} = 1')
    for idx, dv in enumerate(dm):
        print('-'*45)
        print(f'{idx+1} row in dm')
        print(dv)
        if dv[current_index]==1:
            for i in range(len(vocab)):
                temparray[i]=temparray[i]+ dv[i]

    cmatrix.append(temparray)
    print(f'Final corpus map for {word}: {temparray}')


In [None]:
cmatrix

In [None]:
print(vocab)

In [None]:
final_docs

In [None]:
import copy
# normalized corpusmap
normCorpusMap = copy.deepcopy(cmatrix)
for j in range(len(vocab)):
    for i in range(len(vocab)):
        normCorpusMap[j][i] = cmatrix[j][i] - max(cmatrix[j])+1

In [None]:
normCorpusMap

In [None]:
from matplotlib import pyplot
from mpl_toolkits.mplot3d import Axes3D
import random

# Since we can plot only 3 dimentions at a time we could take only the first 3 rows representing hungry, ate, cake.
x_vals, y_vals, z_vals = normCorpusMap[0], normCorpusMap[1], normCorpusMap[2]


In [None]:
import plotly.express as px
fig = px.scatter_3d(x=x_vals, y=y_vals, z=z_vals, text=vocab)

fig.update_layout(scene = dict(
                    xaxis_title='hungry',
                    yaxis_title='ate',
                    zaxis_title='cake'),
                    width=1000,
                    margin=dict(r=20, b=10, l=10, t=10))

fig.show()

### TF-IDF

A common issue we have with text analysis is that some words are much more frequent than others and aren’t useful for classification. For example, words like “the”, “is”, and “a” are common English words that don’t convey much meaning. These words will differ from task to task depending on the domain of the text documents. If we are working with movie reviews, the word “movie” will be frequent but not useful. If we were working with email data, on the other hand, the word “movie” may not be frequent and would be useful.
The simplest way to account for these overrepresented words is to divide word count by the proportion of text documents each word appeared in. For example, the document:

“I loved this movie! It was great, great, great.”

…contains the word “loved” and “movie” once each. Now, let’s suppose that we look at all the other documents and find that, in total, “loved” appears in 1% of text documents and “movie” appears in 33%. We could now weight our scores as

`“loved” = times it appears in text / proportion of texts it appears in = 1 / 1%`
`“movie” = times it appears in text / proportion of texts it appears in = 1 / 33%`

Before applying weights, both “loved” and “movie” had a score of 1 (since each word appeared in the sentence once). After we apply weights, “loved” has a score of 100 and “movie” has a score of 3. The score for “loved” is much higher relative to “movie”, indicating that we care about the word “loved” much more than “movie”.
In fact, our score for “loved” is now 33 times larger than our score for “movie”. While we suspect that “movie” should be less important than “loved” for predicting whether a review is positive or negative, this relative difference might be too big. Very rare words — perhaps, misspelled words — will receive too much relative weight in our current weighting scheme.

We need to strike a balance between downweighting very frequent words without overweighting rare words. This is what term frequency–inverse document frequency (tf-idf) weighting does for us. In the simple weighting scheme, we used the formula:

**times a word appears in text * (1 / proportion of texts it appears in)**

tf-idf weighting alters this formula slightly by taking the log of the second term:

**times a word appears in text * log(1 / proportion of texts it appears in)**

By taking the log, we ensure that our weight changes slowly in relation to how frequently a word appears in all our documents. This means that while common words are downweighted, they aren’t downweighted too much. (There’s also a connection to information theory, too).

For Python code and detailed example:
https://towardsdatascience.com/tf-idf-for-document-ranking-from-scratch-in-python-on-real-world-dataset-796d339a4089

### 2.2 Prediction based Embedding

### Word vectors in spaCy

Okay, let's have some fun with real word vectors. We're going to use the GloVe vectors that come with spaCy to creatively analyze and manipulate the text of Bram Stoker's *Dracula*. First, make sure you've got `spacy` imported, which we already have.

#### GloVe vectors

But you don't have to create your own word vectors from scratch! Many researchers have made downloadable databases of pre-trained vectors. One such project is Stanford's [Global Vectors for Word Representation (GloVe)](https://nlp.stanford.edu/projects/glove/). These 300-dimensional vectors are included with spaCy, and they're the vectors we'll be using for the rest of this tutorial.

In [None]:
#sample test
tokens = nlp("dog cat banana afskfsd")

for token in tokens:
    print(token.text, token.has_vector, token.vector_norm, token.is_oov)

In [None]:
doc = nlp(open("/kaggle/input/dracula-story-pg345/pg345.txt").read())
type(doc)

And the cell below creates a list of unique words (or tokens) in the text, as a list of strings.

In [None]:
# all of the words in the text file
tokens = list(set([w.text for w in doc if w.is_alpha]))

In [None]:
len(tokens)

In [None]:
print(tokens[:10])

In [None]:
nlp.vocab

You can see the vector of any word in spaCy's vocabulary using the `vocab` attribute, like so:

In [None]:
nlp.vocab['cheese'].vector

For the sake of convenience, the following function gets the vector of a given string from spaCy's vocabulary:

In [None]:
def vec(s):
    return nlp.vocab[s].vector

### Cosine similarity and finding closest neighbors

The cell below defines a function `cosine()`, which returns the [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) of two vectors. Cosine similarity is another way of determining how similar two vectors are, which is more suited to high-dimensional spaces. [See the Encyclopedia of Distances for more information and even more ways of determining vector similarity.](http://www.uco.es/users/ma1fegan/Comunes/asignaturas/vision/Encyclopedia-of-distances-2009.pdf)

(You'll need to install `numpy` to get this to work. If you haven't already: `pip install numpy`. Use `sudo` if you need to and make sure you've upgraded to the most recent version of `pip` with `sudo pip install --upgrade pip`.)

In [None]:
from numpy import dot
from numpy.linalg import norm

# cosine similarity
def cosine(v1, v2):
    if norm(v1) > 0 and norm(v2) > 0:
        return dot(v1, v2) / (norm(v1) * norm(v2))
    else:
        return 0.0

1. The following cell shows that the cosine similarity between `dog` and `puppy` is larger than the similarity between `trousers` and `octopus`, thereby demonstrating that the vectors are working how we expect them to:

In [None]:
cosine(vec('dog'), vec('puppy')) > cosine(vec('trousers'), vec('octopus'))

The following cell defines a function that iterates through a list of tokens and returns the token whose vector is most similar to a given vector.

In [None]:
def spacy_closest(token_list, vec_to_check, n=10):
    return sorted(token_list,
                  key=lambda x: cosine(vec_to_check, vec(x)),
                  reverse=True)[:n]

Using this function, we can get a list of synonyms, or words closest in meaning (or distribution, depending on how you look at it), to any arbitrary word in spaCy's vocabulary. In the following example, we're finding the words in *Dracula* closest to "basketball":

In [None]:
# what's the closest equivalent of basketball?
spacy_closest(tokens, vec("basketball"))

### Fun with spaCy, Dracula, and vector arithmetic

Now we can start doing vector arithmetic and finding the closest words to the resulting vectors. For example, what word is closest to the halfway point between day and night?

In [None]:
# halfway between day and night
spacy_closest(tokens, meanv([vec("day"), vec("night")]))

Variations of `night` and `day` are still closest, but after that we get words like `evening` and `morning`, which are indeed halfway between day and night!

Here are the closest words in _Dracula_ to "wine":

In [None]:
spacy_closest(tokens, vec("wine"))

If you subtract "alcohol" from "wine" and find the closest words to the resulting vector, you're left with simply a lovely dinner:

In [None]:
spacy_closest(tokens, subtractv(vec("wine"), vec("alcohol")))

The closest words to "water":

In [None]:
spacy_closest(tokens, vec("water"))

But if you add "frozen" to "water," you get "ice":

In [None]:
spacy_closest(tokens, addv(vec("water"), vec("frozen")))

If you take the difference of "blue" and "sky" and add it to grass, you get the analogous word ("green"):

In [None]:
# analogy: blue is to sky as X is to grass
blue_to_sky = subtractv(vec("blue"), vec("sky"))
spacy_closest(tokens, addv(blue_to_sky, vec("grass")))

## Sentence similarity

To get the vector for a sentence, we simply average its component vectors, like so:

In [None]:
def meanv(coords):
    # assumes every item in coords has same length as item 0
    sumv = [0] * len(coords[0])
    for item in coords:
        for i in range(len(item)):
            sumv[i] += item[i]
    mean = [0] * len(sumv)
    for i in range(len(sumv)):
        mean[i] = float(sumv[i]) / len(coords)
    return mean
meanv([[0, 1], [2, 2], [4, 3]])

In [None]:
def sentvec(s):
    sent = nlp(s)
    return meanv([nlp.vocab[word].vector for word in sent])

In [None]:
nlp.vocab["my"].vector

Let's find the sentence in our text file that is closest in "meaning" to an arbitrary input sentence. First, we'll get the list of sentences:

In [None]:
from __future__ import unicode_literals, print_function
from spacy.lang.en import English # updated

# creating a custom pipeline for getting sentenses alone

nlp = English()
nlp.add_pipe(nlp.create_pipe('sentencizer')) # updated
doc = nlp(open("/kaggle/input/dracula-story-pg345/pg345.txt").read())
sentences = [sent.string.strip() for sent in doc.sents]

In [None]:
len(sentences)

The following function takes a list of sentences from a spaCy parse and compares them to an input sentence, sorting them by cosine similarity.

In [None]:
def spacy_closest_sent(space, input_str, n=10):
    input_vec = sentvec(input_str)
    return sorted(space,
                  key=lambda x: cosine(np.mean([w.vector for w in x], axis=0), input_vec),
                  reverse=True)[:n]

Here are the sentences in *Dracula* closest in meaning to "My favorite food is strawberry ice cream." (Extra linebreaks are present because we didn't strip them out when we originally read in the source text.)

In [None]:
#for sent in spacy_closest_sent(sentences, "My favorite food is strawberry ice cream."):
#    print (sent.text)
#    print ("---")

## Further resources

* [Word2vec](https://en.wikipedia.org/wiki/Word2vec) is another procedure for producing word vectors which uses a predictive approach rather than a context-counting approach. [This paper](http://clic.cimec.unitn.it/marco/publications/acl2014/baroni-etal-countpredict-acl2014.pdf) compares and contrasts the two approaches. (Spoiler: it's kind of a wash.)
* If you want to train your own word vectors on a particular corpus, the popular Python library [gensim](https://radimrehurek.com/gensim/) has an implementation of Word2Vec that is relatively easy to use. [There's a good tutorial here.](https://rare-technologies.com/word2vec-tutorial/)
* When you're working with vector spaces with high dimensionality and millions of vectors, iterating through your entire space calculating cosine similarities can be a drag. I use [Annoy](https://pypi.python.org/pypi/annoy) to make these calculations faster, and you should consider using it too.