# Understanding How Word Vectors or Word Embeddings Work ?

Actual article is [here](https://gist.github.com/aparrish/2f562e3737544cf29aaf1af30362f469) by [Allison Parrish](http://www.decontextualize.com/).  

Thanks to her, for making us clearly understand how they are actually formed in very brief and high level overview.


## Why word vectors?

https://medium.com/analytics-vidhya/word-embedding-why-and-what-dcf5c42724d2

## All the Examples here same as those (mentioned in Allison Parrish notebook)

## Animal similarity and simple linear algebra

Consider a small subset of English words for animals.

**Task :** Task to write computer programs to find similarities among these words and the creatures they designate. To perform this taks, create a spreadsheet of some animals and their characteristics. 

For example:
![Animal spreadsheet](http://static.decontextualize.com/snaps/animal-spreadsheet.png)

This spreadsheet associates a handful of animals with two numbers, indicating their cuteness and their size --- both in a range from zero to one hundred.(There is no any restriction behind assigning values to these columns (you can change and experiment it).

These values tells everything, to make determinations about which animals are similar (at least, similar in the properties, which are included in the data). 

**To answer this question: Which animal is most similar to a capybara(largest rodent)?.**

**Answer :** By analysing the values one by one and by doing the math, we can evaluate but Visualizing, the data as points in 2-dimensional space makes finding the answer in very easy and intuitive way:

![Animal space](http://static.decontextualize.com/snaps/animal-space.png)

The plot shows us that the closest animal to the capybara is the panda bear (this is only in terms of subjective size and cuteness). (We also considered only these two features among others)

Another way of calculating how two points are "far apart" is to find their **Euclidean distance**. (It is simply the length of the line that connects the two points.

In [103]:
# Eucledian distance for two data-points in two dimensions can be calculated with by the following Python function
import math
def eucledian_distance_in_2d(x1, y1, x2, y2):
    return math.sqrt((x1 - x2)**2 + (y1 - y2)**2)

(The `**` operator raises the value on its left to the power on its right.)

So, the distance between "capybara" (70, 30) and "panda" (74, 40):

In [104]:
eucledian_distance_in_2d(70, 30, 75, 40) # panda and capybara

11.180339887498949

... is less than the distance between "tarantula" and "elephant":

In [105]:
eucledian_distance_in_2d(8, 3, 65, 90) # tarantula and elephant

104.0096149401583

Modeling animals in this way has a few other interesting properties.(Which can help us answer few questions)

**For example :** Pick an arbitrary point in "animal space" and then find the animal closest to that point. If you imagine an animal of size 25 and cuteness 30, then you can easily look at the space to find the animal that most closely fits that description: From the table, we can say it is "chicken" (based on data).

Other questions like **What is *halfway between* a chicken and an elephant?** can also be answered by simply drawing a line from "elephant" to "chicken" and mark-off the midpoint and find the closest animal. (According to the chart or table, halfway between an elephant and a chicken is a horse).

Few another questions like **What is the *difference between* a hamster and a tarantula?** can be answered with plot and data --- It is about seventy five units of cuteness (and a few units of size).

The relationship of "difference" is an interesting one, because it allows us to reason about *analogous* relationships. In the chart below, There is an arrow from "tarantula" to "hamster" (in blue):

![Animal analogy](http://static.decontextualize.com/snaps/animal-space-analogy.png)

This arrow can be interpreted as being the *relationship* between a tarantula and a hamster in terms of their size and cuteness (i.e., hamsters and tarantulas are about the same size, but hamsters are much cuter). 

In the same figure, This same arrow is also transposed (this time in red). So,that its origin point is "chicken". The arrow ends closest to "kitten". 

What we have discovered is that the animal --- that is about the same size as a chicken but much cuter is... a kitten. To put it in terms of an analogy:

    Tarantulas are to hamsters as chickens are to kittens.
    
A sequence of numbers used to identify a point is called a *vector* and the kind of math we have been doing so far is called *linear algebra*. 

(Linear algebra is surprisingly useful across many domains: It's the same kind of math you might do to, e.g., simulate the velocity and acceleration of a sprite in a video game).

A set of vectors that are all part of the same data set is often called a *vector space*.

Here, the vector space of animals in this section has *two dimensions*, means each vector in this space has two numbers associated with it (i.e., two columns in the spreadsheet). 

The fact that this space has two dimensions just happens to make it easy to *visualize* the space by drawing a 2D plot. 

But, most vector spaces you will work with will only have more than two dimensions—sometimes many hundreds. 

In those cases, it is more difficult to visualize the "space", but the math works pretty much the same.

## Language with vectors: colors

Let's look at another vector space, which deals with language, it is the vector space of colors.

Colors are often represented in computers as vectors with three dimensions: red, green, and blue (RGB). 

As similar in the previous section, These vectors can be used to answer questions **Which colors are similar?** and **What is the most likely color name for an arbitrarily chosen set of values for red, green and blue?** 

If names of two colors are given, then **What is the name of those colors "average"?**


Working on [color data](https://github.com/dariusk/corpora/blob/master/data/colors/xkcd.json) from the [xkcd color survey](https://blog.xkcd.com/2010/05/03/color-survey-results/). This data relates a color name to the RGB value associated with that color. [Here's a page that shows what the colors look like](https://xkcd.com/color/rgb/). Download it and place it in the same directory (where this notebook is present).


Other notes before proceeding further:

* The linear algebra functions (`addition_vector`, `mean_vector`, etc.) which are implemented below are slow and inaccurate and these can't be used for "real" code. These are very basic functions (which are written within simple for loop)
* It is always recommended to use [numpy](http://www.numpy.org/) for fast and accurate math in Python.
* If you're interested in perceptually accurate color math in Python, consider using the [colormath library](http://python-colormath.readthedocs.io/en/latest/).

Now, import the `json` library and load the color data:

In [106]:
import json
color_data = json.loads(open("xkcd.json").read())

The following function converts colors from hex format (`#1a2b3c`) to a tuple of integers:

In [107]:
#this is how color data looks like
color_data

{'description': 'The 954 most common RGB monitor colors, as defined by several hundred thousand participants in the xkcd color name survey.',
 'colors': [{'color': 'cloudy blue', 'hex': '#acc2d9'},
  {'color': 'dark pastel green', 'hex': '#56ae57'},
  {'color': 'dust', 'hex': '#b2996e'},
  {'color': 'electric lime', 'hex': '#a8ff04'},
  {'color': 'fresh green', 'hex': '#69d84f'},
  {'color': 'light eggplant', 'hex': '#894585'},
  {'color': 'nasty green', 'hex': '#70b23f'},
  {'color': 'really light blue', 'hex': '#d4ffff'},
  {'color': 'tea', 'hex': '#65ab7c'},
  {'color': 'warm purple', 'hex': '#952e8f'},
  {'color': 'yellowish tan', 'hex': '#fcfc81'},
  {'color': 'cement', 'hex': '#a5a391'},
  {'color': 'dark grass green', 'hex': '#388004'},
  {'color': 'dusty teal', 'hex': '#4c9085'},
  {'color': 'grey teal', 'hex': '#5e9b8a'},
  {'color': 'macaroni and cheese', 'hex': '#efb435'},
  {'color': 'pinkish tan', 'hex': '#d99b82'},
  {'color': 'spruce', 'hex': '#0a5f38'},
  {'color': 'str

In [108]:
def hex_to_int(hex_code):
    hex_code = hex_code.lstrip("#")
    return int(hex_code[:2], 16), int(hex_code[2:4], 16), int(hex_code[4:6], 16) #converting the color code into RGB using hexa-decimal notation

And the following cell creates a dictionary and populates it with mappings from color names to RGB vectors for each color in the data:

In [109]:
colors = dict()
for item in color_data['colors']:
    colors[item["color"]] = hex_to_int(item["hex"])

In [110]:
#after converting color-codes into RGB,colors look like
colors

{'cloudy blue': (172, 194, 217),
 'dark pastel green': (86, 174, 87),
 'dust': (178, 153, 110),
 'electric lime': (168, 255, 4),
 'fresh green': (105, 216, 79),
 'light eggplant': (137, 69, 133),
 'nasty green': (112, 178, 63),
 'really light blue': (212, 255, 255),
 'tea': (101, 171, 124),
 'warm purple': (149, 46, 143),
 'yellowish tan': (252, 252, 129),
 'cement': (165, 163, 145),
 'dark grass green': (56, 128, 4),
 'dusty teal': (76, 144, 133),
 'grey teal': (94, 155, 138),
 'macaroni and cheese': (239, 180, 53),
 'pinkish tan': (217, 155, 130),
 'spruce': (10, 95, 56),
 'strong blue': (12, 6, 247),
 'toxic green': (97, 222, 42),
 'windows blue': (55, 120, 191),
 'blue blue': (34, 66, 199),
 'blue with a hint of purple': (83, 60, 198),
 'booger': (155, 181, 60),
 'bright sea green': (5, 255, 166),
 'dark green blue': (31, 99, 87),
 'deep turquoise': (1, 115, 116),
 'green teal': (12, 181, 119),
 'strong pink': (255, 7, 137),
 'bland': (175, 168, 139),
 'deep aqua': (8, 120, 127),
 

Testing whether code is working fine or not

Testing it out:

In [111]:
colors['blue']

(3, 67, 223)

In [112]:
colors['red']

(229, 0, 0)

In [113]:
colors['green']

(21, 176, 26)

### Vector math

These functions are required for performing basic vector "arithmetic". These functions will work with vectors in spaces of any number of dimensions.

The `math_calculate_euclidean_distance` function returns the Euclidean distance between two points:

In [116]:
import math
def math_calculate_euclidean_distance(coordinate_vec_1,coordinate_vec_2):
    return math.sqrt(math.pow((coordinate_vec_2[0] - coordinate_vec_1[0]), 2) + math.pow((coordinate_vec_2[1] - coordinate_vec_1[1]), 2))

In [117]:
math_calculate_euclidean_distance([10, 1], [5, 2])

5.0990195135927845

The `subtract_vector` function subtracts one vector from another:

In [118]:
def subtract_vector(coordinate_vec_1,coordinate_vec_2):
    return [coordinate_1 - coordinate_2 for coordinate_1, coordinate_2 in zip(coordinate_vec_1,coordinate_vec_2)]


In [119]:
subtract_vector([10, 1], [5, 2])

[5, -1]

The `addition_vector` vector adds two vectors together:

In [120]:
def addition_vector(coordinate_vec_1,coordinate_vec_2):
    return [coordinate_1 + coordinate_2 for coordinate_1, coordinate_2 in zip(coordinate_vec_1,coordinate_vec_2)]

In [121]:
addition_vector([10, 1], [5, 2])

[15, 3]

The `mean_vector` function takes a list of vectors and finds their mean or average:

In [122]:
def mean_vector(coordinates):
    # assumes every item in coordinates has same length or size as item 0
    sum_vectors = [0] * len(coordinates[0])
    for item in coordinates:
        for each in range(len(item)):
            sum_vectors[each] += item[each]
    mean = [0] * len(sum_vectors)
    for each_sum_vec in range(len(sum_vectors)):
        mean[each_sum_vec] = float(sum_vectors[each_sum_vec]) / len(coordinates)
    return mean

In [123]:
mean_vector([[0, 1], [2, 2], [4, 3]])

[2.0, 2.0]

Another Test : Whether the following cell shows that the distance from "red" to "green" is greater than the distance from "red" to "pink":

In [124]:
math_calculate_euclidean_distance(colors['red'], colors['green'])

272.47018185482244

In [125]:
math_calculate_euclidean_distance(colors['red'], colors['pink'])

131.59407281484982

In [126]:
math_calculate_euclidean_distance(colors['red'], colors['green']) > math_calculate_euclidean_distance(colors['red'], colors['pink'])

True

So, we have passed the test

### Finding the closest item
#### TASK : To find out closest color name to an arbitary point in RGB space.

In the start of this notebook, we found out animal that most closely matched an arbitrary point in cuteness/size space. In the same way, we should find out the closest color name to an arbitrary point in RGB space. 

The easiest way to find the closest item to an arbitrary vector is simply to find the distance between the target vector and each item in the space, in turn, then sort the list from closest to farthest. 

The `closest()` function below does the same.

This function returns a list of the ten closest items to the given vector.
(Can imagine of KNN, which finds closest points and then returns prediction)

> Note: Calculating "closest neighbors" in this way is fine for example given in this notebook but it is slow for vector spaces of larger sizes. 
   If vector space increases, then we should use SciPy's [kdtree](https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.spatial.KDTree.html) or [Annoy](https://pypi.python.org/pypi/annoy), which gives faster solution.

In [127]:
def closest(vector_space, coordinates, num_of_items_to_return=10):
    closest = []
    for key in sorted(vector_space.keys(),
                        key=lambda x: math_calculate_euclidean_distance(coordinates, vector_space[x]))[:num_of_items_to_return]:
        closest.append(key)
    return closest

Testing it out, 

Test-1 : Find the ten colors closest to "red":


In [128]:
closest(colors, colors['red'])

['red',
 'dark hot pink',
 'cerise',
 'fuchsia',
 'hot magenta',
 'pink red',
 'pinkish red',
 'cherry red',
 'cherry',
 'fire engine red']

These are ten colors closest to red color

#### ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Test-2 : Ten colors closest to (150, 60, 150)

In [129]:
closest(colors, [150, 60, 150])

['purply',
 'brownish red',
 'medium purple',
 'russet',
 'brick',
 'auburn',
 'burnt umber',
 'warm purple',
 'ugly purple',
 'rust brown']

### Color magic

The magical part of representing words as vectors is that the vector operations we defined earlier appear to operate on language the same way they operate on numbers. 

For example : We find the word closest to the vector resulting from subtracting "red" from "purple", then we get a series of "blue" colors:

In [130]:
closest(colors, subtract_vector(colors['purple'], colors['red']))

['navy blue',
 'dark forest green',
 'navy',
 'rich blue',
 'true blue',
 'dark navy',
 'dark blue',
 'dark navy blue',
 'very dark blue',
 'marine blue']

**This matches our intuition about RGB colors, which is that purple is a combination of Red and Blue. This says that, if you take away Red, all you have left is Blue.**

You can do something similar with addition.

**What is Blue plus Green?**

In [131]:
closest(colors, addition_vector(colors['blue'], colors['green']))

['spearmint',
 'highlighter green',
 'wintergreen',
 'aqua',
 'electric green',
 'minty green',
 'bright turquoise',
 'bright aqua',
 'bright light blue',
 'neon green']

That's right, it's something like turquoise or cyan! 

**What if we find the average of black and white?**

Predictably, we get gray:

In [132]:
# the average of black and white: medium grey
closest(colors, mean_vector([colors['black'], colors['white']]))

['medium grey',
 'drab',
 'shit green',
 'brownish grey',
 'swamp green',
 'steel',
 'brown grey',
 'lavender blue',
 'ugly brown',
 'periwinkle']

Similar with the tarantula/hamster example from the previous section

Trying to find more relationships between colors using color vectors. 

***In the cells below, we will find the difference between "pink" and "red" then adding it to "blue" seems to give us a list of colors that are to blue***

**What pink is to red (i.e., a slightly lighter, less saturated shade)?**

In [133]:
# an analogy: pink is to red as X is to blue
pink_to_red = subtract_vector(colors['pink'], colors['red'])
closest(colors, addition_vector(pink_to_red, colors['blue']))

['algae green',
 'dark mint green',
 'greenblue',
 'tealish',
 'topaz',
 'seaweed',
 'dark seafoam',
 'green',
 'greenish teal',
 'green teal']

**Another example of color analogies:** 
Navy is to blue as true green/dark grass green is to green

In [134]:
# another example: 
navy_to_blue = subtract_vector(colors['navy'], colors['blue'])
closest(colors, addition_vector(navy_to_blue, colors['green']))

['blue green',
 'dark sea green',
 'water blue',
 'nice blue',
 'dark cyan',
 'greenish blue',
 'deep sky blue',
 'deep aqua',
 'jungle green',
 'cerulean']

The examples above are fairly simple from a mathematical perspective but nevertheless *feel* magical.

***We have seen that it is possible to use math to reason about how people use language***

### Interlude: A Love Poem That Loses Its Way

In [135]:
import random
red = colors['red']
blue = colors['blue']
for each in range(14):
    rednames = closest(colors, red)
    bluenames = closest(colors, blue)
    print(f"Roses are {rednames[0]}, violets are {bluenames[0]}")
    red = colors[random.choice(rednames[1:])]
    blue = colors[random.choice(bluenames[1:])]

Roses are red, violets are blue
Roses are cherry red, violets are forest green
Roses are neon pink, violets are bottle green
Roses are fire engine red, violets are british racing green
Roses are electric pink, violets are bottle green
Roses are strong pink, violets are darkgreen
Roses are neon pink, violets are bottle green
Roses are hot pink, violets are evergreen
Roses are electric pink, violets are bottle green
Roses are strong pink, violets are british racing green
Roses are neon pink, violets are forest green
Roses are bright red, violets are bottle green
Roses are strong pink, violets are blue
Roses are neon pink, violets are forest green


### Doing bad digital humanities with color vectors

With the above tools in hand, start using our vectorized knowledge of language toward academic ends. In the following example, We will calculate the average color of Bram Stoker's *Dracula* (dataset, which is imported).

[download the text file from Project Gutenberg](http://www.gutenberg.org/cache/epub/345/pg345.txt) and place it in the same directory as this notebook.) to proceed further.


Initially load [spaCy](https://spacy.io/):

spaCy is an open-source software library for advanced natural language processing, written in the programming languages Python and Cython. This library is published under the MIT license and currently offers statistical neural network models for English, German, Spanish, Portuguese, French, Italian, Dutch and multi-language NER, as well as tokenization for various other languages.

This ***'en'*** and ***'en-core-web-sm'*** and ***'en-core-web-md'*** and ***'en-core-web-lg'***. These are all language models. In the later part of these notebook, you will understand what these models will do

In [147]:
import spacy
nlp = spacy.load('en_core_web_lg')

To calculate the average color, we'll follow these steps:

1. Parse the text into words
2. Check every word to see if it names a color in our vector space. If it does, add it to a list of vectors.
3. Find the average of that list of vectors.
4. Find the color(s) closest to that average vector.

The following cell performs steps 1-3:

In [142]:
document = nlp(open("dracula.txt").read())

In [143]:
#we will convert all the text to lower case, in order to avoid any problems while further processing
dracula_colors = [colors[word.lower_] for word in document if word.lower_ in colors]

In [144]:
average_color = mean_vector(dracula_colors)

In [145]:
print(average_color)

[147.44839067702551, 113.65371809100999, 100.13540510543841]


Now, we'll pass the averaged color vector to the `closest()` function, yielding... well, it's just a **brown mush**, which is kind of what you would expect from adding a bunch of colors together.

In [146]:
closest(colors, average_color)

['faded purple',
 'deep lilac',
 'poo',
 'puke brown',
 'reddish grey',
 'hazel',
 'dark lilac',
 'brownish',
 'dirt',
 'baby poop']

****************************************************************************************************************************************************

**On the other hand,**

When we average the colors of Charlotte Perkins Gilman's classic *The Yellow Wallpaper*, this is what we get. 

In [40]:
document = nlp(open("yellow_newspaper.txt").read())

In [41]:
newspaper_colors = [colors[word.lower_] for word in document if word.lower_ in colors]

In [42]:
average_color = mean_vector(newspaper_colors)

In [43]:
closest(colors, average_color)

['dirty yellow',
 'vomit yellow',
 'silver',
 'ugly yellow',
 'light periwinkle',
 'puke yellow',
 'mustard yellow',
 'greenish beige',
 'greenish tan',
 'olive yellow']

These results definitely reflects the content of the story. Most of the closest colors obtanied are some variants of yellow or shades of yellow. So, may be we are close to something we expected.

**Task : Use the vector arithmetic functions to rewrite a text, making it...**

* more blue (i.e., add `colors['blue']` to each occurrence of a color word); or
* more light (i.e., add `colors['white']` to each occurrence of a color word); or
* darker (i.e., attenuate each color. You might need to write a vector multiplication function to do this one right.)

The `multiplication_vector` function multiplies one vector to another:

In [44]:
def multiplication_vector(coordinate_vec_1,coordinate_vec_2):
    return [coordinate_1 * coordinate_2 for coordinate_1, coordinate_2 in zip(coordinate_vec_1,coordinate_vec_2)]


**more blue (i.e., add colors['blue'] to each occurrence of a color word)**

In [45]:
newspaper_colors_with_blue = [multiplication_vector(colors[word.lower_], colors['blue']) for word in document if word.lower_ in colors]

In [46]:
average_color_mixed = mean_vector(newspaper_colors_with_blue)

In [47]:
closest(colors, average_color_mixed)

['creme',
 'butter',
 'banana',
 'ivory',
 'eggshell',
 'off white',
 'cream',
 'white',
 'pale yellow',
 'yellow']

**more light (i.e., add colors['white'] to each occurrence of a color word)**

In [48]:
newspaper_colors_with_light_color = [multiplication_vector(colors[word.lower_], colors['white']) for word in document if word.lower_ in colors]

In [49]:
average_color_mixed_light = mean_vector(newspaper_colors_with_light_color)

In [50]:
closest(colors, average_color_mixed_light)

['creme',
 'butter',
 'banana',
 'ivory',
 'eggshell',
 'off white',
 'cream',
 'white',
 'pale yellow',
 'yellow']

## Distributional semantics

In the previous sections, we have seen the examples, which are interesting because of a simple fact: colors that we think of as similar are "closer" to each other in RGB vector space in our color vector space. 

Similar kind of results are seen in animal cuteness/size space, assuming words, which are identified by vectors close to each other as being *synonyms*, in a sense like, they mean some sort of the same thing.They are also for many purposes, `functionally identical`.

Let's assume, we are building a search engine. If someone searches for "mauve trousers" then they should also be shown similar results (related to trousers) like 

In [51]:
for color_name in closest(colors, colors['mauve']):
    print(f"{color_name} trousers")

mauve trousers
caramel trousers
pinkish brown trousers
leather trousers
clay brown trousers
light urple trousers
soft purple trousers
golden brown trousers
bronze trousers
clay trousers


### Let's move a step ahead

That's all well and good for color words, which intuitively seem to exist in a multidimensional continum of perception, and for our animal space, where we have written out the vectors ahead of time. **But what about... arbitrary words?** ***Is it possible to create a vector space for all English words that has this same "Closer in space is Closer in meaning" property?***

To answer this, we have to take a step back and ask the question: 

**what does *meaning* mean?** No one really knows, but one theory popular among computational linguists, computer scientists and other people who make search engines is the [Distributional Hypothesis]
(https://en.wikipedia.org/wiki/Distributional_semantics), which states that:

    Linguistic items with similar distributions have similar meanings.
    
***What's meant by "similar distributions" is "similar contexts"***. 

For example, all the following sentences have been in Similar context.

    It was really cold yesterday.
    It will be really warm today, though.
    It'll be really hot tomorrow!
    Will it be really cool Tuesday?
    
According to the Distributional Hypothesis, the words `cold`, `warm`, `hot` and `cool` must be related in some way (i.e., be close in meaning) because they occur in a similar context, i.e., between the word "really" and a word indicating a particular day. (Likewise, the words `yesterday`, `today`, `tomorrow` and `Tuesday` must be related, since they occur in the context of a word indicating a temperature.)

In other words, according to the Distributional Hypothesis, **a word's meaning is just a big list of all the contexts it occurs in**. Two words are closer in meaning if they share contexts.

### Word vectors by counting contexts

So, How do we turn this insight from the **Distributional Hypothesis** into a system for creating general-purpose vectors that capture the meaning of words? 

May be you can see where I am going with this. What if we made a *really big* spreadsheet that had ***one column for every context for every word in a given source text***. Let's use a small source text to begin with, such as following sentence: (which is taken from dickens)

    It was the best of times, it was the worst of times.

Such a spreadsheet might look something like this:

![dickens contexts](http://static.decontextualize.com/snaps/best-of-times.png)

The spreadsheet has one column for every possible context, and one row for every word. The values in each cell correspond with how many times the word occurs in the given context. The numbers in the columns constitute that word's vector, i.e., the vector for the word `of` is

    [0, 0, 0, 0, 1, 0, 0, 0, 1, 0]
    
It is a vector of shape (1x10).Because, there are ten possible contexts (which is a ten dimensional space!). It is very hard to imagine ten dimensional space.

Performing vector arithmetic on vectors with ten dimensions is as easy as, doing it on vectors with two or three dimensions. We can use the same distance formula that we defined earlier to get useful information about which vectors in this space are similar to each other. 

In particular, the vectors for `best` and `worst` are actually the same (a distance of zero), since they occur only in the same context (`the ___ of`):

    [0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
    
Of course, the conventional way of thinking about **"best" and "worst"** is that they are ***"antonyms" not "synonyms"***.

But, they are also clearly two words of the same kind, with related meanings (through opposition), a fact that is captured by this distributional model.


### Contexts and dimensionality

In a corpus of any reasonable size, there will be **many thousands if not many millions of possible contexts**. It is difficult enough working with a vector space of ten dimensions. If it is a vector space of a million dimensions then it turns out to be very complex and we might end up being superfluous and can either be eliminated or combined with other dimensions without significantly affecting the predictive power of the resulting vectors. 

The process of getting rid of superfluous dimensions in a vector space is called [dimensionality reduction](https://en.wikipedia.org/wiki/Dimensionality_reduction) and most implementations of count-based word vectors make use of dimensionality reduction so that the resulting vector space has a reasonable number of dimensions (say, 100—300, depending on the corpus and application).

**The question of how to identify a "context" is itself very difficult to answer**. 

In the sample example above, we have said that a "context" is just the word that precedes and the word that follows. Depending on your implementation of this procedure, though, you might want a context with a bigger "window" (example -- two words before and after) or a non-contiguous window (skip a word before and after the given word). 

You might exclude certain "function" words like "the" and "of" when determining a word's context, or you might [lemmatize](https://en.wikipedia.org/wiki/Lemmatisation) the words before you begin your analysis. So, two occurrences with different "forms" of the same word count as the same context. These are all debatable and research questions and different implementations of procedures for creating count-based word vectors make different decisions on this issue.


### GloVe vectors

But you don't have to create your own word vectors from scratch! Many researchers have made downloadable databases of pre-trained vectors. One such project is Stanford's [Global Vectors for Word Representation (GloVe)](https://nlp.stanford.edu/projects/glove/). These 300-dimensional vectors are included with spaCy, and these are the vectors we'll be using for the rest of this tutorial.

## Word vectors in spaCy

Okay, let's have some fun with real word vectors. We're going to use the GloVe vectors that come with spaCy to creatively analyze and manipulate the text *Dracula* text data (which we have)

The below are all available pretrained statistical models for English

`en-core-web-sm`
English multi-task CNN trained on OntoNotes. Assigns context-specific token vectors, POS tags, dependency parse and named entities.

`en-core-web-md`
English multi-task CNN trained on OntoNotes, with GloVe vectors trained on Common Crawl. Assigns word vectors, context-specific token vectors, POS tags, dependency parse and named entities.


`en-core-web-lg`
English multi-task CNN trained on OntoNotes, with GloVe vectors trained on Common Crawl. Assigns word vectors, context-specific token vectors, POS tags, dependency parse and named entities.



In [None]:
#import spacy
from __future__ import unicode_literals

The following cell loads the language model and parses the input text:

In [53]:
#checking out the length of each word vector can be 
len(nlp.vocab['cheese'].vector)

300

In [56]:
#looking at the how word vector looks like for one random word
nlp.vocab['testing'].vector

array([ 5.0906e-02,  3.3306e-02,  3.6719e-01,  1.4088e-01, -3.0364e-01,
       -2.9541e-01,  3.6935e-01, -2.8460e-01,  1.6999e-01,  1.6063e+00,
        2.1006e-01, -1.6693e-01,  3.2297e-01,  1.3770e-01, -8.3215e-02,
       -2.3111e-01,  3.8714e-01,  2.3780e+00, -4.3649e-01,  1.4544e-01,
        5.7012e-01,  2.6686e-01,  8.2137e-02, -4.2225e-02, -7.9745e-02,
        2.5825e-01,  2.3424e-01, -2.6287e-01,  5.8605e-01, -1.0229e-01,
       -4.6239e-02, -1.3312e-01,  1.0148e+00,  1.0173e-01,  4.7532e-02,
        3.8132e-01, -2.1046e-01,  3.5580e-01, -3.8104e-02,  2.1731e-01,
        3.4521e-01, -1.1286e-01,  5.8788e-03,  2.3315e-01, -2.0763e-02,
        4.4258e-02, -2.1529e-01,  3.7189e-01,  5.7120e-02, -9.6403e-02,
        1.5698e-01, -3.3283e-01, -3.0891e-01, -2.9329e-01, -2.3612e-01,
       -2.0338e-01,  1.9757e-01, -1.6525e-01, -1.6665e-03,  3.3229e-01,
        4.3400e-01,  5.4746e-02,  1.0761e-02,  4.7274e-01,  6.6353e-01,
        3.1850e-01,  3.1877e-01, -3.7499e-01, -2.0415e-01,  4.30

In [57]:
document = nlp(open("dracula.txt").read())

Task : Create a list of unique words (or tokens) in the text ---> as a list of strings.

In [58]:
tokens = list(set([word.text for word in document if word.is_alpha]))

You can see the vector of any word in spaCy's vocabulary using the `vocab` attribute, like so:

In [59]:
print(f"Actual Word -- {document[1]}\n")
print(f"Vector for this word : \n\n {document[1].vector}")

Actual Word -- Project

Vector for this word : 

 [ 8.5476e-02 -5.6849e-01  2.0372e-01 -2.2120e-01 -3.0963e-01  3.4719e-01
 -2.6118e-01 -5.0149e-02 -1.5221e-01  2.2953e+00 -9.2400e-01 -5.4083e-02
 -3.2626e-02  2.7534e-01  4.2714e-01  4.6146e-01  1.9751e-01  1.6747e+00
  2.8576e-03 -7.8526e-02  5.6920e-01  1.9254e-01 -1.7107e-01  1.1677e-01
  2.7321e-01  4.6804e-01  4.7820e-01  3.5063e-01  1.7262e-01 -4.4578e-02
  1.6245e-01 -7.8839e-01  4.4394e-02  7.1313e-02 -4.0642e-01  2.7660e-02
  7.9243e-02 -1.3488e-01 -3.4834e-03 -8.1534e-02  2.1869e-01  6.8787e-01
  1.1829e-01 -3.8520e-01 -7.3794e-01 -1.7782e-01  1.7454e-02 -1.2388e-01
 -2.5552e-01  1.1953e-01  4.8163e-01 -2.8986e-01 -7.9492e-02 -7.7109e-02
 -5.5383e-02  2.7323e-01 -7.7889e-02 -1.2901e-01  3.9796e-01 -1.6551e-01
 -3.0875e-01 -2.8209e-01 -1.3139e-01  5.0038e-01  5.9612e-01 -8.4827e-01
 -5.4351e-01  1.3997e-01  2.5255e-01  2.0069e-01  6.2017e-02 -3.6504e-01
  2.1824e-01  9.6262e-02  5.9249e-02  1.1723e-01 -4.1822e-01  1.0989e-01
 

###### For the sake of convenience, the following function gets the vector of a given string from spaCy's vocabulary:

In [64]:
def string_to_vector(given_string):
    return nlp.vocab[given_string].vector


### Cosine similarity and finding closest neighbors

The cell below defines a function `cosine()`, which returns the [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) of two vectors. Cosine similarity is another way of determining how similar two vectors are, which is more suited to high-dimensional spaces. [See the Encyclopedia of Distances for more information and even more ways of determining vector similarity.](http://www.uco.es/users/ma1fegan/Comunes/asignaturas/vision/Encyclopedia-of-distances-2009.pdf)



In [61]:
import numpy as np
from numpy import dot           #This function returns the dot product of two arrays.
from numpy.linalg import norm   #In NumPy, the np.linalg.norm() function is used to calculate one of the eight different matrix norms or one of the vector norms.

# cosine similarity function
def cosine_similarity(vector_1, vector_2):
    if norm(vector_1) > 0 and norm(vector_2) > 0:
        return dot(vector_1, vector_2) / (norm(vector_1) * norm(vector_2))
    else:
        return 0.0

The following cell shows that the cosine similarity between `dog` and `puppy` is larger than the similarity between `trousers` and `octopus`, thereby demonstrating that the vectors are working how we expect them to:

In [65]:
cosine_similarity(string_to_vector('dog'), string_to_vector('puppy'))

0.85852146

In [66]:
cosine_similarity(string_to_vector('trousers'),string_to_vector('octopus'))

0.1566927

So, 

In [67]:
cosine_similarity(string_to_vector('dog'), string_to_vector('puppy')) > cosine_similarity(string_to_vector('trousers'),string_to_vector('octopus'))

True

The following function defines -- that, it iterates through a list of tokens and returns the token whose vector is most similar to a given vector.

In [75]:
def spacy_closest(token_list, vector_to_check, number_of_words_to_out=10):
    return sorted(token_list,key=lambda x: cosine_similarity(vector_to_check, string_to_vector(x)),reverse=True)[:number_of_words_to_out]

Using this function, we can get a list of synonyms, or words closest in meaning (or distribution, depending on how you look at it), to any arbitrary word in spaCy's vocabulary. In the following example, we're finding the words in *Dracula* closest to "basketball":

In [76]:
# what's the closest equivalent of basketball?
spacy_closest(tokens, string_to_vector("basketball"))

['tennis',
 'coach',
 'game',
 'teams',
 'junior',
 'Junior',
 'Team',
 'school',
 'boys',
 'leagues']

### Fun with spaCy, Dracula, and vector arithmetic

Now we can start doing vector arithmetic and finding the closest words to the resulting vectors. For example, what word is closest to the halfway point between day and night?

In [77]:
# halfway between day and night
spacy_closest(tokens, mean_vector([string_to_vector("day"), string_to_vector("night")]))

['night',
 'Day',
 'day',
 'evening',
 'Evening',
 'morning',
 'Morning',
 'afternoon',
 'nights',
 'Nights']

Variations of `night` and `day` are still closest, but after that we get words like `evening` and `morning`, which are indeed halfway between day and night!

Here are the closest words in _Dracula_ to "wine":

In [78]:
spacy_closest(tokens, string_to_vector("wine"))

['wine',
 'beer',
 'bottle',
 'Drink',
 'drink',
 'cellar',
 'fruit',
 'bottles',
 'brandy',
 'taste']

If you subtract "alcohol" from "wine" and find the closest words to the resulting vector, you're left with simply a lovely dinner:

In [79]:
spacy_closest(tokens, subtract_vector(string_to_vector("wine"), string_to_vector("alcohol")))

['wine',
 'cellar',
 'exquisite',
 'fabulous',
 'splendid',
 'magnificent',
 'delightful',
 'dinner',
 'Dinner',
 'sparkling']

The closest words to "water":

In [80]:
spacy_closest(tokens, string_to_vector("water"))

['water',
 'waters',
 'salt',
 'Salt',
 'pond',
 'dry',
 'liquid',
 'ocean',
 'boiling',
 'heat']

But if you add "frozen" to "water," you get "ice":

In [82]:
spacy_closest(tokens, addition_vector(string_to_vector("water"),string_to_vector("frozen")))

['water',
 'cold',
 'ice',
 'salt',
 'Salt',
 'dry',
 'fresh',
 'liquid',
 'boiling',
 'milk']

You can even do analogies! For example, the words most similar to "grass":

In [83]:
spacy_closest(tokens, string_to_vector("grass"))

['grass',
 'lawn',
 'trees',
 'greens',
 'grassy',
 'garden',
 'GARDEN',
 'sand',
 'foliage',
 'tree']

If you take the difference of "blue" and "sky" and add it to grass, you get the analogous word ("green"):

In [84]:
# analogy: blue is to sky as X is to grass
blue_to_sky = subtract_vector(string_to_vector("blue"), string_to_vector("sky"))
spacy_closest(tokens, addition_vector(blue_to_sky,string_to_vector("grass")))

['grass',
 'green',
 'GREEN',
 'Green',
 'yellow',
 'Red',
 'red',
 'purple',
 'lawn',
 'pink']

## Sentence similarity

To get the vector for a sentence, we simply average its component vectors, like so:

In [85]:
def sentence_vector(sentence):
    sentence = nlp(sentence)
    return mean_vector([word.vector for word in sentence])

Let's find the sentence in our text file that is closest in "meaning" to an arbitrary input sentence. First, we'll get the list of sentences:

In [87]:
sentences = list(document.sents)

In [88]:
sentences

[The Project Gutenberg EBook of Dracula, by Bram Stoker
 ,
 This eBook is for the use of anyone anywhere at no cost and with
 almost no restrictions whatsoever.  ,
 You may copy it, give it away or
 re-use it under the terms of the Project,
 Gutenberg License included
 with this eBook or online at www.gutenberg.org/license
 
 ,
 Title: Dracula
 ,
 Author: Bram Stoker
 ,
 Release Date: August 16, 2013,
 [EBook #345]
 
 Language: English
 
 ,
 *,
 ** START OF THIS PROJECT,
 GUTENBERG EBOOK DRACULA,
 *,
 *,
 *
 
 
 
 ,
 Produced by Chuck Greif and the Online Distributed
 Proofreading Team at http://www.pgdp.net,
 (This file was
 produced from images generously made available by The
 Internet Archive)
 
 
 
 
 
 
 
                                 ,
 DRACULA
 
 
 
 
 
                                 DRACULA
 
                                   ,
 _by_
 
                               ,
 Bram Stoker
 
                         ,
 [Illustration: colophon]
 
                                 N

The following function takes a list of sentences from a spaCy parse and compares them to an input sentence, sorting them by cosine similarity.

In [97]:
def spacy_closest_sentence(space, input_string, number_of_sentences_out=10):
    input_vector = sentence_vector(input_string)
    return sorted(space,key=lambda x: cosine_similarity(np.mean([word.vector for word in x], axis=0), input_vector),reverse=True)[:number_of_sentences_out]

Here are the sentences in *Dracula* closest in meaning to "My favorite food is strawberry ice cream." (Extra linebreaks are present because we didn't strip them out when we originally read in the source text.)

In [98]:
for each_sentence in spacy_closest_sentence(sentences, "My favorite food is strawberry ice cream"):
    print(each_sentence.text)
    print(f"---")

My own heart grew cold as ice,

---
This, with some cheese
and a salad and a bottle of old Tokay, of which I had two glasses, was
my supper.
---
I dined on what they
called "robber steak"--bits of bacon, onion, and beef, seasoned with red
pepper, and strung on sticks and roasted over the fire, in the simple
style of the London cat's meat!
---
a chicken done up some way with red pepper, which was
very good but thirsty.
---
I got a cup of tea at the Aërated Bread Company
and came down to Purfleet by the next train.


---
There is not even a toilet glass on my
table, and I had to get the little shaving glass from my bag before I
could either shave or brush my hair.
---
There was everywhere a bewildering mass of fruit blossom--apple,
plum, pear, cherry; and as we drove by I could see the green grass under
the trees spangled with the fallen petals.
---
We get hot soup, or coffee, or tea; and
off we go.
---
I
saw it drip with the fresh blood!"
---
We had a capital "severe tea" at Robin Hood'