#### Initialize
*Here, we will talk about word vectors or word embeddings. For this we will working with a larger English medium model- 'en_core_web_md'.* 

*Lets import spacy and download the model--*

In [6]:
import spacy
#!python -m spacy download en_core_web_md

*Load the model and make a doc object*

In [7]:
# Load the model
nlp = spacy.load("en_core_web_md")
with open("dataset/nlp_wiki.txt", "r") as f:
    text = f.read()

# Make a doc object
doc = nlp(text)
sentence1 = list(doc.sents)[0]

In [8]:
print(sentence1)

One of the first things required for natural language processing (NLP) tasks is a corpus.


#### Word Vectors
*Word vectors are numerical representations of words in multidimensional space through matrices. The purpose of word vector is to convert a word into a number to make computer system to understand the word.*

*The idea is take all the words in a corpus and convert them into a single unique number. These numbers then stored in a dictionary. This is known as bag of words.*

*This approach to represent words numerically. This allows computer to understand words numerically, however, doesn't allow computer to understand its meaning.*

*For Example, lets take two sentences-
A) Nee loves to eat chocolate and B) Nee likes to eat chocolate. These sentences represented numerically as- A) 1 2 3 4 5 and B) 1 6 3 4 5*

*Here, as we can see, both sentences are nearly identical. The only difference is the degreeto which Nee appericiating eating chocolate. If we examine the numerical representation, we see that they are quite close  but their semantical meaning is impossible to know for certain. How similar is 2 to 6?? 6 could represents "hates" like it represent "like" here.*

*This is where word vectors come in.*

##### Why use Word Vectors?
*Word vectyors takes one dimensional bag-of-words matrix and gives them multidimentional meaning by representing them in higher dimentional space. [This is achived via Python libraries, such a s Gensim.]*

*The goal of word vectors is to achieve numerical understanding of language. Lets consider the example above- how do we get a computer to understand 2 and 6 are synonyms or mean something similar?*

*One possible ways would be- give computer a synonym dictionary. It can look up the synonyms and know what the words mean. Lets explore that option and see why it cannot possibly work.*

*Python has a library named "PyDictionary". Lets try it...*

In [15]:
from PyDictionary import PyDictionary

dictionary = PyDictionary
sentence = "Nee loves to eat Chocolate"

words = sentence.split()
#print(words)
for word in words:
    syns = dictionary.synonym(word)
    print(f"{word}:{syns[0:5]}\n")

Nee has no Synonyms in the API


TypeError: 'NoneType' object is not subscriptable

*However it has some potential to work but not entirely relliable and such a list would be computationally expensive. For bothe the reasons- word vector is prefered.*

##### How Word Vector Looks Like?

*Word Vectors has a preset number of dimensions. Models take into account word frequency along with words across a corpus and the appearance of the other words in similar contexts. This allows for the computer to determine the syntactical similarity of words numerically. Also, it then needs to represent these relationships numerically.*

*To represent these more concisely, models flatten a matrix (or matrix of matrices) of floats. The number of dimensions represent the number of floats in the matrix.*

In [17]:
sentence1[0].vector

array([ 5.4071e-02,  1.1110e-01, -1.4557e-01, -2.4294e-02,  3.8110e-01,
       -1.4389e-01, -1.7998e-01, -3.1079e-01, -7.9690e-03,  2.6538e+00,
       -1.2772e-01,  2.3885e-02,  7.1284e-02, -1.4264e-01,  1.0939e-01,
       -1.0667e-01, -3.8178e-02,  1.1853e+00,  5.2559e-02, -1.7181e-01,
       -1.8629e-01, -1.6533e-02, -8.4008e-02,  1.4542e-01,  1.6059e-01,
       -6.9163e-02, -7.6812e-02, -2.0658e-01,  1.6025e-01,  2.1405e-01,
        5.9209e-02,  4.7891e-01,  8.3374e-02,  1.9994e-01,  9.6225e-02,
       -1.0033e-01,  4.2577e-02, -9.3587e-02, -1.3389e-01, -3.2704e-01,
        2.3650e-02,  3.4064e-01, -7.5976e-02, -1.0150e-01,  1.2431e-01,
       -5.5954e-02, -2.5284e-01, -1.8520e-02,  4.6912e-02, -8.4774e-02,
       -1.5884e-01,  2.3751e-01,  7.6109e-02,  7.1753e-02,  3.1405e-02,
        3.2656e-02,  1.1271e-01,  2.7839e-01,  3.5233e-01, -1.0844e-01,
        7.0183e-02, -4.6891e-02, -1.8825e-01,  4.0518e-01,  2.1180e-01,
       -1.4376e-01, -4.8075e-03,  1.3877e-01, -9.1521e-02,  2.51

*Once a word vector is trained, we can matches similarity quickly and reliably. Lets explore some vectors from our model. Specifically, try and find the words most closely related to the word "nlp"*

In [25]:
import numpy as np
my_word = "language"
most_similar_word  = nlp.vocab.vectors.most_similar(
    np.asarray([nlp.vocab.vectors[nlp.vocab.strings[my_word]]]), n=10)
words = [nlp.vocab.strings[w] for w in most_similar_word[0][0]]
distances = most_similar_word[2]
print(words)

['Langauge', 'language', 'MULTILINGUAL', 'FLASHCARDS', 'UTTERANCE', 'GRAMMER', 'grammar', 'Typological', 'PATOIS', 'LOCALIZATION']


#### Doc Similarity
*In spaCy we can do the same thing at the document level. Through word vectors we can calculate the similarity between two documents.*

In [26]:
nlp = spacy.load("en_core_web_md")

In [32]:
doc1 = nlp("I like spicy fries and Tea.")
doc2 = nlp("Fast food tastes good.")

# Similarity of two documents
print(doc1,"<=>",doc2,"\nSimilarity:", doc1.similarity(doc2))

I like spicy fries and Tea. <=> Fast food tastes good. 
Similarity: 0.8149152232462087


#### Word Similarity
*We can also calculate the similarity between two given words*

In [36]:
french_fries = doc1[2:4]
burgers = doc1[5]
print("french_fries <=> burgers")
print("Similarity:", doc1.similarity(doc2))

french_fries <=> burgers
Similarity: 0.8149152232462087
