<a href="https://colab.research.google.com/github/astrovishalthakur/MachineLearning/blob/main/NaturalLanguageProcessing/Spacy/WordVectorsAndSpacy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import spacy
! python -m spacy download en_core_web_md

Collecting en_core_web_md==2.2.5
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-2.2.5/en_core_web_md-2.2.5.tar.gz (96.4 MB)
[K     |████████████████████████████████| 96.4 MB 33.1 MB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_md')


In [2]:
# reading data
url = "https://raw.githubusercontent.com/astrovishalthakur/freecodecamp_spacy/main/data/wiki_us.txt"

In [3]:
from urllib.request import urlopen

In [4]:
 # the lib that handles the url stuff
data = urlopen(url)

In [5]:
text = data.read()

In [6]:
# text = str(text) 
# this doesn't work. Since it can't specify decoding.
text = text.decode("utf-8")

In [7]:
nlp = spacy.load("en_core_web_sm")

In [8]:
doc = nlp(text)

In [9]:
sentence1 = list(doc.sents)[0]

# Word Vectors

Word vectors, or word embeddings, are numerical representations of words in multidimensional space through matrices. The purpose of the word vector is to get a computer system to understand a word. Computers cannot understand text efficiently. They can, however, process numbers quickly and well. For this reason, it is important to convert a word into a number.

Initial methods for creating word vectors in a pipeline take all words in a corpus and convert them into a single, unique number. These words are then stored in a dictionary that would look like this: {“the”: 1, “a”, 2} etc. This is known as a bag of words. This approach to representing words numerically, however, only allow a computer to understand words numerically to identify unique words. It does not, however, allow a computer to understand meaning.

Imagine this scenario:

Tom loves to eat chocolate.

Tom likes to eat chocolate.

These sentences represented as a numerical array (list) would look like this:

1, 2, 3, 4, 5

1, 6, 3, 4, 5

As we can see, as humans both sentences are nearly identical. The only difference is the degree to which Tom appreciates eating chocolate. If we examine the numbers, however, these two sentences seem quite close, but their semantical meaning is impossible to know for certain. How similar is 2 to 6? The number 6 could represent “hates” as much as it could represent “likes”. This is where word vectors come in.

Word vectors take these one dimensional bag of words and gives them multidimensional meaning by representing them in higher dimensional space, noted above. This is achieved through machine learning and can be easily achieved via Python libraries, such as Gensim, which we will explore more closely in the next notebook.

# Why use Word Vectors?

The goal of word vectors is to achieve numerical understanding of language so that a computer can perform more complex tasks on that corpus. Let’s consider the example above. How do we get a computer to understand 2 and 6 are synonyms or mean something similar? One option you might be thinking is to simply give the computer a synonym dictionary. It can look up synonyms and then know what words mean. This approach, on the surface, makes perfect sense, but let’s explore that option and see why it cannot possibly work.

For the example below, we will be using the Python library PyDictionary which allows us to look up definitions and synonyms of words.

In [10]:
! pip install PyDictionary



In [11]:
from PyDictionary import PyDictionary

In [12]:
dictionary = PyDictionary()
text = "Tom loves to eat chocolate"

words = text.split()
for word in words:
  syns = dictionary.synonym(word)
  print(f"{word} : {syns[0:5]}\n")

Tom has no Synonyms in the API


TypeError: ignored

Even with the simple sentence, the results are comically bad. Why? The reason is because synonym substitution, a common method of data augmentation, does not take into account syntactical differences of synonyms. I do not believe anyone would think “Felis domesticus”, the Latin name of the common house cat, would be an adequite substitution for the name Tom. Nor is “garbage down” a really proper synonym for eat.

Perhaps, then we could use synonyms to find words that have cross-terms, or terms that appear in both synonym sets.

In [13]:
dictionary=PyDictionary()

words  = ["like", "love"]
for word in words:
    syns = dictionary.synonym(word)
    print (f"{word}: {syns[0:5]}\n")

like has no Synonyms in the API


TypeError: ignored

This, as we can see, has some potential to work, but again it is not entirely reliable and to work with such a list would be computationally expensive. For both of these reasons, word vectors are prefered. The reason? Because they are formed by the computer on corpora for a specific task. Further, they are numerical in nature (not a dictionary of words), meaning the computer can process them more quickly.

# What do Word Vectors Look Like?

Word vectors have a preset number of dimensions. These dimensions are honed via machine learned. Models take into account word frequency alongside words across a corpus and the appearance of other words in similar contexts. This allows for the the computer to determine the syntactical similarity of words numerically. It then needs to represent these relationships numerically. It does this through the vector, or a matrix of matrices. To represent these more concisely, models flatten a matrix to a float (decimal number). The number of dimensions represent the number of floats in the matrix.

Let’s take a look at the first word in our sentence. Specifically, let’s look at its vector.

In [23]:
sentence1

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America.

In [14]:
sentence1[0].vector

array([-3.6614687 , -0.04620421,  3.0726507 , -0.23650384,  2.4024577 ,
        1.6393081 , -0.23510802, -0.882233  ,  0.2080667 , -0.37595803,
       -1.324495  ,  2.1997802 , -2.2784414 ,  3.673751  , -3.7754827 ,
       -0.457344  , -0.6581736 , -1.8261079 , -2.471551  , -2.0428147 ,
       -0.3225528 ,  0.5658449 , -2.8750336 , -3.3324761 , -0.68903667,
       -1.1300318 , -0.4956967 , -2.1609244 ,  0.6490332 ,  0.20021117,
        0.39715707,  1.309732  , -3.2883654 , -0.11644733,  0.11724067,
       -1.2879778 , -0.27009898,  1.7993681 , -0.46875978,  0.55478245,
        1.8216534 , -2.9869418 ,  0.674435  ,  1.4011077 ,  0.8784035 ,
       -0.32177418, -1.5453427 ,  0.4830721 ,  0.8402412 ,  1.9110055 ,
        0.4099064 , -1.1826029 ,  1.4667608 ,  0.34309214, -4.052678  ,
        3.892485  ,  3.46963   ,  2.0397763 ,  0.2942291 , -2.2132115 ,
        1.233685  ,  1.7040749 ,  2.2687542 ,  1.1289511 ,  0.36300308,
        3.584022  ,  3.8411102 , -1.637488  , -6.0296807 ,  0.06

# Why use Word Vectors?

Once a word vector model is trained, we can do similarity matches very quickly and very reliably. Let’s explore some vectors from our medium sized model. Let’s specifically try and find the words most closely related to the word dog.

In [15]:
import numpy as np

In [17]:
! python -m spacy download en_core_web_lg

Collecting en_core_web_lg==2.2.5
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-2.2.5/en_core_web_lg-2.2.5.tar.gz (827.9 MB)
[K     |████████████████████████████████| 827.9 MB 1.6 MB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_lg')


In [21]:
import en_core_web_lg
nlp2 = en_core_web_lg.load()

In [22]:
your_word = "dog"
nlp2 = spacy.load("en_core_web_lg")

ms = nlp2.vocab.vectors.most_similar(
    np.asarray([nlp2.vocab.vectors[nlp2.vocab.strings[your_word]]]), n=10)
words = [nlp2.vocab.strings[w] for w in ms[0][0]]

distances = ms[2]
print(words)

['dog', 'doG', 'Dog', 'DoG', 'DOG', 'DOGS', 'dogs', 'Dogs', 'PUPPY', 'Puppy']


# Doc Similarity

In spaCy we can do this same thing at the document level. Through word vectors we can calculate the similarity between two documents. Let’s look at the example from spaCy’s documentation.

In [24]:
doc1 = nlp2("I like salty fries and hamburgers.")
doc2 = nlp2("Fast food tastes very good.")

# Similarity of two documents
print(doc1, "<->", doc2, doc1.similarity(doc2))

I like salty fries and hamburgers. <-> Fast food tastes very good. 0.7687607012190486


# Word Similarity

We can also calculate the similarity between two given words.

In [25]:
# Similarity of tokens and spans
french_fries = doc1[2:4]
burgers = doc1[5]
print(french_fries, "<->", burgers, french_fries.similarity(burgers))

salty fries <-> hamburgers 0.6949788


# Conclusion
As we have seen in this notebook, spaCy is made up of a series of complex P