# NLP Week 4: Distributed Representations part 1

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Understand-some-key-terms:" data-toc-modified-id="Understand-some-key-terms:-1">Understand some key terms:</a></span></li><li><span><a href="#Word-Embeddings" data-toc-modified-id="Word-Embeddings-2">Word Embeddings</a></span></li><li><span><a href="#Pre-trained-word-embeddings-using-gensim" data-toc-modified-id="Pre-trained-word-embeddings-using-gensim-3">Pre-trained word embeddings using gensim</a></span></li><li><span><a href="#Getting-the-embedding-representation-for-full-text" data-toc-modified-id="Getting-the-embedding-representation-for-full-text-4">Getting the embedding representation for full text</a></span></li></ul></div>

## Understand some key terms:


- <b> Distributional similarity </b>

    - This is the idea that the meaning of a word can be understood from the context in which the word appears. 
    - This is also known as connotation: meaning is defined by context. 
    - This is opposed to denotation: the literal meaning of any word. 
    - For example: “NLP rocks.” The literal meaning of the word “rocks” is “stones,” but from the context, it’s used to refer to something good and fashionable.

- <b> Distributional hypothesis </b>
    - In linguistics, this hypothesizes that words that occur in similar contexts have similar meanings. 
    - For example, the English words “dog” and “cat” occur in similar contexts.
    - Now, following from VSM, the meaning of a word is represented by the vector. Thus, if two words often occur in similar context, then their corresponding representation vectors must also be close to each other

- <b>Distributional representation </b>
Mathematically, distributional representation schemes use high-dimensional vectors to represent words. These vectors are obtained from a co-occurrence matrix that captures co-occurrence of word and context. The dimension of this matrix is equal to the size of the vocabulary of the corpus. The four schemes that we’ve seen so far—one-hot, bag of words, bag of n-grams, and TF-IDF—all fall under the umbrella of distributional representation.

- <b> Distributed representation </b>
This is a related concept. It, too, is based on the distributional hypothesis. As discussed in the previous paragraph, the vectors in distributional representation are very high dimensional and sparse. This makes them computationally inefficient
and hampers learning. To alleviate this, distributed representation schemes significantly compress the dimensionality. This results in vectors that are compact(i.e., low dimensional) and dense (i.e., hardly any zeros). The resulting vector
space is known as distributed representation. All the subsequent schemes we’ll discuss in this chapter are examples of distributed representation

- <b> Embedding </b>
For the set of words in a corpus, embedding is a mapping between vector space
coming from distributional representation to vector space coming from distributed representation.


- <b> Vector semantics </b>
This refers to the set of NLP methods that aim to learn the word representations
based on distributional properties of words in a large corpus.


## Word Embeddings


Let’s consider some examples. If we’re given the word “USA,” distributionally similar words could be other countries (e.g., Canada, Germany, India, etc.) or cities in the USA.
In 2013, a seminal work by Mikolov et al. showed that their neural network–based word representation model known as “Word2vec,” based on “distributional similarity,” can capture word analogy relationships such as:
King – Man + Woman ≈ Queen

While learning such semantically rich relationships, Word2vec ensures that the
learned word representations are low dimensional (vectors of dimensions 50–500,
instead of several thousands, as with previously studied representations in this chapter) and dense (that is, most values in these vectors are non-zero). Such representa‐
tions make ML tasks more tractable and efficient.

To “derive” the meaning of the word, Word2vec uses distributional similarity and distributional hypothesis. That is, it derives the meaning of a word from its context:
words that appear in its neighborhood in the text. So, if two different words (often)
occur in similar context, then it’s highly likely that their meanings are also similar.


Word2vec operationalizes this by projecting the meaning of the words in a vector
space where words with similar meanings will tend to cluster together, and words
with very different meanings are far from one another.

## Pre-trained word embeddings using gensim

Pre-trained word embeddings can be thought of as a large collection of key-value pairs, where keys are
the words in the vocabulary and values are their corresponding word vectors. 


Some of the most popular pre-trained embeddings are Word2vec by Google, GloVe by
Stanford, and fasttext embeddings by Facebook , to name a few. Further,
they’re available for various dimensions like d = 25, 50, 100, 200, 300, 600.

In [1]:
from gensim.models import Word2Vec, KeyedVectors
import time

# Path of the downloaded google word2vec
pretrainedpath = "GoogleNews-vectors-negative300.bin"

#Start the timer
start_time = time.time() 

# Load model
w2v_model = KeyedVectors.load_word2vec_format(pretrainedpath, binary=True)

#Calculate the total time elapsed to load the model
print("%0.2f seconds taken to load model"%float(time.time() - start_time)) 

#Number of words in the vocabulary
print("The len of the vocab is: ", len(w2v_model.key_to_index)) 
print(type(w2v_model))
print(type(w2v_model.key_to_index)) 

18.32 seconds taken to load model
The len of the vocab is:  3000000
<class 'gensim.models.keyedvectors.KeyedVectors'>
<class 'dict'>


In [2]:
# word example 1

word = 'happy'

# print similar words to a given word  
print("Similar words to the word", word, "are", w2v_model.most_similar(word))

# print the vector size of a word 
print(len(w2v_model[word]))

# print the vector of the word 
print(w2v_model[word])

Similar words to the word happy are [('glad', 0.7408890724182129), ('pleased', 0.6632170677185059), ('ecstatic', 0.6626911163330078), ('overjoyed', 0.6599287390708923), ('thrilled', 0.6514049768447876), ('satisfied', 0.6437950134277344), ('proud', 0.636042058467865), ('delighted', 0.627237856388092), ('disappointed', 0.6269949674606323), ('excited', 0.6247665882110596)]
300
[-5.18798828e-04  1.60156250e-01  1.60980225e-03  2.53906250e-02
  9.91210938e-02 -8.59375000e-02  3.24218750e-01 -2.17285156e-02
  1.34765625e-01  1.10351562e-01 -1.04980469e-01 -2.90527344e-02
 -2.38037109e-02 -4.02832031e-02 -3.68652344e-02  2.32421875e-01
  3.20312500e-01  1.01074219e-01  5.83496094e-02 -2.91824341e-04
 -3.29589844e-02  2.11914062e-01  4.32128906e-02 -8.59375000e-02
  2.81250000e-01 -1.78222656e-02  3.79943848e-03 -1.71875000e-01
  2.06054688e-01 -1.85546875e-01  3.73535156e-02 -1.21459961e-02
  2.04101562e-01 -3.80859375e-02  3.61328125e-02 -8.15429688e-02
  8.44726562e-02  9.37500000e-02  1.44

In [3]:
# word example 2
word = 'egypt'

# print similar words to a given word  
print("Similar words to the word", word, "are", w2v_model.most_similar(word))

# print the vector size of a word 
print(len(w2v_model[word]))

# print the vector of the word 
print(w2v_model[word])

Similar words to the word egypt are [('syria', 0.6730753183364868), ('saudi_arabia', 0.6684698462486267), ('bahrain', 0.6469383239746094), ('ethiopia', 0.6380573511123657), ('#_Jan##', 0.6275482773780823), ('syrian', 0.6245205998420715), ('yemen', 0.6230114698410034), ('saudi', 0.6178266406059265), ('lebanon', 0.6116478443145752), ('arabia', 0.6031927466392517)]
300
[ 3.12500000e-02  1.12792969e-01  1.14746094e-01  3.55468750e-01
 -2.24609375e-01 -5.73730469e-02 -6.34765625e-02 -6.10351562e-02
  3.93066406e-02  1.01562500e-01 -1.67968750e-01 -3.63281250e-01
 -2.15820312e-01 -6.00585938e-02 -3.80859375e-02  8.64257812e-02
 -7.32421875e-02  1.08398438e-01  9.17968750e-02  3.85742188e-02
  1.25976562e-01 -2.22656250e-01  3.53515625e-01 -4.19921875e-02
 -8.10546875e-02  1.39648438e-01 -1.47460938e-01  2.41210938e-01
  2.96875000e-01 -1.66992188e-01 -2.20947266e-02 -7.27539062e-02
 -2.36328125e-01 -1.25885010e-03 -2.94921875e-01  2.08984375e-01
 -2.13867188e-01  1.09375000e-01 -1.00585938e-

In [4]:
# What if the word is not in the vocab?
w2v_model['practicalnlp']

KeyError: "Key 'practicalnlp' not present"

In [2]:
try:
    embedding = w2v_model['practicalnlp']
except KeyError:
    print("The word 'practicalnlp' is not in the vocabulary.")


The word 'practicalnlp' is not in the vocabulary.


In [3]:
import gensim
print(gensim.__version__)

4.3.3


In [8]:
!python -m spacy download en_core_web_md


Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting en-core-web-md==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.7.1/en_core_web_md-3.7.1-py3-none-any.whl (42.8 MB)
     ---------------------------------------- 0.0/42.8 MB ? eta -:--:--
     ---------------------------------------- 0.3/42.8 MB ? eta -:--:--
     ---------------------------------------- 0.5/42.8 MB 1.5 MB/s eta 0:00:28
      --------------------------------------- 0.8/42.8 MB 1.6 MB/s eta 0:00:27
     - -------------------------------------- 1.3/42.8 MB 1.7 MB/s eta 0:00:25
     - -------------------------------------- 1.3/42.8 MB 1.7 MB/s eta 0:00:25
     - -------------------------------------- 1.8/42.8 MB 1.6 MB/s eta 0:00:26
     - -------------------------------------- 2.1/42.8 MB 1.6 MB/s eta 0:00:26
     -- ------------------------------------- 2.6/42.8 MB 1.7 MB/s eta 0:00:25
     -- ------------------------------------- 2.6/42.8

##  Getting the embedding representation for full text

In [9]:
import spacy

# the pre-trained en_core_web_md model from spaCy, is a medium-sized English model 
nlp = spacy.load('en_core_web_md')

# process a sentence using the model
mydoc = nlp("Canada is a large country")

#Get a vector for individual words
#print(mydoc[0].vector) 

#Averaged vector for the entire sentence
print(mydoc.vector) 

[-2.12132597e+00  3.35791826e+00 -1.37670004e+00  2.12385988e+00
  6.28810024e+00  3.22182178e-01  1.18766809e+00  4.87165976e+00
  2.24417591e+00  7.14037895e-01  1.03926411e+01  8.83959949e-01
 -1.73903596e+00  5.41560054e-01 -1.55289978e-01  5.18263149e+00
  1.30475593e+00  4.21266031e+00 -5.92720024e-02 -1.28370404e+00
  2.54464006e+00  1.31399959e-01 -4.84842014e+00  1.84918189e+00
 -6.28175914e-01 -1.20439982e+00 -1.89999998e+00 -4.88359404e+00
 -1.59767210e+00 -2.89982986e+00  2.57135957e-01  2.57717991e+00
 -2.17529225e+00 -2.77516985e+00 -2.83998394e+00  8.96261990e-01
  3.73915970e-01  4.36887592e-01  2.06502008e+00 -2.08246017e+00
 -7.68391967e-01  1.87826610e+00  1.21900201e+00  4.61789995e-01
 -2.57270002e+00  2.26117969e+00  2.93105793e+00 -1.84933782e+00
 -5.98986030e-01  1.39556003e+00 -1.71248794e+00  4.13538039e-01
  2.05463791e+00 -4.33485985e+00 -3.63799959e-01 -1.03273201e+00
  2.23117399e+00 -5.93478978e-01 -7.95660019e-01  3.38980108e-01
  2.17601585e+00 -9.78588

In [10]:
import spacy

# Load the pre-trained model
model = spacy.load('en_core_web_md')
print(type(model))

# Define the target sentence
target_sentence = "Egypt is in Africa"
target_doc = model(target_sentence)
print(target_doc.vector)

# Define a list of candidate sentences
candidate_sentences = ["Australia is a continent", 
                       "Japan is in Asia", 
                       "cat loves dogs", 
                       "Canada is a cold country"]

# Compute the similarity between the target sentence and each candidate sentence
for candidate_sentence in candidate_sentences:
    candidate_doc = model(candidate_sentence)
    similarity = target_doc.similarity(candidate_doc)
    print(f"Similarity between '{target_sentence}' and '{candidate_sentence}': {similarity:.2f}")


<class 'spacy.lang.en.English'>
[-1.7132499e+00  2.1462150e+00  3.3822498e-01  2.0327001e+00
  5.3187499e+00  1.8000075e+00 -2.0949826e+00  2.2298698e+00
  2.9685848e+00  8.7786734e-01  8.4832745e+00 -1.1373749e+00
 -8.3087504e-01  2.6968000e+00  8.5587001e-01  3.7318499e+00
 -4.0461001e+00  2.3709500e+00  1.3235799e+00 -2.2868550e+00
  1.8101726e+00 -7.2573203e-01 -6.5917253e+00  3.0008998e+00
 -2.0096049e+00 -2.7405250e-01 -2.8172500e+00 -1.7489295e+00
 -1.4400675e+00  2.9781849e+00  1.3113000e+00  1.4226499e+00
 -3.1205325e+00 -2.5082126e+00 -6.3439498e+00  1.7815149e+00
 -2.7906446e+00  1.6620250e+00 -2.6334248e+00 -1.9140749e+00
 -1.8885126e+00  2.8848727e+00  2.4469256e+00 -5.5959249e-01
 -1.6943649e+00  4.5974746e+00  2.8030751e+00 -2.6699748e+00
  1.4377401e+00 -1.5139849e+00 -4.1388702e+00 -5.8904994e-01
  1.4429899e+00 -5.8204503e+00  6.7606497e-01  2.9370501e+00
 -2.0968509e-01 -1.2333748e+00 -2.1224422e+00  3.7628624e+00
 -3.1993122e+00 -6.3285499e+00  1.3901999e+00 -2.3185