### Word embeddings
**Embedding:** 
For the set of words in a corpus, embedding is a mapping between vector space coming from distributional representation to vector space coming from distributed representation.

If we’re given the word “USA,” distributionally similar words could be other countries (e.g., Canada, Germany, India, etc.) or cities in the USA.

Seminal work by *Mikolov et al.* showed that their neural network–based word representation model known as “Word2vec,” based on “distributional similarity,” can capture word analogy relationships. **The Word2vec model is in many ways the dawn of modern-day NLP.**

Conceptually, Word2vec takes a large corpus of text as input and “learns” to represent the words in a common vector space based on the contexts in which they appear in the corpus. Given a word w and the words appearing in its context C, how do we find the vector that best represents the meaning of the word? For every word w in corpus, we start with a vector vw initialized with random values. The Word2vec model refines the values in vw by predicting vw, given the vectors for words in the context C. It does this using a two-layer neural network.

### Pre-trained work embeddings

Thankfully, for many scenarios, it’s not necessary to train your own embeddings, and using pre-trained word embeddings often suffices. 

The most popular pre-trained embeddings are Word2vec by Google, GloVe by Stanford, and fasttext embeddings by Facebook.

The following code shows loading a pre-trained word2vec embedding model and we find words that are semantically similar to the word beautiful:

In [None]:
import os
import wget
import gzip
import shutil
import warnings #This module ignores the various types of warnings generated
warnings.filterwarnings("ignore") 
import psutil #This module helps in retrieving information on running processes and system resource utilization
from psutil import virtual_memory
from gensim.models import Word2Vec, KeyedVectors

gn_vec_path = "GoogleNews-vectors-negative300.bin"
if not os.path.exists("GoogleNews-vectors-negative300.bin"):
    if not os.path.exists("../Ch2/GoogleNews-vectors-negative300.bin"):
        #Downloading the reqired model
        if not os.path.exists("../Ch2/GoogleNews-vectors-negative300.bin.gz"):
            if not os.path.exists("GoogleNews-vectors-negative300.bin.gz"):
                wget.download("https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz")
            gn_vec_zip_path = "GoogleNews-vectors-negative300.bin.gz"
        else:
            gn_vec_zip_path = "../Ch2/GoogleNews-vectors-negative300.bin.gz"
        #Extracting the required model
        with gzip.open(gn_vec_zip_path, 'rb') as f_in:
            with open(gn_vec_path, 'wb') as f_out:
                shutil.copyfileobj(f_in, f_out)
    else:
        gn_vec_path = "../Ch2/" + gn_vec_path

print(f"Model at {gn_vec_path}")

process = psutil.Process(os.getpid())
mem = virtual_memory()

In [None]:
pretrainedpath = gn_vec_path

#Load W2V model. This will take some time, but it is a one time effort! 
pre = process.memory_info().rss
print("Memory used in GB before Loading the Model: %0.2f"%float(pre/(10**9))) #Check memory usage before loading the model
print('-'*10)

start_time = time.time() #Start the timer
ttl = mem.total #Toal memory available

w2v_model = KeyedVectors.load_word2vec_format(pretrainedpath, binary=True) #load the model
print("%0.2f seconds taken to load"%float(time.time() - start_time)) #Calculate the total time elapsed since starting the timer
print('-'*10)

print('Finished loading Word2Vec')
print('-'*10)

post = process.memory_info().rss
print("Memory used in GB after Loading the Model: {:.2f}".format(float(post/(10**9)))) #Calculate the memory used after loading the model
print('-'*10)

print("Percentage increase in memory usage: {:.2f}% ".format(float((post/pre)*100))) #Percentage increase in memory after loading the model
print('-'*10)

print("Numver of words in vocablulary: ",len(w2v_model.vocab)) #Number of words in the vocabulary.

In [None]:
#Let us examine the model by knowing what the most similar words are, for a given word!
w2v_model.most_similar('beautiful')

In [None]:
#Let us try with another word! 
w2v_model.most_similar('toronto')

In [None]:
#What is the vector representation for a word? 
w2v_model['computer']

In [None]:
#What if I am looking for a word that is not in this vocabulary?
w2v_model['practicalnlp']

Two things to note while using pre-trained models:
- Tokens/Words are always lowercased. If a word is not in the vocabulary, the model throws an exception.
- So, it is always a good idea to encapsulate those statements in try/except blocks.

#### Getting embedding representation for full text
With spaCy. 

In [None]:
import spacy
# ! python -m spacy download en_core_web_md # get the spaCy model

nlp = spacy.load('en_core_web_md')
mydoc = nlp("Canada is a large country")
print(mydoc.vector) #Averaged vector for the entire sentence
#What happens when I give a sentence with strange words (and stop words), and try to get its word vector in Spacy?
temp = nlp('practicalnlp is a newword')
temp[0].vector