# word2vec
In this notebook, we will:
> 1. explore different pretrained word2vec models offered by Gensim ("**Gen**erate **Sim**ilar") python library


> 2. create a function that accepts a string, and returns a vector representation of that string

https://intellica-ai.medium.com/comparison-of-different-word-embeddings-on-text-similarity-a-use-case-in-nlp-e83e08469c1c 

**Installing Dependencies**

In [37]:
import gensim.downloader as api # using gensim api to download models and datasets
import numpy as np # numerical python to work with word embeddings

# For preprocessing text
import re 
!pip install unidecode
from unidecode import unidecode

# Natural language toolkit functions and datasets
import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords 
from nltk.stem import WordNetLemmatizer # WordNet is 
from nltk.stem.snowball import SnowballStemmer
nltk.download('punkt') 
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

from pprint import pprint # to print dictionaries nicely

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


#### **1. Exploring models**

Gensim comes with several already pre-trained models, in their [Gensim-data repository](https://github.com/RaRe-Technologies/gensim-data) on Github. Let's load in in and take a look at a few. 
We will be looking to load or train a model with a:
* small distance window
* high dimensionality

In [2]:
%%capture 
# Let's look at the models and datasets available for use (comment out the capture to see JSON)

# ------- 'corpora' are first half (key in JSON)
# -------- 'models' are second half (key in JSON) 

info = api.info()
pprint(info)

**Here are two interesting models:**

1. *word2vec-google-news-300:*

* 'description': Google News (about 100 billion words)
 * readmore: https://code.google.com/archive/p/word2vec/ 

2. *glove-wiki-gigaword-300* 
* Wikipedia 2014 + Gigaword 5 (6B tokens, uncased)
 * readmore:https://nlp.stanford.edu/projects/glove/ 

**glove-wiki-gigaword-50** (word2vec model)

In [3]:
# Loading the model in
model_glove_50 = api.load("glove-wiki-gigaword-50")



In [4]:
# Grabbing the dictionary (key-vector mapping class) of word embeddings (semantic vectors)
glove_vectors = model_glove_50.wv
print(type(glove_vectors))

<class 'gensim.models.keyedvectors.Word2VecKeyedVectors'>


  


In [12]:
# Grabbing at most similar semantic vectors to freedom
test_word = "freedom"
freedom = glove_vectors.most_similar(test_word)

# Priniting word embedding for a word
glove_vectors[test_word]


array([-0.75489  ,  0.27814  , -0.11302  , -0.37057  ,  0.57873  ,
       -0.14529  ,  0.43383  , -0.45607  ,  0.7262   ,  0.31185  ,
        0.24079  , -0.11883  , -0.20353  , -0.29989  , -0.27301  ,
       -0.23686  ,  0.51582  , -0.47091  ,  0.31237  ,  0.0070554,
       -0.22833  ,  0.89127  , -0.32475  , -0.22581  ,  0.43705  ,
       -1.6628   , -0.64576  , -1.0098   ,  0.37792  ,  0.33504  ,
        2.6654   ,  0.58524  , -1.3425   , -0.40824  , -1.4958   ,
       -0.64544  ,  0.071664 , -0.80439  , -0.76056  ,  0.36512  ,
        0.32903  , -0.25687  ,  0.13765  ,  0.39533  , -0.68773  ,
       -0.043908 , -0.95513  , -0.47569  , -0.33671  , -0.44242  ],
      dtype=float32)

In [6]:
# Performing vector addition to explore relationships

def magnitude(a):
  return np.linalg.norm(a)

# Reference 
ref = glove_vectors['icecream'] - glove_vectors['vanilla']  - (glove_vectors['pizza'] - glove_vectors['sauce'])

# KING is to MAN as QUEEN is to WOMAN
test1 = glove_vectors['king'] - glove_vectors['man']  - (glove_vectors['queen'] - glove_vectors['woman'])

# ACTOR is to MAN as ACTRESS is to WOMAN
test2 = glove_vectors['actor'] - glove_vectors['man']  - (glove_vectors['actress'] - glove_vectors['woman'])


print("Reference magnitude:", magnitude(ref))
print("Test 1 magnitude:", magnitude(test1))
print("Test 2 magnitude:", magnitude(test2))

Reference magnitude: 6.263511
Test 1 magnitude: 2.8391206
Test 2 magnitude: 2.0526736


#### **2. Defining function**

In this section, we will write a function that:
1. preprocesses a string (removing stopwords, punctuation, and uppercase letters)
2. associates a vector with the preprocessed strings

###### **Preprocessing**

In [38]:
example_string = "नमस्कार THERE, this is 1 notebook aimed at implementing part of a doc2vec function!!!"

In [39]:
# preprocessing the string

print("Example string 0.0:", example_string)

# making words lowercase
example_string = example_string.lower()
print("Example string 1.0:", example_string)

# removing all digits, punctuation, special characters
example_string = re.sub('[^a-zA-Z]', ' ', example_string) 
examle_string = unidecode(example_string)
print("Example string 2.0:", example_string)

# Removing whitespace
example_string = re.sub(r'\s+', ' ', example_string)
print("Example string 3.0:", example_string) 

Example string 0.0: नमस्कार THERE, this is 1 notebook aimed at implementing part of a doc2vec function!!!
Example string 1.0: नमस्कार there, this is 1 notebook aimed at implementing part of a doc2vec function!!!
Example string 2.0:         there  this is   notebook aimed at implementing part of a doc vec function   
Example string 3.0:  there this is notebook aimed at implementing part of a doc vec function 


That's string is looking much better! Three more things we need to do in this preprocessing stage is:
> 1. Splitting sentence into word
> 2. Stemming and lemmatizing each word
> 3. Remove stopwords

In [40]:
# Tokenizing the sentence
words = word_tokenize(example_string)

# Removing all stopwords, stemming, and lemmatizing
lemmatizer = WordNetLemmatizer()
ss = SnowballStemmer('english')

example_sentence = [lemmatizer.lemmatize(ss.stem(w)) for w in words if w not in stopwords.words('english')]

In [44]:
example_sentence

['notebook', 'aim', 'implement', 'part', 'doc', 'vec', 'function']

###### **Sent2Vec Function**

In [33]:
# Creating Euclidean Distance based function

def sentence_vector(sentence):
  """Add up all the semantic vectors for each sentence to get a sentence vector."""
  s_vector = np.array(50*[0])

  for word in sentence:
    try:
      word_vector = glove_vectors[word]
      s_vector = word_vector
    except:
      pass

  return s_vector

In [34]:
sentence_vector(["man", "sad", "at", "bar", "had", "cups", "whiskey"])

array([-1.4145   , -0.056566 , -0.73252  , -0.60502  , -0.018337 ,
        0.52306  , -0.49091  ,  0.49644  ,  0.2371   ,  1.4823   ,
        0.0066497,  0.43641  ,  0.53346  , -0.52744  ,  0.26829  ,
       -0.1701   ,  0.35027  ,  0.55006  , -0.59988  , -1.1519   ,
        1.0168   , -0.51179  ,  0.73453  ,  0.56626  , -0.71344  ,
       -1.0084   , -0.58305  ,  0.68873  ,  1.4397   ,  0.0021525,
        0.31099  , -0.65845  , -0.53219  ,  1.4891   ,  1.1693   ,
       -0.50109  , -0.60882  , -0.32086  ,  0.37159  ,  0.37465  ,
        0.54943  ,  0.39997  , -0.41693  ,  0.026556 ,  0.20353  ,
        0.23431  , -0.36537  , -0.70014  , -0.050047 , -0.95843  ],
      dtype=float32)

In [35]:
sentence_vector(["guy", "mad", "icecream", "eating", "resteraunt", "wine"])

array([-0.1145  ,  0.75404 , -1.6432  , -0.61038 ,  0.60352 , -0.56396 ,
       -1.0069  , -0.44103 ,  0.61256 ,  1.1812  ,  0.18128 ,  0.30032 ,
        1.1817  , -0.62548 ,  1.2156  , -0.30738 ,  0.54095 ,  0.53758 ,
       -0.026086, -1.7387  ,  0.46533 , -0.62835 ,  0.50936 ,  1.1192  ,
       -0.74747 , -0.57528 , -0.9203  ,  0.98612 ,  0.29107 ,  0.60208 ,
        1.9703  , -0.27461 , -0.34921 ,  0.44141 ,  0.64402 , -0.32353 ,
       -1.4541  ,  1.1472  ,  0.86875 , -0.074512,  0.85632 ,  0.59341 ,
        0.4655  , -0.0387  ,  0.26463 ,  0.94151 , -0.27335 , -0.085403,
        0.12693 , -0.23861 ], dtype=float32)