# Word Embeddings

The aim of this notebook is to show how to train word embeddings using our own dataset. Alternatively we will show the training process for producing word embeddings using the word2vec, GloVe and fastText models. For this task we will use the STS BEnchmark dataset. 

# Table of Contents

* Data Loading and Preprocessing
* Word2Vec
* fastText
* GloVe
* Concluding Remarks

In [5]:
import gensim
import sys
import os

import numpy as np
import nltk
import pandas as pd 

from nltk.corpus import brown
from gensim.models import Word2Vec
from gensim.models.fasttext import FastText

In [6]:
# Set the path for where your repo is located
path = '/Users/cblanesg/cam.blanes Dropbox/Camila Blanes/deep_dive/ghost_recon/Embeddings/'
NLP_REPO_PATH = os.path.join(path)

# Set the path for where your datasets are located
BASE_DATA_PATH = os.path.join(NLP_REPO_PATH, "sts-train.csv")

# Pre Processing

Steps for pre processing the text are the following:
1. Stemming
2. Remove punctuation
3. Everything to lowercase 
4. Remove stopwords
5. **(Optional)** Remove words which appear less than 1% and more than 99% of documents in the corpus (Consider the diversity of vocabulary, average length of document and size of corpus). 

# Word2Vec

Word2vec is a predictive model for learning word embeddings from text. Word embeddings are learned such that words that share common contexts in the corpus will be close together in the vector space. There are two different model architectures that can be used to produce word2vec embeddings: continuous bag-of-words (CBOW) or continuous skip-gram. The former uses a window of surrounding words (the "context") to predict the current word and the latter uses the current word to predict the surrounding context words. 

In [18]:
# We will use the same dataset used before to train the word embeddings from Word2Vec. 
sentences = brown.sents()
print(sentences[:3])

[['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.'], ['The', 'September-October', 'term', 'jury', 'had', 'been', 'charged', 'by', 'Fulton', 'Superior', 'Court', 'Judge', 'Durwood', 'Pye', 'to', 'investigate', 'reports', 'of', 'possible', '``', 'irregularities', "''", 'in', 'the', 'hard-fought', 'primary', 'which', 'was', 'won', 'by', 'Mayor-nominate', 'Ivan', 'Allen', 'Jr.', '.']]


The Word2Vec function has different parameters:
1. size: length of the word embedding/vector (defaults to 100)
2. window: maximum distance between the word being predicted and the current word (defaults to 5)
3. min_count: ignores all words that have a frequency lower than this value (defaults to 5)
4. workers: number of worker threads used to train the model (defaults to 3)
5. sg: training algorithm; 1 for skip-gram and 0 for CBOW (defaults to 0)

In [19]:
EMB_DIM = 100

w2v = Word2Vec(sentences, 
               size = EMB_DIM, 
               window = 5,
               min_count = 5,
               workers = 3,
               sg = 0)

Now that the program is trained we can:
* Query for the word embeddings of a given word
* Inspect the model vocabulary
* Save the word embeddings

In [20]:
# Query for the word embeddings of a given word
print("embedding for election:",  w2v.wv["election"])

embedding for election: [ 0.11844847 -0.11637932 -0.3091827  -0.33256492 -0.09560913  0.01062229
 -0.17380676 -0.02245147  0.09547108  0.04264915 -0.13025706 -0.3440462
  0.03201969  0.14489679  0.00758509 -0.2599973  -0.30909446  0.1263442
 -0.28031054  0.39954695 -0.15825106 -0.01902786 -0.02355967  0.06687792
  0.28980207 -0.07969534  0.03013778 -0.09673845 -0.11158438 -0.31007853
 -0.2609249  -0.08558761 -0.07878723 -0.09899204 -0.19056423  0.02196357
 -0.19999109 -0.09447739 -0.28249282 -0.14981844 -0.1258716  -0.40801698
  0.20556849  0.11825003 -0.16992597  0.2365957  -0.44568986 -0.26205978
  0.09935443 -0.01176151 -0.19523796  0.3195377   0.34194356  0.07939881
  0.27538422 -0.29285547  0.03440086  0.20882723  0.08378489 -0.00295072
  0.06728559 -0.04086252  0.132572    0.12622212 -0.00713637  0.04562614
 -0.15339862  0.10641158  0.16521294 -0.00504828  0.38302675 -0.09802999
 -0.23418489 -0.07284083  0.19840676  0.06116144 -0.17556016  0.16772768
  0.12868421  0.01283603 -0.3

In [21]:
# Inspect the model vocabulary 
print("\nFirst 30 vocabulary words:", list(w2v.wv.vocab)[:20])


First 30 vocabulary words: ['The', 'Fulton', 'County', 'Grand', 'said', 'Friday', 'an', 'investigation', 'of', 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities']


In [23]:
# Check most similar word 
print("Most similar to Saturday:\n", w2v.similar_by_word("Saturday"))

Most similar to Saturday:
 [('Monday', 0.9576911330223083), ('Sunday', 0.9552295207977295), ('Pennsylvania', 0.953289270401001), ('eighth', 0.9518365859985352), ('sixteenth', 0.9496550559997559), ('Easter', 0.9486106634140015), ('Silence', 0.9478686451911926), ('20th', 0.9454261660575867), ('ending', 0.9454016089439392), ('fourth', 0.9434164762496948)]


  


# FastText

In [16]:
# Train the FastText model
fastText_model = FastText(size=100, window=5, min_count=5, sentences=sentences, iter=5)

In [25]:
# 1. Let's see the word embedding for "apple" by accessing the "wv" attribute and passing in "apple" as the key.
print("Embedding for election:", fastText_model.wv["election"])

# 2. Inspect the model vocabulary by accessing keys of the "wv.vocab" attribute. We'll print the first 20 words.
print("\nFirst 30 vocabulary words:", list(fastText_model.wv.vocab)[:20])

Embedding for election: [-3.0171871e-04  8.1126946e-01 -6.0227215e-01  5.8451658e-01
  1.9119148e-01 -2.6985628e-03 -5.5760586e-01 -8.4915543e-01
 -1.2416487e-02 -4.1302424e-02  6.3710523e-01  4.5659903e-01
  1.0093259e-01  1.2377757e+00  7.6508266e-01  9.5967108e-01
 -5.6427377e-01 -1.3122000e-01 -4.6205842e-01 -1.0577116e+00
  3.2587388e-01  4.1558594e-01  3.2371905e-01  2.6745009e-01
 -5.0153751e-02 -5.7547307e-01 -6.6508061e-01 -4.6496564e-01
 -6.7831701e-01  2.6945293e-01 -5.6355432e-02 -8.7843937e-01
  5.4984015e-01 -4.6731395e-01 -5.0314542e-02 -1.3001447e+00
 -8.4473205e-01 -6.3425086e-02 -6.3968778e-01  4.0601847e-01
 -9.1658586e-01  8.0501206e-02 -2.4565169e-01  4.2483443e-01
 -8.8506520e-01  1.8807220e-01 -4.1304886e-01  1.1536719e+00
  1.4344645e-01 -1.8072079e-01 -8.5766412e-02  9.3326962e-01
 -4.4527104e-01 -1.9733871e+00  4.1668475e-01 -8.4575778e-01
  6.8148452e-01  2.8996772e-01 -1.9695307e-01 -8.5368168e-01
  1.4542532e+00  5.3738928e-01 -5.9601694e-01 -5.8495277e-01


# GloVe

Training occurs on word-word co-occurrence statistics with the objective of learning word embeddings such that the dot product of two words' embeddings is equal to the words' probability of co-occurrence