# Text Representation with Word Embeddings

### Exploring Word Embeddings with New Deep Learning Models

We will explore more sophisticated models which can capture semantic information and give us features which are vector representation of words, popularly known as embeddings.

Here we will explore the following feature engineering techniques:

- Word2Vec

Predictive methods like Neural Network based language models try to predict words from its neighboring words looking at word sequences in the corpus and in the process it learns distributed representations giving us dense word embeddings. We will be focusing on these predictive methods in this article.

## Prepare a Sample Corpus

Let’s now take a sample corpus of documents on which we will run most of our analyses in this article. A corpus is typically a collection of text documents usually belonging to one or more subjects or domains.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# visualize embeddings
from sklearn.decomposition import PCA

pd.options.display.max_colwidth = 200

corpus = ['The sky is blue and beautiful.',
          'Love this blue and beautiful sky!',
          'The birds love to eat and sleep and chirp',
          'The cats love to eat and sleep',
          'The sky is very blue and the sky is very beautiful today',
          'The dog loves to sleep and eat'
]


corpus_df = pd.DataFrame({'Document': corpus})
corpus_df

Let's go ahead and pre-process our text data now

## Simple Text Pre-processing

Since the focus of this unit is on feature engineering, we will build a simple text pre-processor which focuses on removing special characters, extra whitespaces, digits, stopwords and lower casing the text corpus.

In [None]:
import nltk
import re

# get list of common english stopwords
nltk.download('stopwords')
stop_words = nltk.corpus.stopwords.words('english')
# download tokenizer - breaks down sentences into words
nltk.download('punkt')

# clean the text (not necessary too much in the case of deep learning models)
def normalize_document(doc): # The sky is blue and beautiful......
    # lower case and remove special characters\ extra whitespaces
    doc = re.sub(r'[^a-zA-Z\s]', '', doc, flags=re.I|re.A)
    doc = doc.lower()
    doc = doc.strip()
    # tokenize document
    tokens = nltk.word_tokenize(doc)
    # filter stopwords out of document
    filtered_tokens = [token for token in tokens if token not in stop_words]
    # re-create document from filtered tokens
    doc = ' '.join(filtered_tokens)
    return doc

normalize_corpus = np.vectorize(normalize_document)

norm_corpus = normalize_corpus(corpus)
norm_corpus

In [None]:
import nltk

tokenized_corpus = [nltk.word_tokenize(doc) for doc in norm_corpus]

In [None]:
tokenized_corpus

## The Word2Vec Model

This model was created by Google in 2013 and is a predictive deep learning based model to compute and generate high quality, distributed and continuous dense vector representations of words, which capture contextual and semantic similarity. Essentially these are unsupervised models which can take in massive textual corpora, create a vocabulary of possible words and generate dense word embeddings for each word in the vector space representing that vocabulary.

Usually you can specify the size of the word embedding vectors and the total number of vectors are essentially the size of the vocabulary. This makes the dimensionality of this dense vector space much lower than the high-dimensional sparse vector space built using traditional Bag of Words models.

There are two different model architectures which can be leveraged by Word2Vec to create these word embedding representations. These include,

- The Continuous Bag of Words (CBOW) Model
- The Skip-gram Model

## The Continuous Bag of Words (CBOW) Model

The CBOW model architecture tries to predict the current target word (the center word) based on the source context words (surrounding words).

Considering a simple sentence, ___“the quick brown fox jumps over the lazy dog”___, this can be pairs of __(context_window, target_word)__ where if we consider a context window of size 2, we have examples like __([quick, fox], brown)__, __([the, brown], quick)__, __([the, dog], lazy)__ and so on.

Thus the model tries to predict the __`target_word`__ based on the __`context_window`__ words.

![](https://i.imgur.com/ATyNx6u.png)


## The Skip-gram Model

The Skip-gram model architecture usually tries to achieve the reverse of what the CBOW model does. It tries to predict the source context words (surrounding words) given a target word (the center word).

Considering our simple sentence from earlier, ___“the quick brown fox jumps over the lazy dog”___. If we used the CBOW model, we get pairs of __(context_window, target_word)__ where if we consider a context window of size 2, we have examples like __([quick, fox], brown)__, __([the, brown], quick)__, __([the, dog], lazy)__ and so on.

Now considering that the skip-gram model’s aim is to predict the context from the target word, the model typically inverts the contexts and targets, and tries to predict each context word from its target word. Hence the task becomes to predict the context __[quick, fox]__ given target word __‘brown’__ or __[the, brown]__ given target word __‘quick’__ and so on.

Thus the model tries to predict the context_window words based on the target_word.

![](https://i.imgur.com/95f3eVF.png)

Further details can be found in [Text Analytics with Python](https://github.com/dipanjanS/text-analytics-with-python/tree/master/New-Second-Edition/Ch04%20-%20Feature%20Engineering%20for%20Text%20Representation)

## Robust Word2Vec Model with Gensim

The __`gensim`__ framework, created by Radim Řehůřek consists of a robust, efficient and scalable implementation of the Word2Vec model. We will leverage the same on our sample toy corpus. In our workflow, we will tokenize our normalized corpus and then focus on the following four parameters in the Word2Vec model to build it.

- __`size`:__ The word embedding dimensionality
- __`window`:__ The context window size
- __`min_count`:__ The minimum word count
- __`sample`:__ The downsample setting for frequent words
- __`sg`:__ Training model, 1 for skip-gram otherwise CBOW

We will build a simple Word2Vec model on the corpus and visualize the embeddings.

In [None]:
import gensim
gensim.__version__
#4.0.0+

In [None]:
tokenized_corpus

__Replace `<REPLACE WITH CODE HERE>` sections with your own code__

In [None]:
from gensim.models import word2vec


# Set values for various parameters
feature_size = 20    # Word vector dimensionality  every word -> [......] -> vector size of 20 float numbers
window_context = 5  # Context window size (looking at surrounding words)
min_word_count = 1   # Minimum word count
sg = 1               # skip-gram model if sg = 1 and CBOW if sg = 0

w2v_model = word2vec.Word2Vec(<REPLACE WITH CODE HERE>)
w2v_model

## Exploring trained embeddings for words now

__Replace `<REPLACE WITH CODE HERE>` sections with your own code__

In [None]:
# embedding for the word sky
w2v_model.wv['sky']

In [None]:
# embedding for the word cats
<REPLACE WITH CODE HERE>

In [None]:
# embedding for the word dog
<REPLACE WITH CODE HERE>

In [None]:
w2v_model.wv['sky'].shape

In [None]:
w2v_model.wv.index_to_key

In [None]:
words = w2v_model.wv.index_to_key
wvs = w2v_model.wv[words]

pca = PCA(n_components=2, random_state=42)
np.set_printoptions(suppress=True)
pcs = pca.fit_transform(wvs)
labels = words

plt.figure(figsize=(10, 7))
plt.scatter(pcs[:, 1], pcs[:, 0], c='orange', edgecolors='r')
for label, x, y in zip(labels, pcs[:, 1], pcs[:, 0]):
    plt.annotate(label, xy=(x+0.005, y+0.005), xytext=(0, 0), textcoords='offset points')

In [None]:
w2v_model.wv['sky'], w2v_model.wv['sky'].shape

In [None]:
vec_df = pd.DataFrame(wvs, index=words)
vec_df

### Looking at term semantic similarity

__Replace `<REPLACE WITH CODE HERE>` sections with your own code__

In [None]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

similarity_matrix = <REPLACE WITH CODE HERE>
similarity_df = pd.DataFrame(similarity_matrix, index=words, columns=words)
similarity_df

In [None]:
w2v_model.wv.most_similar('sky')

In [None]:
w2v_model.wv.most_similar('dog')