# Word Embeddings

Word Embeddings are a featurised representation of each word in our vocabulary. One of the main advantages of word embeddings is that it allow us to generalise the words well. With one-hot encoding, each word is treated individually without any information on how they relate with one another. For example, boy vs. girl, apple vs. orange, old vs. young, etc. In addition, using one-hot vectors quickly becomes impractical as we scale to larger and larger vocabulary sizes. Let say we have a 100 million vocab size and convert each word into a one-hot vector. Imagine the memory costs loading a 1 billion words long passage as well as the computation cost of calculating the softmax function on 1 million classes. As such, we require an alternative representation.

What happens is that word embeddings attempts to solve these issues by mapping these words to a smaller set of features which allows the Network to generalise better especially with unseen words. Word embeddings aims to capitalise on the hidden semantic relationships of words to reduce the dimensions of a corpus of words. For example, if we have a vocab size of 10,000 words, we will attempt to reduce this to a featurised representation of lets say 300 different features. With this reduced representation, it will be forced to incorporate the main differentiating characteristics into these 300 feature set.

In this series, we will focus on a couple of methodologies to training a Word Embedding namely using a Neural Language Model, Skip-Gram/ Word2Vec Model and GloVe Model.

# Preparing the data
We will be using a dataset available on NLTK known as the Brown Corpus.

Reference: https://www.nltk.org/book/ch02.html

In [1]:
import nltk
#nltk.download('brown')

In [27]:
# load dataset
sentences = nltk.corpus.brown.sents(categories='adventure')

In [28]:
print(f"Total number of sentences: {len(sentences)}")

Total number of sentences: 4637


In [29]:
print(sentences[:3])

[['Dan', 'Morgan', 'told', 'himself', 'he', 'would', 'forget', 'Ann', 'Turner', '.'], ['He', 'was', 'well', 'rid', 'of', 'her', '.'], ['He', 'certainly', "didn't", 'want', 'a', 'wife', 'who', 'was', 'fickle', 'as', 'Ann', '.']]


In [30]:
# length of the sentences
count_sent = [len(sent) for sent in sentences]
print(f"Maximum sentence length: {max(count_sent)}")
print(f"Mean sentence length: {sum(count_sent) / len(count_sent)}")
print(f"Minimum sentence length: {min(count_sent)}")

Maximum sentence length: 144
Mean sentence length: 14.95406512831572
Minimum sentence length: 1


The sentences from NLTK have already been tokenised and split by the sentences. This is amazing as we can skip many preprocessing steps and quickly dive into modelling.

# Neural Language Model

In [32]:
set([w for sent in sentences for w in sent])

{'guilty',
 'altitude',
 'fatigues',
 "Brandon's",
 'garbed',
 'ships',
 'dismissed',
 'purpose',
 'scrubbed',
 'barefoot',
 'straps',
 'tipple',
 'patched',
 'butts',
 'politely',
 'examined',
 'scabbard',
 'scalp',
 "one's",
 'dripping',
 'clad',
 "Ain't",
 'talking',
 'pall',
 'Baptist',
 'haughty',
 'sprung',
 'Khasi',
 'careless',
 'blur',
 'blob',
 'saddled',
 'ate',
 'stale',
 'shin',
 'breathing',
 "C'mon",
 'business',
 'mischievous',
 'slip',
 'bankruptcy',
 'gasping',
 'November',
 'tipped',
 'same',
 'rightful',
 'fervor',
 'Paris',
 'terms',
 'Running',
 'unmistakable',
 'getting',
 'creeping',
 "Burnsides'",
 'crook',
 'lurch',
 'activities',
 'under',
 'of',
 'dining',
 'trouble',
 '2',
 'Sweat',
 'Others',
 'dropping',
 'distract',
 'wishful',
 'realization',
 'forefinger',
 'ruddiness',
 'grunted',
 'disappointment',
 'homely',
 'phone',
 'medical',
 'reviving',
 "Jed's",
 'signs',
 'secrets',
 'hugging',
 'whistle',
 'moral',
 'steer',
 'hickory',
 'superb',
 'bright'

# Cosine Similarity