In [52]:
# importing
import pandas as pd
import numpy as np
import bz2
import nltk
import torch
import torchtext
import gensim

# Document Feature Extraction, Text Processing, and Word Embedding
## Requirements
- Introduce scikit-learn, nltk, and other text processing libraries
- Explain basic feature extraction from text data
    - Number of words
    - Number of characters
    - Average word length
    - Number of stopwords
    - Other feature extraction techniques that may be relevant
- Explan advanced text processing
    - N-grams
    - Term Frequency (TF)
    - Inverse Document Frequency (IDF)
    - Term Frequency-Inverse Document Frequency (TF-IDF)
    - "Bag of Words" document representation
    - Word embedding
    - Other text processing methods that may be important

# Document Feature Extraction
## TF/IDF
Term Frequency (TF) and Inverse Document Frequency (IDF) are useful data that will inform us as to which terms are the most relevant to a given document in a corpus. TF simply measures the frequency of a given word in a document. This measurement is useful because typically, common words are relevant to the meaning of that document. TF can be calculated in the following way:

$TF = OC / T$

$OC$ is the total number of occurences of that particular word in the document and $T$ is the total number of words in the document. Suppose we have access to all of the articles on Wikipedia. Our corpus, in this context, is all of these Wikipedia articles combined. Each of these articles would be its own entity called a document. Document Frequency (DF) is a measure of how frequently a given word appears in the entire corpus of text. The IDF is the inverst of the DF:

$IDF = \frac{1}{DF}$

Using the TF and the IDF we can measure how important and unique a given word is for a particular document. Since word frequencies are distributed exponentially, we use the log of the IDF to obtain a measure for the value of a word in a given document:

Word Value (WV) $= TF \cdot log(IDF)$

This approach treats the documents in a corpus as a "bag of words". This means that this approach does not take into account the meaning of sentences or the contexts in which these words are used in. This restricts the applications of this method severely. 

### TF/IDF Application
Suppose we wish to search our Wikipedia corpus for a document most relevant to Serena Williams. We could compute the WV for "Serena Williams" accross all of the documents. Then, we could the set of documents that had the highest "Serena Williams" WVs. This would be a reasonably effective way to query a corpus for relevant documents. 

# N-Gram Models
N-Gram models are very useful models in the field of text analysis. An N-Gram model can predict the probability of a word occurring based on the occurence of its N-1 words. These models have a tremendous amount of use cases. They can be used to determine which word belongs in a particular sentence for the purpose of text generation. They can be used to detect spelling errors in sentences. They can also be used as speech recognition engines. 

We will illustrate a simple example of how an N-Gram model can be used to detect a spelling error. Suppose we are given the following set of sentences:


In [6]:
sentence = "Today I went to the store. Yesterday it took me 10 minutes to drive there. Today, it took me 15 minutes to get there and 15 minuets to get back."

"minutes" was spelled as "minuets" in the third sentence. If we implement a simple bi-gram (2-gram) model on this corpus, we can compute the probability of each word appearing as a function of the word appearing immediately before that word. The following snippet tokenizes the above sentence and computes the bi-gram frequencies:

In [7]:
tree_bank_tokenizer = nltk.tokenize.TreebankWordTokenizer()
wnlemmatizer = nltk.stem.WordNetLemmatizer()
words = tree_bank_tokenizer.tokenize(sentence)
words = [wnlemmatizer.lemmatize(word) for word in words]
n_grams_series = pd.Series(nltk.ngrams(words, 2))
print(n_grams_series.value_counts())

(to, get)              2
(minute, to)           2
(took, me)             2
(it, took)             2
(drive, there.)        1
(there., Today)        1
(back, .)              1
(there, and)           1
(I, went)              1
(and, 15)              1
(Today, I)             1
(Today, ,)             1
(Yesterday, it)        1
(went, to)             1
(store., Yesterday)    1
(me, 10)               1
(15, minute)           1
(minuet, to)           1
(get, back)            1
(get, there)           1
(the, store.)          1
(10, minute)           1
(,, it)                1
(15, minuet)           1
(to, the)              1
(me, 15)               1
(to, drive)            1
dtype: int64


From the snippet above, we can see that the word "minute" appears twice before the word "to". In the original corpus, the word "to" appears 3 times, not including the time when it appeared after "minuets". Therefore, there is a probability of 2/3 that "minuets" should be "minutes" and a probability of 1/3 that it should be "went".

In this simple example a bi-gram model was able to detect a spelling mistake. By using a large corpus we can achieve very effective N-gram models. Further, increasing the model's "N" we can sometimes increase the effectiveness of the model. Increasing "N" can also, however, harm the effectiveness of the model by using irrelevant words to determine the context of a given word. 

# Text Processing for Complex Models
## Why is Text Processing Important?
Natural language processing is a branch of artificial intelligence in which unstructured text data is broken down into information that can be analyzed by models. Especially in the age of social media, humans commuincate online through unstructured text messages all the time. Whether it be through Twitter, private messaging platforms like Facebook Messenger, or through product reviewing interfaces like Yelp. The ability to transform these unstructured blobs of text into forms that are machines can understand opens the door to a wide variety of applications. 

These applications primarily fall under text classification and generative models. Text classification agents try to uncover the intent, logic, and/or meaning behind blobs of text. This model type can be implemented in all sorts of contexts. For example, as an AI that can automatically sort through text reviews to characterize the performance of a fast-food chain. 

Processed text can also be used as an input to generative models. These generative models learn how to commuincate by training on large blobs of processed text. These models can be used to generate speeches or converse with customers, for example.

## Tokenization
When trying to analyze a corpus, some characters, a chain of words, or a set of reviews, we are initially provided with some text. This text may contain spelling errors, they may or may not obey the grammatical rules of the English language, and they may contain punctuation. For our analysis, we will consider Amazon review data. The following code snippets load the Amazon review dataset and display a handful of example reviews:

In [8]:
# loading data
# available: https://www.kaggle.com/bittlingmayer/amazonreviews
train_data_file = bz2.BZ2File("amazon-review-data/train.ft.txt.bz2")
test_data_file = bz2.BZ2File("amazon-review-data/test.ft.txt.bz2")
train_lines = train_data_file.readlines()
test_lines = test_data_file.readlines()
del train_data_file, test_data_file
train_lines = [x.decode('utf-8') for x in train_lines]
test_lines = [x.decode('utf-8') for x in test_lines]

In [9]:
# dataset properties
train_lines_len = len(train_lines)
test_lines_len = len(test_lines)
total_length = train_lines_len + test_lines_len
print("Train Lines Length =", train_lines_len)
print("Test Lines Length =", test_lines_len)

total_length = train_lines_len + test_lines_len

print("Length of Total Dataset =", total_length)
print("Percentage of Training Data = {}%".format(round(train_lines_len / total_length * 100, 3)))
print("Percentage of Testing Data = {}%".format(round(test_lines_len / total_length * 100, 3)))



Train Lines Length = 3600000
Test Lines Length = 400000
Length of Total Dataset = 4000000
Percentage of Training Data = 90.0%
Percentage of Testing Data = 10.0%


In [10]:
# amazon review examples
np.random.seed(20)
random_indices = [np.random.randint(100) for x in range(3)]
for index in random_indices:
    print("Amazon Example Review at Index {}".format(index))
    print(train_lines[index][11:])

Amazon Example Review at Index 99
Caution!: These tracks are not the "original" versions but are re-recorded versions. So, whether the tracks are "remastered" or not is irrelevant.

Amazon Example Review at Index 90
No instructions included - do not trust seller: Promised with this item are "Complete Instructions" and the additional pledge that "Sweet Graces will email you with the Supply List and Instruction sheets on purchase - so you can be ready ahead of time!" I received none of this - only a plastic figurine and bracelet. To boot, Amazon claims they can do nothing to help me contact the seller. All I got was a phone number for the manufacturer. Let's hope that yields some results. Meanwhile, I'm wishing I had listened to previous feedback about this unreliable seller :/

Amazon Example Review at Index 15
Don't try to fool us with fake reviews.: It's glaringly obvious that all of the glowing reviews have been written by the same person, perhaps the author herself. They all have th

As shown above, these text reviews contain a few sentences strung together to form a review. These sentences contain punctuation, capital letters, slang, and spelling mistakes. Our goal when processing these blobs of text is to take these unstructured sentences and transform them into a set of tokens that a model can understand. Currently, the model cannot understand the meaning of these sentences.

There are several natural language processing libraries in Python that allow us to create tokens from these unstructured sentences. The way in which we break these sentences apart will determine the effectiveness of the model.

To create tokens, we must first consider the way in which we understand the English language. A sentence is made up of words. These words are separated by spaces. We can tokenize a sentence by splitting up the sentence using white-space as a delimiter. Consider the following text entry:

In [11]:
text_entry_example = train_lines[24][11:]
print(text_entry_example)

i liked this album more then i thought i would: I heard a song or two and thought same o same o,but when i listened to songs like "blue angel","lanna" and 'mama" the hair just rose off my neck.Roy is trully an amazing singer with a talent you don't find much now days.



### WhitespaceTokenizer
Using the ```nltk.tokenize.WhitespaceTokenizer```, we can separate the above review using white-space as the delimiter:

In [12]:
white_space_tokenizer = nltk.tokenize.WhitespaceTokenizer()
print(white_space_tokenizer.tokenize(text_entry_example))

['i', 'liked', 'this', 'album', 'more', 'then', 'i', 'thought', 'i', 'would:', 'I', 'heard', 'a', 'song', 'or', 'two', 'and', 'thought', 'same', 'o', 'same', 'o,but', 'when', 'i', 'listened', 'to', 'songs', 'like', '"blue', 'angel","lanna"', 'and', '\'mama"', 'the', 'hair', 'just', 'rose', 'off', 'my', 'neck.Roy', 'is', 'trully', 'an', 'amazing', 'singer', 'with', 'a', 'talent', 'you', "don't", 'find', 'much', 'now', 'days.']


The problem with this tokenization method is that tokens like 'would:', 'neck.Roy', '"blue', 'angel"', and 'days.' don't have much meaning. 'would:' and 'would' have different meanings. Further, 'neck.Roy' and 'neck', 'roy' also have different meanings. We need to somehow take into account the puntuation present in each review. Further, we need to factor in capital letters, plural and singular forms of words, and spelling mistakes. Consider another example:

In [13]:
text_entry_example = train_lines[15][11:]
print(text_entry_example)

Don't try to fool us with fake reviews.: It's glaringly obvious that all of the glowing reviews have been written by the same person, perhaps the author herself. They all have the same misspellings and poor sentence structure that is featured in the book. Who made Veronica Haddon think she is an author?



### WordPunctTokenizer
To take into account the punctuation present in a review, we will try the ```nltk.tokenize.WordPunctTokenizer```:

In [14]:
word_punct_tokenizer = nltk.tokenize.WordPunctTokenizer()
print(word_punct_tokenizer.tokenize(text_entry_example))

['Don', "'", 't', 'try', 'to', 'fool', 'us', 'with', 'fake', 'reviews', '.:', 'It', "'", 's', 'glaringly', 'obvious', 'that', 'all', 'of', 'the', 'glowing', 'reviews', 'have', 'been', 'written', 'by', 'the', 'same', 'person', ',', 'perhaps', 'the', 'author', 'herself', '.', 'They', 'all', 'have', 'the', 'same', 'misspellings', 'and', 'poor', 'sentence', 'structure', 'that', 'is', 'featured', 'in', 'the', 'book', '.', 'Who', 'made', 'Veronica', 'Haddon', 'think', 'she', 'is', 'an', 'author', '?']


From the above output we can see tokens like "t", "s", and "'" are occurring. These tokens have no meaning. 

### TreebankWordTokenizer
To solve this problem we can use a more advanced tokenization method that transforms these types of tokens into tokens that are more meaningful called the ```nltk.tokenize.TreebankWordTokenizer```:

In [15]:
tree_bank_tokenizer = nltk.tokenize.TreebankWordTokenizer()
text_entry_tb_output = tree_bank_tokenizer.tokenize(text_entry_example)
print(text_entry_tb_output)

['Do', "n't", 'try', 'to', 'fool', 'us', 'with', 'fake', 'reviews.', ':', 'It', "'s", 'glaringly', 'obvious', 'that', 'all', 'of', 'the', 'glowing', 'reviews', 'have', 'been', 'written', 'by', 'the', 'same', 'person', ',', 'perhaps', 'the', 'author', 'herself.', 'They', 'all', 'have', 'the', 'same', 'misspellings', 'and', 'poor', 'sentence', 'structure', 'that', 'is', 'featured', 'in', 'the', 'book.', 'Who', 'made', 'Veronica', 'Haddon', 'think', 'she', 'is', 'an', 'author', '?']


"Do" + "n't" converys the meaning of "Don't" better than "Don", "'", "t". The ```TreebankWordTokenizer``` presents the most effective way to extract meaning from these sentences. 

## Token Normalization
Now that we have split our data into tokens, we must further parse these tokens. It may be the case that we want the same token for different forms of a given word. For example, we may want both "pen" and "pens" to be represented by the "pen" token. Moreover, we may want "person", "people", and "persons" to all be represented by "person". There are two ways in which we can concatenate these tokens: stemming and lematization.

### Stemming
Stemming is the process of removing and/or replacing the suffixes of words to obtain the root meaning of the word. This normalization method simply cuts off the suffixes of various words to obtain simplified and understandable tokens. Let's apply the ```nltk.stem.PorterStemmer``` normalization method to a list of abnormal and plural words:

In [16]:
pstemmer = nltk.stem.PorterStemmer()
plural_words = ["persons", "Feet", "apples", "Trying", "fries", "geese", "women"]
for word in plural_words:
    print("Original Word = {}, Stemmed Word = {}\n".format(word, pstemmer.stem(word)))

Original Word = persons, Stemmed Word = person

Original Word = Feet, Stemmed Word = feet

Original Word = apples, Stemmed Word = appl

Original Word = Trying, Stemmed Word = tri

Original Word = fries, Stemmed Word = fri

Original Word = geese, Stemmed Word = gees

Original Word = women, Stemmed Word = women



From the output above, we can see that the stemmer handles words like persons, apples, and fries correctly. Notice that the stemmer also modifies the casing of each word such that the words are lower case. The stemmer fails, however, to modify women, geese, and feet to their singular counterparts. The following normalization method addresses these situations. 

### Lemmatization
A lemmatizer looks up the tokens using a database formed using vast amounts of text. This normalization technique can properly address abnormal plural words. The following is an implementation of the ```nltk.stem.WordNetLemmatizer```:

In [17]:
wnlemmatizer = nltk.stem.WordNetLemmatizer()
nltk.download('wordnet')
for word in plural_words:
    print("Original Word = {}, Stemmed Word = {}\n".format(word, wnlemmatizer.lemmatize(word)))

Original Word = persons, Stemmed Word = person

Original Word = Feet, Stemmed Word = Feet

Original Word = apples, Stemmed Word = apple

Original Word = Trying, Stemmed Word = Trying

Original Word = fries, Stemmed Word = fry

Original Word = geese, Stemmed Word = goose

Original Word = women, Stemmed Word = woman

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\purpl\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


The problem with this normalization technique is that some words that have different meanings are reduced to the same lemma in niche circumstances. It is important to determine the context in which these tokenization methods are used so that we can make an informed decision as to which one will work better.

# Token Representations
Now that we have normalized the tokens appropriately, we must represent these tokens in a way that a machine can understand them. Models like neural networks and SVMs accept rational vector inputs. We need to transform these string tokens into real-numbered vectors that are related to the string tokens in some meaningful way.

Given a corpus, the vocabulary of the corpus is all of the unique words extracted from the corpus through tokenization. We need to represent every word in the vocabulary with some real-numbered vector. 

There are a few ways that we can do this. We can manually create representations for tokens by one-hot encoding each token uniquely or by manually training a set of custom word embeddings. Alternatively, we can use pre-trained word embeddings like Word2Vec embeddings and GloVe embeddings.

Consider the following tokenized and lemmatized Amazon review example:

In [31]:
np.random.seed(8)
sentence = train_lines[np.random.randint(100)][11:]
print("Original Sentence:\n{}".format(sentence))

tree_bank_tokenizer = nltk.tokenize.TreebankWordTokenizer()
wnlemmatizer = nltk.stem.WordNetLemmatizer()
words = tree_bank_tokenizer.tokenize(sentence)
words = [wnlemmatizer.lemmatize(word) for word in words]
print("\nTokenized & Lemmatized Sentence:\n{}".format(words))

vocabulary = set(words)
print("\nVocabulary:\n{}".format(vocabulary))

Original Sentence:
Even Mommy has fun with this one!: My four year old daughter loves everything Barbie and loves the Rapunzel movie. This game is tons of fun, even for a 42 year old. We love playing it together. We love decorating all the rooms and finding the gems. What even better is, she can play it alone and I get some me time!


Tokenized & Lemmatized Sentence:
['Even', 'Mommy', 'ha', 'fun', 'with', 'this', 'one', '!', ':', 'My', 'four', 'year', 'old', 'daughter', 'love', 'everything', 'Barbie', 'and', 'love', 'the', 'Rapunzel', 'movie.', 'This', 'game', 'is', 'ton', 'of', 'fun', ',', 'even', 'for', 'a', '42', 'year', 'old.', 'We', 'love', 'playing', 'it', 'together.', 'We', 'love', 'decorating', 'all', 'the', 'room', 'and', 'finding', 'the', 'gems.', 'What', 'even', 'better', 'is', ',', 'she', 'can', 'play', 'it', 'alone', 'and', 'I', 'get', 'some', 'me', 'time', '!']

Vocabulary:
{'this', 'a', 'some', 'We', 'old', '42', 'playing', 'time', 'year', 'ha', 'fun', 'the', ',', 'can',

We will demonstrate the ways in which we can encode the parsed tokens to create meaning for a potential model.

## One-Hot Encoding
To represent the vocabulary of the above sentence using a one-hot encoding representation we simply define one-hot vectors for each of the unique words in the corpus (the tokenized sentence). 

There are a few problems with this method. Firstly, the dimensions of the vectors required to represent the tokens is proportional to the number of words in the vocabulary. This will inevitably slow down future models. Secondly, the relationship between the words in the vocabulary is not captured by this representation. How do we compare Queen = \[0, 0, 0, 1\] and King = \[0, 1, 0, 0\]? 


## GloVe Embeddings
Instead of using one-hot encodings we can use pre-trained "GloVe" embeddings. Using pre-trained GloVe embeddings we can cast our vocabulary into real-numbered vectors of a dimension of our choosing. These vector representations possess the similarity expressed by the original tokens. We can determine how similar two vectors are by computing their cosine similarity or euclidean distance. We can illustrate the power of GloVe embeddings using the tokens from the previous example:

In [37]:
# loading GloVe vectors
glove = torchtext.vocab.GloVe(name="6B", dim=50)

.vector_cache\glove.6B.zip: 862MB [08:31, 1.69MB/s]                           
 99%|█████████▉| 396587/400000 [00:11<00:00, 34150.93it/s]tensor([ 0.3765,  1.2426, -0.3974, -0.5318,  1.1870,  1.5091, -0.8417,  0.6788,
        -0.2581, -0.4798,  0.1782,  0.7467, -0.1347, -0.9236,  0.9562,  0.2057,
        -1.2239, -0.0550,  0.5618,  0.7808, -0.0441,  1.5692, -0.0668,  0.2514,
         1.0403, -2.1412, -0.3199, -0.7717, -0.0292,  0.0471,  1.4145, -0.2327,
        -0.3443,  0.2270,  0.8857, -0.2018, -0.1517,  0.3621,  0.6495, -0.6872,
        -0.0682,  0.5360, -0.1529, -0.9016,  0.3896, -0.5230, -0.3219, -2.4262,
         0.3005,  0.3389])


In [49]:
print("Example of a GloVe Embedding:\n", glove["daughter"])

print("Difference Between Time and Room:\n{}".format(torch.cosine_similarity(glove["time"].unsqueeze(0), glove["room"].unsqueeze(0))))
print("\nDifference Between Year and Old:\n{}".format(torch.cosine_similarity(glove["year"].unsqueeze(0), glove["old"].unsqueeze(0))))

Example of a GloVe Embedding:
 tensor([ 0.3765,  1.2426, -0.3974, -0.5318,  1.1870,  1.5091, -0.8417,  0.6788,
        -0.2581, -0.4798,  0.1782,  0.7467, -0.1347, -0.9236,  0.9562,  0.2057,
        -1.2239, -0.0550,  0.5618,  0.7808, -0.0441,  1.5692, -0.0668,  0.2514,
         1.0403, -2.1412, -0.3199, -0.7717, -0.0292,  0.0471,  1.4145, -0.2327,
        -0.3443,  0.2270,  0.8857, -0.2018, -0.1517,  0.3621,  0.6495, -0.6872,
        -0.0682,  0.5360, -0.1529, -0.9016,  0.3896, -0.5230, -0.3219, -2.4262,
         0.3005,  0.3389])
Difference Between Time and Room:
tensor([0.6698])

Difference Between Year and Old:
tensor([0.5512])


From the above code-snippet we can see an example of a GloVe embedding for the word "daughter". We can also see that the GloVe embeddings are able to capture the fact that "year" and "old" are more similar than "time" and "room". 

## Word2Vec
In certain contexts it may be worth it to train your own word embeddings. GloVe vectors do not take into account the context in which your words are used the most. You can use Python's Word2Vec library to train a word embedding on a desired vocabulary. Consider the following:

In [88]:
from gensim.models import Word2Vec
word2vec = Word2Vec(sentences=[words], window=5, min_count=1, workers=4)

In [94]:
word2vec_vocab = word2vec.wv.vocab
print("Word2Vec Vocab:\n\n{}".format(word2vec_vocab))

Word2Vec Vocab:

{'Even': <gensim.models.keyedvectors.Vocab object at 0x0000025923353988>, 'Mommy': <gensim.models.keyedvectors.Vocab object at 0x0000025923353FC8>, 'ha': <gensim.models.keyedvectors.Vocab object at 0x0000025923353E08>, 'fun': <gensim.models.keyedvectors.Vocab object at 0x00000259235BAC48>, 'with': <gensim.models.keyedvectors.Vocab object at 0x000002592339AB48>, 'this': <gensim.models.keyedvectors.Vocab object at 0x000002592339AB08>, 'one': <gensim.models.keyedvectors.Vocab object at 0x000002592339A908>, '!': <gensim.models.keyedvectors.Vocab object at 0x000002592339A448>, ':': <gensim.models.keyedvectors.Vocab object at 0x000002592339A3C8>, 'My': <gensim.models.keyedvectors.Vocab object at 0x000002592339A4C8>, 'four': <gensim.models.keyedvectors.Vocab object at 0x000002592339A288>, 'year': <gensim.models.keyedvectors.Vocab object at 0x000002592339A588>, 'old': <gensim.models.keyedvectors.Vocab object at 0x000002592339A488>, 'daughter': <gensim.models.keyedvectors.Vocab

In [92]:
print("Word2Vec Representation of 'daughter':\n\n{}".format(word2vec.wv["daughter"]))

Word2Vec Representation of 'daughter':

[-1.2444714e-03  1.4156338e-03 -1.4909385e-03 -1.7089532e-03
  3.2886292e-03 -8.2796864e-04 -2.0006793e-03  1.7300280e-03
 -3.9787362e-03 -3.7500761e-03 -4.5552701e-03 -4.2422013e-03
  2.8239305e-05 -7.8168540e-04 -2.2519373e-03  2.2041863e-03
  4.6283044e-03  4.8661926e-03  1.6543671e-03 -9.9729467e-04
  2.2404236e-03  3.9539691e-03 -1.3324495e-03  2.2458609e-03
 -4.5952527e-03 -1.6946433e-03 -3.3995342e-03 -4.3935794e-03
 -2.9841559e-03  2.8422219e-03  1.3971644e-03  4.8590996e-03
 -4.5874314e-03 -4.5272587e-03 -2.0506987e-03  1.4308434e-03
 -1.1574064e-03 -4.6182331e-03 -4.4134702e-03  1.9977742e-03
  8.7627425e-04 -3.8978001e-03  4.0437295e-03 -3.4957377e-03
 -3.5399715e-03  4.7269608e-03 -3.5634337e-03  4.9276939e-03
  3.2719807e-03  9.2952105e-04  3.2853750e-03  1.4551891e-03
 -3.1109690e-04  2.1551419e-03  2.0533291e-04 -3.0298058e-03
  3.4850249e-03 -4.1296319e-03 -3.8312501e-03  3.1538843e-03
 -1.5336482e-03  4.5721438e-03 -2.8174086e-03

The above snippet provides the Word2Vec model with a vocabulary of one sentence. Then, it computes the Word2Vec representation of "daughter". Just like GloVe vectors, Word2Vec embeddings retain the similarity of similar tokens. 

A major drawback of Word2Vec is that it requires a large corpus to be trained on. If a small corpus is provided then the similarity between the words in the corpus will not be captured accurately. This will ultimately produce a poor model that cannot learn from its inputs. 