### Word Embedding Approaches

We have to represent words in a numeric format that is understandable by the computers. 
Word embedding refers to the numeric representations of words.


    1. Bag of Words
    2. TF-IDF Scheme
    3. Word2Vec


#### 1. Bag of Words

Suppose you have a corpus with three sentences.

    S1 = I love rain
    S2 = rain rain go away
    S3 = I am away


1. Create a dictionary of unique words from the corpus.

In the above corpus, we have following unique words:

[I, love, rain, go, away, am]

 S1 [1, 1, 1, 0, 0, 0]
 
 S2   [0, 0, 2, 1, 1, 0] 
 
 S3  [1, 0, 0, 0, 1, 1],

Pros:
    1. you do not need a very huge corpus of words to get good results.
       You can see that we build a very basic bag of words model with three sentences. 
    
    2. Computationally, a bag of words model is not very complex
    
Cons:
    1. huge vectors with empty spaces in order to represent a number (sparse matrix) which consumes memory and space. Huge feature dimension size.
    
    2. doesn't maintain any context information.
    It doesn't care about the order in which the words appear in a sentence.
    For instance, it treats the sentences "Bottle is in the car" and 
    "Car is in the bottle" equally, which are totally different sentences.

#### 2. TF-IDF Scheme
Term frequency(TF) refers to the number of times a word appears in the document and can be calculated as:

Term frequence = (Number of Occurences of a word)/(Total words in the document)


IDF refers to the log of the total number of documents divided by the number of documents in which the word exists, and can be calculated as:


IDF(word) = Log((Total number of documents)/(Number of documents containing the word))

Pros and Cons of TF-IDF

Though TF-IDF is an improvement over the simple bag of words approach and yields better results for common NLP tasks, the overall pros and cons remain the same. We still need to create a huge sparse matrix, which also takes a lot more computation than the simple bag of words approach.

#### 3. Word2Vec

Word2Vec models are created using billions of documents. For instance Google's Word2Vec model is trained using 3 million words and phrases. However, for the sake of simplicity, we will create a Word2Vec model using a Single Wikipedia article. Our model will not be as good as Google's. Although, it is good enough to explain how Word2Vec model can be implemented using the Gensim library.

Depending on the way the embeddings are learned, Word2Vec is classified into two approaches:

    1. Continuous Bag-of-Words (CBOW)
    2. Skip-gram model
    
Continous Bag Of Words and Skip-gram are inverses of each other

For example, consider the sentence: “I have failed at times but I never stopped trying”.  Let’s say we want to learn the embedding of the word “failed”. So, here the focus word is “failed”.


    Continuous Bag-of-Words: Input = [ I, have, at, times ],  Output = failed
    Skip-gram: Input = failed, Output = [I, have, at, times ]

Pros: 1. Feature dimension size is small.

      2. Keeping the context information.




Packages
1. Beautiful Soup library, which is a very useful Python utility for web scraping.
2. urllib to opem url
3. re - regular expression for preprocessing
4. nltk - to convert word into tokens and remove stopwords


In [1]:
import bs4 as bs
import urllib.request
import re
import nltk

scrapped_data = urllib.request.urlopen('https://en.wikipedia.org/wiki/Artificial_intelligence')
article = scrapped_data .read()

parsed_article = bs.BeautifulSoup(article,'lxml')

paragraphs = parsed_article.find_all('p')

article_text = ""

for p in paragraphs:
    article_text += p.text

Preprocessing

In [2]:
# Cleaing the text
processed_article = article_text.lower()
processed_article = re.sub('[^a-zA-Z]', ' ', processed_article )
processed_article = re.sub(r'\s+', ' ', processed_article)

# Preparing the dataset
all_sentences = nltk.sent_tokenize(processed_article)

all_words = [nltk.word_tokenize(sent) for sent in all_sentences]

# Removing Stop Words
from nltk.corpus import stopwords
for i in range(len(all_words)):
    all_words[i] = [w for w in all_words[i] if w not in stopwords.words('english')]

Creating Word2Vec Model

In [4]:
!pip install gensim
from gensim.models import Word2Vec
#min_count = int - Ignores all words with total absolute frequency lower than this 
#The Gensim default window size is 5 (two words before and two words after the input word, 
#in addition to the input word itself). The number of negative samples is another factor 
#of the training process.
word2vec = Word2Vec(all_words, min_count=2)

[33mDEPRECATION: Python 3.5 reached the end of its life on September 13th, 2020. Please upgrade your Python as Python 3.5 is no longer maintained. pip 21.0 will drop support for Python 3.5 in January 2021. pip 21.0 will remove support for this functionality.[0m
Defaulting to user installation because normal site-packages is not writeable
Collecting gensim
  Downloading gensim-3.8.3-cp35-cp35m-manylinux1_x86_64.whl (24.2 MB)
[K     |████████████████████████████████| 24.2 MB 6.5 MB/s eta 0:00:01
Collecting smart-open>=1.8.1
  Downloading smart_open-3.0.0.tar.gz (113 kB)
[K     |████████████████████████████████| 113 kB 2.4 MB/s eta 0:00:01
Building wheels for collected packages: smart-open
  Building wheel for smart-open (setup.py) ... [?25ldone
[?25h  Created wheel for smart-open: filename=smart_open-3.0.0-py3-none-any.whl size=113976 sha256=63b71115e2020d32e68c82a6b9d540511bedd9d01e61e01abf2559303dc4518b
  Stored in directory: /home/rac/.cache/pip/wheels/d6/b7/9c/9880636bd2f16fcca

In [6]:
vocabulary = word2vec.wv.vocab
print(vocabulary)

{'move': <gensim.models.keyedvectors.Vocab object at 0x7fe4a4176278>, 'concurrent': <gensim.models.keyedvectors.Vocab object at 0x7fe473e6cb00>, 'hard': <gensim.models.keyedvectors.Vocab object at 0x7fe473e7d0f0>, 'cover': <gensim.models.keyedvectors.Vocab object at 0x7fe4a4104048>, 'interaction': <gensim.models.keyedvectors.Vocab object at 0x7fe47b081cf8>, 'research': <gensim.models.keyedvectors.Vocab object at 0x7fe473e7d8d0>, 'come': <gensim.models.keyedvectors.Vocab object at 0x7fe474ee7eb8>, 'black': <gensim.models.keyedvectors.Vocab object at 0x7fe473e6cb70>, 'towards': <gensim.models.keyedvectors.Vocab object at 0x7fe474ee7ef0>, 'machines': <gensim.models.keyedvectors.Vocab object at 0x7fe474779390>, 'combination': <gensim.models.keyedvectors.Vocab object at 0x7fe4743203c8>, 'fast': <gensim.models.keyedvectors.Vocab object at 0x7fe473e545f8>, 'issue': <gensim.models.keyedvectors.Vocab object at 0x7fe473e54630>, 'operating': <gensim.models.keyedvectors.Vocab object at 0x7fe473e54

Model Analysis

Finding Vectors for a Word

In [7]:
v1 = word2vec.wv['artificial']

The vector v1 contains the vector representation for the word "artificial". By default, a hundred dimensional vector is created by Gensim Word2Vec. This is a much, much smaller vector as compared to what would have been produced by bag of words. If we use the bag of words approach for embedding the article, the length of the vector for each will be 1206 since there are 1206 unique words with a minimum frequency of 2. If the minimum frequency of occurrence is set to 1, the size of the bag of words vector will further increase. On the other hand, vectors generated through Word2Vec are not affected by the size of the vocabulary.

Finding Similar Words

In [11]:
sim_words = word2vec.wv.most_similar('intelligence') = word2vec.wv.most_similar('intelligence')

In [12]:
print(sim_words)

[('consider', 0.45905211567878723), ('point', 0.44149050116539), ('also', 0.42999494075775146), ('understanding', 0.38995373249053955), ('systems', 0.3861212134361267), ('risk', 0.37375712394714355), ('like', 0.35617560148239136), ('human', 0.35311782360076904), ('computers', 0.3515392541885376), ('life', 0.35037556290626526)]


Conclusion:

Word2vec is keeping context information, seems better in comparison with bags of word and TF IDF. 

size of the model depends on the number of unique words that the model is learning, and the chosen number of dimensions for those words, not the total amount of training data.