<a href="https://colab.research.google.com/github/dinuka-rp/L6-AI/blob/main/Prasan_Yapa/Day2-NLP3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Word Embedding

Vector Representation using Mathematical approaches.

This hands-on session will walk you through the important concepts of working with the
Word2Vec word embedding technique used for creating word vectors with Python's Gensim
library.

In real-life applications, Word2Vec models are created using billions of documents. For
instance, Google's Word2Vec model is trained using 3 million words and phrases. However,
for the sake of simplicity, we will create a Word2Vec model using a Single Wikipedia article.
Our model will not be as good as Google's. Although, it is good enough to explain how
Word2Vec model can be implemented using the Gensim library.

* Beautiful Soup library - a very useful Python 
utility for web scraping.
* lxml library - parse XML and HTML is the.

In [1]:
!pip install beautifulsoup4
!pip install lxml



## Scrapping

The article we are going to scrape is the Wikipedia article on Artificial Intelligence. Let's write
a Python Script to scrape the article from Wikipedia.

In [4]:
import bs4 as bs
import urllib.request
import re
import nltk

scrapped_data = urllib.request.urlopen('https://en.wikipedia.org/wiki/Artificial_intelligence')
article = scrapped_data.read()

parsed_article = bs.BeautifulSoup(article, 'lxml')

paragraphs = parsed_article.find_all('p')

article_text = ""

for p in paragraphs:
  article_text += p.text

In the script above, we first download the Wikipedia article using the urlopen method of the
request class of the urllib library. We then read the article content and parse it using an object
of the BeautifulSoup class. Wikipedia stores the text content of the article inside p tags. We use the find_all function of the BeautifulSoup object to fetch all the contents from the paragraph tags of the article.

Finally, we join all the paragraphs together and store the scraped article in article_text variable for later use.

## Preprocessing

At this point we have now imported the article. The next step is to preprocess the content for
Word2Vec model. The following script preprocess the text.

In [8]:
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [9]:
from nltk.corpus import stopwords

processed_article = article_text.lower()
processed_article = re.sub('[^a-zA-Z]', ' ', processed_article)
processed_article = re.sub(r'\s+', ' ', processed_article)

# Preparing the dataset
all_sentences = nltk.sent_tokenize(processed_article)

all_words = [nltk.word_tokenize(sent) for sent in all_sentences]

# Removing Stop Words
for i in range(len(all_words)):
  all_words[i] = [w for w in all_words[i] if w not in stopwords.words('english')]

In the script above, we convert all the text to lowercase and then remove all the digits, special
characters, and extra spaces from the text. 

After preprocessing, we are only left with the words.

The Word2Vec model is trained on a collection of words. First, we need to convert our article
into sentences. We use `nltk.sent_tokenize` utility to convert our article into sentences. To
convert sentences into words, we use `nltk.word_tokenize` utility. As a last preprocessing step,
we remove all the stop words from the text.

After the script completes its execution, the all_words object contains the list of all the words
in the article. We will use this list to create our Word2Vec model with the Gensim library.

## Creating Word2Vec Model

With Gensim, it is extremely straightforward to create Word2Vec model. The word list is
passed to the Word2Vec class of the gensim.models package. We need to specify the value for
the min_count parameter. A value of 2 for min_count specifies to include only those words in
the Word2Vec model that appear at least twice in the corpus. The following script creates
Word2Vec model using the Wikipedia article we scraped.

In [10]:
from gensim.models import Word2Vec

word2vec = Word2Vec(all_words, min_count=2)

To see the dictionary of unique words that exist at least twice in the corpus, execute the
following script.

In [25]:
vocabulary = word2vec.wv.vocab

print(vocabulary)
print(vocabulary.keys())
# print(vocabulary.items())

{'artificial': <gensim.models.keyedvectors.Vocab object at 0x7f57d5d6b110>, 'intelligence': <gensim.models.keyedvectors.Vocab object at 0x7f57d5d6bdd0>, 'ai': <gensim.models.keyedvectors.Vocab object at 0x7f57d5d6e490>, 'machines': <gensim.models.keyedvectors.Vocab object at 0x7f57d1fe9d50>, 'natural': <gensim.models.keyedvectors.Vocab object at 0x7f57d1d57cd0>, 'displayed': <gensim.models.keyedvectors.Vocab object at 0x7f57d1d01550>, 'humans': <gensim.models.keyedvectors.Vocab object at 0x7f57d1d01510>, 'leading': <gensim.models.keyedvectors.Vocab object at 0x7f57d1d015d0>, 'define': <gensim.models.keyedvectors.Vocab object at 0x7f57d1d57c90>, 'field': <gensim.models.keyedvectors.Vocab object at 0x7f57d1d64490>, 'study': <gensim.models.keyedvectors.Vocab object at 0x7f57d1d01610>, 'intelligent': <gensim.models.keyedvectors.Vocab object at 0x7f57d1d01650>, 'agents': <gensim.models.keyedvectors.Vocab object at 0x7f57d1d01690>, 'system': <gensim.models.keyedvectors.Vocab object at 0x7f57

When the above script is executed, you will see a list of all the unique words occurring at least
twice.

## Model Analysis

### Finding Vectors for a Word
We know that the Word2Vec model converts words to their corresponding vectors. Let's see
how we can view vector representation of any particular word.

In [15]:
v1 = word2vec.wv['artificial']

# print(v1)

The vector v1 contains the vector representation for the word “artificial”. By default, a
hundred-dimensional vector is created by Gensim Word2Vec. This is a much, much smaller
vector as compared to what would have been produced by bag of words. If we use the bag of
words approach for embedding the article, the length of the vector for each will be 1206 since
there are 1206 unique words with a minimum frequency of 2. If the minimum frequency of
occurrence is set to 1, the size of the bag of words vector will further increase. On the other
hand, vectors generated through Word2Vec are not affected by the size of the vocabulary.

### Finding Similar Words

Earlier we said that contextual information of the words is not lost using Word2Vec approach.
We can verify this by finding all the words similar to the word “intelligence”.

In [17]:
sim_words = word2vec.wv.most_similar('intelligence')

print(sim_words)

[('research', 0.3998758792877197), ('incorporated', 0.3184848725795746), ('called', 0.31742894649505615), ('automation', 0.3104512095451355), ('st', 0.30977052450180054), ('series', 0.3056904375553131), ('classification', 0.3011520504951477), ('potential', 0.2922999858856201), ('ability', 0.2864953577518463), ('could', 0.28556424379348755)]


From the output, you can see the words similar to “intelligence” along with their similarity
index. The word “ai” is the most similar word to “intelligence” according to the model, which
actually makes sense. Similarly, words such as “human” and “artificial” often coexist with the
word “intelligence”. Our model has successfully captured these relations using just a single
Wikipedia article.