# Word2Vec embedding technique

Word2Vec is one of the most popular technique to learn word embeddings using shallow neural network. They try to create vector representations of each word in a given document. If we plot these vector representations on a graph, words with similar meaning will appear closer to each other. 

### Importing the libraries:

We start off with importing the necessary libraries required for computing the word2vec computations. We mainly make use of the nltk and gensim library to achieve this.

In [4]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
nltk.download('punkt')
from gensim.models import Word2Vec

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\gaura\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### Defining the functions:

The following functions are required to extract the text data from the wikipedia page.

In [5]:
def initResp(url):
    response = requests.get(url)
    return response

In [6]:
def titleExtractor(soup):
    title = soup.find('title')
    return title.string

def contextExtractor(soup):
    context = " "
    for i in soup.select('p'):
        context = context + i.getText()
    return context

The following function is used to delete the stopwords present in a sentence.

In [7]:
def stopWordDel(text):
    stop_words = set(stopwords.words('english'))
    filteredSent = " "
    pageWord = text.split()
    for r in pageWord:
        if not r in stop_words:
            filteredSent = filteredSent + " " + r
    return filteredSent

extractText function makes use of the above functions and extracts the text data from the wikipedia page. Before returning the text data it preprocesses the data and removes all the special characters present in the data, numeric characters present in the data and splits the text data into sentences.

In [11]:
def extractText(url):
    sentTemp = " "
    response=initResp(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    title = titleExtractor(soup)
    context = contextExtractor(soup)
    res = stopWordDel(context.lower())
    for k in res.split("\n"):
        final = " ".join(re.findall(r"[a-zA-Z0-9.-]+", k))
        sentTemp = sentTemp +" "+ final
    sentenceList = nltk.tokenize.sent_tokenize(sentTemp)
    sentenceList = [nltk.word_tokenize(sent) for sent in sentenceList]
    return sentenceList

### Defining the word2vec computation function:

The url of the required wikipedia page is passed to the function. This url is then passed to the extractText function which returns a list of sentences on which the word2vec computations can be applied. computeWord2vec function performs those computations and returns the result model.

In [12]:
def computeWordVec(url):
    sentenceList = extractText(url)
    model = Word2Vec(sentenceList,min_count=1)
    words = list(model.wv.index_to_key)
    return model

### End Results:

The url of a wikipedia article about genocide is passed as an argument to the computeWordVec function. The model returned is stored in wordvecModel. 

In [17]:
wordvecModel = computeWordVec('https://en.wikipedia.org/wiki/Genocide')

Lets see the list of words that are present in the text data.

In [18]:
wordList = list(wordvecModel.wv.index_to_key)
print(wordList)

['.', 'genocide', 'group', 'international', 'crimes', 'groups', 'states', 'united', 'also', 'convention', 'part', 'political', 'destruction', 'act', 'found', 'war', 'acts', 'court', 'intent', 'crime', 'law', 'national', 'security', 'committed', 'violence', 'criminal', 'mass', 'killings', 'defined', 'people', 's', 'council', 'definition', 'humanity', 'nations', 'term', 'courts', 'may', 'cppcg', 'state', 'destroy', 'prohibited', 'perpetrator', 'genocides', 'government', 'icty', 'members', 'two', 'whole', 'legal', 'world', 'meaning', 'human', 'guilty', 'physical', 'darfur', 'former', 'time', 'the', 'including', 'lemkin', 'ethnic', 'sexual', 'racial', 'killing', 'icc', 'noted', 'tribunal', 'first', 'nazi', 'would', 'resolution', 'armenian', 'ictr', 'religious', 'un', 'mental', 'substantial', 'rape', 'targeted', 'history', 'yugoslavia', 'chamber', 'perpetrators', 'adopted', 'include', 'protected', 'social', 'however', 'word', 'murder', 'serious', 'children', 'bosnian', 'could', 'genocidal',

Each word vector can also be viewed from the whole list of vectors. We have shown the vector of the word genocide

In [19]:
vector = samMod.wv['genocide']
print(vector)

[-0.00922914  0.00783873  0.00693014  0.00673459  0.00865875 -0.01228166
  0.00232302  0.01494115 -0.00455364 -0.00733094 -0.00148998 -0.01292785
 -0.00578915  0.00813004  0.00401548  0.00491152  0.00865322  0.00411797
 -0.00339695 -0.00874535  0.00507737 -0.00398073  0.01011117 -0.01319152
  0.00595191  0.00287416 -0.00848876  0.00098298 -0.00330236  0.0088836
  0.01414453 -0.00515339 -0.00018507 -0.00844608  0.00384222  0.00743113
  0.00654742  0.0020772   0.00895091  0.00191753  0.00746454 -0.01101854
 -0.00936984 -0.00033092 -0.00033301  0.00601003  0.00324676 -0.0026705
  0.00246428  0.0045684   0.00891352 -0.01223331 -0.00302037  0.00404008
 -0.00206461  0.01197167  0.01209839  0.00606244 -0.00343676  0.00810687
 -0.00885306  0.00440942 -0.00588911 -0.00562938 -0.00120342  0.00899931
  0.00931194 -0.00458581  0.00400091  0.01172251 -0.00727017 -0.01018899
  0.00890896  0.00548624  0.00222728 -0.00566634 -0.00755401 -0.00297641
  0.00075287 -0.00225332 -0.00982231  0.00402269  0.0

We can also view which words are similar according to the computation. This similarity accuracy increases with the size of text data.

In [21]:
similar = wordvecModel.wv.most_similar('genocide')
print(similar)

[('narrow', 0.35746774077415466), ('subsequently', 0.34189215302467346), ('croats', 0.3365311324596405), ('intent', 0.31915900111198425), ('meaning', 0.30031225085258484), ('1593', 0.29207584261894226), ('necessary', 0.28890135884284973), ('intimate', 0.28564026951789856), ('killings', 0.28326547145843506), ('court', 0.2792111933231354)]
