## Word Embeddings in Python with Gensim

In this, you will practice how to train and load word embedding models for natural language processing applications in Python using Gensim.



1. How to train your own word2vec word embedding model on text data.
2. How to visualize a trained word embedding model using Principal Component Analysis.
3. How to load pre-trained word2vec word embedding models.

### Run the below two commands to install gensim and the wiki dataset

In [None]:
#pip install --upgrade gensim --user

In [None]:
#!pip install wikipedia --user

### Import gensim

In [1]:
import gensim

### Obtain Text

#### Import search and page functions from wikipedia module

search(/key word/): search function takes keyword as argument and gives top 10 article titles matching the given keyword.


page(/title of article/): page function takes page title as argument and gives content in the output.

In [2]:
## Usage: 

from wikipedia import search, page
titles = search("Machine Learning")
wikipage = page(titles[0])
#print (wikipage.content)

### Print the top 10 titles for the keyword `Machine Learning`

In [3]:
print(len(titles))

10


In [4]:
#it returned only top 10 so printing all 

for title in titles :
  print(title)

Machine learning
Active learning (machine learning)
Boosting (machine learning)
Deep learning
List of datasets for machine-learning research
Outline of machine learning
Learning
Support-vector machine
Automated machine learning
Feature (machine learning)


### Get the content from the first title from the above obtained 10 titles.

In [5]:
import numpy as np

content = np.array([])

In [6]:
for p in titles :
  pageContent = page(p)
  content = np.append(content, pageContent)
  
print(len(content))

10


### Create a list with name `documents` and append all the words in the 10 pages' content using the above 10 titles.

In [7]:
import re, string

def clean_str(string):
  """
  String cleaning before vectorization
  """
  try:    
    string = re.sub(r'^https?:\/\/<>.*[\r\n]*', '', string, flags=re.MULTILINE)
    string = re.sub(r"[^A-Za-z]", " ", string)         
    words = string.strip().lower().split()    
    words = [w for w in words if len(w)>=1]
    return " ".join(words)	
  except:
    return ""

In [8]:
#Iterate over each page content
documents = []

for doc in content:
    documents.append(clean_str(doc.content).split(' '))

print(len(documents))
print(documents[0])

10
['machine', 'learning', 'ml', 'is', 'the', 'scientific', 'study', 'of', 'algorithms', 'and', 'statistical', 'models', 'that', 'computer', 'systems', 'use', 'to', 'effectively', 'perform', 'a', 'specific', 'task', 'without', 'using', 'explicit', 'instructions', 'relying', 'on', 'patterns', 'and', 'inference', 'instead', 'it', 'is', 'seen', 'as', 'a', 'subset', 'of', 'artificial', 'intelligence', 'machine', 'learning', 'algorithms', 'build', 'a', 'mathematical', 'model', 'based', 'on', 'sample', 'data', 'known', 'as', 'training', 'data', 'in', 'order', 'to', 'make', 'predictions', 'or', 'decisions', 'without', 'being', 'explicitly', 'programmed', 'to', 'perform', 'the', 'task', 'machine', 'learning', 'algorithms', 'are', 'used', 'in', 'a', 'wide', 'variety', 'of', 'applications', 'such', 'as', 'email', 'filtering', 'and', 'computer', 'vision', 'where', 'it', 'is', 'infeasible', 'to', 'develop', 'an', 'algorithm', 'of', 'specific', 'instructions', 'for', 'performing', 'the', 'task', 'm

### Build the gensim model for word2vec with by considering all the words with frequency >=1 with embedding size=50

In [9]:
model = gensim.models.Word2Vec(documents, min_count=1,  workers=4,  size=50, window=5)  

  "C extension not loaded, training will be slow. "


### Exploring the model

#### Check how many words in the model

In [10]:
print("no. of words in the model: ", model.wv.syn0.shape)

no. of words in the model:  (4849, 50)


  """Entry point for launching an IPython kernel.


In [11]:
print("Togal vocabulary in the moel: ", model.wv.vocab)



### Get an embedding for word `SVM`

In [12]:
model.wv['machine']

array([ 0.03054719,  0.00745628, -0.16210437, -0.26187214,  1.0208286 ,
       -0.12640299,  0.9254385 , -0.6895489 , -0.9769933 , -0.10553064,
       -0.11949165,  1.0883739 , -0.21057874,  0.9209669 ,  0.701346  ,
       -0.45119306,  0.5639114 , -0.64307714, -0.79091674, -1.1858433 ,
       -0.4852672 , -1.2006255 ,  0.11507691,  0.56411093, -0.01334467,
        0.2532606 , -0.63461953, -0.60056573,  0.15344314, -1.1139011 ,
        0.572951  ,  0.1732514 , -0.32083797, -0.5782262 ,  0.0997124 ,
       -0.23864335,  0.72331667, -0.33741567, -0.54402375,  0.18651907,
       -0.3843853 ,  0.32202938,  0.57170856,  0.39336702, -0.03007461,
        0.8790341 , -0.21767017,  0.33785585,  1.7959013 , -0.9788905 ],
      dtype=float32)

### Finding most similar words for word `learning`

In [13]:
model.wv.most_similar('learning')

[('machine', 0.9999376535415649),
 ('of', 0.99989914894104),
 ('algorithms', 0.9998930096626282),
 ('a', 0.9998759031295776),
 ('based', 0.9998682141304016),
 ('boosting', 0.9998646378517151),
 ('to', 0.9998524188995361),
 ('has', 0.9998488426208496),
 ('deep', 0.9998481869697571),
 ('it', 0.9998480081558228)]

### Find the word which is not like others from `machine, svm, ball, learning`

In [14]:
model.doesnt_match("machine, svm, ball, learning".split())

  """Entry point for launching an IPython kernel.


'learning'

### Save the model with name `word2vec-wiki-10`

In [15]:
model.save('word2vec-wiki-10')

### Load the model `word2vec-wiki-10`

In [16]:
model = gensim.models.Word2Vec.load('word2vec-wiki-10')

In [17]:
print(model)

Word2Vec(vocab=4849, size=50, alpha=0.025)
