## Coding Practice #0517

### 1. Word embedding (Word2Vec):

In [1]:
# Install once if necessary.
#!pip install gensim
#!pip install nltk

In [2]:
import re
import os
import nltk
import urllib
import bs4 as bs
import warnings
from gensim.models import Word2Vec
from gensim.models import KeyedVectors
from nltk.corpus import stopwords
warnings.filterwarnings('ignore')
# nltk.download()

#### 1.1. Download the text data:

In [3]:
# Connect to the source.
source = urllib.request.urlopen('https://en.wikipedia.org/wiki/Machine_learning').read()

In [4]:
# Beautiful soup object.
soup = bs.BeautifulSoup(source,'lxml')

In [5]:
# Build a long string. 
my_text = ""
for paragraph in soup.find_all('p'):
    my_text += paragraph.text
print(my_text)

Machine learning (ML) is a field devoted to understanding and building methods that let machines "learn" – that is, methods that leverage data to improve computer performance on some set of tasks.[1]
Machine learning algorithms build a model based on sample data, known as training data, in order to make predictions or decisions without being explicitly programmed to do so.[2] Machine learning algorithms are used in a wide variety of applications, such as in medicine, email filtering, speech recognition, agriculture, and computer vision, where it is difficult or unfeasible to develop conventional algorithms to perform the needed tasks.[3][4]
A subset of machine learning is closely related to computational statistics, which focuses on making predictions using computers, but not all machine learning is statistical learning. The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning. Data mining is a related field of study, focu

#### 1.2. Preprocessing of the text data:

In [6]:
my_text = my_text.lower()
my_text = re.sub(r'\[[0-9]*\]',' ', my_text)
my_text = re.sub(r'\W',' ', my_text)
#my_text = re.sub(r'\s+',' ',my_text)
my_text = re.sub(r'\d+',' ',my_text)
my_text = re.sub(r'\s+',' ',my_text)

#### 1.3. Tokenization:

In [7]:
my_sentences = nltk.sent_tokenize(my_text)
my_words_0=[]
for a_sentence in my_sentences:
    my_words_0 += nltk.word_tokenize(a_sentence)
my_words_0 = [a_word for a_word in my_words_0 if len(a_word)>2 ]
my_words_0 = [a_word for a_word in my_words_0 if a_word not in stopwords.words('english')]
my_words_0 = [my_words_0]    # Required by Word2Vec.
len(my_words_0[0])

4444

#### 1.4. Train the Word2Vec model:

In [8]:
my_model = Word2Vec(my_words_0, vector_size = 100, min_count=1)
my_words = my_model.wv.key_to_index
len(my_words)

1698

#### 1.5. Embedding vectors:

In [9]:
# View the dense vector corresponding to 'machine'.
my_vector = my_model.wv['machine']
print("Length = " + str(my_vector.shape[0]))
print("-"*100)
print(my_vector)

Length = 100
----------------------------------------------------------------------------------------------------
[-0.01193737  0.00695289  0.00687346  0.0064067   0.00826405 -0.0136679
  0.00260534  0.01585962 -0.00664186 -0.00750788 -0.00280072 -0.0130693
 -0.0077711   0.00945379  0.0034014   0.00531803  0.00827738  0.00525588
 -0.00367877 -0.0093133   0.00589177 -0.00297566  0.010447   -0.01263215
  0.00608743  0.00325039 -0.00955169  0.00201003 -0.00664782  0.00766373
  0.01501857 -0.00454328  0.00153883 -0.00791845  0.00311298  0.00743465
  0.00611165  0.00082042  0.00622707  0.00217468  0.00834349 -0.01057488
 -0.01126791 -0.00020317  0.00090558  0.00501645  0.00267319 -0.00226942
  0.00321811  0.00476713  0.01007617 -0.01260098 -0.00339154  0.00462564
 -0.00364092  0.01169893  0.01204277  0.00616132 -0.00524212  0.00926016
 -0.00812335  0.00421281 -0.00602852 -0.00484694 -0.00199716  0.00937636
  0.00940321 -0.00343339  0.00271193  0.01349932 -0.00686997 -0.00703544
  0.0096089 

#### 1.6. Most similar words:

In [10]:
my_model.wv.most_similar('learning')

[('without', 0.46288537979125977),
 ('classification', 0.43579626083374023),
 ('systems', 0.4219920337200165),
 ('models', 0.40113741159439087),
 ('data', 0.39267095923423767),
 ('machine', 0.38844671845436096),
 ('develop', 0.3796117603778839),
 ('process', 0.3735598623752594),
 ('result', 0.37050962448120117),
 ('genetic', 0.3642229735851288)]

In [11]:
my_model.wv.most_similar('artificial')

[('data', 0.40522292256355286),
 ('networks', 0.372943252325058),
 ('assumption', 0.34035494923591614),
 ('formulated', 0.3374149203300476),
 ('known', 0.3142104744911194),
 ('cumulative', 0.30836009979248047),
 ('discovering', 0.30635979771614075),
 ('learning', 0.30420035123825073),
 ('understanding', 0.2999662756919861),
 ('related', 0.29209962487220764)]

In [12]:
# Operation:
# global - cooling + warming = ???
my_model.wv.most_similar(positive=['machine','human'], negative= ['learning'])

[('rather', 0.3507980704307556),
 ('tom', 0.33385056257247925),
 ('intimate', 0.31403648853302),
 ('marketing', 0.30135488510131836),
 ('representations', 0.29245656728744507),
 ('derived', 0.28428319096565247),
 ('edge', 0.2803153991699219),
 ('work', 0.2743244767189026),
 ('rare', 0.2686609923839569),
 ('basic', 0.2530176043510437)]

### 2. Using a pre-trained model from Google:

Download ["GoogleNews-vectors-negative300.bin"](https://docs.google.com/uc?id=0B7XkCwpI5KDYNlNUTTlSS21pQmM&export=download) and uncompress. <br>
**Caution**: Size compressed ~ 1.6 Gb, uncompressed ~ 3.5 Gb.

In [13]:
# Go to the directory where the downloaded file is located. 
# os.chdir(r'~~')                # Please, replace the path with your own.   

In [14]:
# Load the file.
filename = "GoogleNews-vectors-negative300.bin"
a_model = KeyedVectors.load_word2vec_format(filename, binary=True)

In [15]:
# The most similar words to 'king' or 'kings'.
a_model.most_similar(['king','kings'])

[('princes', 0.6491754651069641),
 ('queen', 0.6316933631896973),
 ('monarch', 0.5960378646850586),
 ('queens', 0.5806739926338196),
 ('monarchs', 0.573945939540863),
 ('prince', 0.5600895881652832),
 ('ruler', 0.5597525238990784),
 ('sultan', 0.556817352771759),
 ('kings_princes', 0.5520499348640442),
 ('emperors', 0.5400724411010742)]

In [16]:
# Operation: queen(queens) - woman(women) + man(men) = ???
a_model.most_similar(positive=['queen','queens','man','men'], negative= ['woman','women'])

[('kings', 0.6578558683395386),
 ('king', 0.6328856945037842),
 ('princes', 0.5353103280067444),
 ('Senti_pocket', 0.512298047542572),
 ('princesses', 0.4917480945587158),
 ('jesters', 0.48247596621513367),
 ('queens_princes', 0.4717158377170563),
 ('princess', 0.46906542778015137),
 ('monarchs', 0.4661174416542053),
 ('prince', 0.4655950963497162)]