### Coding Challenge: Natural Language Processing

In this Coding Challenge, you will cover **Word2vec** which is a popular algorithm for building vector representations of words (i.e. word embeddings). The concept behind Word2Vec is quite straightforward - an assumption is made that the meaning of a word can be inferred by the *context it appears in* or *the company it keeps*. This is similar to stating: “tell me about your friends, and I will tell who you are”. 

If **2** words  have very similar neighbors (meaning: the context in which it is used is similar), then the words are most likely quite similar.

In this Coding Challenge, you will go through the process of training a Word2vec model with a sample set of documents and then examine certain attributes of the model. After that, you will train a Word2vec model with a large corpus of text and then ascertain the similarity among words in the corpus.


In [1]:
# https://radimrehurek.com/gensim/install.html
!pip install --upgrade gensim

Collecting gensim
[?25l  Downloading https://files.pythonhosted.org/packages/d1/dd/112bd4258cee11e0baaaba064060eb156475a42362e59e3ff28e7ca2d29d/gensim-3.8.1-cp36-cp36m-manylinux1_x86_64.whl (24.2MB)
[K     |████████████████████████████████| 24.2MB 1.3MB/s 
Installing collected packages: gensim
  Found existing installation: gensim 3.6.0
    Uninstalling gensim-3.6.0:
      Successfully uninstalled gensim-3.6.0
Successfully installed gensim-3.8.1


In [2]:
import nltk
nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping corpora/biocreative_ppi.zip.
[nltk_data]    | Downloading package brown to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/brown.zip.
[nltk_data]    | Downloading package brown_tei to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/brown_tei.zip.
[nltk_data]    | Downloading package cess_cat to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/cess_cat.zip.
[nltk_data]    | Downloading package cess_esp to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/cess_esp.zip.
[nltk_data]    | Downloading package chat80 to /root/nltk_data...
[nltk_data]    |   Unzipp

True

**Step #1:** Tokenize the sample set of documents



In [0]:
# Step 1
import gensim

raw_content = [('The dog ran up the steps and entered the owner\'s room to '
                'check if the owner was in the room.'),
               ('My name is Thomson Comer, commander of the Machine Learning '
                'program at Lambda school.'),
               ('I am creating the curriculum for the Machine Learning program '
                'and will be teaching the full-time Machine Learning program.'),
               ('Machine Learning is one of my favorite subjects.'),
               ('I am excited about taking the Machine Learning class at the '
                'Lambda school starting in April.'),
               ('When does the Machine Learning program kick-off at Lambda '
                'school?'),
               ('The batter hit the ball out off AT&T park into the pacific '
                'ocean.'),
               'The pitcher threw the ball into the dug-out.']

In [0]:
simple_content = [s.lower().replace('\s+', ' ') for s in raw_content]
simple_content = [s.lower().replace('\'', '') for s in simple_content]
simple_content = [s.lower().replace('-', '') for s in simple_content]

In [0]:
tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+')
lemmatizer = nltk.stem.WordNetLemmatizer()
stop_words = nltk.corpus.stopwords.words('english')

tokenized_content = [tokenizer.tokenize(s) for s in simple_content]
tokenized_content = [[lemmatizer.lemmatize(t) for t in tokenlist if \
                      t not in stop_words] for tokenlist in tokenized_content]

In [40]:
tokenized_content[0]

['dog', 'ran', 'step', 'entered', 'owner', 'room', 'check', 'owner', 'room']

**Step #2:** Train the Word2vec model with tokenized content; size of the word vectors is 5; the word should show-up at least once in the raw content

In [0]:
# Step 2
model = gensim.models.Word2Vec(tokenized_content,
                                size=50,
                                window=10,
                                min_count=1,
                                workers=10,
                                iter=10)

**Step #3:** Output the number of words as well as the list of words in the model's vocabulary

In [53]:
# Step 3
len(model.wv.vocab)

38

In [54]:
for key in model.wv.vocab.keys():
  print(key)

dog
ran
step
entered
owner
room
check
name
thomson
comer
commander
machine
learning
program
lambda
school
creating
curriculum
teaching
fulltime
one
favorite
subject
excited
taking
class
starting
april
kickoff
batter
hit
ball
park
pacific
ocean
pitcher
threw
dugout


**Step #4:** Output the vector of words for the following tokens: **a)** curriculum, **b)** ocean, and **c)** pitcher

In [55]:
# Step 4
model['curriculum']

  """Entry point for launching an IPython kernel.


array([ 0.0078973 ,  0.00900684, -0.00148584, -0.00964239,  0.00205151,
        0.00615953, -0.00172741, -0.00412862, -0.00307044, -0.00361187,
        0.00161317, -0.00070418,  0.00229991, -0.00167504,  0.00807413,
       -0.00904608, -0.00427291,  0.0086932 ,  0.00106635, -0.00428049,
       -0.0047743 , -0.00514426,  0.00827205,  0.00878079, -0.00558444,
       -0.00423884,  0.00248194,  0.00764772, -0.00977805,  0.0065634 ,
        0.00692894, -0.00653495,  0.001719  ,  0.00543719, -0.00163612,
        0.00565811, -0.00258197, -0.00619078,  0.00199268,  0.00101762,
        0.00792677, -0.00434621,  0.00810428, -0.00185776, -0.00555382,
       -0.00183626,  0.00511515, -0.00263665,  0.00847675,  0.00133379],
      dtype=float32)

In [57]:
model['ocean']

  """Entry point for launching an IPython kernel.


array([ 0.00922351,  0.00657452, -0.00377176,  0.00106541, -0.0083246 ,
       -0.00507166,  0.00896357,  0.00470665, -0.0019717 ,  0.00184349,
        0.00068058, -0.00321811, -0.00614252, -0.00031749, -0.00151758,
       -0.0024629 , -0.00344092, -0.00024497,  0.00463977,  0.00440908,
       -0.00127755, -0.00998037, -0.00791304, -0.0017962 , -0.00366164,
       -0.00394903, -0.00662783, -0.00541074,  0.00599303, -0.00153011,
       -0.00405876,  0.00664217,  0.00574318,  0.00257567, -0.00303558,
        0.00853761, -0.00551205, -0.00138665,  0.00916395, -0.00356858,
       -0.00658254,  0.00440606, -0.0006951 , -0.00529999,  0.00625745,
       -0.0078224 , -0.00920512, -0.00369702, -0.00414902, -0.00067215],
      dtype=float32)

In [58]:
model['pitcher']

  """Entry point for launching an IPython kernel.


array([ 0.00303153,  0.0085448 ,  0.00118706, -0.00557101, -0.00078175,
        0.00610707,  0.00010288,  0.00682049, -0.00532219,  0.00613966,
       -0.00198091,  0.00525637,  0.00083343,  0.00510037,  0.00417584,
        0.00510645,  0.00760938, -0.00514985,  0.0056145 ,  0.00110501,
       -0.00507975, -0.0053293 , -0.00990335, -0.00927391, -0.00679142,
        0.002471  ,  0.00588116,  0.00451168,  0.00798106, -0.00736037,
       -0.00797739,  0.00615937,  0.004915  , -0.00532823, -0.00862616,
       -0.00556596, -0.00777058,  0.00679734,  0.00011425, -0.00884646,
        0.00526021,  0.00024323, -0.00376152,  0.00411266,  0.0090164 ,
        0.0018343 ,  0.00252077,  0.00663791,  0.00014646, -0.00459627],
      dtype=float32)

**Step #5:** Now we are going to train the model with more data - larger corpus i.e. the 20 newsgroups text dataset. Fetch the data from the training subset

*Reference*: http://scikit-learn.org/stable/datasets/index.html

In [0]:
# Step 5
import sklearn

data = sklearn.datasets.fetch_20newsgroups(subset='train', remove=('headers', 
                                                                   'footers', 
                                                                   'quotes'))

**Step #6:** Output the metadata for the data that is fetched

In [74]:
# Step 6
data.keys()

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])

In [75]:
print(data['DESCR'])

.. _20newsgroups_dataset:

The 20 newsgroups text dataset
------------------------------

The 20 newsgroups dataset comprises around 18000 newsgroups posts on
20 topics split in two subsets: one for training (or development)
and the other one for testing (or for performance evaluation). The split
between the train and test set is based upon a messages posted before
and after a specific date.

This module contains two loaders. The first one,
:func:`sklearn.datasets.fetch_20newsgroups`,
returns a list of the raw texts that can be fed to text feature
extractors such as :class:`sklearn.feature_extraction.text.CountVectorizer`
with custom parameters so as to extract feature vectors.
The second one, :func:`sklearn.datasets.fetch_20newsgroups_vectorized`,
returns ready-to-use features, i.e., it is not necessary to use a feature
extractor.

**Data Set Characteristics:**

    Classes                     20
    Samples total            18846
    Dimensionality               1
    Features       

In [0]:
posts = data['data']

**Step #7:** Output the # of posts across the different categories

In [0]:
# Step 7
import pandas as pd

df = pd.DataFrame({'text': data['data'], 
                   'category': data['target']})

In [88]:
df.head()

Unnamed: 0,text,category
0,I was wondering if anyone out there could enli...,7
1,A fair number of brave souls who upgraded thei...,4
2,"well folks, my mac plus finally gave up the gh...",4
3,\nDo you have Weitek's address/phone number? ...,1
4,"From article <C5owCB.n3p@world.std.com>, by to...",14


In [89]:
data['target_names']

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

In [93]:
df.category.value_counts().sort_index()

0     480
1     584
2     591
3     590
4     578
5     593
6     585
7     594
8     598
9     597
10    600
11    595
12    591
13    594
14    593
15    599
16    546
17    564
18    465
19    377
Name: category, dtype: int64

**Step #8**: Tokenize the body of text for each post

In [0]:
# Step 8
df['text'] = df['text'].str.lower().str.replace('\s+', ' ')
df['text'] = df['text'].str.replace(r'[^\w\s]+', '')
docs = [s.replace('\n', ' ') for s in df['text']]

In [0]:
tokens = [tokenizer.tokenize(doc) for doc in docs]
tokens = [[lemmatizer.lemmatize(t) for t in tokenlist if \
                      t not in stop_words] for tokenlist in tokens]

In [109]:
tokens[0]

['wondering',
 'anyone',
 'could',
 'enlighten',
 'car',
 'saw',
 'day',
 '2door',
 'sport',
 'car',
 'looked',
 'late',
 '60',
 'early',
 '70',
 'called',
 'bricklin',
 'door',
 'really',
 'small',
 'addition',
 'front',
 'bumper',
 'separate',
 'rest',
 'body',
 'know',
 'anyone',
 'tellme',
 'model',
 'name',
 'engine',
 'spec',
 'year',
 'production',
 'car',
 'made',
 'history',
 'whatever',
 'info',
 'funky',
 'looking',
 'car',
 'please',
 'email']

**Step #9**: Train the Word2vec model - words should show up at least 3 times in the corpus of text
and the size of each word vector is 200 (i.e. dimension = 200)

Reference" Scroll down to the section "A closer look at the parameter settings" to review the parameters that can be set

In [0]:
# Step 9
model = gensim.models.Word2Vec(tokens,
                               size=200,
                               window=10,
                               min_count=3,
                               workers=4,
                               iter=10)

**Step #10**:  List the number of words in the model's vocabulary

In [115]:
# Step 10
len(model.wv.vocab)

26481

**Step #11:** Examine word similarity to the word "Christ" (find other words most similar to it)

In [116]:
# Step 11
model.wv.most_similar(positive='christ')

[('jesus', 0.9507708549499512),
 ('resurrection', 0.9398853778839111),
 ('holy', 0.9357596635818481),
 ('heaven', 0.933609127998352),
 ('spirit', 0.9299333095550537),
 ('salvation', 0.9290993213653564),
 ('lord', 0.9289507269859314),
 ('grace', 0.9260685443878174),
 ('sin', 0.9168611168861389),
 ('disciple', 0.9151330590248108)]

**Step #12**: Examine document similarity with Doc2vec to any body of text of your choice

*Reference*: https://radimrehurek.com/gensim/models/doc2vec.html

In [0]:
# Step 12

In [0]:
# Examine the first document in the list above to gauge the similarity


**Stretch Goal: **

Download the pre-trained word vectors from Google. Access the pre-trained vectors via the following link: https://code.google.com/archive/p/word2vec

Load the pre-trained word vectors and train the **Word2vec** model

Examine the first 100 keys or words of the vocabulary

Outputs the vector representation for a select set of words - the words can be of your choice

Examine the similarity between words - the words can be of your choice

For example: 

model.similarity('house', 'bungalow')

model.similarity('house', 'umbrella')
