In [1]:
# 3:16:44

## word2vec
- converting words (strings of chars) to a **numeric vector representation of a word**
- example: making vector representations out of words from Wikipedia

- **word2vec** is a shallow, two-layer neural network 
    - accepts a text corpus as an input
    - returns a set of vectors (also known as embeddings)
    - each vector is a numeric representation of a given word
- skipgram method
    - defining a window of words around the focus word (**context**)
    - the model goes through the corpus and checks which words fall within the defined windows (ie. one, two or three...) to help it learn the meaning of a word
    - this allows the model to learn similar words (ie. colors...)
- any n-dimensional vector can be plotted in a n-dimensional graph
    - this provides insight into similarity of words (semantic relationship in the form of synonyms)
    - **cosine similarity**
        - two vectors are passed to a function 
        - cosine is calculated based on the angle between two vectors
        - it returns a number between -1 and 1
            - if the angle is close to zero, then the similarity score (cos(angle)) will be very close to 1
            - if the angle is close to 180, the score will be close to 0 (no similarities)
    - two vectors can be substracted from each other to look for word analogies ("queen is to man what queen is to woman")
        - queen = king - man + woman 
        
--------------------------------------

#### using word2vec
1. using pretrained embeddings
    - a model that has been trained on some extremely large corpus of text
    - generic word vectors out of the box
    - more stable
    
2. train embeddings on our own data
    - the result would be embeddings that are more tailored to a specific problem
    - real-life: words are used differently depending on the channel, style, context etc.
    - downside is a complicated training process and the vectors could not be as good if they are not trained on a large enough corpus

In [2]:
# !pip install gensim

Collecting gensim
[?25l  Downloading https://files.pythonhosted.org/packages/44/52/f1417772965652d4ca6f901515debcd9d6c5430969e8c02ee7737e6de61c/gensim-4.0.1-cp37-cp37m-manylinux1_x86_64.whl (23.9MB)
[K    100% |████████████████████████████████| 23.9MB 574kB/s eta 0:00:01  19% |██████▎                         | 4.7MB 4.6MB/s eta 0:00:05    41% |█████████████▏                  | 9.8MB 3.6MB/s eta 0:00:04    68% |██████████████████████          | 16.5MB 3.2MB/s eta 0:00:03    87% |████████████████████████████    | 20.9MB 3.7MB/s eta 0:00:01
[?25hCollecting smart-open>=1.8.1 (from gensim)
[?25l  Downloading https://files.pythonhosted.org/packages/e9/90/6ca525991e281ecdf204c5c1de854da6334068e44121c384b68c6a838e14/smart_open-5.1.0-py3-none-any.whl (57kB)
[K    100% |████████████████████████████████| 61kB 1.7MB/s ta 0:00:01
Installing collected packages: smart-open, gensim
Successfully installed gensim-4.0.1 smart-open-5.1.0


#### Working with gensim
1. loading pretrained word vectors
2. exploring word vectors
3. finding similar words based on pretrained word vectors

In [4]:
import gensim.downloader as api
wiki_embeddings = api.load('glove-wiki-gigaword-100')



In [5]:
wiki_embeddings['king']

array([-0.32307 , -0.87616 ,  0.21977 ,  0.25268 ,  0.22976 ,  0.7388  ,
       -0.37954 , -0.35307 , -0.84369 , -1.1113  , -0.30266 ,  0.33178 ,
       -0.25113 ,  0.30448 , -0.077491, -0.89815 ,  0.092496, -1.1407  ,
       -0.58324 ,  0.66869 , -0.23122 , -0.95855 ,  0.28262 , -0.078848,
        0.75315 ,  0.26584 ,  0.3422  , -0.33949 ,  0.95608 ,  0.065641,
        0.45747 ,  0.39835 ,  0.57965 ,  0.39267 , -0.21851 ,  0.58795 ,
       -0.55999 ,  0.63368 , -0.043983, -0.68731 , -0.37841 ,  0.38026 ,
        0.61641 , -0.88269 , -0.12346 , -0.37928 , -0.38318 ,  0.23868 ,
        0.6685  , -0.43321 , -0.11065 ,  0.081723,  1.1569  ,  0.78958 ,
       -0.21223 , -2.3211  , -0.67806 ,  0.44561 ,  0.65707 ,  0.1045  ,
        0.46217 ,  0.19912 ,  0.25802 ,  0.057194,  0.53443 , -0.43133 ,
       -0.34311 ,  0.59789 , -0.58417 ,  0.068995,  0.23944 , -0.85181 ,
        0.30379 , -0.34177 , -0.25746 , -0.031101, -0.16285 ,  0.45169 ,
       -0.91627 ,  0.64521 ,  0.73281 , -0.22752 , 

In [6]:
wiki_embeddings.most_similar('king')

[('prince', 0.7682329416275024),
 ('queen', 0.7507690787315369),
 ('son', 0.7020888328552246),
 ('brother', 0.6985775828361511),
 ('monarch', 0.6977890729904175),
 ('throne', 0.691999077796936),
 ('kingdom', 0.6811409592628479),
 ('father', 0.6802029013633728),
 ('emperor', 0.6712858080863953),
 ('ii', 0.6676074266433716)]

#### Training a model with gensim
1. reading in the data
2. cleaning up


In [7]:
import numpy as np
# np.random.bit_generator = np.random._bit_generator

import gensim

import pandas as pd
from sklearn.model_selection import train_test_split
pd.set_option('display.max_colwidth', 100)

data = pd.read_csv('dataset/SMSSpamCollection.csv', sep='delimiter')
messages = pd.DataFrame(columns=['label','text'])
messages[['label','text']] = data['v1\tv2'].str.split('\t', expand=True)

messages.head()

  # Remove the CWD from sys.path while we load stuff.


Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives around here though"


In [8]:
# gensim.utils.simple_preprocess 
# removes the stopwords, punctuations and cleans the text in lowercase
messages['text_clean'] = messages['text'].apply(lambda x: gensim.utils.simple_preprocess(x))
messages.head()

Unnamed: 0,label,text,text_clean
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g...","[go, until, jurong, point, crazy, available, only, in, bugis, great, world, la, buffet, cine, th..."
1,ham,Ok lar... Joking wif u oni...,"[ok, lar, joking, wif, oni]"
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,"[free, entry, in, wkly, comp, to, win, fa, cup, final, tkts, st, may, text, fa, to, to, receive,..."
3,ham,U dun say so early hor... U c already then say...,"[dun, say, so, early, hor, already, then, say]"
4,ham,"Nah I don't think he goes to usf, he lives around here though","[nah, don, think, he, goes, to, usf, he, lives, around, here, though]"


In [9]:
X_train, X_test, y_train, y_test = train_test_split(
    messages['text_clean'],
    messages['label'],
    test_size=0.2
)

In [11]:
# in the training model we pass:
    # 1. the training data,
    # 2. size of the vectors
    # 3. window size - the defines the number of words that are looked into 
                    # before and after a single word 
                    # to define a CONTEXT for that word
    # 4. min_count - the number of times a word MUST appear in the corpus in order to create a word vector 

w2v_model = gensim.models.Word2Vec(X_train, vector_size=100, window=5, min_count=2)


In [12]:
# now we can explore the word vector by calling the wv attribute (word vector) on the model
w2v_model.wv['king']

array([-0.01859508,  0.09334327,  0.01866102,  0.03645543,  0.00165043,
       -0.12536196,  0.03824956,  0.19862303, -0.08388173, -0.07034004,
       -0.03974972, -0.11045031, -0.01419707,  0.06661732,  0.03849647,
       -0.05567477,  0.02262736, -0.09225537, -0.02150563, -0.20847806,
        0.02826986,  0.01709961,  0.0369816 , -0.04529424, -0.01692055,
       -0.01505673, -0.08408951, -0.05140496, -0.06082587,  0.03639682,
        0.06615051,  0.02167609,  0.03725342, -0.06047661, -0.05844461,
        0.0932475 ,  0.03039333, -0.07540444, -0.03886182, -0.15025271,
       -0.01436044, -0.06315015, -0.03963646,  0.02130873,  0.05740609,
       -0.03826748, -0.05030645, -0.02441807,  0.04785962,  0.08740065,
        0.0267141 , -0.07274487, -0.00861943,  0.00321344, -0.02199449,
        0.04259879,  0.07240497,  0.00882292, -0.08044538,  0.04243927,
        0.01797462,  0.05290907,  0.00950083, -0.03138465, -0.12503333,
        0.08406508,  0.0222037 ,  0.0651852 , -0.11138732,  0.11

In [13]:
w2v_model.wv.most_similar('king')

### the results were not good because the dataset is smaller ###
# in order to use word embeddings, we need to understand the problem in hand a little bit better


[('job', 0.9969331622123718),
 ('crazy', 0.9968197345733643),
 ('liao', 0.9967961311340332),
 ('make', 0.9967643618583679),
 ('keep', 0.9967561364173889),
 ('didn', 0.9967349767684937),
 ('of', 0.9967313408851624),
 ('will', 0.996717631816864),
 ('but', 0.9967151284217834),
 ('great', 0.9967098832130432)]

In [15]:
# checking all of the words that our model has created ("learned") a vector for
w2v_model.wv.index_to_key

['to',
 'you',
 'the',
 'and',
 'in',
 'is',
 'me',
 'my',
 'it',
 'for',
 'your',
 'of',
 'call',
 'that',
 'have',
 'on',
 'are',
 'now',
 'can',
 'but',
 'not',
 'so',
 'we',
 'get',
 'at',
 'do',
 'or',
 'be',
 'just',
 'ur',
 'with',
 'no',
 'if',
 'will',
 'this',
 'gt',
 'lt',
 'up',
 'how',
 'free',
 'when',
 'ok',
 'what',
 'go',
 'from',
 'out',
 'll',
 'all',
 'then',
 'know',
 'got',
 'like',
 'am',
 'come',
 'good',
 'day',
 'there',
 'he',
 'time',
 'its',
 'love',
 'only',
 'was',
 'send',
 'want',
 'text',
 'txt',
 'going',
 'as',
 'by',
 'home',
 'lor',
 'don',
 'one',
 'dont',
 'stop',
 'need',
 'about',
 'see',
 'our',
 'da',
 'sorry',
 'still',
 'reply',
 'mobile',
 'she',
 'hi',
 'back',
 'today',
 'tell',
 'later',
 'did',
 'been',
 'pls',
 'take',
 'they',
 'think',
 'new',
 'please',
 'any',
 'well',
 'an',
 'week',
 'oh',
 'phone',
 'dear',
 'some',
 'night',
 'her',
 'hope',
 'him',
 'claim',
 'who',
 'where',
 'here',
 'much',
 'has',
 'hey',
 're',
 'more',


In [17]:
### generate aggregated sentence vectors based on the word vectors for each word in the sentence
# we loop through each text message (ls) within the test set
# next we iterate through each word (represented by 'i') in that text message
# each word is called into wv attribute of the model 
    # ---> to get its word vector
# the only condition is that we try to return the vector AS LONG AS it was RETURNED BY THE MODEL (if i in w2v_model.wv.index2word)
# the returned list (from the nested list comprehension statement) is wraped into an array (1D)
    # that array is wrapped again in an outside array
    # ---> a nested set of arrays within an array
    
w2v_vect = np.array([np.array([w2v_model.wv[i] for i in ls if i in w2v_model.wv.index_to_key]) for ls in X_test])


  # This is added back by InteractiveShellApp.init_path()


In [None]:
print(type(w2v_vect))
print(w2v_vect.ndim)
print(w2v_vect.size)
print(w2v_vect)

In [18]:
# notice that the sentence length is different than the length of the sentence vector
for i,v in enumerate(w2v_vect):
    print(len(X_test.iloc[i]), len(v))

# each line in the loop has 2 numbers
    # - length of the sentence from the test data
    # - number/count of word vector for that exact sentence 
        # each vector having the length of 100 because that is number we set it to while creating the model

        
# we will get an error if we pass this to a model 
    # because the number of 'features' of the model has to be the same as the number(len) that goes into the model
    

27 8
9 9
3 2
7 7
23 23
21 17
7 6
25 24
18 18
4 4
5 5
5 5
13 11
18 17
29 29
25 22
9 9
24 21
7 7
21 21
16 16
21 19
21 18
21 20
16 16
5 5
8 7
6 6
15 15
22 21
18 14
9 9
7 7
7 6
6 6
7 5
13 2
18 16
11 8
15 14
23 22
36 22
2 2
14 10
26 21
26 23
4 4
8 8
4 4
10 8
5 5
22 19
4 4
4 4
27 25
19 17
25 21
12 12
5 5
7 6
11 11
7 6
29 29
11 11
24 21
8 8
10 7
5 5
27 27
7 5
22 20
19 19
10 10
7 7
13 13
7 6
23 22
29 25
7 5
25 25
17 16
8 8
6 6
3 3
4 4
8 7
23 23
17 13
22 22
16 14
12 12
7 7
9 9
51 45
17 9
15 12
5 4
5 5
27 22
17 16
34 33
8 7
17 16
9 9
46 40
11 9
4 2
14 14
19 19
6 5
26 23
14 13
25 25
18 14
23 23
22 16
4 2
25 20
18 17
18 17
12 9
10 9
13 13
9 9
2 2
9 8
26 18
13 12
20 16
6 5
20 17
10 9
10 10
9 9
24 23
21 17
10 10
19 18
3 3
17 17
8 5
50 43
11 11
8 7
18 18
9 8
24 20
12 10
5 5
17 14
7 7
6 6
25 24
23 22
31 29
5 5
15 14
8 8
13 9
7 6
4 4
21 20
8 8
5 5
12 9
6 6
7 6
5 4
5 4
23 23
18 18
5 4
7 7
26 20
9 9
5 4
13 13
12 11
13 13
21 20
5 5
28 26
27 25
22 21
24 23
24 22
7 7
23 18
4 4
31 26
6 4
29 29
5 5
14 13
6 4


In [19]:
# element-wise average
# we will store the first entry of the first sentence in the vector

# for each array in the w2v_vector array
    # make sure that at least one word has a word vector
    # take that array of word vectors
    # calculate the element-wise average of those word vectors
    # and append it to the w2v_vect_avg LIST of our final vectors
    
    # in case that there are no word vector for a certain sentence
        # this means that we have no understanding of that text message
        # create an array with len==100 that is full of zeros
        # and append it to w2v_vect_avg

w2v_vect_avg = []

for vect in w2v_vect:
    if len(vect) != 0:
        w2v_vect_avg.append(vect.mean(axis=0))
    else:
        w2v_vect_avg.append(np.zeros(100))

In [20]:
# now check if the lengths match 
for i,v in enumerate(w2v_vect_avg):
    print(len(X_test.iloc[i]), len(v))

27 100
9 100
3 100
7 100
23 100
21 100
7 100
25 100
18 100
4 100
5 100
5 100
13 100
18 100
29 100
25 100
9 100
24 100
7 100
21 100
16 100
21 100
21 100
21 100
16 100
5 100
8 100
6 100
15 100
22 100
18 100
9 100
7 100
7 100
6 100
7 100
13 100
18 100
11 100
15 100
23 100
36 100
2 100
14 100
26 100
26 100
4 100
8 100
4 100
10 100
5 100
22 100
4 100
4 100
27 100
19 100
25 100
12 100
5 100
7 100
11 100
7 100
29 100
11 100
24 100
8 100
10 100
5 100
27 100
7 100
22 100
19 100
10 100
7 100
13 100
7 100
23 100
29 100
7 100
25 100
17 100
8 100
6 100
3 100
4 100
8 100
23 100
17 100
22 100
16 100
12 100
7 100
9 100
51 100
17 100
15 100
5 100
5 100
27 100
17 100
34 100
8 100
17 100
9 100
46 100
11 100
4 100
14 100
19 100
6 100
26 100
14 100
25 100
18 100
23 100
22 100
4 100
25 100
18 100
18 100
12 100
10 100
13 100
9 100
2 100
9 100
26 100
13 100
20 100
6 100
20 100
10 100
10 100
9 100
24 100
21 100
10 100
19 100
3 100
17 100
8 100
50 100
11 100
8 100
18 100
9 100
24 100
12 100
5 100
17 100
7 100
6

- ---> now each sentence is represented with one vector 
    - (that has an average value of the individual word vectors for words in that sentence)
    - with length set to 100

---------------------------------------------------

## doc2vec 
- creates a vector on a sentence/paragraph/document level
- a two-layer neural network (the same as word2vec)
- skips the consolidation step we had to do previously (creating vectors for each individual word and then averaging the values of those vectors to represent a single sentence)
    - averaging a group of numbers to represent a single number will result in **information loss**
    - doc2vec represents the sentence/paragraph/document in a more sophisticated way
- usage:
    1. using pretrained embeddings
    2. training models with our own data

In [21]:
import gensim
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
pd.set_option('display.max_colwidth', 100)

data = pd.read_csv('dataset/SMSSpamCollection.csv', sep='delimiter')
messages = pd.DataFrame(columns=['label','text'])
messages[['label','text']] = data['v1\tv2'].str.split('\t', expand=True)

# print(messages.head())

messages['text_clean'] = messages['text'].apply(lambda x: gensim.utils.simple_preprocess(x))

print(messages.head())



X_train, X_test, y_train, y_test = train_test_split(
    messages['text_clean'],
    messages['label'],
    test_size=0.2
)

  label  \
0   ham   
1   ham   
2  spam   
3   ham   
4   ham   

                                                                                                  text  \
0  Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g...   
1                                                                        Ok lar... Joking wif u oni...   
2  Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...   
3                                                    U dun say so early hor... U c already then say...   
4                                        Nah I don't think he goes to usf, he lives around here though   

                                                                                            text_clean  
0  [go, until, jurong, point, crazy, available, only, in, bugis, great, world, la, buffet, cine, th...  
1                                                                          [ok, lar, j

  import sys


In [22]:
### doc2vec will provide a 'tag' attribute for each individual sentence

tagged_docs = [gensim.models.doc2vec.TaggedDocument(v, [i]) for i,v in enumerate(X_train)]

In [23]:
tagged_docs[16]

TaggedDocument(words=['baaaaaaaabe', 'wake', 'up', 'miss', 'you', 'crave', 'you', 'need', 'you'], tags=[16])

In [24]:
### Training a doc2vec model
d2v_model = gensim.models.Doc2Vec(tagged_docs,
                                 vector_size=100,
                                 window=5,
                                 min_count=2
                                 )

In [25]:
# we HAVE to pass in a list of strings to the results of the trained model
d2v_model.infer_vector('text')

TypeError: Parameter doc_words of infer_vector() must be a list of strings (not a single string).

In [26]:
d2v_model.infer_vector(['text', 'data', 'from', 'NLP', 'doc2vec', 'model'])

array([-0.03054821,  0.02921768,  0.01270788, -0.01839049,  0.00136889,
       -0.02633071, -0.00040266,  0.03723228, -0.02643488, -0.01421277,
       -0.01325785, -0.04739045,  0.00525372,  0.0096801 ,  0.01145184,
       -0.03060447,  0.00671459, -0.03182997, -0.00150104, -0.02075062,
        0.01844559,  0.02599015,  0.00666935,  0.00831925, -0.00484738,
        0.01182042, -0.02803058, -0.00906078, -0.01258356, -0.00196204,
        0.00688754,  0.00173456, -0.0072806 ,  0.00806671,  0.00209293,
        0.01978914,  0.00376926, -0.01965827, -0.01158088, -0.02770146,
       -0.01198745, -0.00287248, -0.01213399, -0.0063272 , -0.00219679,
       -0.00966262, -0.00305656, -0.01299466,  0.00498746,  0.01906915,
        0.00010572, -0.01325119,  0.00541018,  0.00163815, -0.00471573,
        0.01113642, -0.00780264, -0.01267774, -0.00590003,  0.01157917,
        0.00668021, -0.0027678 ,  0.00272791,  0.01076281, -0.02000392,
        0.01582583,  0.00762441,  0.01619483, -0.01946804,  0.02

- the length of each vector is 100 (that is the parameter we have passed while instantiating a traning Doc2Vec model) 

-------------------------------------------------

### Preparing the vectors for a machine learning model 

- first we iterate through lists of words from X_test
- then we pass those lists of words (see above) to the 'infer_vector' attribute
- then, the resulting arrays of vectors SHOULD be stored in a LIST called 'vectors'
    - we do not have to store it to an array like in the word2vec model,
    - because we do not have to calculate element-wise average values and fix the lengths of the vectors

In [27]:
vectors = [[d2v_model.infer_vector(words)] for words in X_test]

In [28]:
vectors[128]

[array([ 3.12278070e-03, -2.71983701e-03, -6.66441920e-05,  9.29011870e-03,
         3.76638700e-03, -3.03561171e-03, -7.88119447e-04,  1.33309467e-02,
        -5.65573899e-03,  1.09439285e-03, -1.40107248e-03, -2.95319944e-03,
        -3.83217121e-03, -7.17220036e-03,  4.76343837e-03, -4.07574931e-03,
         8.24493822e-04, -2.76558776e-03,  7.14637991e-03, -1.88247638e-03,
        -2.24457844e-03, -5.99887874e-03,  6.67733373e-03, -8.94644111e-03,
        -7.92922336e-04,  4.63969400e-03,  3.61392274e-03, -1.18698152e-02,
         2.28531798e-03, -6.52342150e-03,  7.64028495e-03, -1.12058129e-03,
         4.54420876e-03, -6.33778237e-03, -1.11630158e-02,  4.39576805e-03,
         4.75074304e-03,  1.97859993e-03, -9.95512400e-03, -6.53957250e-04,
         7.03437952e-03,  2.56255339e-03,  5.18462434e-03,  7.96295062e-04,
         4.72550234e-03, -3.16102104e-03, -1.22094229e-02,  1.94759166e-03,
         2.51203915e-03, -3.48096876e-03,  1.04194134e-02, -6.45849854e-03,
         1.3