In [1]:
# 3:16:44

## word2vec
- converting words (strings of chars) to a **numeric vector representation of a word**
- example: making vector representations out of words from Wikipedia

- **word2vec** is a shallow, two-layer neural network 
    - accepts a text corpus as an input
    - returns a set of vectors (also known as embeddings)
    - each vector is a numeric representation of a given word
- skipgram method
    - defining a window of words around the focus word (**context**)
    - the model goes through the corpus and checks which words fall within the defined windows (ie. one, two or three...) to help it learn the meaning of a word
    - this allows the model to learn similar words (ie. colors...)
- any n-dimensional vector can be plotted in a n-dimensional graph
    - this provides insight into similarity of words (semantic relationship in the form of synonyms)
    - **cosine similarity**
        - two vectors are passed to a function 
        - cosine is calculated based on the angle between two vectors
        - it returns a number between -1 and 1
            - if the angle is close to zero, then the similarity score (cos(angle)) will be very close to 1
            - if the angle is close to 180, the score will be close to 0 (no similarities)
    - two vectors can be substracted from each other to look for word analogies ("queen is to man what queen is to woman")
        - queen = king - man + woman 
        
--------------------------------------

#### using word2vec
1. using pretrained embeddings
    - a model that has been trained on some extremely large corpus of text
    - generic word vectors out of the box
    - more stable
    
2. train embeddings on our own data
    - the result would be embeddings that are more tailored to a specific problem
    - real-life: words are used differently depending on the channel, style, context etc.
    - downside is a complicated training process and the vectors could not be as good if they are not trained on a large enough corpus

In [2]:
!pip install gensim



#### Working with gensim
1. loading pretrained word vectors
2. exploring word vectors
3. finding similar words based on pretrained word vectors

In [3]:
import gensim.downloader as api
wiki_embeddings = api.load('glove-wiki-gigaword-100')



In [4]:
wiki_embeddings['king']

array([-0.32307 , -0.87616 ,  0.21977 ,  0.25268 ,  0.22976 ,  0.7388  ,
       -0.37954 , -0.35307 , -0.84369 , -1.1113  , -0.30266 ,  0.33178 ,
       -0.25113 ,  0.30448 , -0.077491, -0.89815 ,  0.092496, -1.1407  ,
       -0.58324 ,  0.66869 , -0.23122 , -0.95855 ,  0.28262 , -0.078848,
        0.75315 ,  0.26584 ,  0.3422  , -0.33949 ,  0.95608 ,  0.065641,
        0.45747 ,  0.39835 ,  0.57965 ,  0.39267 , -0.21851 ,  0.58795 ,
       -0.55999 ,  0.63368 , -0.043983, -0.68731 , -0.37841 ,  0.38026 ,
        0.61641 , -0.88269 , -0.12346 , -0.37928 , -0.38318 ,  0.23868 ,
        0.6685  , -0.43321 , -0.11065 ,  0.081723,  1.1569  ,  0.78958 ,
       -0.21223 , -2.3211  , -0.67806 ,  0.44561 ,  0.65707 ,  0.1045  ,
        0.46217 ,  0.19912 ,  0.25802 ,  0.057194,  0.53443 , -0.43133 ,
       -0.34311 ,  0.59789 , -0.58417 ,  0.068995,  0.23944 , -0.85181 ,
        0.30379 , -0.34177 , -0.25746 , -0.031101, -0.16285 ,  0.45169 ,
       -0.91627 ,  0.64521 ,  0.73281 , -0.22752 , 

In [6]:
wiki_embeddings.most_similar('king')

[('prince', 0.7682329416275024),
 ('queen', 0.7507690787315369),
 ('son', 0.7020887732505798),
 ('brother', 0.6985775232315063),
 ('monarch', 0.6977890729904175),
 ('throne', 0.691999077796936),
 ('kingdom', 0.6811410188674927),
 ('father', 0.680202841758728),
 ('emperor', 0.6712858080863953),
 ('ii', 0.6676074266433716)]

#### Training a model with gensim
1. reading in the data
2. cleaning up


In [7]:
import gensim
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
pd.set_option('display.max_colwidth', 100)

data = pd.read_csv('dataset/SMSSpamCollection.csv', sep='delimiter')
messages = pd.DataFrame(columns=['label','text'])
messages[['label','text']] = data['v1\tv2'].str.split('\t', expand=True)

messages.head()

  data = pd.read_csv('dataset/SMSSpamCollection.csv', sep='delimiter')


Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives around here though"


In [9]:
# gensim.utils.simple_preprocess 
# removes the stopwords, punctuations and cleans the text in lowercase
messages['text_clean'] = messages['text'].apply(lambda x: gensim.utils.simple_preprocess(x))
messages.head()

Unnamed: 0,label,text,text_clean
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g...","[go, until, jurong, point, crazy, available, only, in, bugis, great, world, la, buffet, cine, th..."
1,ham,Ok lar... Joking wif u oni...,"[ok, lar, joking, wif, oni]"
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,"[free, entry, in, wkly, comp, to, win, fa, cup, final, tkts, st, may, text, fa, to, to, receive,..."
3,ham,U dun say so early hor... U c already then say...,"[dun, say, so, early, hor, already, then, say]"
4,ham,"Nah I don't think he goes to usf, he lives around here though","[nah, don, think, he, goes, to, usf, he, lives, around, here, though]"


In [10]:
X_train, X_test, y_train, y_test = train_test_split(
    messages['text_clean'],
    messages['label'],
    test_size=0.2
)

In [11]:
# in the training model we pass:
    # 1. the training data,
    # 2. size of the vectors
    # 3. window size - the defines the number of words that are looked into 
                    # before and after a single word 
                    # to define a CONTEXT for that word
    # 4. min_count - the number of times a word MUST appear in the corpus in order to create a word vector 

w2v_model = gensim.models.Word2Vec(X_train, size=100, window=5, min_count=2)


In [12]:
# now we can explore the word vector by calling the wv attribute (word vector) on the model
w2v_model.wv['king']

array([-1.05539067e-02,  4.64578532e-02,  1.58831719e-02, -1.51924435e-02,
       -1.13291442e-01, -1.97012946e-02,  1.01023382e-02, -1.54630365e-02,
       -1.00295611e-01, -1.76381953e-02,  4.56879064e-02,  5.90193607e-02,
       -1.76282059e-02,  5.12666032e-02, -1.76997688e-02, -6.50342479e-02,
       -8.95903707e-02, -5.60897496e-03, -7.59827346e-02,  1.54406400e-02,
       -9.32148248e-02, -7.01542292e-03,  4.20686416e-02, -2.78477930e-02,
        7.73381768e-03, -9.56976637e-02,  5.03551811e-02, -2.49051861e-02,
       -8.37319065e-03,  6.21041171e-02,  7.45661780e-02, -1.35712817e-01,
       -5.44772595e-02,  7.02554360e-02, -4.79186699e-02,  5.52556142e-02,
        1.14355095e-01,  4.17503230e-02, -3.59951966e-02, -6.00103056e-03,
        2.77257916e-02,  1.89548545e-02,  8.28571990e-03, -1.81145072e-02,
        8.20372999e-03,  5.79352602e-02, -2.57818811e-02, -1.20909847e-01,
        3.94678079e-02,  9.33493939e-05, -1.24044605e-02,  4.10076082e-02,
       -3.17845643e-02, -

In [14]:
w2v_model.wv.most_similar('king')

### the results were not good because the dataset is smaller ###
# in order to use word embeddings, we need to understand the problem in hand a little bit better


[('were', 0.9993231892585754),
 ('dont', 0.9993202090263367),
 ('answer', 0.9993064403533936),
 ('receive', 0.9992914199829102),
 ('get', 0.9992875456809998),
 ('gud', 0.9992870092391968),
 ('who', 0.9992806911468506),
 ('try', 0.9992780089378357),
 ('my', 0.9992777109146118),
 ('we', 0.9992774724960327)]

In [15]:
# checking all of the words that our model has created ("learned") a vector for
w2v_model.wv.index2word

['you',
 'to',
 'the',
 'and',
 'in',
 'is',
 'me',
 'my',
 'it',
 'for',
 'your',
 'call',
 'of',
 'that',
 'have',
 'on',
 'are',
 'now',
 'can',
 'so',
 'but',
 'not',
 'do',
 'or',
 'we',
 'will',
 'if',
 'get',
 'at',
 'be',
 'just',
 'no',
 'ur',
 'with',
 'this',
 'up',
 'how',
 'what',
 'gt',
 'lt',
 'when',
 'ok',
 'from',
 'free',
 'all',
 'out',
 'go',
 'll',
 'know',
 'day',
 'good',
 'like',
 'then',
 'come',
 'got',
 'am',
 'there',
 'its',
 'time',
 'he',
 'was',
 'love',
 'only',
 'want',
 'send',
 'as',
 'txt',
 'text',
 'one',
 'need',
 'by',
 'going',
 'she',
 'don',
 'about',
 'back',
 'lor',
 'today',
 'stop',
 'da',
 'home',
 'sorry',
 'see',
 'hi',
 'still',
 'tell',
 'our',
 'dont',
 'reply',
 'mobile',
 'take',
 'later',
 'been',
 'any',
 'pls',
 'did',
 'think',
 'please',
 'new',
 'some',
 'week',
 'her',
 'they',
 'phone',
 'dear',
 'here',
 'where',
 'night',
 'oh',
 're',
 'well',
 'hope',
 'who',
 'has',
 'much',
 'hey',
 'more',
 'claim',
 'happy',
 'gre

In [16]:
### generate aggregated sentence vectors based on the word vectors for each word in the sentence
# we loop through each text message (ls) within the test set
# next we iterate through each word (represented by 'i') in that text message
# each word is called into wv attribute of the model 
    # ---> to get its word vector
# the only condition is that we try to return the vector AS LONG AS it was RETURNED BY THE MODEL (if i in w2v_model.wv.index2word)
# the returned list (from the nested list comprehension statement) is wraped into an array (1D)
    # that array is wrapped again in an outside array
    # ---> a nested set of arrays within an array
    
w2v_vect = np.array([np.array([w2v_model.wv[i] for i in ls if i in w2v_model.wv.index2word]) for ls in X_test])


  w2v_vect = np.array([np.array([w2v_model.wv[i] for i in ls if i in w2v_model.wv.index2word]) for ls in X_test])


In [20]:
print(type(w2v_vect))
print(w2v_vect.ndim)
print(w2v_vect.size)
print(w2v_vect)

<class 'numpy.ndarray'>
1
1115
[array([[-0.00123451,  0.01461569,  0.01020308, ...,  0.01940734,
         0.01816209, -0.06401429],
       [-0.05081618,  0.24623837,  0.07502738, ...,  0.21390168,
         0.1759904 , -0.8384252 ],
       [-0.06182301,  0.25474226,  0.07726886, ...,  0.21941733,
         0.18692842, -0.88128257],
       ...,
       [-0.06097978,  0.2838241 ,  0.07886321, ...,  0.24613678,
         0.20931083, -0.9697612 ],
       [-0.06732642,  0.26313898,  0.07885629, ...,  0.23650844,
         0.20482902, -0.9468216 ],
       [-0.00241862,  0.03177935,  0.00645023, ...,  0.02196146,
         0.02108919, -0.1051791 ]], dtype=float32)
 array([[-4.37482074e-02,  1.82962731e-01,  5.07822633e-02,
        -6.22133426e-02, -4.64364141e-01, -9.17480364e-02,
         6.02119714e-02, -6.58415109e-02, -4.34282899e-01,
        -8.30904096e-02,  2.08749160e-01,  2.36286834e-01,
        -9.13146585e-02,  2.33189702e-01, -5.68525754e-02,
        -2.77045429e-01, -3.67519468e-01, -3

In [22]:
# notice that the sentence length is different than the length of the sentence vector
for i,v in enumerate(w2v_vect):
    print(len(X_test.iloc[i]), len(v))

# each line in the loop has 2 numbers
    # - length of the sentence from the test data
    # - number/count of word vector for that exact sentence 
        # each vector having the length of 100 because that is number we set it to while creating the model

        
# we will get an error if we pass this to a model 
    # because the number of 'features' of the model has to be the same as the number(len) that goes into the model
    

17 17
6 4
15 14
22 20
7 7
23 22
20 19
8 6
8 7
7 7
17 16
5 5
10 10
5 4
16 16
9 9
9 8
6 5
15 14
7 5
2 2
7 6
11 9
20 20
18 17
14 10
9 9
5 4
4 4
5 5
21 18
16 12
20 19
8 7
8 8
13 13
13 13
26 16
4 4
9 9
20 19
6 5
22 22
27 24
5 5
26 25
31 31
6 6
24 21
20 17
30 24
20 20
11 10
25 24
7 5
7 6
5 3
30 25
5 3
16 15
10 10
19 17
8 6
7 6
9 9
8 8
19 17
15 15
25 20
24 20
13 10
17 13
13 12
17 17
19 19
6 6
12 10
4 3
31 31
7 4
22 20
13 12
15 14
7 7
5 5
13 13
27 22
12 12
5 5
9 8
17 17
5 5
5 5
4 4
22 22
13 13
5 5
14 13
8 7
6 6
8 8
1 1
5 5
18 17
13 12
2 2
11 7
3 3
5 5
18 17
10 10
27 26
7 7
10 10
3 3
4 3
20 20
4 4
12 8
4 4
1 1
7 6
4 3
22 21
29 28
5 5
5 5
11 9
46 43
12 11
21 19
11 10
1 1
5 4
24 22
11 10
4 1
6 6
16 13
6 6
3 3
8 7
8 7
16 14
25 25
6 5
9 7
25 25
22 21
6 6
14 13
7 6
19 19
26 23
20 19
52 46
5 5
12 10
5 5
22 22
6 6
22 19
22 22
17 17
22 22
5 3
27 26
9 8
11 11
22 20
9 8
36 31
6 5
6 6
6 5
23 22
15 14
4 4
22 15
19 19
24 16
2 2
6 5
36 32
25 24
7 7
25 23
9 9
7 6
18 16
7 5
8 7
5 5
24 22
6 6
7 7
4 4
13 13
6 6


In [23]:
# element-wise average
# we will store the first entry of the first sentence in the vector

# for each array in the w2v_vector array
    # make sure that at least one word has a word vector
    # take that array of word vectors
    # calculate the element-wise average of those word vectors
    # and append it to the w2v_vect_avg LIST of our final vectors
    
    # in case that there are no word vector for a certain sentence
        # this means that we have no understanding of that text message
        # create an array with len==100 that is full of zeros
        # and append it to w2v_vect_avg

w2v_vect_avg = []

for vect in w2v_vect:
    if len(vect) != 0:
        w2v_vect_avg.append(vect.mean(axis=0))
    else:
        w2v_vect_avg.append(np.zeros(100))

In [25]:
# now check if the lengths match 
for i,v in enumerate(w2v_vect_avg):
    print(len(X_test.iloc[i]), len(v))

17 100
6 100
15 100
22 100
7 100
23 100
20 100
8 100
8 100
7 100
17 100
5 100
10 100
5 100
16 100
9 100
9 100
6 100
15 100
7 100
2 100
7 100
11 100
20 100
18 100
14 100
9 100
5 100
4 100
5 100
21 100
16 100
20 100
8 100
8 100
13 100
13 100
26 100
4 100
9 100
20 100
6 100
22 100
27 100
5 100
26 100
31 100
6 100
24 100
20 100
30 100
20 100
11 100
25 100
7 100
7 100
5 100
30 100
5 100
16 100
10 100
19 100
8 100
7 100
9 100
8 100
19 100
15 100
25 100
24 100
13 100
17 100
13 100
17 100
19 100
6 100
12 100
4 100
31 100
7 100
22 100
13 100
15 100
7 100
5 100
13 100
27 100
12 100
5 100
9 100
17 100
5 100
5 100
4 100
22 100
13 100
5 100
14 100
8 100
6 100
8 100
1 100
5 100
18 100
13 100
2 100
11 100
3 100
5 100
18 100
10 100
27 100
7 100
10 100
3 100
4 100
20 100
4 100
12 100
4 100
1 100
7 100
4 100
22 100
29 100
5 100
5 100
11 100
46 100
12 100
21 100
11 100
1 100
5 100
24 100
11 100
4 100
6 100
16 100
6 100
3 100
8 100
8 100
16 100
25 100
6 100
9 100
25 100
22 100
6 100
14 100
7 100
19 100
26

- ---> now each sentence is represented with one vector 
    - (that has an average value of the individual word vectors for words in that sentence)
    - with length set to 100

---------------------------------------------------

## doc2vec 
- creates a vector on a sentence/paragraph/document level
- a two-layer neural network (the same as word2vec)
- skips the consolidation step we had to do previously (creating vectors for each individual word and then averaging the values of those vectors to represent a single sentence)
    - averaging a group of numbers to represent a single number will result in **information loss**
    - doc2vec represents the sentence/paragraph/document in a more sophisticated way
- usage:
    1. using pretrained embeddings
    2. training models with our own data

In [None]:
import gensim
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
pd.set_option('display.max_colwidth', 100)

data = pd.read_csv('dataset/SMSSpamCollection.csv', sep='delimiter')
messages = pd.DataFrame(columns=['label','text'])
messages[['label','text']] = data['v1\tv2'].str.split('\t', expand=True)

# print(messages.head())

messages['text_clean'] = messages['text'].apply(lambda x: gensim.utils.simple_preprocess(x))

print(messages.head())



X_train, X_test, y_train, y_test = train_test_split(
    messages['text_clean'],
    messages['label'],
    test_size=0.2
)

In [29]:
### doc2vec will provide a 'tag' attribute for each individual sentence

tagged_docs = [gensim.models.doc2vec.TaggedDocument(v, [i]) for i,v in enumerate(X_train)]

In [31]:
tagged_docs[16]

TaggedDocument(words=['studying', 'in', 'sch', 'or', 'going', 'home', 'anyway', 'll', 'going', 'sch', 'later'], tags=[16])

In [32]:
### Training a doc2vec model
d2v_model = gensim.models.Doc2Vec(tagged_docs,
                                 vector_size=100,
                                 window=5,
                                 min_count=2
                                 )

In [34]:
# we HAVE to pass in a list of strings to the results of the trained model
d2v_model.infer_vector('text')

TypeError: Parameter doc_words of infer_vector() must be a list of strings (not a single string).

In [35]:
d2v_model.infer_vector(['text', 'data', 'from', 'NLP', 'doc2vec', 'model'])

array([-0.00141423,  0.0062188 ,  0.00308266,  0.00276783, -0.00383873,
        0.00274167,  0.00401915, -0.00190763, -0.00608317, -0.00218049,
        0.00775998,  0.00343144, -0.00045128,  0.00121614, -0.00130062,
       -0.00188033, -0.01035386, -0.00589644, -0.00275047,  0.0015035 ,
       -0.00376079, -0.00450834, -0.00075878, -0.00467609,  0.00075025,
       -0.00409629,  0.00276235,  0.00145573,  0.00304923, -0.00255933,
        0.00616254, -0.00497567, -0.00277544,  0.00393416, -0.00028814,
       -0.00058989,  0.00654795, -0.00163432,  0.00279466,  0.00311303,
        0.00337751,  0.00144811,  0.00300472, -0.00054887,  0.00136764,
        0.00908081, -0.00327512, -0.00262275,  0.0019084 ,  0.00550525,
       -0.00374384,  0.00442728,  0.00263951, -0.0033929 , -0.01222997,
        0.00108002,  0.00553672,  0.00742503, -0.00937249,  0.00564457,
       -0.00440077, -0.00125692,  0.00596297, -0.00120858, -0.00029843,
       -0.00034503, -0.01143269, -0.00682692,  0.00635015,  0.00

- the length of each vector is 100 (that is the parameter we have passed while instantiating a traning Doc2Vec model) 

In [37]:
### now the vectors should be prepared for a machine learning model
# first we iterate through lists of words from X_test
# then we pass those lists of words (see above) to the 'infer_vector' attribute
# then, the resulting arrays of vectors SHOULD be stored in a LIST called 'vectors'
    # we do not have to store it to an array like in the word2vec model,
    # because we do not have to calculate element-wise average values and fix the lengths of the vectors
    
vectors = [[d2v_model.infer_vector(words)] for words in X_test]

In [38]:
vectors[128]

[array([-0.00480069,  0.01078585,  0.0060761 , -0.00832097, -0.02767451,
         0.00116301,  0.00923245, -0.00837996, -0.02789045,  0.00531719,
         0.00519943,  0.01568944, -0.00407802,  0.00488987,  0.00735666,
        -0.01594962, -0.02276433, -0.01023729, -0.01090279,  0.00674385,
        -0.02333922, -0.0038877 ,  0.02124634, -0.00444617, -0.00241868,
        -0.02987038,  0.00992619, -0.0067247 ,  0.00316342,  0.00367629,
         0.02253664, -0.02524848, -0.01193742,  0.02027621, -0.01742133,
         0.01005882,  0.02816904,  0.00826702, -0.00741691, -0.00880202,
         0.00470332,  0.00971489,  0.01717932, -0.01005577, -0.00371203,
         0.02099204, -0.01884709, -0.02261822,  0.00820917,  0.00743707,
        -0.00432223,  0.00670638, -0.00100442, -0.02463825, -0.04387959,
        -0.00503191,  0.00306877,  0.01743518, -0.03731865,  0.02951704,
        -0.0084368 , -0.00396591,  0.01556264, -0.00605551,  0.0124385 ,
         0.00183623, -0.03845953, -0.01010414,  0.0