First we will implement word vectors. We will use both pretrained word vector and locally trained word vector. We will comapre both results.

In [1]:
import gensim.downloader as api

In [2]:
wiki_embedding = api.load('glove-wiki-gigaword-100')

In [3]:
#Lets check how good those pretrained embedding works
wiki_embedding.most_similar('long')

[('short', 0.8219379186630249),
 ('longer', 0.7952098250389099),
 ('once', 0.7616981267929077),
 ('still', 0.759249746799469),
 ('so', 0.7570570707321167),
 ('end', 0.7567450404167175),
 ('even', 0.7533906698226929),
 ('though', 0.7517069578170776),
 ('well', 0.7480059862136841),
 ('time', 0.7464157938957214)]

So it seems that most of the words are similar or close to the meaning of the inserted word.

Now we will create our own vectorization model using gensim.

In [4]:
import pandas as pd
import numpy as np
import gensim
from sklearn.model_selection import train_test_split
pd.set_option('display.max_colwidth', 100)

In [5]:
message = pd.read_csv('spam.csv', encoding ='latin-1')
message.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives around here though",,,


In [6]:
#Our dataset conatins columns which are having only NaN values so we need to get rid of it
message.drop(labels=['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], axis = 1, inplace=True)

#Also lets change the existing column name for better understanding
message.columns= ['label', 'email_text']
message.head()

Unnamed: 0,label,email_text
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives around here though"


In [7]:
#We can manually tokenize and clean the data. But here we will use gensim library to take a shortcut
message['cleaned_text'] = message['email_text'].apply(lambda x : gensim.utils.simple_preprocess(x))
message.head()

Unnamed: 0,label,email_text,cleaned_text
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g...","[go, until, jurong, point, crazy, available, only, in, bugis, great, world, la, buffet, cine, th..."
1,ham,Ok lar... Joking wif u oni...,"[ok, lar, joking, wif, oni]"
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,"[free, entry, in, wkly, comp, to, win, fa, cup, final, tkts, st, may, text, fa, to, to, receive,..."
3,ham,U dun say so early hor... U c already then say...,"[dun, say, so, early, hor, already, then, say]"
4,ham,"Nah I don't think he goes to usf, he lives around here though","[nah, don, think, he, goes, to, usf, he, lives, around, here, though]"


In [8]:
#Now divide our training dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(message['cleaned_text'], message['label'],
                                                   test_size = 0.20)

In [9]:
#Now we can create the vectorizing model
w2v = gensim.models.Word2Vec(X_train, size = 100, window = 3, min_count = 2) 

In [10]:
w2v.wv.most_similar('long')

[('he', 0.9997958540916443),
 ('off', 0.9997925758361816),
 ('just', 0.9997900128364563),
 ('its', 0.9997889399528503),
 ('here', 0.9997878670692444),
 ('not', 0.9997830390930176),
 ('today', 0.9997804164886475),
 ('well', 0.9997801780700684),
 ('in', 0.999778687953949),
 ('but', 0.9997777938842773)]

We can clearly see that the words are not even closer in meaning with the inserted word. So to get a better result it is necessary to train the model on a really large dataset.

Now let's see how we can develop an ML model using the output of word to vectorization

In [11]:
#Now we will create the vectorizing by using the model
w2v_vect = np.array([np.array([w2v.wv[i] for i in ls if i in w2v.wv.index2word]) for ls in X_test])

  w2v_vect = np.array([np.array([w2v.wv[i] for i in ls if i in w2v.wv.index2word]) for ls in X_test])


In [12]:
#The ML algorithm can not handle the variable number of input features. In our model, the number of words
    #in the test emails are different. Moreover, not every words will have vector values. Because in the 
    #test set there may be some words which were not present at training set and eventually not vertorized
w2v_vect_avg = []

for vect in w2v_vect:
    if len(vect)!=0:
        w2v_vect_avg.append(vect.mean(axis=0))
    else:
        w2v_vect_avg.append(np.zeros(100))
        
#Here we have added the values column wise and then took the average. 
#So the input shape of numpy array for a 10 words email changes from (10,10) to list of 100 elements
#It is worth remembering that in numpy, axis = 0 means column which is opposite of Pandas

By averaging the values we loose important internal information. To avoid this, we can use a sentece level vectorization instead of word level. 

Like word2vec, there are also pretrained model for doc2vect but comperatively much less in numbers. We will implement here a custom trianed doc2vec model.

In [14]:
#First we need to tag the documents
#For simplicity we will use the indexes as tagg

tagged_doc = [gensim.models.doc2vec.TaggedDocument(v,[i]) for i, v in enumerate(X_train)]

In [15]:
#Let's train our model
#It is similar to w2v; but here we have to pass list of taggs and size is replaced by vector_size
d2v = gensim.models.Doc2Vec(tagged_doc, vector_size = 100,
                           window = 3, min_count = 2)

In [16]:
#To show the vector we have to pass a list of strings
#In w2v we needed to pass the list of one string with w2v.wv['any_word'] command)
#Here we have to use infer_vector
d2v.infer_vector(['it', 'is', 'raining'])

array([-8.47342366e-04,  1.08249690e-02,  7.10447261e-04,  1.45424148e-02,
       -2.09554210e-02, -3.43396375e-03,  3.48304934e-03,  8.67612020e-04,
        3.76065099e-03,  1.89243872e-02,  8.29740521e-03,  5.35160489e-03,
       -1.60630494e-02, -1.69633850e-02,  4.13551723e-04,  9.96800140e-04,
       -2.16301880e-03, -6.61700638e-03,  1.46541058e-03,  1.84242819e-02,
       -1.90337782e-03,  4.36272565e-03,  9.96280555e-03,  1.20894169e-03,
        7.59214256e-03, -4.60318197e-03, -4.94375871e-03,  4.74603893e-03,
       -1.04742832e-02, -5.39830793e-03,  1.01084569e-02, -1.27405871e-03,
       -1.61650195e-03, -7.56048551e-03,  2.44679209e-03, -9.18435544e-05,
       -2.20834203e-02,  4.01854003e-03, -1.78542882e-02,  5.82707068e-03,
       -1.60865462e-03, -1.11642096e-03, -1.13240108e-02,  6.38588890e-03,
        4.55374038e-03, -1.58460217e-03, -1.25673162e-02, -1.07251685e-02,
        6.58075931e-03, -5.39635657e-04, -9.59773362e-03, -1.48652866e-02,
        9.92938294e-04, -

In [17]:
#Just like w2v, we can call the most similar vectors
d2v.most_similar(['it', 'is', 'raining'])

  d2v.most_similar(['it', 'is', 'raining'])


[('going', 0.9996617436408997),
 ('if', 0.9996592998504639),
 ('can', 0.9996568560600281),
 ('one', 0.99965500831604),
 ('now', 0.9996535778045654),
 ('about', 0.9996533393859863),
 ('but', 0.9996504187583923),
 ('he', 0.9996503591537476),
 ('go', 0.9996497631072998),
 ('of', 0.9996475577354431)]

In [18]:
#Unlike w2v, we do not have to take the average before feeding into the ML model

d2v_vectors = [d2v.infer_vector(i) for i in X_test]

In [19]:
d2v_vectors[0]

array([-0.00069643,  0.02257366,  0.00070908,  0.0153123 , -0.0320973 ,
       -0.01000468,  0.00416091, -0.00227025,  0.00497963,  0.02605775,
        0.00804959,  0.01122246, -0.0256967 , -0.02589604,  0.0008933 ,
       -0.00051676, -0.01105438, -0.01878889,  0.00686748,  0.02525457,
        0.00111113,  0.00329549,  0.01572517,  0.00241998,  0.00296828,
       -0.01601721, -0.01581485,  0.00788614, -0.01593765, -0.00413635,
        0.02332717, -0.00236793,  0.00206217, -0.02156455, -0.00311842,
        0.01127873, -0.04008573,  0.01191013, -0.03910189,  0.0102937 ,
        0.00076298, -0.01141857, -0.02278094,  0.00951322,  0.01202042,
       -0.01115231, -0.02073468, -0.00959048,  0.01501272, -0.00896481,
       -0.01959412, -0.01697352,  0.00913172, -0.00905805, -0.00558412,
        0.01323293,  0.00836749,  0.00334338,  0.00923952, -0.00659455,
       -0.00039557,  0.02162655,  0.03645353, -0.01557582,  0.01276152,
        0.01231494, -0.00451587,  0.01463225,  0.0012144 ,  0.02