Document Similarity without Word Embeddings

Approach - 
1. I am using spam email dataset and will be using a subset of data to implement the similarity feature. I will take 50 emails and identify the top 3 similar emails for one email.

2. Text Representation or Vectorization will be done using TF-IDF. TF-IDF considers one complete document or email as vector. 

3. Cosine Similarity will be used to find out similarity between the vectors or emails of the dataset.

In [2]:
## import all the libraries
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import nltk
import re

In [3]:
##lets import the corpus, check the basics.
df = pd.read_csv('emails.csv')
df.head(10)

Unnamed: 0,text,spam
0,Subject: naturally irresistible your corporate...,1
1,Subject: the stock trading gunslinger fanny i...,1
2,Subject: unbelievable new homes made easy im ...,1
3,Subject: 4 color printing special request add...,1
4,"Subject: do not have money , get software cds ...",1
5,"Subject: great nnews hello , welcome to medzo...",1
6,Subject: here ' s a hot play in motion homela...,1
7,Subject: save your money buy getting this thin...,1
8,Subject: undeliverable : home based business f...,1
9,Subject: save your money buy getting this thin...,1


In [4]:
##check the size of dataframe
df.shape

(5728, 2)

In [5]:
##drop extra columns as we will not be classifying the emails. 
df.drop(columns = 'spam', inplace = True)


In [6]:
## take a subset of the original dataframe to be used for Doc Similarity
docsim = df[:50]
docsim.head(10)

Unnamed: 0,text
0,Subject: naturally irresistible your corporate...
1,Subject: the stock trading gunslinger fanny i...
2,Subject: unbelievable new homes made easy im ...
3,Subject: 4 color printing special request add...
4,"Subject: do not have money , get software cds ..."
5,"Subject: great nnews hello , welcome to medzo..."
6,Subject: here ' s a hot play in motion homela...
7,Subject: save your money buy getting this thin...
8,Subject: undeliverable : home based business f...
9,Subject: save your money buy getting this thin...


In [7]:
##create a function to clean your text
def clean_text(text):
  text = text.lower()
  text = text.replace("subject:", '')
  text = re.sub('[^\sa-zA-Z0-9]', '', text)
  return text

In [8]:
## clean up the data 
docsim['Cleanphase1'] = docsim['text'].apply(clean_text)
#docsim['Cleanphase1'] = docsim['text'].apply(lambda x: ' '.join([word for word in x.split() if word not in ("Subject:")]))
#docsim['Cleanphase1'].replace(regex=True, inplace=True, to_replace=r'[^\sa-zA-Z0-9]', value=r'')
##df[‘tweet’] = df[‘tweet’].str.replace(‘[^\w\s#@/:%.,_-]’, ‘’, flags=re.UNICODE) -- remove emoji
docsim.head(5)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,text,Cleanphase1
0,Subject: naturally irresistible your corporate...,naturally irresistible your corporate identit...
1,Subject: the stock trading gunslinger fanny i...,the stock trading gunslinger fanny is merril...
2,Subject: unbelievable new homes made easy im ...,unbelievable new homes made easy im wanting ...
3,Subject: 4 color printing special request add...,4 color printing special request additional ...
4,"Subject: do not have money , get software cds ...",do not have money get software cds from here...


In [9]:
## remove stop words
## create new columns to be able to compare the same. 
nltk.download('stopwords')
from nltk.corpus import stopwords
stop = stopwords.words('english')
docsim['CleanedText'] = docsim['Cleanphase1'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [10]:
docsim.head(5)

Unnamed: 0,text,Cleanphase1,CleanedText
0,Subject: naturally irresistible your corporate...,naturally irresistible your corporate identit...,naturally irresistible corporate identity lt r...
1,Subject: the stock trading gunslinger fanny i...,the stock trading gunslinger fanny is merril...,stock trading gunslinger fanny merrill muzo co...
2,Subject: unbelievable new homes made easy im ...,unbelievable new homes made easy im wanting ...,unbelievable new homes made easy im wanting sh...
3,Subject: 4 color printing special request add...,4 color printing special request additional ...,4 color printing special request additional in...
4,"Subject: do not have money , get software cds ...",do not have money get software cds from here...,money get software cds software compatibility ...


In [11]:
## spelling correction feature fromn TextBlob
## cleans some words correctly but changes a lot of word to another word, example - stationery becomes stationary
from textblob import TextBlob 
from textblob import Word

uniqueWords = list(set(' '.join(docsim['CleanedText']).lower().split(' ')))
count = len(uniqueWords)
print(count)

2781


In [12]:
docsim['spell_checked'] = docsim['CleanedText'].apply(lambda x: ''.join(TextBlob(x).correct()))


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [13]:
## drop additional columns, move on to document similarity logic
docsim.drop(columns = ['text','Cleanphase1','CleanedText'], inplace = True)
docsim.head(5)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


Unnamed: 0,spell_checked
0,naturally irresistible corporate identity it r...
1,stock trading gunslinger fanny merrily muco co...
2,unbelievable new homes made easy in wanting sh...
3,4 color printing special request additional in...
4,money get software cos software incompatibilit...


In [14]:
## check the indices
#print(docsim.index) ## -- RangeIndex(start=0, stop=50, step=1)
indices = pd.Series(docsim.index, index=docsim['spell_checked'])
print(indices)

spell_checked
naturally irresistible corporate identity it really hard recollect company market full suggestions information isoverwhelminq good catch log stylish stationary outstanding webster make task much easier promise having ordered into company automatically become world leader suite clear without good products effective business organization practicable aim that nowadays market promise marketing efforts become much effective list clear benefits creativeness hand made original logs specially done reflect distinctive company image convenience log stationary provided formats easy use content management system letsyou change webster content even structure promptness see log drafts within three business days affordability marketing break make gaps budget 100 satisfaction guaranteed provide unlimited amount changes extra fees surethat love result collaboration look portfolio interested                                                                                                    

In [15]:
## create a Series variable to send to vectorizer
Emails = docsim['spell_checked']
print(type(Emails))

<class 'pandas.core.series.Series'>


In [16]:
## create a tfidfvectorizer
vec = TfidfVectorizer()
vec_matrix = vec.fit_transform(Emails)
print(vec_matrix.shape)

(50, 2477)


In [17]:
## some checks on the object created
vec.get_feature_names()



['00',
 '000',
 '0000',
 '01',
 '0100',
 '03',
 '0308',
 '0700',
 '08',
 '09',
 '0908',
 '09360462',
 '10',
 '100',
 '1000',
 '10006',
 '10024',
 '105',
 '1071',
 '109',
 '11',
 '1148',
 '119',
 '12',
 '120',
 '126432211',
 '127',
 '13',
 '14',
 '15',
 '150',
 '15177',
 '156489',
 '16',
 '1618',
 '162',
 '169',
 '17',
 '170',
 '19',
 '1933',
 '1934',
 '195546',
 '1999',
 '20',
 '200',
 '2000',
 '2001',
 '2002',
 '2003',
 '20032',
 '2004',
 '20046',
 '2005',
 '20058',
 '205',
 '20549',
 '206',
 '21',
 '214',
 '220',
 '220056020109',
 '229',
 '23',
 '24',
 '24815823',
 '25',
 '26',
 '27',
 '279',
 '29',
 '30',
 '300',
 '301',
 '31',
 '32',
 '32430',
 '32746',
 '335',
 '338',
 '34',
 '349',
 '35',
 '36',
 '360',
 '38',
 '39',
 '40',
 '400',
 '42',
 '429',
 '443',
 '450',
 '454',
 '46',
 '4636',
 '4650',
 '46679',
 '49',
 '499',
 '50',
 '500',
 '50346',
 '50352',
 '50355',
 '51',
 '5110',
 '529',
 '55',
 '56',
 '57',
 '58',
 '59',
 '599',
 '60',
 '62',
 '626',
 '63',
 '64',
 '658',
 '66',


In [18]:
print(vec_matrix)

  (0, 1223)	0.08116863471662554
  (0, 1721)	0.08775460428354849
  (0, 1365)	0.0653002623927596
  (0, 507)	0.08775460428354849
  (0, 1921)	0.08775460428354849
  (0, 1373)	0.08775460428354849
  (0, 2179)	0.09703700704049151
  (0, 918)	0.08775460428354849
  (0, 884)	0.08116863471662554
  (0, 468)	0.08775460428354849
  (0, 243)	0.08116863471662554
  (0, 2329)	0.08116863471662554
  (0, 1809)	0.0653002623927596
  (0, 1070)	0.06835722621590971
  (0, 1977)	0.08775460428354849
  (0, 13)	0.07188623195968254
  (0, 408)	0.08775460428354849
  (0, 1014)	0.09703700704049151
  (0, 393)	0.09703700704049151
  (0, 202)	0.09703700704049151
  (0, 652)	0.07606015984544003
  (0, 2248)	0.08116863471662554
  (0, 2444)	0.05418542415487301
  (0, 746)	0.09703700704049151
  (0, 2001)	0.05800983041086657
  :	:
  (49, 243)	0.01833529282148194
  (49, 1809)	0.014750764706968919
  (49, 13)	0.008119239145855557
  (49, 2444)	0.01836002826715589
  (49, 2001)	0.006551959576633651
  (49, 835)	0.020395206599990327
  (49, 218

In [19]:
## check for similarity between the emails of the dataframe, each row is one email. hence we pass the DF to consine similarity 
## function two times to get a matrix of similarity score

cosine_sim = cosine_similarity(vec_matrix, vec_matrix)

In [20]:
print(cosine_sim)

[[1.         0.         0.0403383  ... 0.01179104 0.00561157 0.06632741]
 [0.         1.         0.         ... 0.         0.         0.00457892]
 [0.0403383  0.         1.         ... 0.00985157 0.03029082 0.051471  ]
 ...
 [0.01179104 0.         0.00985157 ... 1.         0.0031358  0.03271839]
 [0.00561157 0.         0.03029082 ... 0.0031358  1.         0.00354768]
 [0.06632741 0.00457892 0.051471   ... 0.03271839 0.00354768 1.        ]]


In [21]:
## lowest value 
## map function - 
print (min(map(min, cosine_sim)))

0.0


In [22]:
## add scores to a string variable
for line in cosine_sim:
    print ('  '.join(map(str, line)))

0.9999999999999999  0.0  0.04033830489511899  0.008946909166754684  0.0  0.0  0.0999483742050053  0.022927604726205864  0.023581648415026153  0.022927604726205864  0.014187745914102051  0.036381221199833835  0.0358259223562692  0.08120027188798062  0.007249401469010135  0.058915152278251365  0.020500150410885872  0.012773925214033515  0.015319643990623239  0.0619072685051091  0.02469421261016071  0.03398803197097409  0.2795502360825406  0.0  0.006923703019952953  0.04890003114048208  0.0709440540383943  0.0  0.0  0.022927604726205864  0.0  0.044179441810758974  0.04271868631133792  0.014575110220699426  0.016278386029055326  0.024188222622722733  0.005831098918721803  0.16347130804672566  0.015857700141999075  0.030086809782506524  0.02538905750994986  0.0  0.10392049287879172  0.009882045525961999  0.042231078795736274  0.056522015454830064  0.0  0.01179103643704334  0.005611567350196463  0.06632740909036412
0.0  1.0000000000000004  0.0  0.0  0.0  0.0  0.013668276714564461  0.02377424

In [23]:
## function to get the top 3 emails for a particular email in the dataframe. 
## as of now, only the emails in the same dataframe is considered. 
def getsimilaremail(testemail, cosine_sim, indices):
    idx = indices[testemail]
    sim_score = list(enumerate(cosine_sim[idx]))
    sim_score = sorted(sim_score, key=lambda x: x[1], reverse=True)
    e_score = sim_score[1:4]
    sim_email = [i[0] for i in e_score]
    return docsim['spell_checked'].iloc[sim_email]


In [24]:
temail = docsim['spell_checked'][4:5].astype(str).values[0]
print(temail)

money get software cos software incompatibility great grow old along best yet tragedies finish death remedies ended marriage


In [25]:
## run the function to check the code
idx = indices[temail]
print(idx)
sim_scores = list(enumerate(cosine_sim[idx]))
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
e_score = sim_scores[1:4]
sim_email = [i[0] for i in e_score]
print(sim_email)

4
[17, 16, 11]


In [26]:
## predict the top 3 emails for any one email in the dataframe

print(getsimilaremail(temail, cosine_sim, indices))

17    localised software languages available hello w...
16    software guaranteed 100 legal name brand softw...
11    save money buy getting thing tried calls yet c...
Name: spell_checked, dtype: object
