#                          Cosine Similarity for Data Quality 

A large majority of data scientist’ time will be allocated to activities out of actual modeling. One issue that you could spend a ton of time on is data quality. Often in large corporations, the same data can be stored in a variety of places. Some of these places are easier to access than others or some sources are much faster than others. 

Perhaps the data comes from an external vendor, and you've been asked to confirm the data internally is the same. Whatever the situation is, you have been asked to compare two data sources for text documents & confirm that it is the same data. Another assumption to this walkthrough is that you believe you have a 1 to 1 relationship with each respective data sources text/document identifiers. To confirm at scale, we can use cosine similarity of the respective data sources to confirm they are a match.

A score of over 80 would indicate the documents are very similar / same. The documents are unlikely to be a 100% match (score of 1) due to different encoding methods of the data sources. For instances, one data source could encode the data differently and when it's decoded special characters or additional information is introduced into the text. 

The purpose of this notebook is to take a toy example and show how you can use machine learning for data quality in user case that you will likely face in the world of NLP.


In [1]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
import pandas as pd

## Example 1

In [2]:
text_one = ['My name is Bodie']

text_two = ['\n\n text_body : My name is Bodie']


X = TfidfVectorizer().fit_transform([text_one[0],text_two[0]])
array = X.toarray()
value = cosine_similarity(array)
cos = value[0][1]
print(cos)

0.8181802073667197


In example above, the two texts are clearly the same but are stored differently. As a  result of the different storing formats, the cosine similiarty score is .81.  

## Example 2

In previous example, we just compared two pieces of text. Now let's compare a few more examples so we get an idea of what this looks at scale.

In [3]:
headlines = ['Life Insurance - Why Pay More?',
'FORTUNE 500 COMPANY HIRING, AT HOME REPS.',
 'Re: PROTECT YOUR COMPUTER AGAINST HARMFUL VIRUSES! 21198']

new_headlines =['message : \n\n\n\n\n\ Life Insurance - Why Pay More?',
'FORTUNE 500 COMPANY HIRING, AT HOME REPS.',
 'My name is Bodie']


Note: Headlines came from dataset from grad school. I believe original data source is kaggle?

In [4]:
combine_list = list(zip(headlines,new_headlines))


Cosine_Score = []

for text in combine_list:
    X = TfidfVectorizer().fit_transform([text[0],text[1]])
    array = X.toarray()
    value = cosine_similarity(array)
    cos = value[0][1]
    Cosine_Score.append(cos)

Simple for loop and we will append the cosine scores to an empty list.

In [5]:
df= pd.DataFrame(combine_list,columns=['Original_Headline','new_headlines'])
df['Cosine_Score'] = Cosine_Score

Now that we got our scores, let's roll them up into a dataframe

In [6]:
df.head()

Unnamed: 0,Original_Headline,new_headlines,Cosine_Score
0,Life Insurance - Why Pay More?,message : \n\n\n\n\n\ Life Insurance - Why Pay...,0.846647
1,"FORTUNE 500 COMPANY HIRING, AT HOME REPS.","FORTUNE 500 COMPANY HIRING, AT HOME REPS.",1.0
2,Re: PROTECT YOUR COMPUTER AGAINST HARMFUL VIRU...,My name is Bodie,0.0


###  Text one

In [7]:
print(headlines[0])
print("----------------------")
print(new_headlines[0])

Life Insurance - Why Pay More?
----------------------
message : 




\ Life Insurance - Why Pay More?


We see the first raw has a cosine similiarty score of .84.  The high score indicates that these documents are likely a match. When we review the text itself, we see that  'message' is introduced into new_headlines which causes the departure from a cosine similiarty score of 1. 

### Text two

In [8]:
print(headlines[1])
print("----------------------")
print(new_headlines[1])

FORTUNE 500 COMPANY HIRING, AT HOME REPS.
----------------------
FORTUNE 500 COMPANY HIRING, AT HOME REPS.


The second text has a cosine similiarty score of 1. This is great, this indicates that the texts are exact matches. We can verify this by looking at both print statements.


### Text Three

Finally, we have a cosine similiarty score of  0. Score of 0 indicates  that these documents are not similiar at all. This would prompt futhur investigation into these texts. Let's print them and take a look!

In [9]:
print(headlines[2])
print("----------------------")
print(new_headlines[2])

Re: PROTECT YOUR COMPUTER AGAINST HARMFUL VIRUSES! 21198
----------------------
My name is Bodie


Clearly, these are two different  text. Our cosine similairty scores caught this and identified this for us. Therefore we can trust this on scale veryifing thousands of documents etc.

# Final Thoughts

My hope is that this notebook was useful. Often I've seen other tuturials teaching cosine similiarty as a basis of NLP work. But I haven't seen it used a data quality check in any type of use case. I was excited to use machine learning in a creative way. 