### Cosine similarity project

### Text preprocessing

Before we vectorize or compare the similarity between two sentences, it is crucial to have the right data and by that I mean that your data should be clean. 
Otherwise whatever model you create will get trained on noise and it will definitely give you bad results.




- Decontraction

The English language has a couple of contractions. For instance:

you've -> you have
he's -> he is
These can sometimes cause headache when you are doing natural language processing.

- Removing special characters

These are  noises in your data and irrelevant for modeling so we simply get rid of them.

- Removing http url links 

(not that necessary here , but it's a good step to remove those type of junks in your text)


- Removing stopwords

Stopwords are the English words which does not add much meaning to a sentence. They can safely be ignored without sacrificing the meaning of the sentence. For example, the words like the, he, have etc. 




For all of the operations mentioned above , I've used some inbuilt libraries like **Regex** and **nltk** to clean the text input

In [15]:
from nltk.corpus import stopwords
import re


stpwrds = stopwords.words("english") #load stopwords from nltk library


# https://stackoverflow.com/a/47091490/4084039
def decontracted(phrase):
    # specific
    phrase = re.sub(r"won't", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)
    phrase = re.sub(r"didn\'t", "did not", phrase)
    # general
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    #phrase = re.sub(r'^https?:\/\/.*[\r\n]*', '', phrase, flags=re.MULTILINE)
    
    return phrase





def preprocess_txt(raw_text):
    
    preprocessed_text = []
    # tqdm is for printing the status bar
    for sentence in raw_text:
        em_id = re.findall('\S*@\S*\s?', sentence) #find all email id
        sentence = ' '.join(e for e in sentence.split() if e not in em_id)  #remove email id tags
        
        sent = decontracted(sentence)
        
        
        
        
        sent = re.sub(r'\w+:\/{2}[\d\w-]+(\.[\d\w-]+)*(?:(?:\/[^\s/]*))*', '', sent) #remove https url
        
        
        
        
        
        sent = re.sub(r'\<[^)]*\>', '', sent)
        sent = re.sub(r'\[[^)]*\]', '', sent)
        
        



        sent = sent.replace('\\r', ' ')
        sent = sent.replace('\\"', ' ')
        sent = sent.replace('\\n', ' ')
        
        sent = re.sub('[^A-Za-z0-9]+', ' ', sent)   #remove all special characters using regex
        
        #sent = re.sub("\d+", " ", sent)  #remove digits
        
        
        
        sent = ' '.join(e for e in sent.split() if e not in stpwrds) #remove stopwords
        
        preprocessed_text.append(sent.strip())
        
    return preprocessed_text

### Vectorizing the text using One hot encoding

- After cleaning your data ,create an array that has all the unique words in your text corpus.
- If the number of unique words is 5, you will have a 5 dimensional vector.

- Creating the one hot vector:
Take the words in your input and check if each word is present in the unique words list.

If present,add 1 in that position or index. 
Otherwise add 0



### Finding cosine similarity

- Once you have vectorized your text using the steps mentioned above,use the cosine similarity formula.

In [16]:

from numpy import dot
from numpy.linalg import norm

def one_hot_vector(sentence_clean,unique_words):
    one_hot_sentence = [0 for i in range(len(unique_words))]
    for word in sentence_clean.split():
        if word in unique_words :
            ix = unique_words.index(word)
            one_hot_sentence[ix] = 1
        
    return one_hot_sentence

def cosine_similarity_sentences(sentence1,sentence2):
    sentence1 = sentence1.lower()
    sentence2 = sentence2.lower()
    sentence1_clean = preprocess_txt([sentence1])[0]
    sentence2_clean = preprocess_txt([sentence2])[0]
    unique_words = set((sentence1_clean+' '+sentence2_clean).split())
    unique_words = list(unique_words)
    sentence1_vector = one_hot_vector(sentence1_clean,unique_words)
    sentence2_vector = one_hot_vector(sentence2_clean,unique_words)


    similarity = dot(sentence1_vector, sentence2_vector)/(norm(sentence1_vector)*norm(sentence2_vector))
    
    return similarity    

In [17]:
s1 = "I like Paris"
s2 = "I like Paris and the people in France"

cosine_similarity_sentences(s1,s2)

0.7071067811865475

In [18]:
s1 = "I love France"
s2 = "I hate France"

cosine_similarity_sentences(s1,s2)

0.4999999999999999