# Testing TF-IDF for semantic similarity

We are using the scikit-learn [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html).

In [1]:
import pandas as pd
import numpy as np
import nltk

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

## 1. Load the sts-benchmark data and remove lines that contain erros.

In [4]:
train_df = pd.pandas.read_table(
    'Dataset/sts-train.csv',
    on_bad_lines='skip',
    skip_blank_lines=True,
    usecols=[4, 5, 6],
    names=["score", "s1", "s2"])


## 2. A quick look at the dataset we are using

In [5]:
train_df.head()

Unnamed: 0,score,s1,s2
0,5.0,A plane is taking off.,An air plane is taking off.
1,3.8,A man is playing a large flute.,A man is playing a flute.
2,3.8,A man is spreading shreded cheese on a pizza.,A man is spreading shredded cheese on an uncoo...
3,2.6,Three men are playing chess.,Two men are playing chess.
4,4.25,A man is playing the cello.,A man seated is playing the cello.


In [6]:
train_df.tail()

Unnamed: 0,score,s1,s2
5706,0.0,Severe Gales As Storm Clodagh Hits Britain,Merkel pledges NATO solidarity with Latvia
5707,0.0,Dozens of Egyptians hostages taken by Libyan t...,Egyptian boat crash death toll rises as more b...
5708,0.0,President heading to Bahrain,President Xi: China to continue help to fight ...
5709,0.0,"China, India vow to further bilateral ties",China Scrambles to Reassure Jittery Stock Traders
5710,0.0,Putin spokesman: Doping charges appear unfounded,The Latest on Severe Weather: 1 Dead in Texas ...


## 3. Comparing two sentence paires with TF-IDF as an example

In [7]:
s1 = train_df.loc[0][1]
s2 = train_df.loc[0][2]
s3 = train_df.loc[45][1]
s4 = train_df.loc[45][2]

print(f's1 = {s1}')
print(f's2 = {s2}')
print('\n')
print(f's3 = {s3}')
print(f's4 = {s4}')

s1 = A plane is taking off.
s2 = An air plane is taking off.


s3 = A man is playing the piano.
s4 = A woman is playing the violin.


### 3.1 Fit the training data with TfidfVectorizer and create vectors for the sentence paires

In [58]:
sentences = [] 

for row in train_df.itertuples(index=False):
    sentences.extend((str(row[1]), str(row[2])))

vectorizer = TfidfVectorizer(
            analyzer='char_wb', ngram_range=(3, 5))
vectorizer.fit(sentences) 

sentence_vectors = vectorizer.transform([s1, s2, s3, s4])

s1_vec = sentence_vectors[0]
s2_vec = sentence_vectors[1]
s3_vec = sentence_vectors[2]
s4_vec = sentence_vectors[3]

print(f's1 vs s2 = {cosine_similarity(s1_vec,s2_vec)[0][0]}')
print(f'Human score = {train_df.loc[0][0]}')
print(f'TF-IDF score = {round(cosine_similarity(s1_vec,s2_vec)[0][0]*5,1)}')

print(f's3 vs s4 = {cosine_similarity(s3_vec,s4_vec)[0][0]}')
print(f'Human score = {train_df.loc[45][0]}')
print(f'TF-IDF score = {round(cosine_similarity(s3_vec,s4_vec)[0][0]*5,1)}')

print(f's1 vs s3 = {cosine_similarity(s1_vec,s3_vec)[0][0]}')
print(f's1 vs s4 = {cosine_similarity(s1_vec,s4_vec)[0][0]}')


s1 vs s2 = 0.8888452529233067
Human score = 5.0
TF-IDF score = 4.4
s3 vs s4 = 0.4142671985203453
Human score = 1.0
TF-IDF score = 2.1
s1 vs s3 = 0.10401728563003457
s1 vs s4 = 0.08905307776527066


## 4. Getting the human score and the TF-IDF scores and comparing them

### 4.1 Load the data and preprocess it

In [10]:
dev_df = pd.pandas.read_table(
    'Dataset/sts-dev.csv',
    on_bad_lines='skip',
    skip_blank_lines=True,
    usecols=[4, 5, 6],
    names=["score", "s1", "s2"])

# removes punctuation from sentences
tokenizer = nltk.RegexpTokenizer(r"\w+")

# For some reason some of the sentences were "float" datatypes 
dev_df['s1'] = dev_df['s1'].astype(str)
dev_df['s2'] = dev_df['s2'].astype(str)

dev_df['s1'] = dev_df.apply(lambda row: tokenizer.tokenize(row['s1']), axis=1)
dev_df['s1'] = dev_df.apply(lambda row: ' '.join(row['s1']).lower() , axis=1)

dev_df['s2'] = dev_df.apply(lambda row: tokenizer.tokenize(row['s2']), axis=1)
dev_df['s2'] = dev_df.apply(lambda row: ' '.join(row['s2']).lower() , axis=1)

In [11]:
dev_df.head()

Unnamed: 0,score,s1,s2
0,5.0,a man with a hard hat is dancing,a man wearing a hard hat is dancing
1,4.75,a young child is riding a horse,a child is riding a horse
2,5.0,a man is feeding a mouse to a snake,the man is feeding a mouse to the snake
3,2.4,a woman is playing the guitar,a man is playing guitar
4,2.75,a woman is playing the flute,a man is playing a flute


### 4.2 Get the scores and normalize them

In [55]:
dev_scores = dev_df['score'].tolist()

score_human = []

for row in dev_scores:
    score = row/5
    score_human.append(score)

In [56]:
score_machine = []

for row in dev_df.itertuples(index=False):
    sentence_vectors = vectorizer.transform([str(row[1]), str(row[2])])
    s1_vec = sentence_vectors[0]
    s2_vec = sentence_vectors[1]
    score = cosine_similarity(s1_vec,s2_vec)[0][0]
    score_machine.append(score)

In [57]:
from scipy.stats import pearsonr

result, _ = pearsonr(score_machine, score_human)
print('Pearsonr:', end=' ')
print("%.1f" % (result*100))

Pearsonr: 74.2


### 4.3 Compare human and TF-IDF scores

In [30]:
def on_bad_line(values):
    return values[:7]

test_df = pd.pandas.read_table(
    'Dataset/sts-test.csv',
    on_bad_lines=on_bad_line,
    skip_blank_lines=True,
    engine='python',
    usecols=[4, 5, 6],
    names=["score", "s1", "s2"])

# For some reason some of the sentences were "float" datatypes 

test_df['s1'] = test_df['s1'].astype(str)
test_df['s2'] = test_df['s2'].astype(str)

test_df['s1'] = test_df.apply(lambda row: tokenizer.tokenize(row['s1']), axis=1)
test_df['s1'] = test_df.apply(lambda row: ' '.join(row['s1']).lower() , axis=1)

test_df['s2'] = test_df.apply(lambda row: tokenizer.tokenize(row['s2']), axis=1)
test_df['s2'] = test_df.apply(lambda row: ' '.join(row['s2']).lower() , axis=1)

In [21]:
dev_df

Unnamed: 0,score,s1,s2
0,5.00,a man with a hard hat is dancing,a man wearing a hard hat is dancing
1,4.75,a young child is riding a horse,a child is riding a horse
2,5.00,a man is feeding a mouse to a snake,the man is feeding a mouse to the snake
3,2.40,a woman is playing the guitar,a man is playing guitar
4,2.75,a woman is playing the flute,a man is playing a flute
...,...,...,...
1465,2.00,scientists prove there is water on mars,has nasa discovered water on mars
1466,0.00,pranab stresses need to strive for peace by na...,wto india regrets action of developed nations
1467,2.00,volkswagen skids into red in wake of pollution...,volkswagen s gesture of goodwill to diesel owners
1468,0.00,obama is right africa deserves better leadership,obama waiting for midterm to name attorney gen...


In [31]:
test_df

Unnamed: 0,score,s1,s2
0,2.50,a girl is styling her hair,a girl is brushing her hair
1,3.60,a group of men play soccer on the beach,a group of boys are playing soccer on the beach
2,5.00,one woman is measuring another woman s ankle,a woman measures another woman s ankle
3,4.20,a man is cutting up a cucumber,a man is slicing a cucumber
4,1.50,a man is playing a harp,a man is playing a keyboard
...,...,...,...
1091,4.00,so in his state of the union address in januar...,in his jan 28 state of the union message bush ...
1092,4.00,the other 24 members are split between represe...,of the 24 directors who are not exchange execu...
1093,2.75,the episcopal diocese of central florida becam...,the episcopal diocese of central florida voted...
1094,2.25,mcgill also detailed the hole that had been cu...,mcgill also said a dark glove was stuffed into...


In [47]:
test_scores = test_df['score'].tolist()

score_human = []

for row in test_scores:
    score = row/5
    score_human.append(score)

In [48]:
score_machine = []

for row in test_df.itertuples(index=False):
    sentence_vectors = vectorizer.transform([str(row[1]), str(row[2])])
    s1_vec = sentence_vectors[0]
    s2_vec = sentence_vectors[1]
    score = cosine_similarity(s1_vec,s2_vec)[0][0]
    score_machine.append(score)

In [49]:
from scipy.stats import pearsonr

result, _ = pearsonr(score_machine, score_human)
print('Pearsonr:', end=' ')
print("%.1f" % (result*100))

Pearsonr: 70.8
