# Sentence similarity

In this notebook, we will explore sentence similarity with the use of four algorithms: TF-IDF, Doc2Vec, InferSent and Sentence-Bert.

In [7]:
import nltk
import torch
import numpy as np
import pandas as pd
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
from models.models import InferSent
from gensim.models.doc2vec import Doc2Vec
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer
nltk.download('punkt')

# Number of top news
K=5

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/diogo.ferreira/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


The dataset contains the first 1000 news of https://www.kaggle.com/rmisra/news-category-dataset. We will only use the "short_description" field. The goal is to, given a search query, get the top 5 most similar news with different approaches.

In [8]:
dataset = pd.read_json('dataset/News_Category_Dataset_v2_1000.json', lines=True)

In [9]:
news_description = dataset["short_description"]
print(news_description)

0      She left her husband. He killed their children...
1                               Of course it has a song.
2      The actor and his longtime girlfriend Anna Ebe...
3      The actor gives Dems an ass-kicking for not fi...
4      The "Dietland" actress said using the bags is ...
                             ...                        
995    In one instance, the politician reportedly thr...
996    Federal immigration officials say that in sanc...
997    "‘Pulling out’ is a bit of a misnomer," the Fo...
998                                Orange you intrigued?
999    Sometimes we can’t help but wonder if Mother N...
Name: short_description, Length: 1000, dtype: object


The first step is to create numerical representations of each description the news.

## TF-IDF
The first approach uses TF-IDF to create description representation. Each description is lower cased and lemmatized with WordNet, before create the TF-IDF matrix.

In [11]:
lemmatizer = WordNetLemmatizer()
tfidf_vectorizer = TfidfVectorizer()

In [12]:
descriptions_lemmatized = [" ".join([lemmatizer.lemmatize(token.lower()) for token in word_tokenize(description)]) for description in news_description.values ]

In [13]:
descriptions_representation_tfidf = tfidf_vectorizer.fit_transform(descriptions_lemmatized)

## Doc2Vec
The second approach to create description representations uses the Doc2Vec library, implemented in GenSim. The pre-trained model was downloaded from https://github.com/jhlau/doc2vec. Make sure you download one of the pre-trained Doc2Vec models and save it in the models folder.

In [14]:
file = "models/doc2vec.bin"
doc2vec_model = Doc2Vec.load(file)

In [15]:
start_alpha = 0.01
infer_epoch = 1000
documents = [[token for token in nltk.word_tokenize(description.lower())] for description in news_description]
embeddings_doc2vec = []
for document in documents:
    embeddings_doc2vec.append(doc2vec_model.infer_vector(document, alpha=start_alpha, steps=infer_epoch))

## InferSent
The third approach creates description embeddings using InferSent. The pre-trained model was downloaded from https://github.com/facebookresearch/InferSent. You can download the word embeddings from https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M.vec.zip and the infersent model from https://dl.fbaipublicfiles.com/infersent/infersent2.pkl. Both should be saved in the models folder.

In [16]:
V = 2
MODEL_PATH = 'models/infersent%s.pkl' % V
params_model = {'bsize': 64, 'word_emb_dim': 300, 'enc_lstm_dim': 2048,
                'pool_type': 'max', 'dpout_model': 0.0, 'version': V}
infersent = InferSent(params_model)
infersent.load_state_dict(torch.load(MODEL_PATH))
W2V_PATH = 'models/crawl-300d-2M.vec'
infersent.set_w2v_path(W2V_PATH)

In [17]:
infersent.build_vocab(news_description.values, tokenize=True)

Found 4415(/4467) words with w2v vectors
Vocab size : 4415


In [18]:
embeddings_infersent = infersent.encode(news_description.values, tokenize=True)

## Sentence-Bert
The last approach creates description embeddings using Sentence Transformers.

In [19]:
model = SentenceTransformer('distilbert-base-nli-stsb-mean-tokens')

In [20]:
embeddings_distilbert = model.encode(news_description.values)

Now that the description representations are calculated, it is needed to calculate the representation of the input query and its similarity with all other descriptions. For that, the function below uses the cosine sinimlarity between vectors and returns the K indexes with the most similar representations.

In [21]:
def find_similar(vector_representation, all_representations, k=1):
    similarity_matrix = cosine_similarity(vector_representation, all_representations)
    np.fill_diagonal(similarity_matrix, 0)
    similarities = similarity_matrix[0]
    if k == 1:
        return [np.argmax(similarities)]
    elif k is not None:
        return np.flip(similarities.argsort()[-k:][::1])
    

Let's test the most similar news with different queries and test the performance of the different algorithms.

In [23]:
descriptions = ["Democrats win republicans in election.",
                "Another whistleblower complaint in EU.",
                "It was released the best movie of the year.",
                "Stand-up comedy solo in London this year.",
                "Artificial intelligence will take over our jobs.",
                "The police caught the murderer and he is now in custody, waiting for the court decision.",
                "They won the cup in their home stadium."]

for description in descriptions:
    print("Description: {}".format(description))
    print()
    
    tf_idf_similar_indexes = find_similar(tfidf_vectorizer.transform([" ".join([lemmatizer.lemmatize(token.lower()) for token in word_tokenize(description)])]), descriptions_representation_tfidf, K)
    print("5 most similar descriptions using TF-IDF")
    for index in tf_idf_similar_indexes:
        print(news_description[index])
    print()
    
    doc2vec_similar_indexes = find_similar([doc2vec_model.infer_vector([token for token in nltk.word_tokenize(description.lower())], alpha=start_alpha, steps=infer_epoch)], embeddings_doc2vec, K)
    print("5 most similar descriptions using Doc2Vec")
    for index in doc2vec_similar_indexes:
        print(news_description[index])
    print()
    
    infersent_similar_indexes = find_similar(infersent.encode([description], tokenize=True), embeddings_infersent, K)
    print("5 most similar descriptions using Infersent")
    for index in infersent_similar_indexes:
        print(news_description[index])
    print()
    
    distilbert_similar_indexes = find_similar(model.encode([description]), embeddings_distilbert, K)
    print("5 most similar descriptions using Sentence-Bert")
    for index in distilbert_similar_indexes:
        print(news_description[index])
    print()

Description: Democrats win republicans in election.

5 most similar descriptions using TF-IDF
Democrat Richard Cordray will face Republican Mike DeWine in November.
Ohio state Sen. Troy Balderson now will face a Democrat in an Aug. 7 special election.
Republican Morrisey will face Sen. Joe Manchin, a conservative Democrat who has voted for the president's agenda 61 percent of the time.
Haspel looks all but assured to win confirmation in a vote before the full Senate.
"I win either way," second-place finisher Caleb Lee Hutchinson said.

5 most similar descriptions using Doc2Vec
*Sends giftbasket to Marvel*
Democrats are targeting the seat, and a former Marine is their candidate.
He defeated two congressmen and will challenge Democratic Sen. Joe Donnelly in November.
Vote counting will begin Saturday.
Paulette Jordan won the Democratic primary in the Idaho governor's race.

5 most similar descriptions using Infersent
Democrat Richard Cordray will face Republican Mike DeWine in November.


5 most similar descriptions using Doc2Vec
An inmate’s sex assigned at birth will now be used to make the initial decision as to where transgender prisoners are housed.
*Sends giftbasket to Marvel*
It's over.
The officers who took Robinson into custody have been reassigned pending an investigation.
The TV reboot's hero now lives on the razor's edge.

5 most similar descriptions using Infersent
The officers who took Robinson into custody have been reassigned pending an investigation.
A male student, believed to be the suspect, has been detained, according to police.
A police sergeant on the scene denied even knowing what Airbnb was.
Trump has asked the Justice Department to look into whether an FBI informant infiltrated his campaign for political purposes.

5 most similar descriptions using Sentence-Bert
*Patiently waits for first photo*
The 26-year-old's plea comes amid rioting at the prison where he has been held without trial since 2016.
They still deserve detention.
But don't count o