# Text Similarity Problem

Problem Type: Unsupervised, Text Similarity, Info. Retrieval Problem  
Solution Options: Lexical and Semantic  
Analysis Options: Cosine Similariy and Soft Cosine Measure (SCM)  

#### Sources  
- Activity: https://towardsdatascience.com/how-to-rank-text-content-by-semantic-similarity-4d2419a84c32
- Cosine Similarity: https://realpython.com/build-recommendation-engine-collaborative-filtering/#how-to-find-similar-users-on-the-basis-of-ratings
- Soft Cosine Measure: https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/soft_cosine_tutorial.ipynb

#### Main Reference 
Text Analsis with Python: A Real-World Practical Approach to gaining actionable insights from your data by Dipajan Sarkar

#### tfidf.rank_documents
 
- Used the tfidf.rank_documents(search_terms: str, documents: list) function to score documents based on overlapping content
- But did not see documentation for this anywhere 

# Steps Involved:

1. Text Normalization
    - Tokenization
    - Lemmatization
2. Information Retrieval
    - Define your search term
3. Feature Engineering
    - TF-IDF (WordNet Lemmatizer vs Gensim)
4. Similarity Measure
    - Cosine Similarity vs Soft Cosine Measure (GloVe, Word2Net)

# Get input from user

In [37]:
# Python program showing 
# a use of input()
  
search_terms = input("Type in your sentence: ")
search_terms

Type in your sentence: You are trying to eat healthier


'You are trying to eat healthier'

# A - Lexical Similarity

## Simplist solution with no tokenizer/lemmatization 

using tfidf.rank_documents(search_terms: str, documents: list)

In [38]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel


documents = ['cars drive on the road', 'tomatoes are actually fruit', 'I would like to eat more vegetables']

doc_vectors = TfidfVectorizer().fit_transform([search_terms] + documents)

cosine_similarities = linear_kernel(doc_vectors[0:1], doc_vectors).flatten()
document_scores = [item.item() for item in cosine_similarities[1:]]
document_scores

[0.0, 0.14808960350408654, 0.2461539234244219]

## Lemmatization

In [39]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
from nltk import word_tokenize          
from nltk.stem import WordNetLemmatizer
import nltk
from nltk.corpus import stopwords

In [40]:
stop_words = set(stopwords.words('english')) 

# Interface lemma tokenizer from nltk with sklearn
class LemmaTokenizer:
    ignore_tokens = [',', '.', ';', ':', '"', '``', "''", '`']
    def __init__(self):
        self.wnl = WordNetLemmatizer()
    def __call__(self, doc):
        return [self.wnl.lemmatize(t) for t in word_tokenize(doc) if t not in self.ignore_tokens]

# Lemmatize the stop words
tokenizer=LemmaTokenizer()
token_stop = tokenizer(' '.join(stop_words))

#search_terms = 'red tomato'
documents = ['cars drive on the road', 'tomatoes are actually fruit','I would like to eat more vegetables']

# Create TF-idf model
vectorizer = TfidfVectorizer(stop_words=token_stop, 
                              tokenizer=tokenizer)
doc_vectors = vectorizer.fit_transform([search_terms] + documents)

# Calculate similarity
cosine_similarities = linear_kernel(doc_vectors[0:1], doc_vectors).flatten()
document_scores = [item.item() for item in cosine_similarities[1:]]
document_scores

[0.0, 0.0, 0.2371049775288231]

## Summary: Lexical Similarity (TF-idf)

Pros
- It’s fast and works well when documents are large and/or have lots of overlap.
- It looks for exact matches, so at the very least you should use a lemmatizer to take care of the plurals.


Limitations
- When comparing short documents with limited-term variety — such as search queries — there is a risk that you will miss semantic relationships where there isn’t an exact word match.

# B - Semantic Similarity

## Preprocessing with gensim

In [41]:
from re import sub
from gensim.utils import simple_preprocess

query_string = 'fruit and vegetables'
documents = ['cars drive on the road', 'tomatoes are actually fruit']

stopwords = ['the', 'and', 'are', 'a']

# From: https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/soft_cosine_tutorial.ipynb
def preprocess(doc):
    # Tokenize, clean up input document string
    doc = sub(r'<img[^<>]+(>|$)', " image_token ", doc)
    doc = sub(r'<[^<>]+(>|$)', " ", doc)
    doc = sub(r'\[img_assist[^]]*?\]', " ", doc)
    doc = sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', " url_token ", doc)
    return [token for token in simple_preprocess(doc, min_len=0, max_len=float("inf")) if token not in stopwords]

# Preprocess the documents, including the query string
corpus = [preprocess(document) for document in documents]
query = preprocess(query_string)

In [24]:
# Load the model: this is a big file, can take a while to download and open
import gensim.downloader as api
glove = api.load("glove-wiki-gigaword-50")  



In [42]:
from gensim.corpora import Dictionary
from gensim.models import TfidfModel
from gensim.models import WordEmbeddingSimilarityIndex
from gensim.similarities import SparseTermSimilarityMatrix
from gensim.similarities import SoftCosineSimilarity

  
similarity_index = WordEmbeddingSimilarityIndex(glove)

# Build the term dictionary, TF-idf model
dictionary = Dictionary(corpus+[query])
tfidf = TfidfModel(dictionary=dictionary)

# Create the term similarity matrix.  
similarity_matrix = SparseTermSimilarityMatrix(similarity_index, dictionary, tfidf)

In [44]:
import numpy as np
# Compute Soft Cosine Measure between the query and the documents.
# From: https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/soft_cosine_tutorial.ipynb
query_tf = tfidf[dictionary.doc2bow(query)]

index = SoftCosineSimilarity(
            tfidf[[dictionary.doc2bow(document) for document in corpus]],
            similarity_matrix)

doc_similarity_scores = index[query_tf]

# Output the sorted similarity scores and documents
sorted_indexes = np.argsort(doc_similarity_scores)[::-1]
for idx in sorted_indexes:
    print(f'{idx} \t {doc_similarity_scores[idx]:0.3f} \t {documents[idx]}')


1 	 0.688 	 tomatoes are actually fruit
0 	 0.000 	 cars drive on the road
