<a href="https://colab.research.google.com/github/bandiajay/Generative-AI/blob/main/05_Text_Similarity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1 style="text-align: center;"> Text Similarity </h1>

<b> Objective: </b>
<p> This worksheet demonstrates finding the similarity between given sentences using <i> cosine </i>  rule. The purpose of this worksheet is to introduce participants to the concept and techniques of text similarity within natural language processing. Through guided activities and practical examples, learners will gain an understanding of how to measure and analyze the similarity between texts. This worksheet aims to equip participants with the skills to apply these techniques in various applications, such as document clustering, plagiarism detection, and content recommendation systems, enhancing their ability to develop systems that can intelligently identify and categorize related textual data.  </p>

<b> Requirements: </b>
<ol>
<li> <i> Transformers </i> - A versatile library from Hugging Face providing state-of-the-art pre-trained models for natural language processing tasks </li>
<li> <i> sentence-transformers </i> - Python framework for state-of-the-art sentence, text and image embeddings.
<li> <i> Torch (PyTorch) </i> - PyTorch is an open-source machine learning library known for its flexibility and speed, widely used for developing and training deep learning models.
</ol>

<b> Steps: </b>
<ol>
    <li> Install <code>transformers</code>, <code>sentence-transformers </code>, <code>torch</code> packages.</li>
     <li> Write source code </li>
        <p> 2.1 Import <code>transformers</code>, <code>sentence-transformers </code>, <code>torch</code> modules <br>
            2.2 Get the API from Hugging Face. <br>
            2.3 Get the Deep Learning model. <br>
            2.4 Load the Deep Learning model. <br>
            2.5 Load the Tokenizer from the model. <br>
            2.6 Enter the input Sentences. <br>
            2.7 Convert those Sentences to machine understandle format (<b>Tokens</b>) using <i> Tokenizer </i>. <br>
            2.8 Apply the Deep Learning model on the tokens <br>
            2.9 Get the Embeddings. <br>
            2.10 Calculate Cosine Similarity. <br>
        </p>
    <li> Test the source code on multiple sentences</li>
</ol>

<h3> Step 1: Install transformers, sentence-transformers, torch packages </h3>

**Note:** if the below command fails, execute
`python -m pip install transformers`

In [None]:
pip install transformers



**Note:** if the below command fails, execute
`python -m pip install sentence-transformers`

In [None]:
pip install sentence-transformers

Collecting sentence-transformers
  Downloading sentence_transformers-2.7.0-py3-none-any.whl (171 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m171.5/171.5 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch>=1.11.0->sentence-transform

**Note:** if the below command fails, execute
`python -m pip install torch`

In [None]:
pip install torch



<h3> Step 2 : Write source code </h3>

<h4> Step 2.1 : Import <code>transformers</code>, <code>sentence-transformers </code>, <code>torch</code> modules </h4>

**AutoModel** and **AutoTokenizer** are part of the transformers library developed by Hugging Face.



*  **AutoModel** : automatically downloads and loads a pre-trained model given the model's identifier.
*   **AutoTokenizer** : automatically downloads and loads a tokenizer that converts text to machine readable format, *Tokens*.



**Scikit-Learn** : Scikit-learn is an open-source Python library that provides simple and efficient tools for data mining and data analysis, built on NumPy, SciPy, and matplotlib.

**Cosine-Similarity** : Cosine similarity measures the cosine of the angle between two vectors, commonly used to determine how similar two documents are in terms of their content.

In [None]:
from transformers import AutoModel, AutoTokenizer
from sklearn.metrics.pairwise import cosine_similarity
import os

<h4>  Step 2.2 : Get the API from Hugging Face </h4>

**Note** : You need to create an account at [Hugging Face](https://huggingface.co/) and get an api key.

In [None]:
os.environ["api_key"] = "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"

<h4> Step 2.3 : Get the model </h4>



*  **sentence-transformers** : Project of which the model is part of.
*  **paraphrase-MiniLM-L6-v2** : Deep Learning Model
      * firstpart - primary task or training objective of the model. Here `paraphrase` is indicates tasks related to paraphrasing.
      * secondpart - the base architecture of the model. Here it is `MiniLM`, smaller but effective version of language model *BERT*.
      * Thirdpart - No of layers in  its transformer architecture. Here `L6` refers there are 6 layers.
      * Fourthpart - version of model. Here `V2` indicates second version.

**Note**: you can find more models at [Hugging Face](https://huggingface.co/)


In [None]:
model_name = "sentence-transformers/paraphrase-MiniLM-L6-v2"

<h4> Step 2.4 : Load the Deep Learning model </h4>

In [None]:
model = AutoModel.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

<h4> Step 2.5 : Load the Tokenizer from the deep learning model </h4>

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

tokenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

<h4> Step 2.6: Enter the input sentences </h4>

In [None]:
sentence1 = "The sun is shining brightly."
sentence2 = "The weather is sunny today."

<h4> Step 2.7 : Convert those Sentences to machine understandle format (Tokens) using Tokenizer </h4>

`tokenizer` is the function.


*   First argument is our input.
*   return_tensors - format of the tokens. Here `pt` indicates pytorch.
*   truncation - ensure that the sequence length of each tokenized sentence does not exceed the maximum length that the model can handle. Here it is `true`.
*    padding - process to convert sentences of different lengths to same length. Here it is `true`.


In [None]:
tokens = tokenizer([sentence1, sentence2], return_tensors="pt", truncation=True, padding=True)

<h4> Step 2.8 : Apply the Deep Learning model on the tokens </h4>

In [None]:
outputs = model(**tokens)

<h4> Step 2.9 : Get the Word Embeddings </h4>

In [None]:
embeddings = outputs.last_hidden_state.mean(dim=1).detach().cpu().numpy()

<h4> Step 2.10 : Calculate the similarity score using cosine similarity </h4>

In [None]:
similarity_score = cosine_similarity([embeddings[0]], [embeddings[1]])[0][0]

Round off to 2 decimals.

In [None]:
rounded_similarity_score = round(similarity_score, 2)

print the output.

In [None]:
print(f"Similarity Score: {rounded_similarity_score:.2f}")

Similarity Score: 0.66


<h3> Step 3 : Test the source code on multiple sentences </h3>

In [None]:
from itertools import combinations
model_name = "sentence-transformers/paraphrase-MiniLM-L6-v2"
#model_name = "BAAI/bge-large-en-v1.5"
model = AutoModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Define sentences
sentences = [
    "The sun is shining brightly.",
    "The weather is sunny today.",
    "I love to go to class.",
    "Today We have an in class Activity.",
    "I want to eat Dosa for breakfast."

]

# Tokenize and obtain embeddings for all sentences
tokens = tokenizer(sentences, return_tensors="pt", truncation=True, padding=True)
outputs = model(**tokens)
embeddings = outputs.last_hidden_state.mean(dim=1).detach().cpu().numpy()

# Calculate cosine similarity between all pairs of sentences
similarity_scores = []
for pair in combinations(range(len(sentences)), 2):
    similarity_score = cosine_similarity([embeddings[pair[0]]], [embeddings[pair[1]]])[0][0]
    similarity_scores.append(similarity_score)

# Round the similarity scores to 2 decimals
rounded_similarity_scores = [round(score, 2) for score in similarity_scores]

# Print similarity scores
for i, pair in enumerate(combinations(sentences, 2)):
    print(f"Similarity Score between '{pair[0]}' and '{pair[1]}': {rounded_similarity_scores[i]:.2f}")

Similarity Score between 'The sun is shining brightly.' and 'The weather is sunny today.': 0.72
Similarity Score between 'The sun is shining brightly.' and 'I love to go to class.': 0.12
Similarity Score between 'The sun is shining brightly.' and 'Today We have an in class Activity.': 0.15
Similarity Score between 'The sun is shining brightly.' and 'I want to eat Dosa for breakfast.': 0.06
Similarity Score between 'The weather is sunny today.' and 'I love to go to class.': 0.17
Similarity Score between 'The weather is sunny today.' and 'Today We have an in class Activity.': 0.26
Similarity Score between 'The weather is sunny today.' and 'I want to eat Dosa for breakfast.': 0.08
Similarity Score between 'I love to go to class.' and 'Today We have an in class Activity.': 0.47
Similarity Score between 'I love to go to class.' and 'I want to eat Dosa for breakfast.': 0.23
Similarity Score between 'Today We have an in class Activity.' and 'I want to eat Dosa for breakfast.': 0.14
