<a href="https://colab.research.google.com/github/dennismathewjose/Applied-NLP-Learning-Materials/blob/main/NLP_Homework4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Part 1: Sentence-BERT-Based Similarity Analysis

In [None]:
#importing the libraries
import numpy as np
import pandas as pd
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import random
import re
import kagglehub
import torch

In [None]:
# Downloading the dataset from kaggle
path = kagglehub.dataset_download("quora/question-pairs-dataset")

print("Path to dataset files:", path)

Path to dataset files: /kaggle/input/question-pairs-dataset


In [None]:
# Load the CSV file
file_path = f"{path}/questions.csv"
df = pd.read_csv(file_path)

# Display basic info
df.head()


Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 404351 entries, 0 to 404350
Data columns (total 6 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   id            404351 non-null  int64 
 1   qid1          404351 non-null  int64 
 2   qid2          404351 non-null  int64 
 3   question1     404350 non-null  object
 4   question2     404349 non-null  object
 5   is_duplicate  404351 non-null  int64 
dtypes: int64(4), object(2)
memory usage: 18.5+ MB


## Cleaning and preprocessing

In [None]:
df = df.dropna(subset=["question1", "question2"])

In [None]:
def preporcessing(text):
    text = text.lower()  # Convert to lowercase
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)  # Remove special characters
    text = re.sub(r'\s+', ' ', text).strip()  # Remove extra spaces
    return text

# Apply preprocessing
df["question1"] = df["question1"].apply(preporcessing)
df["question2"] = df["question2"].apply(preporcessing)

In [None]:
df.head()

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,what is the step by step guide to invest in sh...,what is the step by step guide to invest in sh...,0
1,1,3,4,what is the story of kohinoor kohinoor diamond,what would happen if the indian government sto...,0
2,2,5,6,how can i increase the speed of my internet co...,how can internet speed be increased by hacking...,0
3,3,7,8,why am i mentally very lonely how can i solve it,find the remainder when math2324math is divide...,0
4,4,9,10,which one dissolve in water quikly sugar salt ...,which fish would survive in salt water,0


In [None]:
#combine all the questions
all_questions = pd.Series(pd.concat([df['question1'], df['question2']]))

In [None]:
all_questions.head()

Unnamed: 0,0
0,what is the step by step guide to invest in sh...
1,what is the story of kohinoor kohinoor diamond
2,how can i increase the speed of my internet co...
3,why am i mentally very lonely how can i solve it
4,which one dissolve in water quikly sugar salt ...


## Selecting the model for embedding

In [None]:
#using the model "distilbert-base-nli-stsb-quora-ranking"
# Load model and move to GPU
model = SentenceTransformer('distilbert-base-nli-stsb-quora-ranking')
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = model.to(device)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.58k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/557 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/265M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/507 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
# Encode all unique questions
print("Generating embeddings on GPU... (this might take a while)")
question_embeddings = model.encode(
    all_questions.tolist(),
    show_progress_bar=True,
    convert_to_tensor=True,
    device=device
)

Generating embeddings on GPU... (this might take a while)


Batches:   0%|          | 0/16749 [00:00<?, ?it/s]

In [None]:
# Mapping question to embedding
embedding_dict = dict(zip(all_questions, question_embeddings))

# Select 5 random unique questions as queries
sample_queries = all_questions.sample(5, random_state=42)

In [None]:
print("Running on device:", device)


Running on device: cuda


In [None]:
# Initialize results and counter
from torch.nn.functional import cosine_similarity

results = []
correct_count = 0
for query in sample_queries:
    query_embedding = embedding_dict[query].unsqueeze(0)

    # Compute cosine similarity with all questions
    sims = cosine_similarity(query_embedding, question_embeddings, dim=1)  # sims is a 1D tensor

    # Get top 10 indices excluding the query itself
    top_indices = torch.argsort(sims, descending=True)
    top_indices = [i.item() for i in top_indices if all_questions.iloc[i.item()] != query][:10]

    top_questions = all_questions.iloc[top_indices].values
    top_1 = top_questions[0]

    # Check if the top match is a true duplicate in the original dataset
    dup1 = df[(df['question1'] == query) & (df['question2'] == top_1) & (df['is_duplicate'] == 1)]
    dup2 = df[(df['question2'] == query) & (df['question1'] == top_1) & (df['is_duplicate'] == 1)]

    is_duplicate = not dup1.empty or not dup2.empty
    results.append((query, top_questions.tolist(), is_duplicate))

    if is_duplicate:
        correct_count += 1

In [None]:
# Compute average Precision@1
p_at_1 = correct_count / len(sample_queries)

In [None]:
print("\n--- Results ---\n")
for i, (query, top_qs, is_dup) in enumerate(results, 1):
    print(f"Query {i}: {query}\n")
    print("Top 10 Most Similar Questions:")
    for rank, sim_q in enumerate(top_qs, 1):
        print(f"  {rank}. {sim_q}")

    print(f"\nIs Duplicate at Rank 1: {is_dup}")
    print("-" * 50)



--- Results ---

Query 1: how many people landed on the moon

Top 10 Most Similar Questions:
  1. how many people have walked on the moon
  2. what percentage of people believe man has landed on moon
  3. did we ever land on the moon
  4. how many missions were there to the moon
  5. did the us really land on the moon if yes what proof do we have except photos and all
  6. how many men have walked on the moon
  7. what are some of the best evidence that we really did land on the moon
  8. did humans really land on moon if yes then what about conspiracy theory
  9. how many humans have gone to outer space
  10. has life been discovered on the moon

Is Duplicate at Rank 1: False
--------------------------------------------------
Query 2: which book is most worth reading for english learners

Top 10 Most Similar Questions:
  1. which is the best book for learning english
  2. what are the best english books
  3. which books are a must read for an english literature student
  4. what is t

# Part 2: Part 2: Fine-Tuning a Hugging Face Model for Sentiment Analysis

## Data Preparation and loading

In [4]:
!pip install datasets

Collecting datasets
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.5.0-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.2/491.2 kB[0m [31m20.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m11.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.12.0-py3-none-any.

In [5]:
import pandas as pd
import torch
import numpy as np
from datasets import load_dataset

from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification

In [7]:
dataset = load_dataset("imdb")

In [11]:
train_dataset = dataset["train"]
test_dataset = dataset["test"]

### Why did I choose IMDB Reviews?

- It is a standard benchmark dataset for binary sentiment analysis.

- Contains 50,000 reviews labeled as positive or negative, making it ideal for supervised classification tasks.

- Already pre-split into train and test, reducing data preparation effort.

## Model Setup

In [9]:
# Load tokenizer and model
tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased")
model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Model Choice: distilbert-base-uncased
- It is a lighter and faster version of BERT, with ~40% fewer parameters but retains ~97% of performance.

- Pretrained on a large English corpus, making it effective for NLP tasks like sentiment classification.

- "Uncased" version ignores case distinctions, which is fine for sentiment analysis, where "GREAT" and "great" can be treated the same.

In [12]:
# Tokenize the dataset
def tokenize(batch):
    return tokenizer(batch['text'], padding=True, truncation=True)

train_dataset = train_dataset.map(tokenize, batched=True)
test_dataset = test_dataset.map(tokenize, batched=True)

# Set format for PyTorch
train_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])
test_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

### Tokenization with DistilBertTokenizerFast
- Hugging Face models require input in the form of input_ids and attention_mask, which are generated using a compatible tokenizer.

- The “Fast” version uses Hugging Face’s Rust-based tokenizer backend for high-speed batch tokenization.

## Training Procedure

In [13]:
from transformers import Trainer, TrainingArguments
from sklearn.metrics import accuracy_score

# Define accuracy function
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    return {'accuracy': accuracy_score(labels, preds)}

# Training arguments
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
)

# Define Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    compute_metrics=compute_metrics
)

# Start training
trainer.train()

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mdennisparambanattu[0m ([33mdennisparambanattu-northeastern-university[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss,Accuracy
1,0.2354,0.231241,0.91048
2,0.1769,0.236308,0.93084
3,0.1078,0.280905,0.93144


TrainOutput(global_step=4689, training_loss=0.16890614851196964, metrics={'train_runtime': 4579.7018, 'train_samples_per_second': 16.377, 'train_steps_per_second': 1.024, 'total_flos': 9935054899200000.0, 'train_loss': 0.16890614851196964, 'epoch': 3.0})

## Evaluation

In [14]:
# Evaluate on test set
eval_results = trainer.evaluate()
print("Final Evaluation:", eval_results)


Final Evaluation: {'eval_loss': 0.2809053957462311, 'eval_accuracy': 0.93144, 'eval_runtime': 361.0469, 'eval_samples_per_second': 69.243, 'eval_steps_per_second': 4.329, 'epoch': 3.0}


## Analysis

- Slow training: Running full IMDb (25K reviews) takes time. Solution: subset for quick iterations.

- Overfitting: Initial runs with fewer epochs gave better generalization. Used dropout and weight decay.

### Possible Enhancements:
- Hyperparameter tuning using optuna or grid search.

- Data augmentation using back-translation, synonym replacement, or dropout noise.

- Use a larger model like bert-base-uncased if resources allow.

