<a href="https://colab.research.google.com/github/giggsy1106/NLP-HW3_KOTA/blob/main/NLP_HW3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [8]:
# ============================================================
# DATA-622 Homework 3 | NLP - Sentence Embeddings & Similarity
# ============================================================

import nltk
from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

nltk.download('punkt')
nltk.download('punkt_tab')

# ------------------------------------------------------------------
# STEP 1: Read the Article
# Note: washingtonpost.com blocks automated scraping in Colab,
# so we use a direct hardcoded excerpt from the article (>700 chars).
# ------------------------------------------------------------------

article_text = """
On Air India Flight 171, only one individual managed to survive — British citizen Viswashkumar Ramesh.
He was spotted limping away from a group of stunned rescuers and heading towards an ambulance shortly after the incident,
which resulted in the deaths of the remaining 241 passengers and crew, along with numerous individuals on the ground in Ahmedabad.
Viswashkumar Ramesh, on Air India Flight 171 in seat 11A, was the only survivor after the plane crashed in Ahmedabad.
An expert called his survival a miracle. "Everything happened in front of my eyes," Ramesh said.
"I don't believe how I survived." Thirty seconds after takeoff, there was a loud noise and then the plane crashed.
He escaped through an opening in the fuselage. He feels like the luckiest man alive despite injuries to his leg, shoulder, knee, and back.
""".strip()

first_700 = article_text[:700]
print("=" * 60)
print("STEP 1 — First 700 Characters:")
print("=" * 60)
print(first_700)
print(f"\nLength: {len(first_700)} characters")

# ------------------------------------------------------------------
# STEP 2: Split Text into Sentences using NLTK
# ------------------------------------------------------------------

sentences = nltk.sent_tokenize(article_text)

print("\n" + "=" * 60)
print("STEP 2 — Sentence Tokenization:")
print("=" * 60)
print(f"Total sentences: {len(sentences)}")
print("First 5 sentences:")
for i, s in enumerate(sentences[:5], 1):
    print(f"  {i}. {s}")

# ------------------------------------------------------------------
# STEP 3: Load Pre-trained Embedding Model + TF-IDF Vectorization
# ------------------------------------------------------------------

model = SentenceTransformer('all-MiniLM-L6-v2')

first_ten = sentences[:10]

vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(first_ten)

print("\n" + "=" * 60)
print("STEP 3 — TF-IDF Vectorization:")
print("=" * 60)
print(f"TF-IDF matrix shape: {tfidf_matrix.shape}")
print("  (rows = sentences, cols = unique vocabulary terms)")

# ------------------------------------------------------------------
# STEP 4: Embed Each Sentence with SBERT
# ------------------------------------------------------------------

embeddings = model.encode(first_ten)

print("\n" + "=" * 60)
print("STEP 4 — Sentence Embeddings (SBERT all-MiniLM-L6-v2):")
print("=" * 60)
print(f"All embeddings shape: {embeddings.shape}")
print(f"First sentence embedding shape: {embeddings[0].shape}")

# ------------------------------------------------------------------
# STEP 5: Cosine Similarity Between Sentence 1 and Sentence 2
# ------------------------------------------------------------------

print("\n" + "=" * 60)
print("STEP 5 — Cosine Similarity:")
print("=" * 60)

if len(embeddings) >= 2:
    sim = cosine_similarity([embeddings[0]], [embeddings[1]])[0][0]
    print(f"Sentence 1: {sentences[0]}")
    print(f"Sentence 2: {sentences[1]}")
    print(f"\nCosine Similarity (SBERT): {sim:.4f}")
    print("\nInterpretation:")
    print("  - TF-IDF captures word overlap / lexical similarity")
    print("  - SBERT captures deeper semantic meaning")
    print(f"  - Score of {sim:.4f} → {'High' if sim > 0.7 else 'Moderate' if sim > 0.4 else 'Low'} semantic similarity")
else:
    print("Not enough sentences to compute similarity.")


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


STEP 1 — First 700 Characters:
On Air India Flight 171, only one individual managed to survive — British citizen Viswashkumar Ramesh.
He was spotted limping away from a group of stunned rescuers and heading towards an ambulance shortly after the incident,
which resulted in the deaths of the remaining 241 passengers and crew, along with numerous individuals on the ground in Ahmedabad.
Viswashkumar Ramesh, on Air India Flight 171 in seat 11A, was the only survivor after the plane crashed in Ahmedabad.
An expert called his survival a miracle. "Everything happened in front of my eyes," Ramesh said.
"I don't believe how I survived." Thirty seconds after takeoff, there was a loud noise and then the plane crashed.
He escaped thr

Length: 700 characters

STEP 2 — Sentence Tokenization:
Total sentences: 9
First 5 sentences:
  1. On Air India Flight 171, only one individual managed to survive — British citizen Viswashkumar Ramesh.
  2. He was spotted limping away from a group of stunned rescuers

Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.



STEP 3 — TF-IDF Vectorization:
TF-IDF matrix shape: (9, 90)
  (rows = sentences, cols = unique vocabulary terms)

STEP 4 — Sentence Embeddings (SBERT all-MiniLM-L6-v2):
All embeddings shape: (9, 384)
First sentence embedding shape: (384,)

STEP 5 — Cosine Similarity:
Sentence 1: On Air India Flight 171, only one individual managed to survive — British citizen Viswashkumar Ramesh.
Sentence 2: He was spotted limping away from a group of stunned rescuers and heading towards an ambulance shortly after the incident,
which resulted in the deaths of the remaining 241 passengers and crew, along with numerous individuals on the ground in Ahmedabad.

Cosine Similarity (SBERT): 0.4253

Interpretation:
  - TF-IDF captures word overlap / lexical similarity
  - SBERT captures deeper semantic meaning
  - Score of 0.4253 → Moderate semantic similarity
