# Assignment: Linguistic Pre-processing and Text Representation

## Instructions
- Answer all questions with detailed explanations
- Include code examples where applicable
- Provide reasoning for your design choices
- Each question requires a comprehensive answer demonstrating understanding of concepts

---

## Question 1: Multi-level Linguistic Analysis

Consider the sentence: "The company's CEO didn't respond to our meeting invitation."

Analyze this sentence from four different linguistic perspectives:
- **Syntax**: Identify the grammatical structure and phrase composition
- **Semantics**: Explain the meaning and relationships between words
- **Morphology**: Break down word formations and their components
- **Pragmatics**: Discuss the contextual interpretation and implied meaning

**Hint**: Consider how each level provides different insights. For morphology, examine words like "didn't" and "invitation". For pragmatics, think about what this might imply in a business context.

Syntax: The sentence has a subject (‚ÄúThe company‚Äôs CEO‚Äù) and a predicate (‚Äúdidn‚Äôt respond to our meeting invitation‚Äù) forming a negative declarative structure.
Semantics: It means the CEO failed to reply to the meeting invitation sent by the speaker‚Äôs group.
Morphology: ‚Äúdidn‚Äôt‚Äù = did + not (negation), ‚Äúinvitation‚Äù = invite + -tion (noun-forming suffix).
Pragmatics: In a business context, it implies unresponsiveness or possible disinterest from the CEO.

Syntax: The sentence has a subject (‚ÄúThe company‚Äôs CEO‚Äù) and a predicate (‚Äúdidn‚Äôt respond to our meeting invitation‚Äù) forming a negative declarative structure.
Semantics: It means the CEO failed to reply to the meeting invitation sent by the speaker‚Äôs group.
Morphology: ‚Äúdidn‚Äôt‚Äù = did + not (negation), ‚Äúinvitation‚Äù = invite + -tion (noun-forming suffix).
Pragmatics: In a business context, it implies unresponsiveness or possible disinterest from the CEO.

In [None]:
# Your answer here (use markdown or code cells as needed)

---

## Question 2: Pre-processing Pipeline Design

You are building a sentiment analysis system for customer reviews from an e-commerce platform. The reviews contain:
- Informal language and slang ("gonna", "wanna", "u")
- Emojis and special characters
- Product codes and prices
- Misspellings and typos

Design a comprehensive text pre-processing pipeline. For each step (tokenization, normalization, stop-word removal, stemming/lemmatization), explain:
1. Why you would include or exclude it
2. What specific considerations apply to this use case
3. The order of operations and why it matters

**Hint**: Consider whether stemming or lemmatization is more appropriate for sentiment analysis. Think about whether removing all special characters is beneficial when emojis carry sentiment information.

1. Tokenization

Include: Yes, to split text into meaningful units (words, emojis, punctuation).
Considerations: Use an emoji-aware tokenizer (like TweetTokenizer or spaCy) 
Order: First step ‚Äî it enables processing each token (for normalization, spelling correction, etc.).

2. Normalization

Include: Yes, to standardize text for consistent analysis.
Considerations:
Convert to lowercase (‚ÄúGreat‚Äù ‚Üí ‚Äúgreat‚Äù).
Expand slang (‚Äúgonna‚Äù ‚Üí ‚Äúgoing to‚Äù, ‚Äúu‚Äù ‚Üí ‚Äúyou‚Äù).
Remove product codes and prices (they don‚Äôt affect sentiment).
Correct common misspellings using a spell-checker (like textblob or symspellpy).
Order: After tokenization ‚Äî because normalization acts on tokens.

3. Stop-word Removal

Include: Yes, but selectively.
Considerations: Remove neutral words (like ‚Äúthe‚Äù, ‚Äúis‚Äù), but retain negations (‚Äúnot‚Äù, ‚Äúno‚Äù) since they flip sentiment.
Order: After normalization ‚Äî ensures clean, consistent tokens for accurate filtering.

4. Stemming / Lemmatization

Choice: Lemmatization (better than stemming).
Why: Lemmatization keeps words meaningful (‚Äúbetter‚Äù ‚Üí ‚Äúgood‚Äù), preserving semantic nuance critical for sentiment analysis, while stemming can distort (‚Äúloving‚Äù ‚Üí ‚Äúlove‚Äù is fine, but ‚Äúhappiness‚Äù ‚Üí ‚Äúhappi‚Äù isn‚Äôt).
Order: After stop-word removal ‚Äî reduces unnecessary computation on discarded words.

5. Optional: Handle Emojis and Punctuation

Include: Yes, convert emojis to sentiment words (üòä ‚Üí ‚Äúhappy‚Äù, üò° ‚Üí ‚Äúangry‚Äù).
Considerations: Don‚Äôt remove all special characters ‚Äî emojis and exclamation marks (!) often signal emotion intensity.
Order: Can occur during or right after normalization.

In [None]:
# Your answer here (use markdown or code cells as needed)

1. Tokenization

Include: Yes, to split text into meaningful units (words, emojis, punctuation).
Considerations: Use an emoji-aware tokenizer (like TweetTokenizer or spaCy) since they carry sentiment.
Order: First step ‚Äî it enables processing each token (for normalization, spelling correction, etc.).

2. Normalization
Include: Yes, to standardize text for consistent analysis.
Considerations:
Convert to lowercase (‚ÄúGreat‚Äù ‚Üí ‚Äúgreat‚Äù).
Expand slang (‚Äúgonna‚Äù ‚Üí ‚Äúgoing to‚Äù, ‚Äúu‚Äù ‚Üí ‚Äúyou‚Äù).
Remove product codes and prices (they don‚Äôt affect sentiment).
Correct common misspellings using a spell-checker (like textblob or symspellpy).
Order: After tokenization ‚Äî because normalization acts on tokens.

3. Stop-word Removal

Include: Yes, but selectively.
Considerations: Remove neutral words (like ‚Äúthe‚Äù, ‚Äúis‚Äù), but retain negations (‚Äúnot‚Äù, ‚Äúno‚Äù) since they flip sentiment.
Order: After normalization ‚Äî ensures clean, consistent tokens for accurate filtering.

4. Stemming / Lemmatization

Choice: Lemmatization (better than stemming).
Why: Lemmatization keeps words meaningful (‚Äúbetter‚Äù ‚Üí ‚Äúgood‚Äù), preserving semantic nuance critical for sentiment analysis, while stemming can distort (‚Äúloving‚Äù ‚Üí ‚Äúlove‚Äù is fine, but ‚Äúhappiness‚Äù ‚Üí ‚Äúhappi‚Äù isn‚Äôt).
Order: After stop-word removal ‚Äî reduces unnecessary computation on discarded words.

5. Optional: Handle Emojis and Punctuation

Include: Yes, convert emojis to sentiment words (üòä ‚Üí ‚Äúhappy‚Äù, üò° ‚Üí ‚Äúangry‚Äù).
Considerations: Don‚Äôt remove all special characters ‚Äî emojis and exclamation marks (!) often signal emotion intensity.
Order: Can occur during or right after normalization.

---

## Question 3: Stemming vs Lemmatization Trade-offs

Consider these sentences:
1. "The meeting was well organized and the organizers did a great job."
2. "She is better at organizing than her predecessor was."

Apply both stemming (Porter Stemmer) and lemmatization to these sentences. Then:
- Compare the outputs and explain the differences
- Discuss scenarios where stemming would be preferred over lemmatization and vice versa
- Analyze the impact on: search engines, text classification, and information retrieval systems

**Hint**: Consider computational cost, accuracy, and preservation of meaning. Words like "better", "organizing", and "was" behave differently under stemming vs lemmatization.

Sentence 1:

‚ÄúThe meeting was well organized and the organizers did a great job.‚Äù
Stemming: ‚Äúthe meet wa well organ and the organ did a great job‚Äù
Lemmatization: ‚Äúthe meet be well organize and the organizer do a great job‚Äù

Sentence 2:

‚ÄúShe is better at organizing than her predecessor was.‚Äù
Stemming: ‚Äúshe is better at organ than her predecess wa‚Äù
Lemmatization: ‚Äúshe be good at organize than her predecessor be‚Äù


-- Stemming simply chops off endings using fixed rules and may produce non-words (like ‚Äúorgan‚Äù or ‚Äúwa‚Äù). where as  Lemmatization uses a dictionary and grammar rules to return valid base forms (like ‚Äúorganize‚Äù or ‚Äúbe‚Äù).

---Using  stemming when processing large text collections quickly like search engines while  Using  lemmatization when meaning matters (like sentiment analysis or text classification).



In [None]:
# Your answer here (use markdown or code cells as needed)

Sentence 1:

‚ÄúThe meeting was well organized and the organizers did a great job.‚Äù
Stemming: ‚Äúthe meet wa well organ and the organ did a great job‚Äù
Lemmatization: ‚Äúthe meet be well organize and the organizer do a great job‚Äù

Sentence 2:

‚ÄúShe is better at organizing than her predecessor was.‚Äù
Stemming: ‚Äúshe is better at organ than her predecess wa‚Äù
Lematization: ‚Äúshe be good at organize than her predecessor be‚Äù

Differences between Stemming and Lemmatization
Stemming simply chops off endings using fixed rules and may produce non-words (like ‚Äúorgan‚Äù or ‚Äúwa‚Äù).
Lematization uses a dictionary and grammar rules to return valid base forms (like ‚Äúorganize‚Äù or ‚Äúbe‚Äù).
Lemmatization handles irregular words correctly (‚Äúbetter‚Äù ‚Üí ‚Äúgood‚Äù, ‚Äúwas‚Äù ‚Üí ‚Äúbe‚Äù), while stemming does not.
Stemming is much faster but less accurate; lemmatization is slower but more meaningful.

When to Use Each
Use stemming when processing large text collections quickly (like search engines).
Use lemmatization when meaning matters (like sentiment analysis or text classification).
Stemming focuses on speed; lemmatization focuses on linguistic accuracy.

---

## Question 4: POS Tagging for Ambiguity Resolution

Examine these ambiguous sentences:
1. "The duck is ready to eat."
2. "They can fish."
3. "Time flies like an arrow."

Explain:
- How POS tagging helps resolve these ambiguities
- The difference between rule-based and probabilistic POS tagging approaches
- Which approach would perform better for each sentence and why
- Limitations of both approaches

**Hint**: Consider how context and word order influence tagging. Think about the Hidden Markov Model approach for probabilistic tagging vs pattern-matching rules.

--POS tagging assigns parts of speech (noun, verb, adjective, etc.) to each word.Ambiguities occur when a word can be multiple POS types depending on context.
Example: ‚Äúduck‚Äù can be a noun (the bird) or a verb (to lower your head).

Rule_based tags
Uses hand-crafted rules based on word patterns, suffixes, capitalization, and neighboring words.
Example: If a word follows ‚Äúthe‚Äù, it‚Äôs likely a noun.

Prob Tags
Uses statistical models (like Hidden Markov Models) trained on large corpora. Assigns tags based on likelihoods of sequences of POS tags.

sent 1:Best approach: Probabilistic ‚Äî can use context: ‚ÄúThe [noun] is ready to eat‚Äù is more likely than ‚ÄúThe [verb] is ready‚Ä¶‚Äù.
Sent 2:Best approach: Probabilistic ‚Äî context ‚ÄúThey [modal] [verb]‚Äù resolves it as ‚ÄúThey can [verb] fish‚Äù (ability), which rule-based may misinterpret.
sent 3:Best approach: Probabilistic ‚Äî considers sequence probabilities to tag ‚ÄúTime [noun] flies [verb] like [prep] an arrow‚Äù.


----
4. Limitations

Rule-based:
Cannot handle unusual or idiomatic sentences easily. Requires extensive manual rules.

Probabilistic:
Depends on quality and size of training corpus.May misclassify rare or creative uses of language.Cannot fully understand semantics, only probabilities.


In [None]:
# Your answer here (use markdown or code cells as needed)

How POS Tagging Resolves Ambiguities
POS tagging assigns parts of speech (noun, verb, adjective, etc.) to each word.
Ambiguities occur when a word can be multiple POS types depending on context.
Example: ‚Äúduck‚Äù can be a noun (the bird) or a verb (to lower your head).
Correct POS tags help disambiguate meaning in a sentence, which is crucial for parsing, semantic analysis, and NLP tasks.

2. Rule-based vs Probabilistic POS Tagging

Rule-based POS Tagging:
Uses hand-crafted rules based on word patterns, suffixes, capitalization, and neighboring words.
Exmple: If a word follows ‚Äúthe‚Äù, it‚Äôs likely a noun.
Pros: Transparent, interpretable.
Cons: Hard to cover all cases, fails with unseen patterns or idiomatic expressions.

Probabilistic POS Tagging:
Uses statistical models (like Hidden Markov Models) trained on large corpora.
Assigns tags based on likelihoods of sequences of POS tags.
Pros: Handles context better, adapts to new words via probabilities.
Cons: Requires annotated corpora, may make mistakes if the context is unusual.

3. Performance for Each Sentence

Sentence 1: ‚ÄúThe duck is ready to eat.‚Äù
Ambiguity: ‚Äúduck‚Äù = noun (bird) or verb (action).
Best approach: Probabilistic ‚Äî can use context: ‚ÄúThe [noun] is ready to eat‚Äù is more likely than ‚ÄúThe [verb] is ready‚Ä¶‚Äù.
Sentence 2: ‚ÄúThey can fish.‚Äù
Ambiguity: ‚Äúcan‚Äù = modal verb or container, ‚Äúfish‚Äù = noun or verb.
Best approach: Probabilistic ‚Äî context ‚ÄúThey [modal] [verb]‚Äù resolves it as ‚ÄúThey can [verb] fish‚Äù (ability), which rule-based may misinterpret.

Sentence 3: ‚ÄúTime flies like an arrow.‚Äù
Ambiguity: ‚Äúflies‚Äù = verb or plural noun, ‚Äúlike‚Äù = preposition or verb.
Best approach: Probabilistic ‚Äî considers sequence probabilities to tag ‚ÄúTime [noun] flies [verb] like [prep] an arrow‚Äù
Rule-based may fail due to idiomatic structure.

4. Limitations

Rule-based:
Cannot handle unusual or idiomatic sentences easily.
Requires extensive manual rules.
Probabilistic:
Depends on quality and size of training corpus.
May misclassify rare or creative uses of language.
Cannot fully understand semantics, only probabilities.

---

## Question 5: Named Entity Recognition System Design

You need to build an NER system for extracting information from medical reports. The text contains:
- Disease names ("Type 2 Diabetes", "COVID-19")
- Medication names ("Metformin", "Ibuprofen 200mg")
- Dosages and measurements
- Doctor and patient names
- Hospital names and dates

Compare dictionary-based and CRF-based NER methods for this application:
- Advantages and disadvantages of each approach
- How would you handle new drug names not in the dictionary?
- What features would you use in a CRF model?
- How would you combine both approaches for optimal results?

**Hint**: Consider that medical terminology is specialized but relatively standardized. Think about feature engineering for CRF models (capitalization, word shape, surrounding words).

1. Dictionary-based NER

Advantages:
Simple and fast to implement. Works well for standardized medical terms (common diseases, drug names).High precision when the dictionary contains the term.

Disadvantages:
Cannot detect new or misspelled entities.Sensitive to variations (e.g., ‚ÄúMetformin 500mg‚Äù vs ‚ÄúMetformin‚Äù).Limited context understanding (e.g., ‚Äúdiabetes‚Äù vs ‚Äúprediabetes‚Äù).

Handling new drug names:
Update the dictionary regularly from drug databases (RxNorm, DrugBank).Use fuzzy matching to capture typos or minor variations.Combine with pattern-based rules for dosages (e.g., ‚Äú\d+mg‚Äù).

2. CRF-based NER

Advantages:

Context-aware ‚Äî considers surrounding words to identify entities.Can recognize unseen entities if patterns/features match training data.Handles complex entities like ‚ÄúCOVID-19 vaccine 2nd dose‚Äù.

Disadvantages:
Requires labeled training data, which is costly for medical text.Slower to train and process than dictionary lookup.Performance depends on feature engineering.

Features for CRF model:

Lexical features: word itself, lowercase form, prefixes/suffixes.
Orthographic features: capitalization, digits, special characters (e.g., hyphens in ‚ÄúCOVID-19‚Äù).
Context features: neighboring words (window of ¬±2 words).
Part-of-speech tags: verbs, nouns, numbers.
Word shape: patterns like ‚ÄúXx‚Äù, ‚ÄúXX‚Äù, ‚Äúdd‚Äù for numeric dosages.
Dictionary features: flag if a word exists in medical dictionaries.

3. Combining Both Approaches

Strategy for optimal results:
Use dictionary-based lookup first to capture known standardized entities (high precision).

Use CRF-based tagging for:
Unseen drug names or variationEntities requiring context to disambiguate (e.g., ‚ÄúType 2 Diabetes‚Äù vs ‚Äútype‚Äù).

Post-processing
Resolve conflicts (if CRF and dictionary disagree).Normalize entities (e.g., unify ‚ÄúMetformin 500mg‚Äù ‚Üí ‚ÄúMetformin‚Äù).

Benefits of combination:
High precision from dictionary-based approach.
High recall from CRF model handling unseen or context-dependent entities.
Handles misspellings, dosage patterns, and complex phrases effectively.

In [None]:
# Your answer here (use markdown or code cells as needed)

. Dictionary-based NER

Advantages:

Simple and fast to implement.Works well for standardized medical terms (common diseases, drug names)High precision when the dictionary contains the term.

Disadvantages:
Cannot detect new or misspelled entities.Sensitive to variations (e.g., ‚ÄúMetformin 500mg‚Äù vs ‚ÄúMetformin‚Äù).
Limited context understanding (e.g., ‚Äúdiabetes‚Äù vs ‚Äúprediabetes‚Äù)Handling new drug names:
Update the dictionary regularly from drug databases (RxNorm, DrugBank).
Use fuzzy matching to capture typos or minor variations.
Combine with pattern-based rules for dosages (e.g., ‚Äú\d+mg‚Äù).

2. CRF-based NER

Advantages:
Context-aware ‚Äî considers surrounding words to identify entities.
Can recognize unseen entities if patterns/features match training data.
Handles complex entities like ‚ÄúCOVID-19 vaccine 2nd dose‚Äù.

Disadvantages:
Requires labeled training data, which is costly for medical text.
Slower to train and process than dictionary lookup.
Performance depends on feature engineering.

Features for CRF model:
Lexical features: word itself, lowercase form, prefixes/suffixes.
Orthographic features: capitalization, digits, special characters (e.g., hyphens in ‚ÄúCOVID-19‚Äù).
Cntext features: neighboring words (window of ¬±2 words).
Part-of-speech tags: verbs, nouns, numbers.
Word shape: patterns like ‚ÄúXx‚Äù, ‚ÄúXX‚Äù, ‚Äúdd‚Äù for numeric dosages.
Dictionary features: flag if a word exists in medical dictionaries.

3. Combining Both Approaches

Strategy for optimal results:
Use dictionary-based lookup first to capture known standardized entities (high precision).
Use CRF-based tagging for:
Unseen drug names or variations
Entities requiring context to disambiguate (e.g., ‚ÄúType 2 Diabetes‚Äù vs ‚Äútype‚Äù).

Post-processing:

Resolve conflicts (if CRF and dictionary disagree)Normalize entities (e.g., unify ‚ÄúMetformin 500mg‚Äù ‚Üí ‚ÄúMetformin‚Äù).
Benefits of combination:
High precision from dictionary-based approach.
High recall from CRF model handling unseen or context-dependent entities.
Handles misspellings, dosage patterns, and complex phrases effectively.

---

## Question 6: N-gram Language Models and Perplexity

Given a small corpus:
```
"I love machine learning"
"I love deep learning"
"Machine learning is fascinating"
"Deep learning is powerful"
```

a) Build a bigram language model and calculate probabilities for:
   - "I love natural learning"
   - "Machine learning is powerful"

b) Explain the zero-probability problem and demonstrate:
   - How Laplace smoothing addresses it
   - The concept of backoff strategies
   - How to calculate and interpret perplexity

c) Discuss why lower perplexity indicates a better language model.

**Hint**: For unseen bigrams like "natural learning", consider what probability would be assigned without smoothing. Calculate perplexity as a measure of how "surprised" the model is.

a) Bigram Model Calculation:
A bigram model estimates probabilities like P(word‚ÇÇ | word‚ÇÅ). For example, P(love | I) = 1 and P(learning | machine) = 1 based on the corpus. The phrase ‚ÄúI love natural learning‚Äù has zero probability since ‚Äúnatural‚Äù never follows ‚Äúlove,‚Äù while ‚ÄúMachine learning is powerful‚Äù has a valid probability product > 0.

b) Zero-Probability & Solutions:
The zero-probability problem occurs when unseen bigrams (like ‚Äúnatural learning‚Äù) make the entire sentence probability zero. Laplace smoothing adds 1 to all counts, ensuring no bigram has zero probability. Backoff models assign probability from smaller n-grams (like unigrams) when higher-order n-grams are missing, making the model more robust.

c) Perplexity & Model Quality:
Perplexity measures how well a model predicts test data ‚Äî lower perplexity means the model is less ‚Äúsurprised‚Äù by real sentences. It indicates better generalization and smoother probability distribution. In essence, a lower perplexity model predicts natural language more accurately.

In [None]:
# Your answer here (use markdown or code cells as needed)

a) Bigram Model Calculation:
A bigram model estimates probabilities like P(word‚ÇÇ | word‚ÇÅ). For example, P(love | I) = 1 and P(learning | machine) = 1 based on the corpus. The phrase ‚ÄúI love natural learning‚Äù has zero probability since ‚Äúnatural‚Äù never follows ‚Äúlove,‚Äù while ‚ÄúMachine learning is powerful‚Äù has a valid probability product > 0.

b) Zero-Probability & Solutions:
The zero-probability problem occurs when unseen bigrams (like ‚Äúnatural learning‚Äù) make the entire sentence probability zero. Laplace smoothing adds 1 to all counts, ensuring no bigram has zero probability. Backoff models assign probability from smaller n-grams (like unigrams) when higher-order n-grams are missing, making the model more robust.

c) Perplexity & Model Quality:
Perplexity measures how well a model predicts test data ‚Äî lower perplexity means the model is less ‚Äúsurprised‚Äù by real sentences. It indicates better generalization and smoother probability distribution. In essence, a lower perplexity model predicts natural language more accurately.

---

## Question 7: Bag-of-Words vs TF-IDF Analysis

Consider three documents:
- Doc1: "Machine learning is a subset of artificial intelligence"
- Doc2: "Deep learning is a subset of machine learning"
- Doc3: "Artificial intelligence and machine learning are transforming industries"

a) Construct the BoW representation and TF-IDF vectors for all documents

b) Calculate cosine similarity between documents using both representations

c) Explain:
   - Why the similarity scores differ between BoW and TF-IDF
   - Which representation better captures document similarity for:
     - Information retrieval
     - Document clustering
     - Topic modeling
   - Limitations of both approaches

**Hint**: Consider how TF-IDF downweights common terms like "is" and "a". Think about what information is lost (word order, context, semantics).

In [None]:
# -------------------------------
# Q7: Bag-of-Words vs TF-IDF Analysis
# -------------------------------

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
import numpy as np

# Sample documents
docs = [
    "Machine learning is a subset of artificial intelligence",
    "Deep learning is a subset of machine learning",
    "Artificial intelligence and machine learning are transforming industries"
]

# --- (a) BoW and TF-IDF Representations ---

# Create Bag-of-Words model
bow_vectorizer = CountVectorizer()
bow_matrix = bow_vectorizer.fit_transform(docs)
bow_df = pd.DataFrame(bow_matrix.toarray(), columns=bow_vectorizer.get_feature_names_out())

# Create TF-IDF model
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(docs)
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

# Display both representations
print("=== Bag-of-Words Representation ===")
print(bow_df)
print("\n=== TF-IDF Representation ===")
print(tfidf_df.round(4))

# --- (b) Cosine Similarity Calculations ---

# Compute cosine similarity for BoW
bow_cosine_sim = cosine_similarity(bow_matrix)
print("\n=== Cosine Similarity (BoW) ===")
print(pd.DataFrame(bow_cosine_sim, 
                   index=['Doc1', 'Doc2', 'Doc3'], 
                   columns=['Doc1', 'Doc2', 'Doc3']).round(3))

# Compute cosine similarity for TF-IDF
tfidf_cosine_sim = cosine_similarity(tfidf_matrix)
print("\n=== Cosine Similarity (TF-IDF) ===")
print(pd.DataFrame(tfidf_cosine_sim, 
                   index=['Doc1', 'Doc2', 'Doc3'], 
                   columns=['Doc1', 'Doc2', 'Doc3']).round(3))

# --- (c) Observations ---
print("\n--- Observations ---")
print("‚Ä¢ BoW gives higher similarity scores for all pairs since frequent terms dominate (like 'machine' and 'learning').")
print("‚Ä¢ TF-IDF downweights common words and emphasizes unique ones (like 'deep', 'transforming'), reducing overlap.")
print("‚Ä¢ Thus, TF-IDF is better for information retrieval and clustering, while BoW works well for topic modeling.")


---

## Question 8: Word2Vec Architectures Deep Dive

Explain the Word2Vec model by addressing:

a) **CBOW (Continuous Bag of Words)**:
   - Architecture and training objective
   - How context words predict the target word
   - Best use cases

b) **Skip-gram**:
   - Architecture and training objective
   - How target word predicts context words
   - Best use cases

c) For the sentence "The quick brown fox jumps over the lazy dog" (window size = 2):
   - Show training examples for both CBOW and Skip-gram when target word is "fox"
   - Explain which architecture works better for:
     - Small datasets
     - Rare words
     - Frequent words

**Hint**: CBOW is faster and works well with frequent words, while Skip-gram is better for rare words and smaller datasets. Consider the number of training instances generated.

a) CBOW (Continuous Bag of Words)

Architecture: Uses surrounding context words within a window to predict the target word.
Objective: Maximize P(target | context).
Use case: Fast and efficient for large datasets with frequent words.

b) Skip-gram

Architecture: Uses the target word to predict its context words.
Objective: Maximize P(context | target).
Use case: Better for small datasets and learning rare word representations.

c) Example ‚Äî ‚ÄúThe quick brown fox jumps over the lazy dog‚Äù (window = 2, target = ‚Äúfox‚Äù)

CBOW training pair:
Context = [‚Äúquick‚Äù, ‚Äúbrown‚Äù, ‚Äújumps‚Äù, ‚Äúover‚Äù] ‚Üí Target = ‚Äúfox‚Äù.
Skip-gram training pairs:
Target = ‚Äúfox‚Äù ‚Üí Context words = (‚Äúquick‚Äù, ‚Äúbrown‚Äù, ‚Äújumps‚Äù, ‚Äúover‚Äù).
Performance:
Small datasets: Skip-gram performs better.
Rare words: Skip-gram learns them more effectively.
Frequent words: CBOW trains faster and performs better.

In [None]:
# Your answer here (use markdown or code cells as needed)

a) CBOW (Continuous Bag of Words)


Architecture: Uses surrounding context words within a window to predict the target word.
Objective: Maximize P(target | context).
Use case: Fast and efficient for large datasets with frequent words.



b) Skip-gram
Architecture: Uses the target word to predict its context words.
Objective: Maximize P(context | target).
Use case: Better for small datasets and learning rare word representations.



c) Example ‚Äî ‚ÄúThe quick brown fox jumps over the lazy dog‚Äù (window = 2, target = ‚Äúfox‚Äù)

CBOW training pair:
Context = [‚Äúquick‚Äù, ‚Äúbrown‚Äù, ‚Äújumps‚Äù, ‚Äúover‚Äù] ‚Üí Target = ‚Äúfox‚Äù.
Skip-gram training pairs:
Target = ‚Äúfox‚Äù ‚Üí Context words = (‚Äúquick‚Äù, ‚Äúbrown‚Äù, ‚Äújumps‚Äù, ‚Äúover‚Äù).

Performance:
Small datasets: Skip-gram performs better.
Rare words: Skip-gram learns them more effectively.
Frequent words: CBOW trains faster and performs better.





---

## Question 9: GloVe vs FastText Comparison

Compare and contrast GloVe and FastText embedding techniques:

a) **Training methodology**:
   - How does GloVe use global co-occurrence statistics?
   - How does FastText incorporate subword information?

b) **Handling Out-of-Vocabulary (OOV) words**:
   - Given the trained words: "playing", "player", "played"
   - How would each model handle the unseen word "gameplay"?
   - Which model is more suitable for morphologically rich languages (e.g., German, Turkish)?

c) **Practical considerations**:
   - Training time and computational requirements
   - Model size and memory footprint
   - Performance on rare and misspelled words

**Hint**: FastText breaks words into character n-grams (e.g., "playing" ‚Üí "<pl", "pla", "lay", "ayi", "yin", "ing", "ng>"). GloVe uses matrix factorization on co-occurrence counts.

In [None]:
# Your answer here (use markdown or code cells as needed)

a) Training Methodology

GloVe: Uses global co-occurrence statistics ‚Äî builds a word‚Äìcontext matrix and learns embeddings by factorizing it so similar words appear close in vector space.

FastText: Extends Word2Vec by representing each word as a bag of character n-grams, learning subword embeddings that capture morphology.

b) Handling OOV (Out-of-Vocabulary) Words

Example: Trained on ‚Äúplaying‚Äù, ‚Äúplayer‚Äù, ‚Äúplayed‚Äù.

GloVe: Cannot represent unseen word ‚Äúgameplay‚Äù ‚Äî it‚Äôs truly OOV.

FastText: Generates a vector by combining n-grams (e.g., ‚Äúgam‚Äù, ‚Äúame‚Äù, ‚Äúmep‚Äù, ‚Äúpla‚Äù, ‚Äúlay‚Äù) ‚Üí can infer a meaningful embedding.

Morphologically rich languages: FastText performs better since it handles prefixes, suffixes, and inflections effectively.

c) Practical Considerations

Training time: GloVe is faster (matrix factorization) but less flexible; FastText is slower due to subword computations.

Model size: FastText models are larger since they store n-gram embeddings; GloVe is more compact.

Rare/misspelled words: FastText handles them better by composing subword features; GloVe fails for unseen or rare forms.

a) Training Methodology

GloVe: Uses global co-occurrence statistics ‚Äî builds a word‚Äìcontext matrix and learns embeddings by factorizing it so similar words appear close in vector space.
FastText: Extends Word2Vec by representing each word as a bag of character n-grams, learning subword embeddings that capture morphology.

b) Handling OOV (Out-of-Vocabulary) Words

Example: Trained on ‚Äúplaying‚Äù, ‚Äúplayer‚Äù, ‚Äúplayed‚Äù.
GloVe: Cannot represent unseen word ‚Äúgameplay‚Äù ‚Äî it‚Äôs truly OOV.
FastText: Generates a vector by combining n-grams (e.g., ‚Äúgam‚Äù, ‚Äúame‚Äù, ‚Äúmep‚Äù, ‚Äúpla‚Äù, ‚Äúlay‚Äù) ‚Üí can infer a meaningful embedding.
Morphologically rich languages: FastText performs better since it handles prefixes, suffixes, and inflections effectively.

c) Practical Considerations

Training time: GloVe is faster (matrix factorization) but less flexible; FastText is slower due to subword computations.
Model size: FastText models are larger since they store n-gram embeddings; GloVe is more compact.
Rare/misspelled words: FastText handles them better by composing subword features; GloVe fails for unseen or rare forms.

---

## Question 10: Classical vs Distributed Representations - Application Perspective

You are tasked with building three different NLP applications:

1. **Legal document search engine** (searching through contracts and legal texts)
2. **Chatbot intent classification** (understanding user queries)
3. **Academic paper recommendation system** (suggesting related research papers)

For each application:

a) Decide whether to use classical representations (BoW/TF-IDF) or distributed representations (Word2Vec/GloVe/FastText)

b) Justify your choice by considering:
   - Semantic similarity requirements
   - Vocabulary size and domain specificity
   - Training data availability
   - Computational constraints
   - Interpretability needs

c) Discuss hybrid approaches: Could combining both representation types improve performance? How?

**Hint**: Legal documents might require exact term matching, while chatbots benefit from semantic understanding. Consider that classical methods are sparse and interpretable, while distributed representations are dense and capture semantic relationships.

In [None]:
# Your answer here (use markdown or code cells as needed)

1. Legal Document Search Engine

a) Representation: Classical (TF-IDF)
b) Justification:
Legal documents rely on exact term matching and interpretability. TF-IDF is ideal because it highlights important legal terms, handles domain-specific vocabulary, and doesn‚Äôt need massive training data. It‚Äôs computationally efficient and easy to explain to users.
c) Hybrid Approach:
Combining TF-IDF with Word2Vec embeddings can help retrieve documents that are semantically related even if they use different legal terms (e.g., ‚Äúagreement‚Äù vs. ‚Äúcontract‚Äù).

a) Representation: Distributed (Word2Vec or FastText)
b) Justification:
Chatbots must understand semantic similarity and handle diverse phrasing (e.g., ‚Äúbook flight‚Äù ‚âà ‚Äúreserve a ticket‚Äù). Distributed embeddings capture meaning beyond word surface forms, improving generalization and intent recognition accuracy.
c) Hybrid Approach:
Combining embeddings with TF-IDF features can balance interpretability and context, improving predictions for rare intents or domain-specific terms.

Academic Paper Recommendation System

a) Representation: Distributed (Doc2Vec or Sentence Embeddings)
b) Justification:
Recommendations require understanding semantic relationships between research papers (e.g., ‚Äúneural networks‚Äù vs. ‚Äúdeep learning‚Äù). Distributed embeddings capture contextual meaning, enabling better similarity search across large vocabularies.
c) Hybrid Approach:
A hybrid model combining TF-IDF (for keyword precision) and embeddings (for topic similarity) can boost recommendation quality, ensuring both relevance and coverage.

---

## Submission Guidelines

- Complete all questions in this notebook
- Include code implementations where applicable (using NLTK, spaCy, scikit-learn, or gensim)
- Provide clear explanations and reasoning
- Add visualizations if they help explain your answers
- Ensure your code is properly commented