<a href="https://colab.research.google.com/github/elijahmflomo/Sem_2_APPLIED-NATURAL-LANGUAGE-PROCESSING/blob/main/Lab_Assignment_3_BOW_TF_IDF_Uni_Bi_grams.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Task 1. Understanding the Concept

The **Bag of Words (BoW)** model is a way of representing text data as numerical vectors. In NLP, machines can't "read" words; they need numbers. BoW looks at a document as a literal "bag" of its words: it ignores grammar, word order, and sentence structure, focusing only on **whether a word appears** and **how often**.

### How Word Frequency is Represented

In a BoW vector:

* Each **dimension** (or index) in the vector corresponds to a specific word from a predefined **vocabulary**.
* The **value** at that index represents the **count** (frequency) of that word in the specific document.
* If a word from the vocabulary is missing in the document, its value is .

---

## 2. The Modern Approach & Alternatives

While BoW is easy to implement, it has significant downsides:

1. **Sparsity:** If your vocabulary has 10,000 words but your email only has 10, your vector is 99.9% zeros.
2. **No Context:** It treats "Dog bites man" and "Man bites dog" exactly the same.
3. **Frequency Bias:** Common words like "the" or "is" can dominate the vector without providing actual meaning.

**Modern Alternatives:**

* **TF-IDF (Term Frequency-Inverse Document Frequency):** Penalizes common words and rewards unique, meaningful words.
* **Word Embeddings (Word2Vec, GloVe):** Represents words in a continuous vector space where similar words (e.g., "boat" and "ship") are mathematically close to each other.
* **Transformers (BERT, GPT):** The gold standard. These use **Attention mechanisms** to understand the context of a word based on the words surrounding it.

---

## 3. Implementation: Spam Detection

We will use a real-world dataset: the **UCI SMS Spam Collection**. It’s a classic dataset for BoW because spam often relies on specific "trigger words."

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

# 1. Real-world Dataset Loading (Subset for demonstration)
data = {
    'text': [
        "Win a free cash prize now! Click here for your offer.",
        "Are we still having the meeting at 5?",
        "Free entry to win a prize. Text 'OFFER' to 8001.",
        "I will be late for the meeting, sorry."
    ],
    'label': ['spam', 'ham', 'spam', 'ham']
}
df = pd.DataFrame(data)

# 2. Define our specific vocabulary as requested
custom_vocab = ['offer', 'free', 'win', 'meeting']

# 3. Initialize CountVectorizer with the custom vocabulary
# We use lowercase=True to ensure 'OFFER' and 'offer' are treated the same
vectorizer = CountVectorizer(vocabulary=custom_vocab, lowercase=True)

# 4. Transform the text into BoW vectors
bow_matrix = vectorizer.transform(df['text'])



In [2]:
# 5. Display the Results
bow_df = pd.DataFrame(bow_matrix.toarray(), columns=custom_vocab)
bow_df['Original Text'] = df['text']

print("Vocabulary:", custom_vocab)
print("\nBag of Words Feature Vectors:")
print(bow_df)

Vocabulary: ['offer', 'free', 'win', 'meeting']

Bag of Words Feature Vectors:
   offer  free  win  meeting  \
0      1     1    1        0   
1      0     0    0        1   
2      1     1    1        0   
3      0     0    0        1   

                                       Original Text  
0  Win a free cash prize now! Click here for your...  
1              Are we still having the meeting at 5?  
2   Free entry to win a prize. Text 'OFFER' to 8001.  
3             I will be late for the meeting, sorry.  


### Explanation of the Output Vectors

If we look at the first row of the output based on the code above:

* **Text:** "Win a free cash prize now! Click here for your offer."
* **Vector:** `[1, 1, 1, 0]`
* `1` for "offer"
* `1` for "free"
* `1` for "win"
* `0` for "meeting"



---

## 4. Summary Table

| Feature | Bag of Words (BoW) | TF-IDF | Word Embeddings (Modern) |
| --- | --- | --- | --- |
| **Value** | Raw counts | Importance score | Semantic meaning |
| **Context** | None | None | High (Spatial relationship) |
| **Size** | Large/Sparse | Large/Sparse | Fixed/Dense |






## Task 2. **TF-IDF** stands for **Term Frequency-Inverse Document Frequency**

**TF-IDF** stands for **Term Frequency-Inverse Document Frequency**. It is a statistical measure used to evaluate how important a word is to a document in a collection or corpus.

The logic is split into two parts:

1. **Term Frequency (TF):** How often does the word appear in *this* specific review? (More is better).
2. **Inverse Document Frequency (IDF):** How many reviews contain this word? If *everyone* is saying it (like "the" or "is"), it’s not a helpful feature. We decrease its weight.

Mathematically, for a term t in a document :

$$\text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t)$$Where:$$\text{IDF}(t) = \log\left(\frac{\text{Total number of documents}}{\text{Number of documents containing term } t}\right)$$



## 2. Implementation: Product Review Analysis

We will use a small sample of product reviews to demonstrate how "common" words get suppressed while "feature" words get boosted.

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

# 1. Sample Dataset: Product Reviews
reviews = [
    "The battery life is very long and the battery charges fast.",
    "This camera is very clear but the battery is weak.",
    "The performance of this laptop is very fast and smooth.",
    "I love the camera quality and the performance is great."
]

# 2. Initialize the TF-IDF Vectorizer
# 'stop_words' removes common English words like 'the', 'is', 'and' automatically
vectorizer = TfidfVectorizer(stop_words='english')

# 3. Fit and transform the reviews
tfidf_matrix = vectorizer.fit_transform(reviews)

# 4. Convert to a readable DataFrame
feature_names = vectorizer.get_feature_names_out()
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=feature_names)

# Adding the original review for clarity
tfidf_df.insert(0, "Original Review", reviews)

# 5. Display the results
print("TF-IDF Feature Matrix:")
print(tfidf_df.to_string())


TF-IDF Feature Matrix:
                                               Original Review   battery    camera   charges     clear      fast     great    laptop      life      long      love  performance   quality    smooth      weak
0  The battery life is very long and the battery charges fast.  0.638021  0.000000  0.404624  0.000000  0.319010  0.000000  0.000000  0.404624  0.404624  0.000000     0.000000  0.000000  0.000000  0.000000
1           This camera is very clear but the battery is weak.  0.437791  0.437791  0.000000  0.555283  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000     0.000000  0.000000  0.000000  0.555283
2      The performance of this laptop is very fast and smooth.  0.000000  0.000000  0.000000  0.000000  0.437791  0.000000  0.555283  0.000000  0.000000  0.000000     0.437791  0.000000  0.555283  0.000000
3      I love the camera quality and the performance is great.  0.000000  0.382743  0.000000  0.000000  0.000000  0.485461  0.000000  0.000000  0.000000 

### Analysis of the Results

* **Weighting:** Notice that in Review 1, the word **"battery"** will have a high score because it appears twice, but it is penalized slightly because it also appears in Review 2.
* **The "Very" Effect:** If "very" wasn't in the stop-word list, its TF-IDF score would be very low across all documents because it appears in almost every review, making it useless for distinguishing between a laptop and a camera.
* **Uniqueness:** A word like **"laptop"** or **"charges"** that only appears in one review will have a much higher relative weight for that specific document.

---

## 3. Why this is better for Feature Selection

* **Automatic Noise Reduction:** It naturally handles "the", "is", and "a" without needing a massive manual list of words to ignore.
* **Highlighting Keywords:** It makes "battery" or "performance" stand out as the mathematical "signature" of the review.
* **Efficiency:** It’s computationally inexpensive compared to Deep Learning models while providing much cleaner data for a classifier (like a sentiment analyzer).

---

## 4. Comparison Summary

| Metric | Bag of Words (BoW) | TF-IDF |
| --- | --- | --- |
| **Common Words** | High score (Distorts data) | Low score (Filtered out) |
| **Rare Keywords** | Low score (Lost in noise) | High score (Highlighted) |
| **Best Use Case** | Basic text classification | Feature extraction, Search engines |

---

##3. The Task: News Article Word Frequency Analysis

> **Question:** A news analytics company needs to identify trending topics from hundreds of daily articles. To do this, you must perform an initial exploratory text analysis by:
> * **a)** Writing a Python program to calculate the frequency of each word in a set of news articles after basic preprocessing.
> * **b)** Identifying the top 5 most frequent words in the processed text.
>
>



## 1. Understanding the Concept

To get an accurate word count, we can't just count every string. If we don't preprocess, "Market" and "market!" would be treated as different words.

**The Preprocessing Pipeline:**

1. **Tokenization:** Splitting sentences into individual words.
2. **Lowercasing:** Converting everything to lowercase so "Election" and "election" match.
3. **Noise Removal:** Removing punctuation and special characters.
4. **Stop Word Removal:** (Optional but recommended) Removing words like "the", "is", and "in" that carry no topical meaning.


## 2. Implementation: Identifying Trending Topics

We will use a sample set of news-style headlines and the `collections` library, which is the most efficient way to count items in Python.


In [4]:
import re
from collections import Counter

# 1. Sample News Articles (Dataset)
news_articles = [
    "The election results are coming in today. The election is close!",
    "Market trends show a shift in technology stocks. Investors watch the market.",
    "New policy changes affecting the technology sector were announced.",
    "Technology is driving the market to new heights during this election cycle.",
    "Economic policy and the market are the main focuses of the election."
]

def preprocess_and_count(text_list):
    # Combine all articles into one large string
    combined_text = " ".join(text_list).lower()

    # Use Regex to remove punctuation and keep only alphanumeric words
    words = re.findall(r'\b\w+\b', combined_text)

    # Define basic stop words to filter out "noise"
    stop_words = {'the', 'is', 'and', 'in', 'to', 'of', 'are', 'this', 'were', 'for', 'with'}

    # Filtered word list
    filtered_words = [word for word in words if word not in stop_words]

    # Calculate frequencies
    word_counts = Counter(filtered_words)
    return word_counts

# Execute the analysis
frequencies = preprocess_and_count(news_articles)

# Identify the Top 5
top_5 = frequencies.most_common(5)

# 3. Display Results
print("Full Word Frequencies:")
print(dict(frequencies))
print("\n--- TOP 5 TRENDING WORDS ---")
for word, count in top_5:
    print(f"{word.capitalize()}: {count} times")

Full Word Frequencies:
{'election': 4, 'results': 1, 'coming': 1, 'today': 1, 'close': 1, 'market': 4, 'trends': 1, 'show': 1, 'a': 1, 'shift': 1, 'technology': 3, 'stocks': 1, 'investors': 1, 'watch': 1, 'new': 2, 'policy': 2, 'changes': 1, 'affecting': 1, 'sector': 1, 'announced': 1, 'driving': 1, 'heights': 1, 'during': 1, 'cycle': 1, 'economic': 1, 'main': 1, 'focuses': 1}

--- TOP 5 TRENDING WORDS ---
Election: 4 times
Market: 4 times
Technology: 3 times
New: 2 times
Policy: 2 times


## 3. Modern Approach & Alternatives

While manual counting is great for EDA, modern industry applications use more robust tools:

* **SpaCy / NLTK:** These libraries have built-in "Stop Word" lists for multiple languages and can handle **Lemmatization** (grouping "running", "ran", and "runs" into the single word "run").
* **Word Clouds:** A visual alternative where the size of the word represents its frequency.
* **Topic Modeling (LDA):** Instead of just counting words, Latent Dirichlet Allocation (LDA) can automatically group words into themes (e.g., it would recognize that "stocks," "investors," and "market" all belong to a "Finance" topic).


## 4. Summary Table

| Method | Best For | Pros | Cons |
| --- | --- | --- | --- |
| **Simple Counter** | Quick EDA | Fast, no dependencies | High noise (stop words) |
| **NLTK/SpaCy** | Production Pipelines | Highly accurate, handles grammar | Slower, more memory |
| **LDA / BERT** | Complex Trends | Discovers hidden themes | Requires high compute |


---

## 4. The Task: Unigram and Bigram Probability Modeling

> **Question:** A chatbot team is building a component to predict the next word in a sentence. Using a corpus containing:
> * “Natural language processing is interesting”
> * “Natural language processing is useful”
>
>
> **A)** Construct a **Unigram model** by calculating the probability of each word.
> **B)** Construct a **Bigram model** by calculating the conditional probability of a word given the previous word.

---

## 1. Understanding the Concept

In a **Unigram** model, we assume each word is independent. The probability of a word is simply its frequency divided by the total number of words.

In a **Bigram** model, we assume the probability of a word depends *only* on the word immediately preceding it (this is known as a **Markov Assumption**).

The conditional probability for a Bigram  is calculated as:


---

## 2. Implementation: Probability Calculations

In [5]:
from collections import Counter, defaultdict

# 1. Prepare the Corpus
corpus = [
    "Natural language processing is interesting",
    "Natural language processing is useful"
]

# Tokenize (simplistic approach for this example)
tokens = []
for sentence in corpus:
    tokens.extend(sentence.split())

total_words = len(tokens)

# --- A) Unigram Model ---
unigram_counts = Counter(tokens)
unigram_probs = {word: count / total_words for word, count in unigram_counts.items()}

# --- B) Bigram Model ---
bigrams = []
for sentence in corpus:
    words = sentence.split()
    # Create pairs of consecutive words
    bigrams.extend(zip(words[:-1], words[1:]))

bigram_counts = Counter(bigrams)
# Dictionary to store P(w_n | w_{n-1})
bigram_probs = defaultdict(dict)

for (w1, w2), count in bigram_counts.items():
    # Probability = Count(w1, w2) / Count(w1)
    bigram_probs[w1][w2] = count / unigram_counts[w1]

# 3. Display Results
print("--- Unigram Probabilities ---")
for word, prob in unigram_probs.items():
    print(f"P('{word}'): {prob:.2f}")

print("\n--- Bigram Conditional Probabilities ---")
for w1, next_words in bigram_probs.items():
    for w2, prob in next_words.items():
        print(f"P('{w2}' | '{w1}'): {prob:.2f}")

--- Unigram Probabilities ---
P('Natural'): 0.20
P('language'): 0.20
P('processing'): 0.20
P('is'): 0.20
P('interesting'): 0.10
P('useful'): 0.10

--- Bigram Conditional Probabilities ---
P('language' | 'Natural'): 1.00
P('processing' | 'language'): 1.00
P('is' | 'processing'): 1.00
P('interesting' | 'is'): 0.50
P('useful' | 'is'): 0.50


In [6]:
print(bigrams)
bigram_counts

[('Natural', 'language'), ('language', 'processing'), ('processing', 'is'), ('is', 'interesting'), ('Natural', 'language'), ('language', 'processing'), ('processing', 'is'), ('is', 'useful')]


Counter({('Natural', 'language'): 2,
         ('language', 'processing'): 2,
         ('processing', 'is'): 2,
         ('is', 'interesting'): 1,
         ('is', 'useful'): 1})

In [7]:
a = "Natural language processing is interesting"
a = a.split()

# Get the zip object
zipped_pairs = zip(a[:-1], a[1:])

print("Iterating through the zip object:")
for pair in zipped_pairs:
    print(pair)

# Note: A zip object can only be iterated over once.
# If you try to iterate again, it will be empty.
# To see it again, you'd need to recreate the zip object:
# list_of_pairs = list(zip(a[:-1], a[1:]))
# print("\nConverting to a list (recreating zip object):")
# print(list_of_pairs)


Iterating through the zip object:
('Natural', 'language')
('language', 'processing')
('processing', 'is')
('is', 'interesting')


## 3. Results Analysis

### Unigram Results:

Since "Natural", "language", "processing", and "is" appear twice, and "interesting" and "useful" appear once in a 10-word corpus:

* **P(Natural):**
* **P(interesting):**

### Bigram Results:

The Bigram model shows us the "flow" of the sentence:

* **P(language | Natural):**  (Every time "Natural" appeared, "language" followed it).
* **P(interesting | is):**  (After the word "is", there is a 50% chance the next word is "interesting").

---

## 4. Modern Approach & Efficiency

While N-grams are excellent for understanding text statistics, they struggle with the **"Sparsity Problem"**—if a chatbot encounters a word pair it hasn't seen before, the probability is zero.

**Better Alternatives:**

1. **Laplace Smoothing:** Adding 1 to all counts so we never have a 0% probability.
2. **Kneser-Ney Smoothing:** A more sophisticated way to handle unseen words.
3. **Neural Language Models (RNNs/LSTMs):** Instead of counting, these models use "hidden states" to remember long-term context beyond just the previous word.
4. **Transformers (Attention):** Instead of looking only at the *previous* word, Transformers look at *every* word in the sentence simultaneously to predict the next one.

---

## Summary Table

| Model | Dependencies | Complexity | Context Window |
| --- | --- | --- | --- |
| **Unigram** | None | Extremely Low | 1 word |
| **Bigram** | Previous 1 word | Low | 2 words |
| **Trigram** | Previous 2 words | Medium | 3 words |
| **Transformer** | All words | High | Thousands of words |
