In [3]:
# Step 1: Define a list of stopwords (commonly used words that don’t carry much meaning)
stopwords = ["to", "is", "a"]

# Step 2: Define a list of special characters to be filtered out
special_char = [",", ":", " ", ";", ".", "?"]

# Step 3: Write two sentences as input to the model (corpus of text)
string1 = "Welcome to Great Learning , Now start learning"
string2 = "Learning is a good practice"

# Step 4: Convert both sentences to lowercase to ensure consistency
string1 = string1.lower()
string2 = string2.lower()

# Step 5: Split (tokenize) the lowercase sentences into individual words
tokens1 = string1.split()  # e.g., ['welcome', 'to', 'great', 'learning', ',', 'now', 'start', 'learning']
tokens2 = string2.split()  # e.g., ['learning', 'is', 'a', 'good', 'practice']

# Step 6: Print tokenized word lists for both sentences (for understanding)
print("Tokens 1:", tokens1)
print("Tokens 2:", tokens2)

# Step 7: Combine both token lists and extract a unique vocabulary
def unique(sequence):
    """
    Returns a list of unique elements from the sequence, preserving the order.
    """
    seen = set()
    return [x for x in sequence if not (x in seen or seen.add(x))]
    # x in seen: Checks if x has already been seen.
       # If True, x is a duplicate.
       # If False, it goes to the next part.
    # seen.add(x): Adds x to the seen set
       # Always returns None (which is False in boolean logic).

vocab = unique(tokens1 + tokens2)
print("Full Vocabulary (before filtering):", vocab)

# Step 8: Filter the vocabulary to exclude stopwords and special characters
filtered_vocab = []
for w in vocab:
    if w not in stopwords and w not in special_char:
        filtered_vocab.append(w)
print("Filtered Vocabulary (final BoW vocab):", filtered_vocab)

# Step 9: Define the function to convert a list of tokens into a Bag of Words vector
def vectorize(tokens):
    """
    Converts a list of tokens into a BoW vector using filtered_vocab.
    Each position in the vector corresponds to the frequency of a word from filtered_vocab.
    """
    vector = []
    for w in filtered_vocab:
        vector.append(tokens.count(w)) # Counts how many times the word w appears in the list tokens.
    return vector

# Step 10: Convert each sentence into a Bag of Words vector
vector1 = vectorize(tokens1)
print("Vector 1 (BoW for Sentence 1):", vector1)

vector2 = vectorize(tokens2)
print("Vector 2 (BoW for Sentence 2):", vector2)


Tokens 1: ['welcome', 'to', 'great', 'learning', ',', 'now', 'start', 'learning']
Tokens 2: ['learning', 'is', 'a', 'good', 'practice']
Full Vocabulary (before filtering): ['welcome', 'to', 'great', 'learning', ',', 'now', 'start', 'is', 'a', 'good', 'practice']
Filtered Vocabulary (final BoW vocab): ['welcome', 'great', 'learning', 'now', 'start', 'good', 'practice']
Vector 1 (BoW for Sentence 1): [1, 1, 2, 1, 1, 0, 0]
Vector 2 (BoW for Sentence 2): [0, 0, 1, 0, 0, 1, 1]


# 📦 Bag of Words (BoW) vs 🔍 TF-IDF

| Feature           | BoW (Bag of Words)                         | TF-IDF (Term Frequency-Inverse Document Frequency)           |
|-------------------|---------------------------------------------|---------------------------------------------------------------|
| What it captures  | Frequency of each word                     | Frequency adjusted by how rare the word is across all docs    |
| Weights           | Raw count of words                         | Weighted score = TF × IDF                                     |
| Common words      | Appear with high frequency and high score  | Down-weighted if they appear in many documents               |
| Meaningful ranks  | No                                         | Yes — rare but important words get higher weights             |
| Normalization     | None by default                            | Includes normalization based on document frequency            |
| Use-cases         | Simple tasks, prototyping, spam filters    | NLP tasks, document similarity, search engines, classifiers   |


# 🧠 Example

Sentence: "AI is the future and the future is AI"
Vocabulary: ['AI', 'is', 'the', 'future', 'and']

BoW Vector: [2, 2, 2, 2, 1]
TF-IDF Vector (example): [0.9, 0.5, 0.1, 0.9, 0.2]

➡ BoW gives equal importance to all frequent words.
➡ TF-IDF boosts important/rare terms like 'AI', reduces weight for common ones like 'is' and 'the'.


# ✅ When to Use What?

🔹 Use BoW when:
- You want a simple, fast, count-based model
- The data is small
- Word frequency itself is useful (e.g., spam detection)

🔸 Use TF-IDF when:
- You need meaningful word importance
- You want to reduce noise from common words
- You are working on search, NLP classification, document similarity


# 🧠 Are BoW and TF-IDF Language Models?

❌ No, BoW and TF-IDF are not language models.

Reason:
- They don’t consider word order
- They don’t learn from data
- They don’t calculate probability of word sequences

✅ Language Models:
- Learn probability of word sequences
- Understand context and structure
- Examples: N-gram, RNNs, LSTMs, Transformers (BERT, GPT)

| Technique   | Language Model? | Captures Order? | Learns from Data? |
|-------------|------------------|------------------|--------------------|
| BoW         | ❌ No             | ❌ No             | ❌ No              |
| TF-IDF      | ❌ No             | ❌ No             | ❌ No              |
| BERT/GPT    | ✅ Yes            | ✅ Yes            | ✅ Yes             |
