### 🔹 What is an N-gram Model?

An **N-gram model** is a type of **language model** that predicts the next word in a sequence based on the previous \( n - 1 \) words.

Examples:
- **Unigram**: 1 word at a time → "I", "like", "pizza"
- **Bigram**: 2 words → "I like", "like pizza"
- **Trigram**: 3 words → "I like pizza"

It estimates:
\[
P(w_n | w_{n-1}, w_{n-2}, ..., w_1) \approx P(w_n | w_{n-1}, ..., w_{n-(n-1)})
\]

✅ So yes — **N-gram is a basic statistical language model**.


### 🔹 Is it better than Bag of Words (BoW) and TF-IDF?

It depends on the **task** you're working on. Here's a comparison:

| Feature                | BoW                | TF-IDF             | N-gram Model         |
|------------------------|--------------------|---------------------|----------------------|
| Captures word order    | ❌ No              | ❌ No               | ✅ Yes (context aware) |
| Handles word importance| ❌ No              | ✅ Yes              | ❌ No                 |
| Used for               | Text classification| Same as BoW         | Language modeling, prediction |
| Complexity             | Low                | Low                 | Higher (as n increases) |
| Sparsity               | Sparse             | Sparse              | More sparse           |

- 🧠 Use **BoW/TF-IDF** when classifying or clustering text.
- 🧠 Use **N-gram models** when generating or predicting text.


### 🔹 Is N-gram a Language Model?

✅ **Yes!**

An N-gram model is a **statistical language model**. It calculates the probability of word sequences.

It is used for:
- ✍️ Predictive typing
- 🗣️ Speech recognition
- 🌐 Machine translation (older models)
- 🤖 Text generation

While modern NLP uses deep learning (like GPT), N-grams were foundational!


## 📘 Step-by-Step Guide: Creating a Document-Term Matrix using Trigrams

We are constructing a Document-Term Matrix (DTM) from a small corpus using word trigrams with specific preprocessing conditions:

### 📝 Given Pre-Conditions:
- **N-gram = word**
- **Min length of N = 3**
- **Max length of N = 3** → We're using **trigrams only**
- **Stop Words Removal = True**
- **Max Features = 4** → We only keep the **top 4 most frequent trigrams**
- **Term Weighting = Term Frequency** → Values in the matrix represent how often a trigram appears in a document.

---

### 🔹 Step 1: Define the Corpus

We start with the following four short documents:

1. `"a nice car"`  
2. `"a good car"`  
3. `"a beautiful car a nice car"`  
4. `"a nice black car car"`

---

### 🔹 Step 2: Preprocessing (Stopword Removal + Tokenization)

We remove stop words like `"a"`, `"the"`, etc., and tokenize each document into individual words.

After preprocessing:

1. `"nice car"`  
2. `"good car"`  
3. `"beautiful car nice car"`  
4. `"nice black car car"`

---

### 🔹 Step 3: Generate Trigrams (N-grams with n=3)

From the preprocessed tokens, we generate **word-level trigrams** (3-word sequences).

Examples:

- From `"beautiful car nice car"` →  
  Trigrams = `["beautiful car nice", "car nice car"]`

- From `"nice black car car"` →  
  Trigrams = `["nice black car", "black car car"]`

---

### 🔹 Step 4: Create Vocabulary (Top 4 Features)

From all trigrams, we collect their frequency and keep only the **top 4** most frequent ones:

✅ Selected Trigrams (Vocabulary / Columns):

1. `beautiful car nice`  
2. `black car car`  
3. `car nice car`  
4. `nice black car`

These form the **columns** of our document-term matrix.

---

### 🔹 Step 5: Construct the Document-Term Matrix (DTM)

We now check each document and count how many times each of the selected trigrams appears.

📌 Format:  
- **Rows** = Documents  
- **Columns** = Trigrams (in order above)  
- **Values** = Frequency of each trigram in the document

| Document                | beautiful car nice | black car car | car nice car | nice black car |
|-------------------------|--------------------|----------------|---------------|----------------|
| Doc 1: "a nice car"      | 0                  | 0              | 0             | 0              |
| Doc 2: "a good car"      | 0                  | 0              | 0             | 0              |
| Doc 3: "a beautiful car a nice car" | 1         | 0              | 1             | 0              |
| Doc 4: "a nice black car car"       | 0         | 1              | 0             | 1              |

### 🧠 How This Matrix Was Built:

- **Doc 1 & Doc 2**: Too short to produce any trigrams after stop word removal.  
- **Doc 3**:
  - `"beautiful car nice"` → appears once ✅  
  - `"car nice car"` → appears once ✅  
- **Doc 4**:
  - `"nice black car"` → appears once ✅  
  - `"black car car"` → appears once ✅

This results in the frequency-based document-term matrix shown above.


In [2]:
from sklearn.feature_extraction.text import CountVectorizer
#CountVectorizer is a class from sklearn.feature_extraction.text that converts a collection of text documents into a matrix of token counts

import pandas as pd

# Step 2: Define the corpus (your collection of documents)
corpus = [
    "a nice car",                        # Document 1
    "a good car",                        # Document 2
    "a beautiful car a nice car",        # Document 3
    "a nice black car car"               # Document 4
]

# Step 3: Initialize CountVectorizer with the given settings
vectorizer = CountVectorizer(
    stop_words='english',      # Remove common English stop words like 'a'
    ngram_range=(3, 3),        # Use only trigrams (3-word phrases)
    max_features=4             # Keep only the top 4 most frequent trigrams
)

# Step 4: Transform the corpus into a document-term matrix
X = vectorizer.fit_transform(corpus)

# Step 5: Convert to a readable DataFrame using pandas
df = pd.DataFrame(
    data=X.toarray(),
    index=[f"Document {i+1}" for i in range(len(corpus))],  # Document labels
    columns=vectorizer.get_feature_names_out()               # Trigram features
)

# Step 6: Display the document-term matrix as a table
print("📄 Document-Term Matrix:")
print(df)


📄 Document-Term Matrix:
            beautiful car nice  black car car  car nice car  nice black car
Document 1                   0              0             0               0
Document 2                   0              0             0               0
Document 3                   1              0             1               0
Document 4                   0              1             0               1


🔧 Setup: What the Code is Supposed to Do
We are using:
- Word-based n-grams: unigrams (1 word), bigrams (2 words), trigrams (3 words)
- Removing English stop words
- Keeping only the top 10 n-grams
- Using term frequency (i.e., simple counts)

📚 Step 1: Define the Corpus
corpus = [
    "a nice car",                           # Document 1
    "a good car",                           # Document 2
    "a beautiful car a nice car",           # Document 3
    "a nice black car car"                  # Document 4
]

🧹 Step 2: Remove Stop Words
Let’s assume "a" is considered a stop word.
After stop word removal:

| Document    | Cleaned Version          |
|-------------|--------------------------|
| Document 1  | nice car                 |
| Document 2  | good car                 |
| Document 3  | beautiful car nice car    |
| Document 4  | nice black car car        |

🔍 Step 3: Extract N-grams (1 ≤ N ≤ 3)
Now we extract n-grams for each document:

Document 1: "nice car"
- Unigrams: nice, car
- Bigrams: nice car
- Trigrams: (none)

Document 2: "good car"
- Unigrams: good, car
- Bigrams: good car
- Trigrams: (none)

Document 3: "beautiful car nice car"
- Unigrams: beautiful, car, nice
- Bigrams: beautiful car, car nice, nice car
- Trigrams: beautiful car nice, car nice car

Document 4: "nice black car car"
- Unigrams: nice, black, car
- Bigrams: nice black, black car, car car
- Trigrams: nice black car, black car car

📊 Step 4: Count Frequencies of All N-grams
Count across all documents:

| N-gram            | Frequency |
|-------------------|-----------|
| car               | 4         |
| nice              | 3         |
| nice car          | 2         |
| good              | 1         |
| good car          | 1         |
| beautiful         | 1         |
| beautiful car     | 1         |
| black             | 1         |
| black car         | 1         |
| car car           | 1         |
| ... (others)      | ...       |

Top 10 selected n-grams:
['car', 'nice', 'nice car', 'good', 'good car',
 'beautiful', 'beautiful car', 'black', 'black car', 'car car']

🧮 Step 5: Document-Term Matrix (DTM)
Final 4×10 matrix (documents × top n-grams):

|            | car | nice | nice car | good | good car | beautiful | beautiful car | black | black car | car car |
|------------|-----|------|----------|------|----------|-----------|---------------|-------|-----------|---------|
| Document 1 | 1   | 1    | 1        | 0    | 0        | 0         | 0             | 0     | 0         | 0       |
| Document 2 | 1   | 0    | 0        | 1    | 1        | 0         | 0             | 0     | 0         | 0       |
| Document 3 | 2   | 1    | 1        | 0    | 0        | 1         | 1             | 0     | 0         | 0       |
| Document 4 | 2   | 1    | 0        | 0    | 0        | 0         | 0             | 1     | 1         | 1       |

In [3]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

# Sample text corpus
corpus = [
    "a nice car",
    "a good car",
    "a beautiful car a nice car",
    "a nice black car car"
]

# Initialize CountVectorizer with specified settings
vectorizer = CountVectorizer(
    stop_words='english',       # Remove stopwords
    ngram_range=(1, 3),         # Include unigrams, bigrams, and trigrams
    max_features=10             # Keep top 10 most frequent features
)

# Transform the corpus into a document-term matrix
X = vectorizer.fit_transform(corpus)

# Convert the matrix to a pandas DataFrame
df = pd.DataFrame(
    data=X.toarray(),
    index=[f"Document {i+1}" for i in range(len(corpus))],
    columns=vectorizer.get_feature_names_out()
)

# Display the document-term matrix
print("📄 Document-Term Matrix:")
print(df)


📄 Document-Term Matrix:
            beautiful  beautiful car  beautiful car nice  black  black car  \
Document 1          0              0                   0      0          0   
Document 2          0              0                   0      0          0   
Document 3          1              1                   1      0          0   
Document 4          0              0                   0      1          1   

            black car car  car  car car  nice  nice car  
Document 1              0    1        0     1         1  
Document 2              0    1        0     0         0  
Document 3              0    2        0     1         1  
Document 4              1    2        1     1         0  
