# Natural Language Processing - N-grams with NLTK and Scikit-learn

In this notebook, we will implement and understand how to use **N-grams** using **NLTK** and **Scikit-learn** with a Spam Classification problem statement.

We will:
- Create Bag of Words
- Understand vocabulary extraction
- Use `ngram_range` in `CountVectorizer`
- Experiment with unigram, bigram, trigram combinations
- Observe how the feature space changes


In [None]:
# Import required libraries
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

# Sample dataset (SMS spam messages)
corpus = [
    'Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005.',
    'U dun say so early hor... U c already then say...',
    'Nah I don’t think he goes to usf, he lives around here though',
    'WINNER!! As a valued network customer you have been selected to receivea £900 prize reward!',
    'Had your mobile 11 months or more? You are eligible to update to the latest colour mobiles with camera for free!',
    'SIX chances to win CASH! From 100 to 20,000 pounds txt> CSH11 and send to 87575.',
    'I‘m gonna be home soon and i don‘t want to talk about this stuff anymore tonight.',
    'URGENT! You have won a 1 week FREE membership in our £100,000 Prize Jackpot!',
    'I‘ve been searching for the right words to thank you for this breather. I promise i won‘t take your help for granted again.',
    'As per your request, your subscription has been renewed.'
]

## Bag of Words Model
We'll use `CountVectorizer` to convert the text documents to a matrix of token counts.


In [None]:
# Bag of Words with top 100 frequent words
cv = CountVectorizer(max_features=100)
X = cv.fit_transform(corpus).toarray()

print("Vocabulary (Top 100 Words):")
print(cv.vocabulary_)

## Exploring N-grams with `ngram_range`

### Unigram (1,1)
Only single words are considered as features.

In [None]:
# Unigram only
cv = CountVectorizer(max_features=100, ngram_range=(1,1))
X = cv.fit_transform(corpus).toarray()
print("Unigram Vocabulary:")
print(cv.vocabulary_)

### Unigram + Bigram (1,2)
Combines both single and two-word combinations.

In [None]:
# Unigram and Bigram
cv = CountVectorizer(max_features=200, ngram_range=(1,2))
X = cv.fit_transform(corpus).toarray()
print("Unigram + Bigram Vocabulary:")
print(cv.vocabulary_)

### Bigram only (2,2)
Only two-word combinations considered as features.

In [None]:
# Bigram only
cv = CountVectorizer(max_features=200, ngram_range=(2,2))
X = cv.fit_transform(corpus).toarray()
print("Bigram Vocabulary:")
print(cv.vocabulary_)

### Trigram only (3,3)
Only three-word combinations considered.

In [None]:
# Trigram only
cv = CountVectorizer(max_features=200, ngram_range=(3,3))
X = cv.fit_transform(corpus).toarray()
print("Trigram Vocabulary:")
print(cv.vocabulary_)

## Summary
- Start with `ngram_range=(1,1)` (Unigram).
- If accuracy is low, try `(1,2)` or `(1,3)`.
- You can also test `(2,3)` for bigram + trigram only.
- Adjust `max_features` to get more/less frequent words.

Next step: Explore **TF-IDF** transformation.
