## N-Grams in NLP

### Context
N-Grams are a more advanced text representation technique in Natural Language Processing (NLP) that capture the sequential structure of words. Unlike Bag of Words (BoW), which disregards word order, N-Grams preserve context by considering a sequence of N words together.

#### Key Points:
- **Purpose**: Represents text data by capturing word sequences of length N.
- **Usage**:
  - Helps in text classification, speech recognition, and machine translation.
  - Addresses BoW limitations by preserving some context.
- **How It Works**:
  - A vocabulary of N-word sequences (N-Grams) is created.
  - Each document is represented as a vector based on the frequency of these N-Grams.

### Example: Sentiment Classification using N-Grams

Let's use a toy dataset to show how N-Grams improve classification where BoW fails.

In [10]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

import warnings
warnings.filterwarnings("ignore", category=UserWarning)

# Expanded dataset for better bigram learning
data = {
    "text": [
        "I love this movie", 
        "This movie is terrible", 
        "I enjoyed the film a lot", 
        "The film was not good", 
        "Amazing storyline and great acting", 
        "I do not like this film",
        "The movie was fantastic", 
        "Absolutely horrible experience",
        "I really liked this movie", 
        "I hated this film",
        "The acting was brilliant", 
        "Worst movie I have ever seen", 
        "The movie was not bad", 
        "I do not hate this movie",
        "Fantastic direction and cinematography",
        "The worst film I watched this year",
        "I would highly recommend this movie",
        "This was a complete waste of time",
        "Such a delightful movie experience",
        "I would never watch this film again",
    ],
    "label": [1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0]  
}

df = pd.DataFrame(data)

# Splitting dataset
X_train, X_test, y_train, y_test = train_test_split(df["text"], df["label"], test_size=0.2, random_state=42)

# Vectorizing text using Unigrams + Bigrams (BoW)
# - `ngram_range=(1, 2)` means we include both unigrams (single words) and bigrams (two consecutive words)
# - Unigrams help capture individual word importance
# - Bigrams help capture short phrases and context
vectorizer = CountVectorizer(ngram_range=(1, 2))
X_train_vectorized = vectorizer.fit_transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)

# Training a Naive Bayes classifier
classifier = MultinomialNB()
classifier.fit(X_train_vectorized, y_train)

# Evaluating the model
y_pred = classifier.predict(X_test_vectorized)
accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy:", accuracy)
print("\nClassification Report:\n", classification_report(y_test, y_pred))

# Predicting for new sentences
new_sentences = [
    "I do not like this movie",
    "This movie is fantastic"
]

# Transform and classify
new_sentences_vectorized = vectorizer.transform(new_sentences)
new_predictions = classifier.predict(new_sentences_vectorized)

# Display predictions
for sentence, prediction in zip(new_sentences, new_predictions):
    label = "Positive" if prediction == 1 else "Negative"
    print(f"'{sentence}' -> {label}")

Model Accuracy: 0.5

Classification Report:
               precision    recall  f1-score   support

           0       1.00      0.33      0.50         3
           1       0.33      1.00      0.50         1

    accuracy                           0.50         4
   macro avg       0.67      0.67      0.50         4
weighted avg       0.83      0.50      0.50         4

'I do not like this movie' -> Negative
'This movie is fantastic' -> Positive


In [11]:
vectorizer.vocabulary_

{'really': 59,
 'liked': 45,
 'this': 72,
 'movie': 48,
 'really liked': 60,
 'liked this': 46,
 'this movie': 74,
 'do': 17,
 'not': 54,
 'like': 43,
 'film': 26,
 'do not': 18,
 'not like': 58,
 'like this': 44,
 'this film': 73,
 'worst': 81,
 'have': 37,
 'ever': 21,
 'seen': 63,
 'worst movie': 82,
 'movie have': 50,
 'have ever': 38,
 'ever seen': 22,
 'the': 68,
 'was': 75,
 'good': 30,
 'the film': 70,
 'film was': 29,
 'was not': 78,
 'not good': 56,
 'such': 66,
 'delightful': 13,
 'experience': 23,
 'such delightful': 67,
 'delightful movie': 14,
 'movie experience': 49,
 'would': 83,
 'highly': 39,
 'recommend': 61,
 'would highly': 84,
 'highly recommend': 40,
 'recommend this': 62,
 'hate': 33,
 'not hate': 57,
 'hate this': 34,
 'enjoyed': 19,
 'lot': 47,
 'enjoyed the': 20,
 'film lot': 28,
 'hated': 35,
 'hated this': 36,
 'never': 52,
 'watch': 79,
 'again': 4,
 'would never': 85,
 'never watch': 53,
 'watch this': 80,
 'film again': 27,
 'amazing': 5,
 'storyline': 6

### Advantages
- Captures context better than BoW.
- Useful in distinguishing phrases like "not good" vs. "good movie".
- Reduces the sparsity issue compared to simple BoW.

### Disadvantages
- Increases dimensionality with higher N values.
- Computationally expensive for large datasets.
- Still does not capture deep semantic meaning.

### Conclusion
N-Grams provide a balance between BoW and deep learning-based NLP models by preserving some word order information. While bigrams and trigrams help in capturing local context, more advanced models like word embeddings and transformers further enhance text understanding.