<a href="https://colab.research.google.com/github/Varshitha-bit/nlp/blob/main/Lab8_NGram_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Step-2** Import required libraries

• text preprocessing

• tokenization

• counting N-grams

• probability calculations

In [6]:
# Text preprocessing
import re
import string

# Tokenization
import nltk
from nltk.tokenize import word_tokenize

# N-gram generation and counting
from collections import Counter
from nltk.util import ngrams

# Probability calculations
import math
!pip install python-docx




**STEP 3**

Load dataset

• load or copy text corpus

• clean unnecessary lines

• display sample text

• explain dataset in 5–6 lines

In [5]:
from docx import Document
import re

# Load the Word document
try:
    doc = Document("data_set-8.docx")

    # Extract text from all paragraphs
    text = ""
    for para in doc.paragraphs:
        text += para.text + " "

    # Convert to lowercase
    text = text.lower()

    # Remove numbers and special characters
    text = re.sub(r'[^a-z\s]', '', text)

    # Remove extra spaces
    text = re.sub(r'\s+', ' ', text).strip()

    print("Sample Text:")
    print(text[:500])   # Display first 500 characters
except Exception as e:
    print(f"Error loading document: {e}")

Sample Text:
natural language processing is a field of artificial intelligence it helps computers understand human language students use nlp techniques for text analysis language models predict the next word in a sentence n gram models are widely used in nlp probability plays an important role in language modeling text preprocessing improves the quality of data tokenization breaks text into meaningful units nlp is used in chatbots and search engines learning nlp is important for modern applications


This dataset is a manually created text corpus containing simple and meaningful English sentences.
It focuses on concepts related to natural language processing and general computing.
The corpus is stored in plain text format, making it easy to load and process.
Unnecessary symbols and formatting are minimal to simplify preprocessing.
This dataset is suitable for performing tokenization, N-gram generation, and probability calculations.
It helps demonstrate the fundamental concepts of language modeling in NLP.

**STEP 4**

Preprocess Text
Write functions to:

• convert to lowercase

• remove punctuation and numbers

• tokenize words

• optionally remove stopwords

• add start/end tokens for sentences (e.g., <s>, </s>)

Students must briefly explain each step.

In [8]:
# STEP 4 — Preprocess Text
import nltk
from nltk.corpus import stopwords

# Download necessary NLTK resources
nltk.download('punkt_tab')
nltk.download('stopwords')

# Convert text to lowercase
def to_lowercase(text):
    return text.lower()

# Remove punctuation and numbers
def remove_punctuation_numbers(text):
    text = re.sub(r'[^a-z\s]', '', text)
    return text

# Tokenize words
def tokenize_text(text):
    return word_tokenize(text)

# Remove stopwords (optional)
def remove_stopwords(tokens):
    stop_words = set(stopwords.words('english'))
    return [word for word in tokens if word not in stop_words]

# Add start and end tokens
def add_start_end_tokens(tokens):
    return ['<s>'] + tokens + ['</s>']


# Apply preprocessing steps
# Note: 'text' variable comes from the previous cell
processed_text = to_lowercase(text)
processed_text = remove_punctuation_numbers(processed_text)
tokens = tokenize_text(processed_text)
tokens = remove_stopwords(tokens)   # Optional
tokens = add_start_end_tokens(tokens)

# Display sample output
print("Sample Preprocessed Tokens:")
print(tokens[:30])

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


Sample Preprocessed Tokens:
['<s>', 'natural', 'language', 'processing', 'field', 'artificial', 'intelligence', 'helps', 'computers', 'understand', 'human', 'language', 'students', 'use', 'nlp', 'techniques', 'text', 'analysis', 'language', 'models', 'predict', 'next', 'word', 'sentence', 'n', 'gram', 'models', 'widely', 'used', 'nlp']


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [9]:
# STEP 5 — Build N-Gram Models

from collections import Counter
from nltk.util import ngrams
import pandas as pd


# ---------- UNIGRAM MODEL ----------

# Count unigrams
unigram_counts = Counter(tokens)
total_tokens = sum(unigram_counts.values())

# Calculate unigram probabilities
unigram_prob = {
    word: count / total_tokens
    for word, count in unigram_counts.items()
}

# Create unigram table
unigram_table = pd.DataFrame({
    'Word': unigram_counts.keys(),
    'Count': unigram_counts.values(),
    'Probability': [unigram_prob[word] for word in unigram_counts.keys()]
})

print("Unigram Model")
print(unigram_table.head())


# ---------- BIGRAM MODEL ----------

# Generate and count bigrams
bigrams = list(ngrams(tokens, 2))
bigram_counts = Counter(bigrams)

# Calculate bigram conditional probabilities
# P(w2 | w1) = count(w1, w2) / count(w1)
bigram_prob = {
    bigram: count / unigram_counts[bigram[0]]
    for bigram, count in bigram_counts.items()
}

# Create bigram table
bigram_table = pd.DataFrame({
    'Bigram': bigram_counts.keys(),
    'Count': bigram_counts.values(),
    'Conditional Probability': [bigram_prob[bg] for bg in bigram_counts.keys()]
})

print("\nBigram Model")
print(bigram_table.head())


# ---------- TRIGRAM MODEL ----------

# Generate and count trigrams
trigrams = list(ngrams(tokens, 3))
trigram_counts = Counter(trigrams)

# Calculate trigram conditional probabilities
# P(w3 | w1, w2) = count(w1, w2, w3) / count(w1, w2)
trigram_prob = {
    trigram: count / bigram_counts[(trigram[0], trigram[1])]
    for trigram, count in trigram_counts.items()
}

# Create trigram table
trigram_table = pd.DataFrame({
    'Trigram': trigram_counts.keys(),
    'Count': trigram_counts.values(),
    'Conditional Probability': [trigram_prob[tg] for tg in trigram_counts.keys()]
})

print("\nTrigram Model")
print(trigram_table.head())


Unigram Model
         Word  Count  Probability
0         <s>      1     0.017544
1     natural      1     0.017544
2    language      4     0.070175
3  processing      1     0.017544
4       field      1     0.017544

Bigram Model
                   Bigram  Count  Conditional Probability
0          (<s>, natural)      1                     1.00
1     (natural, language)      1                     1.00
2  (language, processing)      1                     0.25
3     (processing, field)      1                     1.00
4     (field, artificial)      1                     1.00

Trigram Model
                             Trigram  Count  Conditional Probability
0           (<s>, natural, language)      1                      1.0
1    (natural, language, processing)      1                      1.0
2      (language, processing, field)      1                      1.0
3    (processing, field, artificial)      1                      1.0
4  (field, artificial, intelligence)      1                 

**STEP 6**

Apply Smoothing

• Add-one (Laplace) smoothing

Explain in 3–4 lines:
Why smoothing is needed and what problem it solves.

In [10]:
# Vocabulary size
vocab_size = len(unigram_counts)


# ---------- SMOOTHED BIGRAM PROBABILITY ----------

def laplace_bigram_probability(w1, w2):
    """
    P(w2 | w1) with Add-One smoothing
    """
    bigram = (w1, w2)
    return (bigram_counts.get(bigram, 0) + 1) / (unigram_counts.get(w1, 0) + vocab_size)


# Example: smoothed bigram probabilities
print("Smoothed Bigram Probability Example:")
print("P(processing | language) =",
      laplace_bigram_probability('language', 'processing'))


# ---------- SMOOTHED TRIGRAM PROBABILITY ----------

def laplace_trigram_probability(w1, w2, w3):
    """
    P(w3 | w1, w2) with Add-One smoothing
    """
    trigram = (w1, w2, w3)
    bigram = (w1, w2)
    return (trigram_counts.get(trigram, 0) + 1) / (bigram_counts.get(bigram, 0) + vocab_size)


# Example: smoothed trigram probabilities
print("\nSmoothed Trigram Probability Example:")
print("P(is | language, processing) =",
      laplace_trigram_probability('language', 'processing', 'is'))


Smoothed Bigram Probability Example:
P(processing | language) = 0.04

Smoothed Trigram Probability Example:
P(is | language, processing) = 0.02127659574468085


**STEP 7**

Sentence Probability Calculation


• choose at least 5 sentences

• compute probability using:

o Unigram model

o Bigram model

o Trigram model



In [11]:
# STEP 7 — Sentence Probability Calculation

import math

# Function to preprocess a sentence (same steps as before, but simpler)
def preprocess_sentence(sentence):
    sentence = sentence.lower()
    sentence = re.sub(r'[^a-z\s]', '', sentence)
    words = word_tokenize(sentence)
    words = ['<s>'] + words + ['</s>']
    return words


# ---------- UNIGRAM SENTENCE PROBABILITY ----------

def sentence_probability_unigram(sentence):
    words = preprocess_sentence(sentence)
    prob = 1.0

    for word in words:
        prob *= unigram_prob.get(word, 1 / len(unigram_counts))  # small fallback

    return prob


# ---------- BIGRAM SENTENCE PROBABILITY (Laplace Smoothed) ----------

def sentence_probability_bigram(sentence):
    words = preprocess_sentence(sentence)
    prob = 1.0

    for i in range(len(words) - 1):
        prob *= laplace_bigram_probability(words[i], words[i+1])

    return prob


# ---------- TRIGRAM SENTENCE PROBABILITY (Laplace Smoothed) ----------

def sentence_probability_trigram(sentence):
    words = preprocess_sentence(sentence)
    prob = 1.0

    for i in range(len(words) - 2):
        prob *= laplace_trigram_probability(words[i], words[i+1], words[i+2])

    return prob


# Choose at least 5 sentences
sentences = [
    "Natural language processing is important",
    "Language models predict words",
    "NLP is used in chatbots",
    "Students learn artificial intelligence",
    "Computers understand human language"
]


# Compute probabilities
for sentence in sentences:
    print("\nSentence:", sentence)
    print("Unigram Probability:", sentence_probability_unigram(sentence))
    print("Bigram Probability:", sentence_probability_bigram(sentence))
    print("Trigram Probability:", sentence_probability_trigram(sentence))



Sentence: Natural language processing is important
Unigram Probability: 5.070876356831149e-12
Bigram Probability: 6.979548666089596e-10
Trigram Probability: 1.8207518259364167e-08

Sentence: Language models predict words
Unigram Probability: 2.8903995233937555e-10
Bigram Probability: 1.6401939365310554e-08
Trigram Probability: 4.2787667909505786e-07

Sentence: NLP is used in chatbots
Unigram Probability: 6.283477224769034e-12
Bigram Probability: 8.914097481147038e-11
Trigram Probability: 4.855241555647359e-09

Sentence: Students learn artificial intelligence
Unigram Probability: 3.6129994042421943e-11
Bigram Probability: 8.910062126922891e-09
Trigram Probability: 2.1858917301595349e-07

Sentence: Computers understand human language
Unigram Probability: 1.1663015620711644e-10
Bigram Probability: 3.278902862707624e-08
Trigram Probability: 8.375458399307515e-07


The unigram model calculates probability based only on individual word frequencies.
The bigram model considers the probability of a word given the previous word.
The trigram model uses two previous words for prediction, making it more context-aware.
Generally, longer sentences or unseen word combinations result in lower probabilities.
A lower probability means the sentence is less likely according to the trained language model.

**STEP 8**

Perplexity Calculation

• compute perplexity for test sentences

• compare perplexity across models


In [12]:
# STEP 8 — Perplexity Calculation

import math

# Function to compute perplexity
def calculate_perplexity(sentence, model_type="unigram"):

    words = preprocess_sentence(sentence)
    N = len(words)

    if model_type == "unigram":
        prob = sentence_probability_unigram(sentence)

    elif model_type == "bigram":
        prob = sentence_probability_bigram(sentence)

    elif model_type == "trigram":
        prob = sentence_probability_trigram(sentence)

    else:
        return None

    # Avoid log(0)
    if prob == 0:
        return float("inf")

    # Perplexity formula
    perplexity = pow(prob, -1/N)
    return perplexity


# Test sentences (at least 5)
test_sentences = [
    "Natural language processing is important",
    "Language models predict words",
    "NLP is used in chatbots",
    "Students learn artificial intelligence",
    "Computers understand human language"
]


# Compute perplexity for each model
for sentence in test_sentences:
    print("\nSentence:", sentence)

    uni_pp = calculate_perplexity(sentence, "unigram")
    bi_pp = calculate_perplexity(sentence, "bigram")
    tri_pp = calculate_perplexity(sentence, "trigram")

    print("Unigram Perplexity:", uni_pp)
    print("Bigram Perplexity:", bi_pp)
    print("Trigram Perplexity:", tri_pp)



Sentence: Natural language processing is important
Unigram Perplexity: 41.0732972673736
Bigram Perplexity: 20.32472441878892
Trigram Perplexity: 12.754941283097695

Sentence: Language models predict words
Unigram Perplexity: 38.890215872352385
Bigram Perplexity: 19.83889442691526
Trigram Perplexity: 11.51985193867672

Sentence: NLP is used in chatbots
Unigram Perplexity: 39.83429510112762
Bigram Perplexity: 27.27113614925815
Trigram Perplexity: 15.405796807050338

Sentence: Students learn artificial intelligence
Unigram Perplexity: 54.99907073029815
Bigram Perplexity: 21.962741228969545
Trigram Perplexity: 12.884331557608252

Sentence: Computers understand human language
Unigram Perplexity: 45.24092998109368
Bigram Perplexity: 17.675779484371482
Trigram Perplexity: 10.29987377270416


**STEP 9**

Comparison and Analysis

• Which model gave lowest perplexity?

• Did trigrams always perform best?

• What happens when unseen words appear?

• How did smoothing affect results?

Write comparison in 8–10 sentences.

In our experiments, the trigram model generally produced the lowest perplexity, indicating that it was better at predicting the test sentences compared to the unigram and bigram models. This is because trigrams use more contextual information (two previous words), allowing more accurate probability estimation. The bigram model performed better than the unigram model in most cases, since it considers word-to-word relationships instead of treating words independently. However, trigrams did not always perform best for every sentence, especially when the training data was small or when certain three-word combinations were rare.

When unseen words or unseen N-grams appeared in the test sentences, the probability became very small or zero without smoothing. This caused perplexity to increase significantly. The zero-probability problem is common in language models when encountering new word combinations not present in the training corpus.

Applying Add-One (Laplace) smoothing solved this issue by assigning a small non-zero probability to unseen N-grams. As a result, the model became more robust and avoided infinite perplexity values. Although smoothing slightly lowers the probability of frequent N-grams, it improves overall model stability. Therefore, smoothing plays an essential role in making N-gram language models practical and reliable.