<a href="https://colab.research.google.com/github/abdulsamadkhan/Tutorial/blob/main/n_gram.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Unigram, Bigram, Trigram and N-Gram
The provided Python code is a text processing and N-gram modeling example. It takes a sample text as input and calculates unigram, bigram, and trigram models, then prints the probability distributions for each model. Here's a detailed explanation of each part of the code:

1. **Importing Libraries**:
   - `import re`: This imports the Python regular expressions library, which will be used to remove punctuation from the text.
   - `from collections import defaultdict`: This imports the `defaultdict` data structure from the `collections` module, which will be used to create dictionaries with default values.

2. **Preprocessing Function (`preprocess_text`)**:
   - This function takes a text string as input and performs the following steps:
     - Removes punctuation using regular expressions (`re.sub`).
     - Converts the text to lowercase.
     - Tokenizes the text into individual words by splitting it using spaces.
   - The function returns a list of words.

3. **Unigram Model Function (`calculate_unigram_model`)**:
   - This function takes a list of words as input.
   - It creates a dictionary (`unigram_model`) to store unigram (single-word) probabilities.
   - For each word in the input, it increments the count of that word in the dictionary and normalizes it by dividing by the total number of words in the text.
   - The function returns the unigram model, where each word is associated with its probability of occurrence.

4. **Bigram Model Function (`calculate_bigram_model`)**:
   - This function takes a list of words as input.
   - It creates a dictionary (`bigram_model`) to store bigram (two-word) probabilities.
   - It iterates through the list of words and creates bigrams (pairs of consecutive words).
   - For each bigram, it increments the count of that bigram in the dictionary.
   - After counting all bigrams, it normalizes the counts by dividing by the total number of bigrams.
   - The function returns the bigram model, where each bigram is associated with its probability of occurrence.

5. **Trigram Model Function (`calculate_trigram_model`)**:
   - This function is similar to the bigram model function but operates on trigrams (three-word sequences) instead.
   - It creates a dictionary (`trigram_model`) to store trigram probabilities.
   - It iterates through the list of words and creates trigrams (triplets of consecutive words).
   - For each trigram, it increments the count of that trigram in the dictionary.
   - After counting all trigrams, it normalizes the counts by dividing by the total number of trigrams.
   - The function returns the trigram model, where each trigram is associated with its probability of occurrence.

6. **Sample Text**:
   - A sample text string is defined, which contains the text you want to analyze.

7. **Text Preprocessing and Model Calculation**:
   - The sample text is preprocessed using the `preprocess_text` function to obtain a list of words.

   - Unigram, bigram, and trigram models are calculated using their respective functions, and the results are stored in the `unigram_model`, `bigram_model`, and `trigram_model` dictionaries.

8. **Printing the Models**:
   - The code then prints the probability distributions for each model:
     - For the unigram model, it prints each word and its probability.
     - For the bigram model, it prints each bigram (pair of words) and its probability.
     - For the trigram model, it prints each trigram (triplet of words) and its probability.

The output of the code will show the probability distributions for each N-gram model, providing insights into the likelihood of different word sequences in the sample text.

In [1]:
import re
from collections import defaultdict

def preprocess_text(text):
    # Remove punctuation and convert to lowercase
    text = re.sub(r'[^\w\s]', '', text).lower()
    # Tokenize the text into words
    words = text.split()
    return words

def calculate_unigram_model(words):
    unigram_model = defaultdict(int)
    total_words = len(words)
    for word in words:
        unigram_model[word] += 1 / total_words
    return unigram_model

def calculate_bigram_model(words):
    bigram_model = defaultdict(float)
    for i in range(len(words) - 1):
        bigram = (words[i], words[i + 1])
        bigram_model[bigram] += 1
    total_bigrams = len(words) - 1
    for bigram, count in bigram_model.items():
        bigram_model[bigram] = count / total_bigrams
    return bigram_model

def calculate_trigram_model(words):
    trigram_model = defaultdict(float)
    for i in range(len(words) - 2):
        trigram = (words[i], words[i + 1], words[i + 2])
        trigram_model[trigram] += 1
    total_trigrams = len(words) - 2
    for trigram, count in trigram_model.items():
        trigram_model[trigram] = count / total_trigrams
    return trigram_model

# Sample text
text = "This is a simple example of a unigram, bigram, and trigram model calculation."

# Preprocess the text
words = preprocess_text(text)

# Calculate the models
unigram_model = calculate_unigram_model(words)
bigram_model = calculate_bigram_model(words)
trigram_model = calculate_trigram_model(words)

# Print the models
print("Unigram Model:")
for word, prob in unigram_model.items():
    print(f"{word}: {prob:.4f}")

print("\nBigram Model:")
for bigram, prob in bigram_model.items():
    print(f"{bigram[0]} {bigram[1]}: {prob:.4f}")

print("\nTrigram Model:")
for trigram, prob in trigram_model.items():
    print(f"{trigram[0]} {trigram[1]} {trigram[2]}: {prob:.4f}")


Unigram Model:
this: 0.0769
is: 0.0769
a: 0.1538
simple: 0.0769
example: 0.0769
of: 0.0769
unigram: 0.0769
bigram: 0.0769
and: 0.0769
trigram: 0.0769
model: 0.0769
calculation: 0.0769

Bigram Model:
this is: 0.0833
is a: 0.0833
a simple: 0.0833
simple example: 0.0833
example of: 0.0833
of a: 0.0833
a unigram: 0.0833
unigram bigram: 0.0833
bigram and: 0.0833
and trigram: 0.0833
trigram model: 0.0833
model calculation: 0.0833

Trigram Model:
this is a: 0.0909
is a simple: 0.0909
a simple example: 0.0909
simple example of: 0.0909
example of a: 0.0909
of a unigram: 0.0909
a unigram bigram: 0.0909
unigram bigram and: 0.0909
bigram and trigram: 0.0909
and trigram model: 0.0909
trigram model calculation: 0.0909
