### N-Gram Models

N-gram Models are a technique used in natural language processing and language modeling that models sequences of words (or symbol sequences) of a specific length. These models learn the sequential dependencies between words and are used to predict future words. Essentially, the probability of a word or word sequence is computed based only on the preceding N words.

![n-gram-models](../images/3/3-n-gram-models.png)

Here, N refers to the length of the word sequence. For example:

- 1-gram (Unigram): Each word is considered independently.
- 2-gram (Bigram): The probability of a word depends only on the previous word.
- 3-gram (Trigram): The probability of a word depends on the previous two words.

N-gram models typically learn by counting the frequency of each word sequence and then calculating probabilities based on these frequencies.

---

#### Use Cases of N-gram Models

- Language Modeling: N-gram models are used to learn the structure of a language and generate text that follows grammatical rules. For instance, completing a sentence or predicting a sequence of words.
- Spell Checking: Used to correct spelling mistakes or provide suggestions. For example, predicting the correct spelling of a misspelled word.
- Machine Translation: N-gram models can be used to translate words from the source language to the target language. They are particularly useful for matching sequential words accurately.
- Text Classification: N-grams can be used as features in text classification tasks, such as spam email detection or sentiment analysis.
- Speech Recognition: N-gram models can be used to predict the correct words while transcribing spoken commands into text.
- Information Retrieval (Search Engines): N-gram models can be used to predict the most likely results related to search queries.

#### Advantages

- Simplicity: N-gram models have a very simple structure and are easy to set up. They learn word dependencies by only considering previous words.
- Efficient Computation: Learning and prediction processes are fast because these models only consider fixed-length word sequences and typically require remembering only a few previous words.
- Effective Language Modeling: Especially useful for smaller datasets, as they are good for understanding and predicting the structure of language. N-grams can be effective in learning sequential relationships.
- Easy Implementation: N-grams can be practically applied in nearly every NLP task, making them a common solution.

#### Disadvantages

- Dependency Issues: N-gram models only consider the previous N-1 words, which ignores dependencies over longer distances. This can be problematic, especially in long sentences or texts with more complex linguistic structures.
- Data Sparsity: If your N-gram model is large (e.g., trigrams or four-grams), you'll need a very large dataset to compute all the probabilities. This can make it difficult to predict rare word sequences and lead to data sparsity issues.
- Large Data Requirements: Larger N values require more data. For example, trigram or four-gram models require very large datasets, increasing computational costs.
- Complexity: As you create larger language models, higher-order N-grams can consume more memory and take longer to process.
- General Performance Limitations: N-gram models may struggle to capture deep context and semantic relationships in language. This is why more complex models (e.g., deep learning-based models) often provide better results.


---


In [1]:
from collections import Counter

from nltk.tokenize import word_tokenize
from nltk.util import ngrams

In [2]:
# Sample data
corpus = [
    "I love you",
    "I love apple",
    "I love programming",
    "You love me",
    "She loves apple",
    "They love you",
    "I love you and you love me",
]

In [3]:
# Tokenize
tokens = [word_tokenize(sentence.lower()) for sentence in corpus]
print(tokens)

[['i', 'love', 'you'], ['i', 'love', 'apple'], ['i', 'love', 'programming'], ['you', 'love', 'me'], ['she', 'loves', 'apple'], ['they', 'love', 'you'], ['i', 'love', 'you', 'and', 'you', 'love', 'me']]


In [4]:
# n-gram -> n=2
bigrams = []
for token_list in tokens:
    bigrams.extend(ngrams(token_list, 2))

print(bigrams)

[('i', 'love'), ('love', 'you'), ('i', 'love'), ('love', 'apple'), ('i', 'love'), ('love', 'programming'), ('you', 'love'), ('love', 'me'), ('she', 'loves'), ('loves', 'apple'), ('they', 'love'), ('love', 'you'), ('i', 'love'), ('love', 'you'), ('you', 'and'), ('and', 'you'), ('you', 'love'), ('love', 'me')]


In [5]:
# Bigram Frequency Counter
bigrams_freq = Counter(bigrams)

print(bigrams_freq)

Counter({('i', 'love'): 4, ('love', 'you'): 3, ('you', 'love'): 2, ('love', 'me'): 2, ('love', 'apple'): 1, ('love', 'programming'): 1, ('she', 'loves'): 1, ('loves', 'apple'): 1, ('they', 'love'): 1, ('you', 'and'): 1, ('and', 'you'): 1})


In [6]:
# n-gram -> n=3
trigrams = []
for token_list in tokens:
    trigrams.extend(ngrams(token_list, 3))

print(trigrams)

[('i', 'love', 'you'), ('i', 'love', 'apple'), ('i', 'love', 'programming'), ('you', 'love', 'me'), ('she', 'loves', 'apple'), ('they', 'love', 'you'), ('i', 'love', 'you'), ('love', 'you', 'and'), ('you', 'and', 'you'), ('and', 'you', 'love'), ('you', 'love', 'me')]


In [7]:
# Bigram Frequency Counter
trigrams_freq = Counter(trigrams)

print(trigrams_freq)

Counter({('i', 'love', 'you'): 2, ('you', 'love', 'me'): 2, ('i', 'love', 'apple'): 1, ('i', 'love', 'programming'): 1, ('she', 'loves', 'apple'): 1, ('they', 'love', 'you'): 1, ('love', 'you', 'and'): 1, ('you', 'and', 'you'): 1, ('and', 'you', 'love'): 1})


In [8]:
# Calculate the probability of the bigram "I love" being followed by "you" or "apple"
i_love = ("i", "love")  # ("i", "love")

prob_you = trigrams_freq[(i_love + ("you",))] / bigrams_freq[i_love]
prob_apple = trigrams_freq[(i_love + ("apple",))] / bigrams_freq[i_love]

In [9]:
print("Probability of you:", prob_you)
print("Probability of apple:", prob_apple)

Probability of you: 0.5
Probability of apple: 0.25
