# N-grams


- An N-gram in Natural Language Processing (NLP) is a contiguous sequence of n items from a given text or speech.
- These items can be characters, words, or even phonemes depending on the context.
- N-grams are used to analyze the structure of a text and to predict the likelihood of a sequence occurring in language models.

## Types of N-grams
- **Unigram (1-gram):** A single item from the sequence (e.g., a word or a character). Example: "Hello", "NLP", "World".
- **Bigram (2-gram):** A sequence of two contiguous items. Example: "Hello NLP", "NLP World".
- **Trigram (3-gram):** A sequence of three contiguous items. Example: "Hello NLP World".

### Applications of N-grams
- **Text Prediction:** N-grams are used in predictive text systems, where the model predicts the next word in a sequence based on the previous words. For instance, in mobile typing suggestions.

- **Language Modeling:** N-grams help in building language models that estimate the probability of a sequence of words. These models are useful in tasks like speech recognition, machine translation, and text generation.

- **Text Classification:** In sentiment analysis or spam detection, N-grams can be used as features to classify text into different categories.

- **Machine Translation:** N-grams help in understanding context by analyzing word sequences, aiding in more accurate translations.

- **Spell Checking:** N-grams can be used to identify and correct spelling errors by comparing sequences in a text against a known corpus.

### Advantages of N-grams
- **Simplicity:** N-grams are straightforward to implement and understand.
 
- **Efficiency:** They can efficiently capture the local context of words or characters in a sequence.

### Limitations of N-grams

- **Context Limitation:** N-grams consider only the fixed window of n items, potentially missing broader contextual meaning.

- **Data Sparsity:** Higher-order N-grams (like trigrams or four-grams) require more data to be effective, as the number of possible sequences increases exponentially.

# Implementation

Unigrams: [('I',), ('love',), ('natural',), ('language',), ('processing',), ('.',)]
Unigram Frequency: Counter({('I',): 1, ('love',): 1, ('natural',): 1, ('language',): 1, ('processing',): 1, ('.',): 1})
Bigrams: [('I', 'love'), ('love', 'natural'), ('natural', 'language'), ('language', 'processing'), ('processing', '.')]
Bigram Frequency: Counter({('I', 'love'): 1, ('love', 'natural'): 1, ('natural', 'language'): 1, ('language', 'processing'): 1, ('processing', '.'): 1})
Trigrams: [('I', 'love', 'natural'), ('love', 'natural', 'language'), ('natural', 'language', 'processing'), ('language', 'processing', '.')]
Trigram Frequency: Counter({('I', 'love', 'natural'): 1, ('love', 'natural', 'language'): 1, ('natural', 'language', 'processing'): 1, ('language', 'processing', '.'): 1})


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
import nltk
from nltk import ngrams
from collections import Counter

# Download the required NLTK data files
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [3]:
# Sample text
text = "N-grams are used to analyze the structure of a text and to predict the likelihood of a sequence occurring in language models."

In [6]:

# Tokenize the text into words
tokens = nltk.word_tokenize(text)
print(tokens)


['N-grams', 'are', 'used', 'to', 'analyze', 'the', 'structure', 'of', 'a', 'text', 'and', 'to', 'predict', 'the', 'likelihood', 'of', 'a', 'sequence', 'occurring', 'in', 'language', 'models', '.']


In [7]:
# Generate unigrams, bigrams, and trigrams
unigrams = list(ngrams(tokens, 1))
bigrams = list(ngrams(tokens, 2))
trigrams = list(ngrams(tokens, 3))

In [8]:
# Count the frequency of each N-gram
unigram_freq = Counter(unigrams)
bigram_freq = Counter(bigrams)
trigram_freq = Counter(trigrams)

In [10]:
# Print the results
print("Unigrams:", unigrams)


Unigrams: [('N-grams',), ('are',), ('used',), ('to',), ('analyze',), ('the',), ('structure',), ('of',), ('a',), ('text',), ('and',), ('to',), ('predict',), ('the',), ('likelihood',), ('of',), ('a',), ('sequence',), ('occurring',), ('in',), ('language',), ('models',), ('.',)]


In [11]:
print("Unigram Frequency:", unigram_freq)


Unigram Frequency: Counter({('to',): 2, ('the',): 2, ('of',): 2, ('a',): 2, ('N-grams',): 1, ('are',): 1, ('used',): 1, ('analyze',): 1, ('structure',): 1, ('text',): 1, ('and',): 1, ('predict',): 1, ('likelihood',): 1, ('sequence',): 1, ('occurring',): 1, ('in',): 1, ('language',): 1, ('models',): 1, ('.',): 1})


In [12]:
print("Bigrams:", bigrams)


Bigrams: [('N-grams', 'are'), ('are', 'used'), ('used', 'to'), ('to', 'analyze'), ('analyze', 'the'), ('the', 'structure'), ('structure', 'of'), ('of', 'a'), ('a', 'text'), ('text', 'and'), ('and', 'to'), ('to', 'predict'), ('predict', 'the'), ('the', 'likelihood'), ('likelihood', 'of'), ('of', 'a'), ('a', 'sequence'), ('sequence', 'occurring'), ('occurring', 'in'), ('in', 'language'), ('language', 'models'), ('models', '.')]


In [13]:
print("Bigram Frequency:", bigram_freq)


Bigram Frequency: Counter({('of', 'a'): 2, ('N-grams', 'are'): 1, ('are', 'used'): 1, ('used', 'to'): 1, ('to', 'analyze'): 1, ('analyze', 'the'): 1, ('the', 'structure'): 1, ('structure', 'of'): 1, ('a', 'text'): 1, ('text', 'and'): 1, ('and', 'to'): 1, ('to', 'predict'): 1, ('predict', 'the'): 1, ('the', 'likelihood'): 1, ('likelihood', 'of'): 1, ('a', 'sequence'): 1, ('sequence', 'occurring'): 1, ('occurring', 'in'): 1, ('in', 'language'): 1, ('language', 'models'): 1, ('models', '.'): 1})


In [14]:
print("Trigrams:", trigrams)


Trigrams: [('N-grams', 'are', 'used'), ('are', 'used', 'to'), ('used', 'to', 'analyze'), ('to', 'analyze', 'the'), ('analyze', 'the', 'structure'), ('the', 'structure', 'of'), ('structure', 'of', 'a'), ('of', 'a', 'text'), ('a', 'text', 'and'), ('text', 'and', 'to'), ('and', 'to', 'predict'), ('to', 'predict', 'the'), ('predict', 'the', 'likelihood'), ('the', 'likelihood', 'of'), ('likelihood', 'of', 'a'), ('of', 'a', 'sequence'), ('a', 'sequence', 'occurring'), ('sequence', 'occurring', 'in'), ('occurring', 'in', 'language'), ('in', 'language', 'models'), ('language', 'models', '.')]


In [15]:
print("Trigram Frequency:", trigram_freq)

Trigram Frequency: Counter({('N-grams', 'are', 'used'): 1, ('are', 'used', 'to'): 1, ('used', 'to', 'analyze'): 1, ('to', 'analyze', 'the'): 1, ('analyze', 'the', 'structure'): 1, ('the', 'structure', 'of'): 1, ('structure', 'of', 'a'): 1, ('of', 'a', 'text'): 1, ('a', 'text', 'and'): 1, ('text', 'and', 'to'): 1, ('and', 'to', 'predict'): 1, ('to', 'predict', 'the'): 1, ('predict', 'the', 'likelihood'): 1, ('the', 'likelihood', 'of'): 1, ('likelihood', 'of', 'a'): 1, ('of', 'a', 'sequence'): 1, ('a', 'sequence', 'occurring'): 1, ('sequence', 'occurring', 'in'): 1, ('occurring', 'in', 'language'): 1, ('in', 'language', 'models'): 1, ('language', 'models', '.'): 1})
