### N-gram

N-gram models are statistical language models used in Natural Language Processing (NLP). They analyze sequences of N consecutive words or characters in a text to understand the structure of the language.

- Unigram (1-gram): Analyzes individual words.
- Bigram (2-gram): Examines two-word sequences.
- Trigram (3-gram): Analyzes three-word sequences.
- N-gram: Represents sequences of N words or characters in general.

!["n-gram"](../images/2/2-n-gram.png)


---


In [1]:
from sklearn.feature_extraction.text import CountVectorizer

documents = [
    "This is an sample text",
    "This sample text demonstrates the natural language processing",
]

# Unigram, Bigram and Trigram -> CountVectorizer
vectorizer_unigram = CountVectorizer(ngram_range=(1, 1))
vectorizer_bigram = CountVectorizer(ngram_range=(2, 2))
vectorizer_trigram = CountVectorizer(ngram_range=(3, 3))

In [2]:
# Unigram
X_unigram = vectorizer_unigram.fit_transform(documents)
unigram_features = vectorizer_unigram.get_feature_names_out()
print("unigram_features:", unigram_features)

unigram_features: ['an' 'demonstrates' 'is' 'language' 'natural' 'processing' 'sample'
 'text' 'the' 'this']


In [3]:
# Bigram
X_bigram = vectorizer_bigram.fit_transform(documents)
bigram_features = vectorizer_bigram.get_feature_names_out()
print("bigram_features:", bigram_features)

bigram_features: ['an sample' 'demonstrates the' 'is an' 'language processing'
 'natural language' 'sample text' 'text demonstrates' 'the natural'
 'this is' 'this sample']


In [4]:
# Trigram
X_trigram = vectorizer_trigram.fit_transform(documents)
trigram_features = vectorizer_trigram.get_feature_names_out()
print("trigram_features:", trigram_features)

trigram_features: ['an sample text' 'demonstrates the natural' 'is an sample'
 'natural language processing' 'sample text demonstrates'
 'text demonstrates the' 'the natural language' 'this is an'
 'this sample text']
