# Introduction to N-Grams

N-grams in text preprocessing are sequences of n number of items, such as words or characters, extracted from text data. They help address the 
challenge of capturing linguistic relationships and context in text data. For example, by extracting sequences of adjacent items, such as words or
characters, n-grams enable models to understand the associations between elements with a deeper context. This is particularly true for sentiment
analysis tasks, where capturing phrases such as “not good” is crucial for understanding negation. Additional benefits of n-grams include enhancing
text classification by considering the co-occurrence of words and improving the accuracy of machine translation by considering word sequences. 
Here are common types of n-grams represented in a table:

![image.png](attachment:fead75e5-cde5-4d9d-b1fd-c2e103ff8227.png)


In [1]:
! pip install nltk



# 1. Unigram : 
Example : How to extract unigrams from text data

In [2]:
# a. install necessary libraries

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import word_tokenize
import string

In [9]:
# b. read the necessary dataset

df = pd.read_csv('C:/Users/ariji/OneDrive/Desktop/Data/movie_review.csv')
df.head()

Unnamed: 0,review_id,review_text,rating,author_name
0,1,I have to go to the store.,5,john smith
1,2,This product is amazing.,4,jane doe
2,3,This is the best movie I have ever seen.,5,alex johnson
3,4,The customer support was terrible.,2,emily thompson
4,5,The food was delicious.,4,michael brown


In [6]:
# c. define text preprocessing function ( punctuation removal )

def remove_punctuation(text):
    translator = str.maketrans('', '', string.punctuation)
    text = text.translate(translator)
    return text

In [10]:
# d. apply the function and update the review_text column

df['review_text'] = df['review_text'].apply(remove_punctuation)
df.head()

Unnamed: 0,review_id,review_text,rating,author_name
0,1,I have to go to the store,5,john smith
1,2,This product is amazing,4,jane doe
2,3,This is the best movie I have ever seen,5,alex johnson
3,4,The customer support was terrible,2,emily thompson
4,5,The food was delicious,4,michael brown


In [11]:
# e. word tokenization of the review_text column

vectorizer = CountVectorizer(tokenizer=word_tokenize)
X = vectorizer.fit_transform(df['review_text'])




In [15]:
# f. get the list of individual word tokens

unigrams = vectorizer.get_feature_names_out()
print(len(unigrams))

for unigram in unigrams:
    print(f"'{unigram}'")

111
'a'
'absolutely'
'actor'
'amazing'
'and'
'app'
'art'
'at'
'beautiful'
'best'
'blast'
'book'
'books'
'boring'
'causing'
'challenging'
'concert'
'crashing'
'customer'
'debate'
'delayed'
'delicious'
'description'
'disappointed'
'disappointing'
'during'
'educational'
'ending'
'ever'
'exam'
'exceptional'
'exhibit'
'exhibition'
'experience'
'flight'
'food'
'for'
'go'
'great'
'had'
'has'
'have'
'heated'
'hiking'
'horrible'
'hotel'
'hour'
'hours'
'i'
'impressive'
'in'
'inaccurate'
'incredible'
'inspiring'
'interface'
'is'
'issues'
'keeps'
'last'
'left'
'lights'
'long'
'magical'
'malfunctioning'
'math'
'me'
'movie'
'museum'
'my'
'new'
'night'
'ocean'
'of'
'offers'
'on'
'orchestra'
'party'
'performance'
'phone'
'political'
'product'
'restaurant'
'room'
'rush'
'seen'
'service'
'shopping'
'software'
'spelling'
'store'
'stunning'
'support'
'terrible'
'the'
'this'
'to'
'today'
'too'
'town'
'traffic'
'trail'
'unbearable'
'update'
'userfriendly'
'very'
'view'
'views'
'was'
'weather'
'were'
'wonder

# 2. Bigrams 
Example : How to extract bigrams from text data

In [21]:
# a. install necessary libraries

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import word_tokenize
import string

In [22]:
# b. read the necessary dataset

df = pd.read_csv('C:/Users/ariji/OneDrive/Desktop/Data/movie_review.csv')
df.head()

Unnamed: 0,review_id,review_text,rating,author_name
0,1,I have to go to the store.,5,john smith
1,2,This product is amazing.,4,jane doe
2,3,This is the best movie I have ever seen.,5,alex johnson
3,4,The customer support was terrible.,2,emily thompson
4,5,The food was delicious.,4,michael brown


In [23]:
# c. define text preprocessing function ( punctuation removal )

def remove_punctuation(text):
    translator = str.maketrans('', '', string.punctuation)
    text = text.translate(translator)
    return text

In [24]:
# d. apply the function and update the review_text column

df['review_text'] = df['review_text'].apply(remove_punctuation)
df.head()

Unnamed: 0,review_id,review_text,rating,author_name
0,1,I have to go to the store,5,john smith
1,2,This product is amazing,4,jane doe
2,3,This is the best movie I have ever seen,5,alex johnson
3,4,The customer support was terrible,2,emily thompson
4,5,The food was delicious,4,michael brown


In [25]:
# e. word tokenization of the review_text column

vectorizer = CountVectorizer(tokenizer=word_tokenize , ngram_range=(2,2))
X = vectorizer.fit_transform(df['review_text'])




In [28]:
# f. get the list of bigram word tokens

bigrams = vectorizer.get_feature_names_out()
print(len(bigrams))
for bigram in bigrams:
    print(f"'{bigram}'")

149
'a blast'
'a stunning'
'a wonderful'
'absolutely beautiful'
'actor was'
'and boring'
'app keeps'
'art exhibition'
'at the'
'at this'
'best movie'
'book is'
'books ending'
'causing issues'
'concert last'
'crashing on'
'customer service'
'customer support'
'debate was'
'delayed for'
'description is'
'during rush'
'ending left'
'ever seen'
'exam was'
'exhibit was'
'exhibition was'
'experience at'
'experience was'
'flight was'
'food was'
'for hours'
'go to'
'great food'
'had a'
'has great'
'have ever'
'have to'
'hiking trail'
'hotel room'
'hotel was'
'hour is'
'i had'
'i have'
'in this'
'in town'
'interface is'
'is absolutely'
'is amazing'
'is causing'
'is horrible'
'is inaccurate'
'is the'
'is unbearable'
'is userfriendly'
'keeps crashing'
'last night'
'left me'
'lights were'
'long and'
'malfunctioning today'
'math exam'
'me disappointed'
'movie i'
'movie was'
'museum exhibit'
'my phone'
'new restaurant'
'new software'
'night was'
'ocean view'
'of the'
'offers stunning'
'on my'
'orche

# 3. Trigrams 
Example : How to extract trigrams from text data

In [29]:
# a. install necessary libraries

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import word_tokenize
import string

In [30]:
# b. read the necessary dataset

df = pd.read_csv('C:/Users/ariji/OneDrive/Desktop/Data/movie_review.csv')
df.head()

Unnamed: 0,review_id,review_text,rating,author_name
0,1,I have to go to the store.,5,john smith
1,2,This product is amazing.,4,jane doe
2,3,This is the best movie I have ever seen.,5,alex johnson
3,4,The customer support was terrible.,2,emily thompson
4,5,The food was delicious.,4,michael brown


In [31]:
# c. define text preprocessing function ( punctuation removal )

def remove_punctuation(text):
    translator = str.maketrans('', '', string.punctuation)
    text = text.translate(translator)
    return text

In [32]:
# d. apply the function and update the review_text column

df['review_text'] = df['review_text'].apply(remove_punctuation)
df.head()

Unnamed: 0,review_id,review_text,rating,author_name
0,1,I have to go to the store,5,john smith
1,2,This product is amazing,4,jane doe
2,3,This is the best movie I have ever seen,5,alex johnson
3,4,The customer support was terrible,2,emily thompson
4,5,The food was delicious,4,michael brown


In [33]:
# e. word tokenization of the review_text column

vectorizer = CountVectorizer(tokenizer=word_tokenize , ngram_range=(3,3))
X = vectorizer.fit_transform(df['review_text'])




In [34]:
# f. get the list of trigrams word tokens

trigrams = vectorizer.get_feature_names_out()
print(len(trigrams))
for trigram in trigrams:
    print(f"'{trigram}'")

125
'a stunning ocean'
'a wonderful experience'
'actor was impressive'
'app keeps crashing'
'art exhibition was'
'at the hotel'
'at this hotel'
'best movie i'
'book is horrible'
'books ending left'
'concert last night'
'crashing on my'
'customer service at'
'customer support was'
'debate was heated'
'delayed for hours'
'description is inaccurate'
'during rush hour'
'ending left me'
'exam was challenging'
'exhibit was educational'
'exhibition was inspiring'
'experience at this'
'experience was very'
'flight was delayed'
'food was delicious'
'go to the'
'had a stunning'
'had a wonderful'
'has great food'
'have ever seen'
'have to go'
'hiking trail offers'
'hotel room had'
'hotel was exceptional'
'hour is unbearable'
'i had a'
'i have ever'
'i have to'
'in this book'
'in town has'
'interface is userfriendly'
'is absolutely beautiful'
'is causing issues'
'is the best'
'keeps crashing on'
'last night was'
'left me disappointed'
'lights were malfunctioning'
'long and boring'
'math exam was'


# 4. Quadgrams 
Example : How to extract Quadgrams from text data

In [35]:
# a. install necessary libraries

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import word_tokenize
import string

In [36]:
# b. read the necessary dataset

df = pd.read_csv('C:/Users/ariji/OneDrive/Desktop/Data/movie_review.csv')
df.head()

Unnamed: 0,review_id,review_text,rating,author_name
0,1,I have to go to the store.,5,john smith
1,2,This product is amazing.,4,jane doe
2,3,This is the best movie I have ever seen.,5,alex johnson
3,4,The customer support was terrible.,2,emily thompson
4,5,The food was delicious.,4,michael brown


In [37]:
# c. define text preprocessing function ( punctuation removal )

def remove_punctuation(text):
    translator = str.maketrans('', '', string.punctuation)
    text = text.translate(translator)
    return text

In [38]:
# d. apply the function and update the review_text column

df['review_text'] = df['review_text'].apply(remove_punctuation)
df.head()

Unnamed: 0,review_id,review_text,rating,author_name
0,1,I have to go to the store,5,john smith
1,2,This product is amazing,4,jane doe
2,3,This is the best movie I have ever seen,5,alex johnson
3,4,The customer support was terrible,2,emily thompson
4,5,The food was delicious,4,michael brown


In [39]:
# e. word tokenization of the review_text column

vectorizer = CountVectorizer(tokenizer=word_tokenize , ngram_range=(4,4))
X = vectorizer.fit_transform(df['review_text'])



In [40]:
# f. get the list of trigrams word tokens

quadgrams = vectorizer.get_feature_names_out()
print(len(quadgrams))
for quadgram in quadgrams:
    print(f"'{quadgram}'")

95
'a stunning ocean view'
'a wonderful experience at'
'app keeps crashing on'
'art exhibition was inspiring'
'at the hotel was'
'best movie i have'
'books ending left me'
'concert last night was'
'crashing on my phone'
'customer service at the'
'customer support was terrible'
'during rush hour is'
'ending left me disappointed'
'experience at this hotel'
'experience was very disappointing'
'flight was delayed for'
'go to the store'
'had a stunning ocean'
'had a wonderful experience'
'have to go to'
'hiking trail offers stunning'
'hotel room had a'
'i had a wonderful'
'i have ever seen'
'i have to go'
'in this book is'
'in town has great'
'is the best movie'
'keeps crashing on my'
'last night was incredible'
'lights were malfunctioning today'
'math exam was challenging'
'movie i have ever'
'movie was too long'
'museum exhibit was educational'
'new restaurant in town'
'new software update is'
'of the actor was'
'orchestra performance was magical'
'party was a blast'
'performance of the a

# Limitations

While using n-grams offers benefits, there are also limitations:

As the length of n-grams increases, the number of possible combinations grows exponentially, leading to high-dimensional feature spaces. This can 
result in increased memory and computational requirements. However, we can overcome such a limitation by using feature selection techniques that 
retain only the most informative n-grams.

While n-grams are useful for capturing local patterns of language, they often fail to capture broader contextual information. For instance, consider
the trigram “not good enough.” On its own, this trigram might suggest a negative sentiment. However, without considering the surrounding context,
it’s challenging to determine the sentiment accurately. It could be a sentence like “The product was not good enough, but the customer service was 
excellent.” In this case, the overall sentiment is positive, but the trigram alone can lead to a misinterpretation. To address this limitation,
we can use more advanced language models and techniques like word embeddings to capture richer semantic relationships and contextual information.

Overfitting is a concern when using n-grams, especially in smaller datasets, because models might become overly specialized to the training data 
and perform poorly on unseen text. Regularization techniques like dropout or reducing n-gram sizes can mitigate overfitting issues.

N-grams can be inflexible when dealing with languages that exhibit word order variations or when handling noisy text data because they rely heavily
on fixed sequences. This limitation can be addressed by incorporating more flexible models like recurrent neural networks (RNNs) or transformers, 
which are better at capturing complex language structures.
