<h1 style="color:DodgerBlue; text-align:center; font-weight:bold; font-size:50px; background-color:lightblue; padding:20px 20px">N-GRAMS</h1>


N-grams are contiguous sequences of n items from a given sample of text or speech. They are used in various natural language processing (NLP) tasks to capture the context of words by considering their neighbors.

### Advantages of N-grams

- **Context Capture:** N-grams capture the context of words by considering their neighboring words.
- **Flexible Representation:** Allows for varying levels of granularity (unigrams, bigrams, trigrams, etc.).

### Disadvantages of N-grams

- **Data Sparsity:** Higher-order n-grams (e.g., trigrams) can lead to data sparsity issues.
- **Increased Complexity:** The dimensionality of the feature space increases with the size of n-grams.

### Example : 
sentence =  “The cat sat on the mat.”

- **1-grams (unigrams):** The, cat, sat, on, the, mat
- **2-grams (bigrams):** The cat, cat sat, sat on, on the, the mat
- **3-grams (trigrams):** The cat sat, cat sat on, sat on the, on matn the words!

In [1]:
# import requirments

from string import punctuation                 # import punctuation( it is a string)
from nltk.tokenize import word_tokenize        # import word_tokenize to tokenise a sentence based on space

In [2]:
# take a list of sentences
sentences = [
    "The cat sat on the mat.",
    "The dog barked at the cat.",
    "The cat and the dog are friends.",
    "Birds can fly high in the sky."
]

sentences

['The cat sat on the mat.',
 'The dog barked at the cat.',
 'The cat and the dog are friends.',
 'Birds can fly high in the sky.']

In [3]:
# create tokens for each sentence after removing punctuations

sentence_tokens = [[word for word in word_tokenize(sent) if word not in punctuation] for sent in sentences]
sentence_tokens

[['The', 'cat', 'sat', 'on', 'the', 'mat'],
 ['The', 'dog', 'barked', 'at', 'the', 'cat'],
 ['The', 'cat', 'and', 'the', 'dog', 'are', 'friends'],
 ['Birds', 'can', 'fly', 'high', 'in', 'the', 'sky']]

<h2 style="color:SlateBlue ; font-size:35px ; text-align:center">Manually Create N-Gram Tokens</h2>

In [4]:
# Create manually bi-gram tokens
print("Bi-Gram tokens are : ",'\n\n')

for tokens in sentence_tokens:                           # iterate over list of sentences
    bi_grams = []                                        # create an empty list to store bigram token
    for i in range(len(tokens) - 1):                     # take a range upto n-1 where n is the total tokens in a sentence
        bi_token = ' '.join([tokens[i] , tokens[i+1]])   # create a bi-gram token by joining two conjucative words
        bi_grams.append(bi_token)                        # append the bi-gram token in empty list
        
    print(f"Sentence : {" ".join(tokens)}")
    print(bi_grams , '\n')

Bi-Gram tokens are :  


Sentence : The cat sat on the mat
['The cat', 'cat sat', 'sat on', 'on the', 'the mat'] 

Sentence : The dog barked at the cat
['The dog', 'dog barked', 'barked at', 'at the', 'the cat'] 

Sentence : The cat and the dog are friends
['The cat', 'cat and', 'and the', 'the dog', 'dog are', 'are friends'] 

Sentence : Birds can fly high in the sky
['Birds can', 'can fly', 'fly high', 'high in', 'in the', 'the sky'] 



In [5]:
# Create manually tri-gram tokens
print("Tri-Gram tokens are : ",'\n\n')

for tokens in sentence_tokens:                                           # iterate over list of sentences
    tri_grams = []                                                       # create an empty list to store bigram token
    for i in range(len(tokens) - 2):                                     # take a range upto n-1 where n is the total tokens in a sentence
        tri_token = ' '.join([tokens[i] , tokens[i+1] , tokens[i+2]])    # create a tri-gram token by joining three conjucative words
        tri_grams.append(tri_token)                                      # append the tri-gram token in empty list
        
    print(f"Sentence : {" ".join(tokens)}")
    print(tri_grams , '\n')

Tri-Gram tokens are :  


Sentence : The cat sat on the mat
['The cat sat', 'cat sat on', 'sat on the', 'on the mat'] 

Sentence : The dog barked at the cat
['The dog barked', 'dog barked at', 'barked at the', 'at the cat'] 

Sentence : The cat and the dog are friends
['The cat and', 'cat and the', 'and the dog', 'the dog are', 'dog are friends'] 

Sentence : Birds can fly high in the sky
['Birds can fly', 'can fly high', 'fly high in', 'high in the', 'in the sky'] 



<h2 style="color:SlateBlue ; font-size:35px ; text-align:center">Create N-Gram Tokens Using CounterVectorizer</h2>

In [6]:
# create BI-GRAMS tokens using Counter Vectorizer

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(ngram_range=(2,2))

bi_gram = cv.fit_transform(sentences)

In [7]:
bigram_matrix = bi_gram.toarray()
bigram_matrix

array([[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0],
       [0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0],
       [1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0],
       [0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1]])

In [8]:
# to see the cv_vocabulary dataframe
import pandas as pd

pd.DataFrame(data=bigram_matrix , index=sentences , columns=cv.get_feature_names_out())

Unnamed: 0,and the,are friends,at the,barked at,birds can,can fly,cat and,cat sat,dog are,dog barked,fly high,high in,in the,on the,sat on,the cat,the dog,the mat,the sky
The cat sat on the mat.,0,0,0,0,0,0,0,1,0,0,0,0,0,1,1,1,0,1,0
The dog barked at the cat.,0,0,1,1,0,0,0,0,0,1,0,0,0,0,0,1,1,0,0
The cat and the dog are friends.,1,1,0,0,0,0,1,0,1,0,0,0,0,0,0,1,1,0,0
Birds can fly high in the sky.,0,0,0,0,1,1,0,0,0,0,1,1,1,0,0,0,0,0,1


In [9]:
# create TRI-GRAMS tokens using Counter Vectorizer

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(ngram_range=(3,3))

tri_gram = cv.fit_transform(sentences)
tri_gram_matrix = tri_gram.toarray()
tri_gram_matrix

array([[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0],
       [0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1],
       [1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0],
       [0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0]])

In [10]:
pd.DataFrame(data=tri_gram_matrix , index=sentences , columns=cv.get_feature_names_out())

Unnamed: 0,and the dog,at the cat,barked at the,birds can fly,can fly high,cat and the,cat sat on,dog are friends,dog barked at,fly high in,high in the,in the sky,on the mat,sat on the,the cat and,the cat sat,the dog are,the dog barked
The cat sat on the mat.,0,0,0,0,0,0,1,0,0,0,0,0,1,1,0,1,0,0
The dog barked at the cat.,0,1,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1
The cat and the dog are friends.,1,0,0,0,0,1,0,1,0,0,0,0,0,0,1,0,1,0
Birds can fly high in the sky.,0,0,0,1,1,0,0,0,0,1,1,1,0,0,0,0,0,0


<h2 style="color:SlateBlue ; font-size:35px ; text-align:center">Manually Create tokens from uni-gram to tri-gram </h2>

In [11]:
# create n-GRAMS tokens using Counter Vectorizer form one to three token all combinations

from sklearn.feature_extraction.text import CountVectorizer
n_cv = CountVectorizer(ngram_range=(1,3))

n_gram = n_cv.fit_transform(sentences)
n_gram_matrix = n_gram.toarray()
n_gram_matrix

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,
        0, 2, 1, 0, 1, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
        1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 2, 1, 0, 0, 1, 0, 1, 0, 0],
       [1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,
        1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 2, 1, 1, 0, 1, 1, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0,
        1, 1, 0, 0, 0, 0, 0, 0, 0, 1]])

In [12]:
pd.set_option('display.max_columns', None)
pd.DataFrame(data=n_gram_matrix , index=sentences , columns=n_cv.get_feature_names_out())

Unnamed: 0,and,and the,and the dog,are,are friends,at,at the,at the cat,barked,barked at,barked at the,birds,birds can,birds can fly,can,can fly,can fly high,cat,cat and,cat and the,cat sat,cat sat on,dog,dog are,dog are friends,dog barked,dog barked at,fly,fly high,fly high in,friends,high,high in,high in the,in,in the,in the sky,mat,on,on the,on the mat,sat,sat on,sat on the,sky,the,the cat,the cat and,the cat sat,the dog,the dog are,the dog barked,the mat,the sky
The cat sat on the mat.,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,0,2,1,0,1,0,0,0,1,0
The dog barked at the cat.,0,0,0,0,0,1,1,1,1,1,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,1,0,0,1,0,1,0,0
The cat and the dog are friends.,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,0,1,1,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,1,1,0,1,1,0,0,0
Birds can fly high in the sky.,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,1,1,1,0,1,1,1,1,1,1,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,1


                 _________________________________________ End _________________________________________