# Uni-gram, Bi-gram, Tri-gram, n-gram

## Uni-gram

- Here, every single word in a document is a dimension.
- For example let there be a document corpus as given below:
  1. This car drives good and is expensive.
  2. This car is very expensive and drives good.
     - Uni-gram for the above document corpus is given as:<br>     
     `['This', 'car', 'drives', 'good', 'and', 'is', 'expensive', 'very']`

## Bi-gram

- Here, every pair of consecutive words is a dimension.
- Usage example using the above discussed document corpus:
  - Bi-gram for the above document corpus is given as:<br>  
    `['This car', 'car drives', 'drives good', 'good and', 'and is', 'is expensive', 'car is', 'is very', 'very expensive', 'and drives']`

## Tri-gram

- Here, every triplet of consecutive words is a dimension.
- Usage example using the above discussed document corpus:
  - Tri-gram for the above document corpus is given as:
  
    `['This car drives', 'car drives good', 'drives good and', 'good and is', 'and is expensive', 'This car is', 'car is very', 'is very expensive', 'expensive and drives', 'and drives good']`

## n-gram

- Here, every group of n consecutive words is a dimension.

> **Note:** Bi-grams and tri-grams are very useful in Bag of Words.<br>
> **Note:** Uni-gram discards the sequence information. While bi-gram, tri-gram,...,n-gram retain some of the sequence information.

#Implementation n-grams through Sklearn

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
cv = CountVectorizer(ngram_range=(2,2)) #ngram_range defines the type of n-gram(say uni, bi, tri, etc) we want to create.

corpus = ['This car drives good and is expensive',
          'This car is better than the other car and is less expensive and drives good']

X = cv.fit(corpus) # cv.fit() creates the dictionary of all the unique words in the corpus.
print('Dictionary of all the unique words in the corpus:',X.vocabulary_)

Dictionary of all the unique words in the corpus: {'this car': 16, 'car drives': 4, 'drives good': 6, 'good and': 8, 'and is': 1, 'is expensive': 10, 'car is': 5, 'is better': 9, 'better than': 2, 'than the': 14, 'the other': 15, 'other car': 13, 'car and': 3, 'is less': 11, 'less expensive': 12, 'expensive and': 7, 'and drives': 0}


In [3]:
print(cv.get_feature_names())

['and drives', 'and is', 'better than', 'car and', 'car drives', 'car is', 'drives good', 'expensive and', 'good and', 'is better', 'is expensive', 'is less', 'less expensive', 'other car', 'than the', 'the other', 'this car']


In [4]:
X = cv.transform(corpus)
print(X.toarray())

[[0 1 0 0 1 0 1 0 1 0 1 0 0 0 0 0 1]
 [1 1 1 1 0 1 1 1 0 1 0 1 1 1 1 1 1]]


In [5]:
df =pd.DataFrame(X.toarray(), columns = cv.get_feature_names())
df

Unnamed: 0,and drives,and is,better than,car and,car drives,car is,drives good,expensive and,good and,is better,is expensive,is less,less expensive,other car,than the,the other,this car
0,0,1,0,0,1,0,1,0,1,0,1,0,0,0,0,0,1
1,1,1,1,1,0,1,1,1,0,1,0,1,1,1,1,1,1
