# Bag of N-Grams

In natural language processing (NLP), the Bag of N-Grams Model is a method for representing text input in an organized way that machine learning algorithms may exploit. An N-gram consists of a continuous series of 'N' elements from a specific voice or text sample. These objects may be words, syllables, or characters. To generate a feature set for text analysis, the model builds a "bag" that is, a collection of these N-grams.


* Unigram (1-gram): "I love NLP" -> ["I", "love", "NLP"]

* Bigram (2-gram): "I love NLP" -> ["I love", "love NLP"]

* Trigram (3-gram): "I love NLP" -> ["I love NLP"]

## Comparison with Bag of Words Model

NLP core strategies include the Bag of Words (BoW) Model and the Bag of N-Grams Model, however they differ greatly from one another.

1. Word Order: The BoW Model treats each word as an independent feature and disregards the word order within the text. By taking word sequences into account, however, the Bag of N-Grams Model manages to preserve a portion of the word order information.

2. Context Capture: Because BoW just takes individual words into account, it is unable to capture context well. The Bag of N-Grams Model is more useful for jobs where word order matters because it incorporates word sequences, which capture local context.

3. Dimensionality: When compared to the Bag of N-Grams Model, the BoW Model usually yields a lower-dimensional feature space, especially for higher values of N. This may result in problems with sparsity in the N-Grams Model's feature matrix.

## Understanding N-Grams:

Contiguous groups of n elements from a particular text or audio sample are known as N-grams. Text analysis, language modeling, and machine learning applications are only a few of the many uses for them in natural language processing (NLP). Depending on the value of n, there are several ways to conceptualize n-grams.

1. Unigrams

    When n = 1, a unigram is the most basic type of n-gram. They stand in for certain terms inside a document. Unigrams are helpful for simple text analysis tasks, but they frequently lack the context that word combinations give.

    For example, in the sentence "The cat sat on the mat," the unigrams are:

    * "The"
    * "cat"
    * "sat"
    * "on"
    * "the"
    * "mat"

2. Bigrams

    Bigrams consist of two neighboring words in succession (n = 2). They take into account word pairings to partially capture the context. Bigrams are useful for deciphering word connections in text because they offer more contextual information than unigrams.

    Using the same sentence, the bigrams are:

    * "The cat"
    * "cat sat"
    * "sat on"
    * "on the"
    * "the mat"

3. Trigrams

    Three words follow one another to form a trigram (n = 3). Because they identify triplet word sequences, they provide considerably more context. Trigrams are useful for more in-depth text analysis and language comprehension since they may capture trends at the phrase level.

    From the example sentence, the trigrams are:

    * "The cat sat"
    * "cat sat on"
    * "sat on the"
    * "on the mat"

4. Higher-order N-Grams

    Higher-order n-grams (n > 3) expand on this idea by including four or more words. Higher-order n-grams have higher processing and data needs yet are capable of capturing intricate language patterns. They come in especially handy for specific NLP jobs where it's imperative to capture intricate context.

    For instance, 4-grams (quad grams) for our sentence would be:

    * "The cat sat on"
    * "cat sat on the"
    * "sat on the mat"

## Generating N-Grams from Text

Generating n-grams is the next step after preprocessing the text. N-grams are consecutive groups of n textual elements (words, letters, etc.).

* Sliding Window Approach

    To capture each n-gram, the sliding window method entails dragging a window of size n over the text. At a window size of 2 (bigrams), for example, the window records word pairs that follow one another.

    Example:

    Text: ["natural", "language", "processing", "fascinating"]
    
    Bigrams: [("natural", "language"), ("language", "processing"), ("processing", "fascinating")]

    For trigrams (n=3), the window captures triplets of consecutive words. 

## Handling Boundaries in Text

Text boundaries must be handled carefully while creating n-grams, particularly for texts that are broken up into sentences or pages.

* Sentence Limits: Make sure n-grams don't cross over into other phrases. It is best to handle each statement on its own.

    Example:

    Text: "Natural Language Processing is fascinating. It has many applications."

    Sentence 1 Bigrams: [("natural", "language"), ("language", "processing"), ("processing", "is"), ("is", "fascinating")]

    Sentence 2 Bigrams: [("it", "has"), ("has", "many"), ("many", "applications")]
    
    Document Boundaries: Make sure that n-grams are created independently within each document if the text is separated among documents.

## Vector Representation of Text

Frequency Counts:

Converting text into a numerical vector with each element representing the number of times an N-Gram appears in the text is the process of doing frequency counts. This approach offers a simple means of quantifying textual data.

1. Tokenization: Tokenize the text by dividing it into discrete words or characters.

2. Produce N-Grams: Construct N-word sequences. For instance, the sentence "The cat sat" produces "The cat" and "cat sat" for bigrams (N=2).

3. Count Frequencies: Determine how many times each N-Gram appears in the text.

Example:

Sentences: "I Love Natural Language Processing.", "It has many applications in various domains."

After preprocessing the text will look like:

"love natural language processing", "many application various domain"

The bi-grams are:

"love natural" "natural language" "language processing" "many application" "application various" "various domain"

Feature Matrix is:

<table>
  <tr>
    <th rowspan="2">Sentences</th>
    <th colspan="6">Features</th>
  </tr>
  <tr>
    <th>love natural</th>
    <th>natural language</th>
    <th>language processing</th>
    <th>many application</th>
    <th>application various</th>
    <th>various domain</th>
  </tr>
  <tr>
    <td>love natural language processing</td>
    <td>1</td>
    <td>1</td>
    <td>1</td>
    <td>0</td>
    <td>0</td>
    <td>0</td>
  </tr>
  <tr>
    <td>many application various domain</td>
    <td>0</td>
    <td>0</td>
    <td>0</td>
    <td>1</td>
    <td>1</td>
    <td>1</td>
  </tr>
</table>

The feature vectors for each of the sentences are:
* "I Love Natural Language Processing." -> [1,1,1,0,0,0]

* "It has many applications in various domains." -> [0,0,0,1,1,1]


## N-Grams with NLTK and SpaCy

In [20]:
# Importing libraries
import spacy
from nltk.util import ngrams

# Load spacy's english model
nlp = spacy.load("en_core_web_sm")

In [48]:
def generate_n_grams(text, n):
    # Lowercase and split the text into sentences
    text = text.lower()
    sentences = text.split(".")
    
    n_gram_list = []
    
    # Compute n-grams for each sentence to preserve text boundaries
    # and add them to n_gram_list
    for sentence in sentences:
        doc = nlp(sentence)
    
        # Include only alphanumeric characters and do not include punctuations
        tokens = [token for token in doc if token.is_alpha and not token.is_punct]
    
        # Generate N-Grams
        n_grams = ngrams(tokens, n)
    
        n_gram_list.extend(gram for gram in n_grams)

    return n_gram_list


sample_text = "I love to play cricket. I also like to watch football tho."

# Generate and print n-grams
print("N=1 (unigram): ", generate_n_grams(sample_text, 1))
print("N=2 (bigram):", generate_n_grams(sample_text, 2))
print("M=3 (trigram):", generate_n_grams(sample_text, 3))

N=1 (unigram):  [(i,), (love,), (to,), (play,), (cricket,), (i,), (also,), (like,), (to,), (watch,), (football,), (tho,)]
N=2 (bigram): [(i, love), (love, to), (to, play), (play, cricket), (i, also), (also, like), (like, to), (to, watch), (watch, football), (football, tho)]
M=3 (trigram): [(i, love, to), (love, to, play), (to, play, cricket), (i, also, like), (also, like, to), (like, to, watch), (to, watch, football), (watch, football, tho)]


## Sentiment Analysis with N-Grams

In [115]:
# Importing libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import classification_report

In [63]:
# Loading the inspecting the data
data = pd.read_csv("./restaurant_reviews.csv")
data.head()

Unnamed: 0,Restaurant,Reviewer,Review,Rating,Metadata,Time,Pictures,7514
0,Beyond Flavours,Rusha Chakraborty,"The ambience was good, food was quite good . h...",5,"1 Review , 2 Followers",5/25/2019 15:54,0,2447.0
1,Beyond Flavours,Anusha Tirumalaneedi,Ambience is too good for a pleasant evening. S...,5,"3 Reviews , 2 Followers",5/25/2019 14:20,0,
2,Beyond Flavours,Ashok Shekhawat,A must try.. great food great ambience. Thnx f...,5,"2 Reviews , 3 Followers",5/24/2019 22:54,0,
3,Beyond Flavours,Swapnil Sarkar,Soumen das and Arun was a great guy. Only beca...,5,"1 Review , 1 Follower",5/24/2019 22:11,0,
4,Beyond Flavours,Dileep,Food is good.we ordered Kodi drumsticks and ba...,5,"3 Reviews , 2 Followers",5/24/2019 21:37,0,


In [72]:
# Select the required columns
review_rating_data = data[["Review", "Rating"]]
review_rating_data.head()

Unnamed: 0,Review,Rating
0,"The ambience was good, food was quite good . h...",5
1,Ambience is too good for a pleasant evening. S...,5
2,A must try.. great food great ambience. Thnx f...,5
3,Soumen das and Arun was a great guy. Only beca...,5
4,Food is good.we ordered Kodi drumsticks and ba...,5


In [73]:
# Check for null values
print(review_rating_data.isna().sum())

# Drop rows with null values
review_rating_data = review_rating_data.dropna()

print(review_rating_data.isna().sum())

Review    45
Rating    38
dtype: int64
Review    0
Rating    0
dtype: int64


In [86]:
# Check the class distribution
review_rating_data["Rating"].value_counts()

Rating
5       3826
4       2373
1       1735
3       1192
2        684
4.5       69
3.5       47
2.5       19
1.5        9
Like       1
Name: count, dtype: int64

In [None]:
# Unique ratins in data
review_rating_data["Rating"].unique()

array(['5', '4', '1', '3', '2', '3.5', '4.5', '2.5', '1.5', 'Like'],
      dtype=object)

In [119]:
# Extract rows with rating values in integers between 1 and 5
review_rating_1_to_5_data = review_rating_data[review_rating_data["Rating"].isin(["1", "2", "3", "4", "5"])]

# Convert the ratings from strings to integers
review_rating_1_to_5_data["Rating"] = review_rating_1_to_5_data["Rating"].astype(int)

review_rating_1_to_5_data.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  review_rating_1_to_5_data["Rating"] = review_rating_1_to_5_data["Rating"].astype(int)


Unnamed: 0,Review,Rating
0,"The ambience was good, food was quite good . h...",5
1,Ambience is too good for a pleasant evening. S...,5
2,A must try.. great food great ambience. Thnx f...,5
3,Soumen das and Arun was a great guy. Only beca...,5
4,Food is good.we ordered Kodi drumsticks and ba...,5


In [None]:
# Function to preprocess the Reviews
def preprocess(text):
    # Process the text
    doc = nlp(text)
    
    # Apply lowercasing and lemmatization if the token is alphanumeric and is not a punctuation
    processed_tokens = [token.lemma_ and token.lower_ for token in doc if token.is_alpha and not token.is_punct]
    # Join the token to form a string
    return " ".join(processed_tokens)

# Apply preprocessing function to the dataframe
review_rating_1_to_5_data["clean_text"] = review_rating_1_to_5_data["Review"].apply(lambda x: preprocess(x))

review_rating_1_to_5_data.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  review_rating_1_to_5_data["clean_text"] = review_rating_1_to_5_data["Review"].apply(lambda x: preprocess(x))


Unnamed: 0,Review,Rating,clean_text
0,"The ambience was good, food was quite good . h...",5,the ambience was good food was quite good had ...
1,Ambience is too good for a pleasant evening. S...,5,ambience is too good for a pleasant evening se...
2,A must try.. great food great ambience. Thnx f...,5,a must try great food great ambience thnx for ...
3,Soumen das and Arun was a great guy. Only beca...,5,soumen das and arun was a great guy only becau...
4,Food is good.we ordered Kodi drumsticks and ba...,5,food is ordered kodi drumsticks and basket mut...


In [124]:
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    review_rating_1_to_5_data["clean_text"],
    review_rating_1_to_5_data["Rating"],
    test_size=0.2,
    shuffle=True,
    random_state=42
    )

In [126]:
# Shapes
print("X_train shape: ", X_train.shape)
print("X_test shape: ", X_test.shape)
print("y_train shape: ", y_train.shape)
print("y_test shape: ", y_test.shape)

X_train shape:  (7848,)
X_test shape:  (1962,)
y_train shape:  (7848,)
y_test shape:  (1962,)


**The ngram_range parameter in CountVectorizer**

ngram_range -> tuple (min_n, max_n), default=(1, 1)

The lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted.
All values of n such such that min_n <= n <= max_n will be used.
For example an ngram_range of (1, 1) means only unigrams, (1, 2) means unigrams and bigrams, and (2, 2) means only bigrams.

Source: [Scikit-learn Docs](<https://scikit-learn.org/1.5/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html>)

In [147]:
# Extract unigrams and bigrams
vectorizer = CountVectorizer(ngram_range=(1,2))
X_train_cv = vectorizer.fit_transform(X_train.values)
X_test_cv = vectorizer.transform(X_test.values)

In [157]:
# The vocabulary is a combination of unigrams and bigrams
list(vectorizer.vocabulary_.keys())[15:30]

['recommend',
 'flavours',
 'are',
 'ambience',
 'and',
 'service',
 'must',
 'visit',
 'ordered veg',
 'veg pasta',
 'pasta lasagne',
 'lasagne butter',
 'butter chicken',
 'chicken biryani',
 'biryani colsaw']

In [151]:
# Train and test the model
model = RandomForestClassifier()

model.fit(X_train_cv, y_train)

y_pred = model.predict(X_test_cv)

print(classification_report(y_pred, y_test))

              precision    recall  f1-score   support

           1       0.82      0.70      0.76       416
           2       0.06      0.53      0.10        15
           3       0.07      0.34      0.12        47
           4       0.32      0.44      0.37       348
           5       0.92      0.62      0.74      1136

    accuracy                           0.60      1962
   macro avg       0.44      0.53      0.42      1962
weighted avg       0.76      0.60      0.66      1962



## Advantages of the Bag of N-Grams Model:
1. Capturing Local Context

    The Bag of N-Grams model's ability to identify the local context inside a text is one of its main benefits. The Bag of N-Grams model takes word sequences into account, in contrast to the Bag of Words model, which handles words separately. This method makes it possible to comprehend the connections between words more clearly. A bigram model, for instance, would preserve the context that a unigram model would lose in the sentence "The quick brown fox," by recognizing "quick brown" and "brown fox" as significant units. This feature is especially helpful for jobs like sentiment analysis, where word combinations and order can drastically change the meaning of a statement.
2. Better Performance in Specific Tasks Compared to Unigrams

    The Bag of N-Grams model performs better than the more straightforward Bag of Words model in several natural language processing (NLP) tasks. This improvement is particularly apparent in activities where the context and word order play a significant role. For instance, bigrams and trigrams might offer more discriminative characteristics than individual words in text classification applications like spam detection or sentiment analysis. The model's capacity to take into account neighboring word pairs or triplets improves the analysis's accuracy and resilience by enabling it to identify phrases and expressions that are representative of particular categories or emotions.

3. Flexibility in Choosing N

    The versatility with which the Bag of N-Grams model may choose the value of N is another important benefit. It is possible to optimize efficiency by using varying numbers of N, depending on the text's nature and the particular job utilized. For instance, trigrams (N=3) may be more suited for jobs needing more context, such as named entity identification or complicated language modeling, yet bigrams (N =2) are frequently adequate for capturing local context in sentiment analysis. This adaptability strikes a compromise between model complexity and computing efficiency by enabling practitioners and academics to test various N values in pursuit of the best representation for their particular application. ## 7. The Bag of N-Grams Model's Limitations

## Limitations of Bag of N-Grams Model:

1. Curse of Dimensionality

    The N-Gram Bag The curse of dimensionality, or the exponential increase in the number of features (N-Grams) as the size of the N increases, frequently affects models. For instance, a text corpus containing 10,000 distinct words in its lexicon can produce up to 10,000^2 bigrams (100 million) and 10,000^3 trigrams (1 trillion). There might be a lot of problems resulting from this sharp rise in the possible N-Gram population.

    * Computational Complexity: Processing and analyzing huge datasets effectively is challenging since handling such a large feature set demands a substantial amount of memory and processing resources.

    * Overfitting: When a model has too many features, it may overfit the training set and capture noise rather than broad trends. The model can no longer generalize to previously untested data as a result.

2. Data Sparsity Issues

    A further significant obstacle in the Bag of N-Grams Model is data sparsity. Many of the N-Grams that become more numerous may show up seldom or not at all in the text corpus.

    * Sparse Feature Matrices: The numerous zeros in the generated feature matrices suggest that most papers don't have a lot of N-Grams. It costs a lot of computing power to store and work with sparse matrices.

    * Inefficient Feature Utilisation: Many of the N-Grams that are produced could not provide the model with useful information, which might result in an inefficient use of resources and a possible decline in model performance.

3. Lack of Semantic Understanding

    * Ignoring Context: The model ignores the larger context in which N-Grams exist and interprets them as stand-alone units. For example, treating "not good" and "good" as separate bigrams would miss the negative connotation of the word "not good."

    * Word Meaning Clarification: Words with many meanings, or polysemous words, provide difficulties for the model. For instance, even if "river bank" and "bank account" have different meanings, the term "bank" in both would be handled similarly.

    * Incapacity to collect Long-Distance Dependencies: Long-distance dependencies between words in a text are difficult for the model to collect and might be important information for deciphering the overall meaning of complicated sentences.

4. Scalability Concerns

    When using the Bag of N-Grams Model on large-scale text corpora, scalability is a major challenge. The restrictions listed above become more noticeable as the dataset gets larger.

    * Resource Intensiveness: Processing big datasets with a lot of N-Gram features takes a lot of time and computing power. Scaling the model for large data applications is difficult as a result.

    * Model Maintenance: It might be difficult to update and maintain models that were trained on huge, dynamic datasets regularly. Retraining is often required when the underlying data distribution changes and this requires a lot of resources.
  
    * Real-Time Processing: The significant processing cost of the Bag of N-Grams Model makes real-time text analysis unfeasible, which limits its use in time-sensitive applications like real-time sentiment analysis or spam detection.

## Sources
1. Youtube: Codebasics
2. Javatpoint
3. Restaurant Reviews Dataset: [Kaggle](https://www.kaggle.com/datasets/joebeachcapital/restaurant-reviews)