# **Fake Review Detection using N-grams**

In [1]:
#Loading Required Libraries

import pandas as pd
import spacy as sp
from spacy.lang.en.stop_words import STOP_WORDS
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score, classification_report

**Dataset Description:**
The dataset is a collection of reviews, meticulously curated to enable the detection of fake and authentic reviews. It contains a total of 40,432 reviews, evenly distributed between two primary categories:

- **Computer-Generated (CG):** These reviews are generated by computer algorithms, simulating the content typically found in fake or spammy reviews.

- **Original Reviews (OR):** This category comprises authentic human-created reviews, representing genuine user sentiments and experiences.

Each review is associated with a label indicating its category, and the text content provides the basis for classification and analysis. The dataset is valuable for training and evaluating machine learning models that aim to identify fake and authentic reviews, making it a useful resource for research and applications in natural language processing and online content moderation.

In [2]:
#Loading the Dataset

Dataset = pd.read_csv("/content/drive/MyDrive/Machine_Learning/NLP/Datasets/fake_reviews_dataset.csv")
Dataset.head()

Unnamed: 0,category,rating,label,text_
0,Home_and_Kitchen_5,5.0,CG,"Love this! Well made, sturdy, and very comfor..."
1,Home_and_Kitchen_5,5.0,CG,"love it, a great upgrade from the original. I..."
2,Home_and_Kitchen_5,5.0,CG,This pillow saved my back. I love the look and...
3,Home_and_Kitchen_5,1.0,CG,"Missing information on how to use it, but it i..."
4,Home_and_Kitchen_5,5.0,CG,Very nice set. Good quality. We have had the s...


In [3]:
Dataset.shape

(40432, 4)

In [4]:
Dataset.label.value_counts()#We can see that there is no class imbalnace in our data.

CG    20216
OR    20216
Name: label, dtype: int64

**Preprocessing**

In [5]:
Dataset['Label_Num'] = Dataset.label.map(
    {
        "CG" : 0,
        "OR" : 1
    }
)
print(Dataset.Label_Num.unique())
Dataset.head()

[0 1]


Unnamed: 0,category,rating,label,text_,Label_Num
0,Home_and_Kitchen_5,5.0,CG,"Love this! Well made, sturdy, and very comfor...",0
1,Home_and_Kitchen_5,5.0,CG,"love it, a great upgrade from the original. I...",0
2,Home_and_Kitchen_5,5.0,CG,This pillow saved my back. I love the look and...,0
3,Home_and_Kitchen_5,1.0,CG,"Missing information on how to use it, but it i...",0
4,Home_and_Kitchen_5,5.0,CG,Very nice set. Good quality. We have had the s...,0


In [6]:
Dataset.category.value_counts()

Kindle_Store_5                  4730
Books_5                         4370
Pet_Supplies_5                  4254
Home_and_Kitchen_5              4056
Electronics_5                   3988
Sports_and_Outdoors_5           3946
Tools_and_Home_Improvement_5    3858
Clothing_Shoes_and_Jewelry_5    3848
Toys_and_Games_5                3794
Movies_and_TV_5                 3588
Name: category, dtype: int64

In [7]:
#Preprocess() - which will removes the stopwords and extracts the base word by doing lemmatization.
nlp = sp.load("en_core_web_sm")

def preprocess(text):
  doc = nlp(text)
  not_a_stop_word = []

  for token in doc:
    if token.is_stop or token.is_punct:
      continue
    not_a_stop_word.append(token.lemma_)
  return " ".join(not_a_stop_word)

In [8]:
Dataset['Preprocessed_Text'] = Dataset.text_.apply(preprocess)
Dataset.head()

Unnamed: 0,category,rating,label,text_,Label_Num,Preprocessed_Text
0,Home_and_Kitchen_5,5.0,CG,"Love this! Well made, sturdy, and very comfor...",0,love sturdy comfortable love it!Very pretty
1,Home_and_Kitchen_5,5.0,CG,"love it, a great upgrade from the original. I...",0,love great upgrade original couple year
2,Home_and_Kitchen_5,5.0,CG,This pillow saved my back. I love the look and...,0,pillow save love look feel pillow
3,Home_and_Kitchen_5,1.0,CG,"Missing information on how to use it, but it i...",0,miss information use great product price
4,Home_and_Kitchen_5,5.0,CG,Very nice set. Good quality. We have had the s...,0,nice set good quality set month


**Modelling without Preprocessing the Text**

Using sklearn pipeline module create a classification pipeline to classify the genre of the book.

we'll use the following classification algorithms in the pipeline:
* k-NN
* Naivee Bayes
* Logistic Regression
* Random Forest


In [9]:
#Splitting the Dataset
X_train, X_test, y_train, y_test = train_test_split(Dataset.text_, Dataset.Label_Num, test_size = 0.2, stratify = Dataset.Label_Num)

In [10]:
print("Size of the Training Data - ", X_train.shape)
print("Size of the Testing Data - ", X_test.shape)

Size of the Training Data -  (32345,)
Size of the Testing Data -  (8087,)


In [11]:
print(y_train.value_counts())
print(y_test.value_counts())

0    16173
1    16172
Name: Label_Num, dtype: int64
1    4044
0    4043
Name: Label_Num, dtype: int64


**1.1) k-NN with n_neighbors of 10 and metric as 'euclidean' distance.**

**Using only Uni-gram**

In [12]:
from sklearn.neighbors import KNeighborsClassifier

#Pipeline
knn = Pipeline([
    ("bow(n_gram = (1, 1))", CountVectorizer(ngram_range = (1, 1))),# (1, 1) is actually the default one, it is known as the BoW - bag of words.
    ("knn", KNeighborsClassifier(n_neighbors = 10, metric = 'euclidean'))
])

#Fitting the model
knn.fit(X_train, y_train)

#Making Predictions
y_pred = knn.predict(X_test)

#Model Evaluation
print(classification_report(y_test, y_pred))
print("Accuracy Score - %.2f%%"%(accuracy_score(y_test, y_pred)*100))

              precision    recall  f1-score   support

           0       0.58      0.93      0.71      4043
           1       0.82      0.32      0.46      4044

    accuracy                           0.62      8087
   macro avg       0.70      0.62      0.58      8087
weighted avg       0.70      0.62      0.58      8087

Accuracy Score - 62.37%


**Using Uni-gram, Bi-gram**

In [13]:
#Pipeline
knn = Pipeline([
    ("bow(n_gram = (1, 2))", CountVectorizer(ngram_range = (1, 2))),# (1, 2) means it both uses uni gram and bi-gram for generating vectors.
    ("knn", KNeighborsClassifier(n_neighbors = 10, metric = 'euclidean'))
])

#Fitting the model
knn.fit(X_train, y_train)

#Making Predictions
y_pred = knn.predict(X_test)

#Model Evaluation
print(classification_report(y_test, y_pred))
print("Accuracy Score - %.2f%%"%(accuracy_score(y_test, y_pred)*100))

              precision    recall  f1-score   support

           0       0.60      0.88      0.72      4043
           1       0.78      0.41      0.54      4044

    accuracy                           0.65      8087
   macro avg       0.69      0.65      0.63      8087
weighted avg       0.69      0.65      0.63      8087

Accuracy Score - 64.77%


**Using Uni-gram, Bi-gram, Tri-gram**

In [14]:
#Pipeline
knn = Pipeline([
    ("bow(n_gram = (1, 3))", CountVectorizer(ngram_range = (1, 3))),# (1, 2) means it uses uni gram, bi-gram and tri-gram for generating vectors.
    ("knn", KNeighborsClassifier(n_neighbors = 10, metric = 'euclidean'))
])

#Fitting the model
knn.fit(X_train, y_train)

#Making Predictions
y_pred = knn.predict(X_test)

#Model Evaluation
print(classification_report(y_test, y_pred))
print("Accuracy Score - %.2f%%"%(accuracy_score(y_test, y_pred)*100))

              precision    recall  f1-score   support

           0       0.64      0.82      0.72      4043
           1       0.75      0.53      0.62      4044

    accuracy                           0.68      8087
   macro avg       0.69      0.68      0.67      8087
weighted avg       0.69      0.68      0.67      8087

Accuracy Score - 67.66%


### **Observations:**

*By applying ngram_range(1, 1), ngram_range(1, 2) and ngram_range(1, 3) for knn model(n_neighbors = 10, metric = 'euclidean') we can see that the precision and recall values are increasing slightly for each range for this model. it is not sure that every time for every dataset it will work like that only. for this dataset it worked like that. we know that machine learning is a trial and error process. so it is not sure that every time the results will be same for any dataset.*

**Using only Tri-grams**

In [15]:
#Pipeline
knn = Pipeline([
    ("bow(n_gram = (3, 3))", CountVectorizer(ngram_range = (3, 3))), #(3, 3) means it uses only the tri-gram.
    ("knn", KNeighborsClassifier(n_neighbors = 10, metric = 'euclidean'))
])

#Fitting the model
knn.fit(X_train, y_train)

#Making Predictions
y_pred = knn.predict(X_test)

#Model Evaluation
print(classification_report(y_test, y_pred))
print("Accuracy Score - %.2f%%"%(accuracy_score(y_test, y_pred)*100))

              precision    recall  f1-score   support

           0       0.97      0.50      0.66      4043
           1       0.66      0.98      0.79      4044

    accuracy                           0.74      8087
   macro avg       0.82      0.74      0.73      8087
weighted avg       0.82      0.74      0.73      8087

Accuracy Score - 74.29%


### **Observations:**

*for "only Tri-gram" we get some better results than the above taken "n-grams." for knn model(n_neighbors = 10, metric = 'euclidean')*

**1.2) k-NN with n_neighbors of 10 and metric as 'cosine' distance.**

**Using Uni-gram, Bi-gram, Tri-gram**

In [16]:
#Pipeline
knn = Pipeline([
    ("bow(n_grams = (1, 3))", CountVectorizer(ngram_range = (1, 3))),
    ("knn_cosine", KNeighborsClassifier(n_neighbors = 10, metric = 'cosine'))
])

#Fitting the model
knn.fit(X_train, y_train)

#Making Predictions
y_pred = knn.predict(X_test)

#Model Evaluation
print(classification_report(y_test, y_pred))
print("Accuracy Score - %.2f%%"%(accuracy_score(y_test, y_pred)*100))

              precision    recall  f1-score   support

           0       0.57      0.98      0.72      4043
           1       0.94      0.26      0.40      4044

    accuracy                           0.62      8087
   macro avg       0.75      0.62      0.56      8087
weighted avg       0.75      0.62      0.56      8087

Accuracy Score - 61.98%


**Using only Tri-gram**

In [17]:
#Pipeline
knn = Pipeline([
    ("bow(n_grams = (3, 3))", CountVectorizer(ngram_range = (3, 3))),
    ("knn_cosine", KNeighborsClassifier(n_neighbors = 10, metric = 'cosine'))
])

#Fitting the model
knn.fit(X_train, y_train)

#Making Predictions
y_pred = knn.predict(X_test)

#Model Evaluation
print(classification_report(y_test, y_pred))
print("Accuracy Score - %.2f%%"%(accuracy_score(y_test, y_pred)*100))

              precision    recall  f1-score   support

           0       0.58      0.98      0.73      4043
           1       0.94      0.27      0.42      4044

    accuracy                           0.63      8087
   macro avg       0.76      0.63      0.58      8087
weighted avg       0.76      0.63      0.58      8087

Accuracy Score - 62.84%


### **Observations:**

* *By Using "Uni-gram, Bi-gram, Tri-gram" for knn model(n_neighbors = 10, metric = 'cosine') we can see that the precision and recall values are very low.*
* *By using only tri-gram for knn model(n_neighbors = 10, metric = 'cosine') the precision and recall values are very low but as compared with the above used ngram_range = (1, 3) 1% has increased.*

**2. Multinomial Naive Bayes**

**Using Uni-gram, Bi-gram, Tri-gram**

In [18]:
from sklearn.naive_bayes import MultinomialNB

#Pipeline
mnb = Pipeline([
    ("bow(n_grams = (1, 3))", CountVectorizer(ngram_range = (1, 3))),
    ("mnb", MultinomialNB())
])

#Fitting the model
mnb.fit(X_train, y_train)

#Making Predictions
y_pred = mnb.predict(X_test)

#Model Evaluation
print(classification_report(y_test, y_pred))
print("Accuracy Score - %.2f%%"%(accuracy_score(y_test, y_pred)*100))

              precision    recall  f1-score   support

           0       0.81      0.97      0.88      4043
           1       0.97      0.77      0.86      4044

    accuracy                           0.87      8087
   macro avg       0.89      0.87      0.87      8087
weighted avg       0.89      0.87      0.87      8087

Accuracy Score - 87.24%


**Using only Tri-gram**

In [19]:
from sklearn.naive_bayes import MultinomialNB

#Pipeline
mnb = Pipeline([
    ("bow(n_grams = (3, 3))", CountVectorizer(ngram_range = (3, 3))),
    ("mnb", MultinomialNB())
])

#Fitting the model
mnb.fit(X_train, y_train)

#Making Predictions
y_pred = mnb.predict(X_test)

#Model Evaluation
print(classification_report(y_test, y_pred))
print("Accuracy Score - %.2f%%"%(accuracy_score(y_test, y_pred)*100))

              precision    recall  f1-score   support

           0       0.66      0.98      0.79      4043
           1       0.97      0.49      0.65      4044

    accuracy                           0.74      8087
   macro avg       0.81      0.74      0.72      8087
weighted avg       0.81      0.74      0.72      8087

Accuracy Score - 73.72%


### **Observations:**

* *from above we can see that by using "Uni-gram, Bi-gram, Tri-gram" for multinomial naive bayes we got the better precision and recall values than with using "only Tri-gram" for multinomial naive bayes.*

**3. Logistic Regression**

**Using Uni-gram, Bi-gram, Tri-gram**

In [20]:
from sklearn.linear_model import LogisticRegression

#Pipeline
lgr = Pipeline([
        ("bow(n_grams = (1, 3))", CountVectorizer(ngram_range = (1, 3))),
        ("lgr", LogisticRegression(max_iter=1000))
])

#Fitting the model
lgr.fit(X_train, y_train)

#Making Predictions
y_pred = lgr.predict(X_test)

#Model Evaluation
print(classification_report(y_test, y_pred))
print("Accuracy Score - %.2f%%"%(accuracy_score(y_test, y_pred)*100))

              precision    recall  f1-score   support

           0       0.95      0.93      0.94      4043
           1       0.94      0.95      0.94      4044

    accuracy                           0.94      8087
   macro avg       0.94      0.94      0.94      8087
weighted avg       0.94      0.94      0.94      8087

Accuracy Score - 94.20%


**Using only Tri-gram**

In [21]:
#Pipeline
lgr = Pipeline([
        ("bow(n_grams = (3, 3))", CountVectorizer(ngram_range = (3, 3))),
        ("lgr", LogisticRegression(max_iter=1000))
])

#Fitting the model
lgr.fit(X_train, y_train)

#Making Predictions
y_pred = lgr.predict(X_test)

#Model Evaluation
print(classification_report(y_test, y_pred))
print("Accuracy Score - %.2f%%"%(accuracy_score(y_test, y_pred)*100))

              precision    recall  f1-score   support

           0       0.94      0.89      0.91      4043
           1       0.89      0.94      0.92      4044

    accuracy                           0.91      8087
   macro avg       0.92      0.91      0.91      8087
weighted avg       0.92      0.91      0.91      8087

Accuracy Score - 91.42%


### **Observations:**

* *By using both "Uni-gram, Bi-gram, Tri-gram" and "only Tri-grams" for Logistic Regression we got best results from both.*

**4. Random Forest**

**Using Uni-gram, Bi-gram, Tri-gram**

In [22]:
from sklearn.ensemble import RandomForestClassifier

#Pipeline
RF = Pipeline([
    ("bow(n_grams = (1, 3))", CountVectorizer(ngram_range = (1, 3))),
    ("RF", RandomForestClassifier())
])

#Fitting the model
RF.fit(X_train, y_train)

#Making Predictions
y_pred = RF.predict(X_test)

#Model Evaluation
print(classification_report(y_test, y_pred))
print("Accuracy Score - %.2f%%"%(accuracy_score(y_test, y_pred)*100))

              precision    recall  f1-score   support

           0       0.94      0.89      0.91      4043
           1       0.89      0.94      0.92      4044

    accuracy                           0.92      8087
   macro avg       0.92      0.92      0.91      8087
weighted avg       0.92      0.92      0.91      8087

Accuracy Score - 91.50%


**Using only Tri-gram**

In [23]:
#Pipeline
RF = Pipeline([
    ("bow(n_grams = (3, 3))", CountVectorizer(ngram_range = (3, 3))),
    ("RF", RandomForestClassifier())
])

#Fitting the model
RF.fit(X_train, y_train)

#Making Predictions
y_pred = RF.predict(X_test)

#Model Evaluation
print(classification_report(y_test, y_pred))
print("Accuracy Score - %.2f%%"%(accuracy_score(y_test, y_pred)*100))

              precision    recall  f1-score   support

           0       0.72      0.93      0.81      4043
           1       0.90      0.64      0.75      4044

    accuracy                           0.79      8087
   macro avg       0.81      0.79      0.78      8087
weighted avg       0.81      0.79      0.78      8087

Accuracy Score - 78.56%


### **Observations:**

*From above Random Forest model "Uni-gram, Bi-gram, Tri-gram" gave the best results of precision and recall values.*

**Modelling with Preprocessed Text**

In [24]:
#Splitting the Dataset
X_train, X_test, y_train, y_test = train_test_split(Dataset.Preprocessed_Text, Dataset.Label_Num, test_size = 0.2, stratify = Dataset.Label_Num)

**1. k-NN**

**Using Uni-gram, Bi-gram, Tri-gram**

In [25]:
#Pipeline
knn = Pipeline([
    ("bow(n_gram = (1, 3))", CountVectorizer(ngram_range = (1, 3))),# (1, 2) means it uses uni gram, bi-gram and tri-gram for generating vectors.
    ("knn", KNeighborsClassifier(n_neighbors = 10, metric = 'euclidean'))
])

#Fitting the model
knn.fit(X_train, y_train)

#Making Predictions
y_pred = knn.predict(X_test)

#Model Evaluation
print(classification_report(y_test, y_pred))
print("Accuracy Score - %.2f%%"%(accuracy_score(y_test, y_pred)*100))

              precision    recall  f1-score   support

           0       0.52      0.93      0.67      4044
           1       0.68      0.14      0.24      4043

    accuracy                           0.54      8087
   macro avg       0.60      0.54      0.45      8087
weighted avg       0.60      0.54      0.45      8087

Accuracy Score - 53.77%


**Using only Tri-gram**

In [26]:
#Pipeline
knn = Pipeline([
    ("bow(n_gram = (3, 3))", CountVectorizer(ngram_range = (3, 3))), #(3, 3) means it uses only the tri-gram.
    ("knn", KNeighborsClassifier(n_neighbors = 10, metric = 'euclidean'))
])

#Fitting the model
knn.fit(X_train, y_train)

#Making Predictions
y_pred = knn.predict(X_test)

#Model Evaluation
print(classification_report(y_test, y_pred))
print("Accuracy Score - %.2f%%"%(accuracy_score(y_test, y_pred)*100))

              precision    recall  f1-score   support

           0       0.76      0.47      0.58      4044
           1       0.62      0.85      0.72      4043

    accuracy                           0.66      8087
   macro avg       0.69      0.66      0.65      8087
weighted avg       0.69      0.66      0.65      8087

Accuracy Score - 66.18%


### **Observations:**

*from both "Uni-gram, Bi-gram, Tri-gram" and "Tri-gram" for k-NN model the performance is poor.*

**2. MultiNomial Naive Bayes**

**Using Uni-gram, Bi-gram, Tri-gram**

In [27]:
#Pipeline
mnb = Pipeline([
    ("bow(n_grams = (1, 3))", CountVectorizer(ngram_range = (1, 3))),
    ("mnb", MultinomialNB())
])

#Fitting the model
mnb.fit(X_train, y_train)

#Making Predictions
y_pred = mnb.predict(X_test)

#Model Evaluation
print(classification_report(y_test, y_pred))
print("Accuracy Score - %.2f%%"%(accuracy_score(y_test, y_pred)*100))

              precision    recall  f1-score   support

           0       0.79      0.94      0.86      4044
           1       0.92      0.75      0.83      4043

    accuracy                           0.85      8087
   macro avg       0.86      0.85      0.84      8087
weighted avg       0.86      0.85      0.84      8087

Accuracy Score - 84.51%


**Using Only Tri-gram**

In [28]:
#Pipeline
mnb = Pipeline([
    ("bow(n_grams = (3, 3))", CountVectorizer(ngram_range = (3, 3))),
    ("mnb", MultinomialNB())
])

#Fitting the model
mnb.fit(X_train, y_train)

#Making Predictions
y_pred = mnb.predict(X_test)

#Model Evaluation
print(classification_report(y_test, y_pred))
print("Accuracy Score - %.2f%%"%(accuracy_score(y_test, y_pred)*100))

              precision    recall  f1-score   support

           0       0.71      0.87      0.78      4044
           1       0.83      0.65      0.73      4043

    accuracy                           0.76      8087
   macro avg       0.77      0.76      0.75      8087
weighted avg       0.77      0.76      0.75      8087

Accuracy Score - 75.57%


### **Observations:**

by applying "Uni-gram, Bi-gram, Tri-gram" for the Multinomial Naive Bayes model, better results are appeared for the model.

**3. Logistic Regression**

**Using Uni-gram, Bi-gram, Tri-gram**

In [29]:
#Pipeline
lgr = Pipeline([
        ("bow(n_grams = (1, 3))", CountVectorizer(ngram_range = (1, 3))),
        ("lgr", LogisticRegression(max_iter=1000))
])

#Fitting the model
lgr.fit(X_train, y_train)

#Making Predictions
y_pred = lgr.predict(X_test)

#Model Evaluation
print(classification_report(y_test, y_pred))
print("Accuracy Score - %.2f%%"%(accuracy_score(y_test, y_pred)*100))

              precision    recall  f1-score   support

           0       0.90      0.88      0.89      4044
           1       0.88      0.90      0.89      4043

    accuracy                           0.89      8087
   macro avg       0.89      0.89      0.89      8087
weighted avg       0.89      0.89      0.89      8087

Accuracy Score - 88.69%


**Using only Tri-gram**

In [30]:
#Pipeline
lgr = Pipeline([
        ("bow(n_grams = (3, 3))", CountVectorizer(ngram_range = (3, 3))),
        ("lgr", LogisticRegression(max_iter=1000))
])

#Fitting the model
lgr.fit(X_train, y_train)

#Making Predictions
y_pred = lgr.predict(X_test)

#Model Evaluation
print(classification_report(y_test, y_pred))
print("Accuracy Score - %.2f%%"%(accuracy_score(y_test, y_pred)*100))

              precision    recall  f1-score   support

           0       0.90      0.71      0.79      4044
           1       0.76      0.93      0.83      4043

    accuracy                           0.82      8087
   macro avg       0.83      0.82      0.81      8087
weighted avg       0.83      0.82      0.81      8087

Accuracy Score - 81.56%


### **Observations:**

*We can see that both "Uni-gram, Bi-gram, Tri-gram" and "Tri-gram" for the Logistic Regression, the results are better compared to other models.*

**4. Random Forest**

**Using Uni-gram, Bi-gram, Tri-gram**

In [31]:
#Pipeline
RF = Pipeline([
    ("bow(n_grams = (1, 3))", CountVectorizer(ngram_range = (1, 3))),
    ("RF", RandomForestClassifier())
])

#Fitting the model
RF.fit(X_train, y_train)

#Making Predictions
y_pred = RF.predict(X_test)

#Model Evaluation
print(classification_report(y_test, y_pred))
print("Accuracy Score - %.2f%%"%(accuracy_score(y_test, y_pred)*100))

              precision    recall  f1-score   support

           0       0.85      0.88      0.86      4044
           1       0.87      0.85      0.86      4043

    accuracy                           0.86      8087
   macro avg       0.86      0.86      0.86      8087
weighted avg       0.86      0.86      0.86      8087

Accuracy Score - 86.09%


**Using only Tri-gram**

In [32]:
#Pipeline
RF = Pipeline([
    ("bow(n_grams = (3, 3))", CountVectorizer(ngram_range = (3, 3))),
    ("RF", RandomForestClassifier())
])

#Fitting the model
RF.fit(X_train, y_train)

#Making Predictions
y_pred = RF.predict(X_test)

#Model Evaluation
print(classification_report(y_test, y_pred))
print("Accuracy Score - %.2f%%"%(accuracy_score(y_test, y_pred)*100))

              precision    recall  f1-score   support

           0       0.92      0.69      0.79      4044
           1       0.75      0.94      0.83      4043

    accuracy                           0.81      8087
   macro avg       0.83      0.81      0.81      8087
weighted avg       0.83      0.81      0.81      8087

Accuracy Score - 81.24%


### **Observations:**

*By applying RandomForest model with "Uni-grams, Bi-grams, Tri-grams" we got better results than with the "only Tr-gram". but it also performed well.*

**Conclusion:**

* We can see that our knn model gave very low results compared to other models. but as the n-gram range increases, model's performance slightly increased.
our knn model showed less performance because, as we converting our data into vectors by bow(ngrams) we get high dimension vectors. knn model performs very low on the high dimension data.

* among all the models **Logistic Regression** gave the better results. logistic regression is very powerful for binary classifications.

* compared to preprocessed text, by using non-preprocessed text we got better results. now it is not that we should not remove stop words, sometimes by removing stop words we will get better results sometimes we may not get because that stop words may play a crusial role in particular text.

  Ex:
   * "It is a not good movie",

   * "It is a good movie".

   by removing stop words from the above sentences we will get common word "good movie".

   Like this sometimes stop words play a crucial role.

* by testing various n-grams we may find which is giving better results. here also same that, not always a fixed n-gram range will only perform well, every range will work well on a particular type of text. we don't know until we test. so we have to find out which is more suitable for our model.


**Machine Learning is like a trial and error scientific method, where we keep trying all the possible algorithms we have and select the one which gives good results and satisfies the requirements like latency, interpretability, etc.**