###Fake Review Detection using NLP and Classification Algorithms

I find myself putting my trust too much on the reviews online and feeling confused after buying the product or service. This has become more frequent that ever. This work uses the data from [kaggle](https://www.kaggle.com/datasets/mexwell/fake-reviews-dataset) t create classification models that could help me predict the authenticity of a review that I find online.

#### Importing the necessary libraries

In [2]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

#### Reading the data
THe dat contains 'text_' which carries the review texts and the 'label' that indicates if the review is 'CG' - conputer generated or 'OR' - original.

In [16]:
data = pd.read_csv("data/fake_reviews_dataset.csv")
data.head()

Unnamed: 0,category,rating,label,text_
0,Home_and_Kitchen_5,5.0,CG,"Love this! Well made, sturdy, and very comfor..."
1,Home_and_Kitchen_5,5.0,CG,"love it, a great upgrade from the original. I..."
2,Home_and_Kitchen_5,5.0,CG,This pillow saved my back. I love the look and...
3,Home_and_Kitchen_5,1.0,CG,"Missing information on how to use it, but it i..."
4,Home_and_Kitchen_5,5.0,CG,Very nice set. Good quality. We have had the s...


In [17]:
data['label'].value_counts()

label
CG    20216
OR    20216
Name: count, dtype: int64

In [4]:
data['label'] = data['label'].map(lambda x: 1 if x=='CG' else 0)
data.head()

Unnamed: 0,category,rating,label,text_
0,Home_and_Kitchen_5,5.0,1,"Love this! Well made, sturdy, and very comfor..."
1,Home_and_Kitchen_5,5.0,1,"love it, a great upgrade from the original. I..."
2,Home_and_Kitchen_5,5.0,1,This pillow saved my back. I love the look and...
3,Home_and_Kitchen_5,1.0,1,"Missing information on how to use it, but it i..."
4,Home_and_Kitchen_5,5.0,1,Very nice set. Good quality. We have had the s...


#### Preparing the features and labels 

In [5]:
X = data['text_']
y = data['label']

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)

#### Vectorizing the text input

Text data should be converted to numeric values and TF-IDF is one of the ways to achieve the conversion in a meaningful way using the word frequencies.

In [7]:
vectorizer = TfidfVectorizer(stop_words="english",max_features=5000) 
X_train_vec =  vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

#### Fitting Models and Comparing Results

In [8]:
model_lr = LogisticRegression(random_state=42)
model_lr.fit(X_train_vec,y_train)
y_pred = model_lr.predict(X_test_vec)
report = classification_report(y_test,y_pred)
print(f"Classification Report - Logistic Regression: \n {report}")

Classification Report - Logistic Regression: 
               precision    recall  f1-score   support

           0       0.86      0.87      0.86      4005
           1       0.87      0.85      0.86      4082

    accuracy                           0.86      8087
   macro avg       0.86      0.86      0.86      8087
weighted avg       0.86      0.86      0.86      8087



In [9]:
model_rf = RandomForestClassifier(n_estimators=100, random_state=42)
model_rf.fit(X_train_vec,y_train)
y_pred = model_rf.predict(X_test_vec)
accuracy = accuracy_score(y_test,y_pred)
report = classification_report(y_test,y_pred)

print(f"Accuracy is {accuracy:.2f}")
print(f"Classification Report - RandomForest Classifier: \n {report}")

Accuracy is 0.84
Classification Report - RandomForest Classifier: 
               precision    recall  f1-score   support

           0       0.85      0.83      0.84      4005
           1       0.84      0.86      0.85      4082

    accuracy                           0.84      8087
   macro avg       0.84      0.84      0.84      8087
weighted avg       0.84      0.84      0.84      8087



In [11]:
model_nb =  GaussianNB()
model_nb.fit(X_train_vec.toarray(),y_train)
y_pred = model_nb.predict(X_test_vec.toarray())

report = classification_report(y_test,y_pred)


print(f"Classification Report - GaussianNB Classifier: \n {report}")

Classification Report - GaussianNB Classifier: 
               precision    recall  f1-score   support

           0       0.83      0.60      0.69      4005
           1       0.69      0.88      0.77      4082

    accuracy                           0.74      8087
   macro avg       0.76      0.74      0.73      8087
weighted avg       0.76      0.74      0.74      8087



In [12]:
model_ab =  AdaBoostClassifier(n_estimators=100, random_state=42)
model_ab.fit(X_train_vec,y_train)
y_pred = model_ab.predict(X_test_vec)

report = classification_report(y_test,y_pred)

print(f"Classification Report - AdaBoost Classifier: \n {report}")

Classification Report - AdaBoost Classifier: 
               precision    recall  f1-score   support

           0       0.65      0.83      0.73      4005
           1       0.77      0.56      0.65      4082

    accuracy                           0.69      8087
   macro avg       0.71      0.69      0.69      8087
weighted avg       0.71      0.69      0.69      8087



In [14]:
model_gb =  GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, random_state=42)
model_gb.fit(X_train_vec,y_train)
y_pred = model_gb.predict(X_test_vec)

report = classification_report(y_test,y_pred)


print(f"Classification Report - Gradient Boost Classifier: \n {report}")

Classification Report - Gradient Boost Classifier: 
               precision    recall  f1-score   support

           0       0.73      0.85      0.79      4005
           1       0.83      0.69      0.75      4082

    accuracy                           0.77      8087
   macro avg       0.78      0.77      0.77      8087
weighted avg       0.78      0.77      0.77      8087

