# Review Credibility Prediction – Training Notebook

This notebook trains **Logistic Regression** and **Linear SVM** models using **TF-IDF** for fake/credible review classification.

Key goals:
- No data leakage
- No overfitting / underfitting
- Dataset : https://www.kaggle.com/datasets/mexwell/fake-reviews-dataset
- Target accuracy: **80–90%**

In [10]:

# ================================
# 1. Import Required Libraries
# ================================
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, classification_report

import joblib


In [11]:

# ================================
# 2. Load Dataset
# ================================
# Dataset path (DO NOT change folder structure)
df = pd.read_csv("../data/fake_reviews.csv")

# Keep only required columns
df = df[['text_', 'label']]

# Encode labels: CG = Credible (1), OR = Fake (0)
df['label'] = df['label'].map({'CG': 0, 'OR': 1})

df.head()


Unnamed: 0,text_,label
0,"Love this! Well made, sturdy, and very comfor...",0
1,"love it, a great upgrade from the original. I...",0
2,This pillow saved my back. I love the look and...,0
3,"Missing information on how to use it, but it i...",0
4,Very nice set. Good quality. We have had the s...,0


In [12]:

# ================================
# 3. Train-Test Split (Prevent Data Leakage)
# ================================
X = df['text_']
y = df['label']

# 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

len(X_train), len(X_test)


(32345, 8087)

In [13]:

# ================================
# 4. TF-IDF Feature Extraction
# (Fit ONLY on training data)
# ================================
tfidf = TfidfVectorizer(
    max_features=10000,
    min_df=5,
    max_df=0.9,
    stop_words='english'
)

X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

X_train_tfidf.shape


(32345, 10000)

In [14]:

# ================================
# 5. Train Logistic Regression
# ================================
logistic_model = LogisticRegression(max_iter=1000)

logistic_model.fit(X_train_tfidf, y_train)

# Predictions
y_pred_log = logistic_model.predict(X_test_tfidf)

print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred_log))
print(classification_report(y_test, y_pred_log))


Logistic Regression Accuracy: 0.8653394336589588
              precision    recall  f1-score   support

           0       0.88      0.85      0.86      4044
           1       0.86      0.88      0.87      4043

    accuracy                           0.87      8087
   macro avg       0.87      0.87      0.87      8087
weighted avg       0.87      0.87      0.87      8087



In [15]:
from sklearn.metrics import accuracy_score

log_accuracy = accuracy_score(y_test, y_pred_log)
print("Logistic Regression Accuracy:", round(log_accuracy * 100, 2), "%")


Logistic Regression Accuracy: 86.53 %


In [16]:

# ================================
# 6. Train Linear SVM
# ================================
svm_model = LinearSVC()

svm_model.fit(X_train_tfidf, y_train)

# Predictions
y_pred_svm = svm_model.predict(X_test_tfidf)

print("Linear SVM Accuracy:", accuracy_score(y_test, y_pred_svm))
print(classification_report(y_test, y_pred_svm))


Linear SVM Accuracy: 0.8705329541239025
              precision    recall  f1-score   support

           0       0.87      0.88      0.87      4044
           1       0.88      0.86      0.87      4043

    accuracy                           0.87      8087
   macro avg       0.87      0.87      0.87      8087
weighted avg       0.87      0.87      0.87      8087





In [17]:
svm_accuracy = accuracy_score(y_test, y_pred_svm)
print("Linear SVM Accuracy:", round(svm_accuracy * 100, 2), "%")


Linear SVM Accuracy: 87.05 %


In [18]:

# ================================
# 7. Save Models and Vectorizer
# ================================
joblib.dump(tfidf, "../models/tfidf.pkl")
joblib.dump(logistic_model, "../models/logistic_model.pkl")
joblib.dump(svm_model, "../models/svm_model.pkl")

print("Models and TF-IDF vectorizer saved successfully.")


Models and TF-IDF vectorizer saved successfully.



## ✅ Training Completed

- Both models trained using identical conditions
- No data leakage
- Regularization applied
- Models saved for Gradio deployment
