This notebook evaluates a range of machine learning models for duplicate question detection on the Quora dataset using both traditional and modern feature representations.
Models—including logistic regression, random forest, XGBoost, and a soft-voting ensemble methods—are trained on TF-IDF features, BERT-based similarity features, and their combinations.
Each model’s performance is assessed using metrics such as F1-score, log loss, and confusion matrix to ensure robust comparison. Feature importance analyses are performed to interpret the contribution of different features, specifically to compare the predictive value of classic (TF-IDF) and semantic (BERT) features.
This systematic evaluation provides actionable insights into model effectiveness and feature utility, guiding the selection of the most robust approach for duplicate detection.

# Imports

In [None]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import hstack
import joblib

import sys
sys.path.append('..')

from src.features.similarity import TextEmbedder, TextSimilarity
from src.models.trainers import ClassicMLTrainer 

# Load processed features

In [3]:
# Load saved dataframes with precomputed features
df_train = pd.read_csv("../data/processed/quora_train_with_bert_sim.csv")
df_test = pd.read_csv("../data/processed/quora_test_with_bert_sim.csv")

df_train.dropna(inplace=True)
df_test.dropna(inplace=True)
df_train.shape, df_test.shape


((322840, 9), (80843, 9))

In [None]:
tfidf_vectorizer = joblib.load("../src/models/tfidf_vectorizer.joblib")

# Transform cleaned text columns to TF-IDF feature matrices
X_train_tfidf = tfidf_vectorizer.transform(df_train["cleaned_text"])
X_test_tfidf = tfidf_vectorizer.transform(df_test["cleaned_text"])

# Prepare BERT cosine similarity feature 

In [5]:
# These are dense arrays, need to be reshaped for compatibility with hstack (must be 2D, not 1D)
X_train_cos = df_train[['bert_cosine_similarity']].values  # shape: (n_samples, 1)
X_test_cos = df_test[['bert_cosine_similarity']].values    # shape: (n_samples, 1)

# Combine features for ML models

In [6]:
# Concatenate TF-IDF sparse matrix and cosine similarity dense column into a single feature matrix
X_train_combined = hstack([X_train_tfidf, X_train_cos])
X_test_combined = hstack([X_test_tfidf, X_test_cos])

# Target variable
y_train = df_train["is_duplicate"].values
y_test = df_test["is_duplicate"].values

# Train classical ML models

In [26]:
# Initialize the trainer
trainer = ClassicMLTrainer(n_jobs=-1)

# Logistic Regression: only TF-IDF features
model_tfidf = trainer.train_logreg_tfidf(X_train_tfidf, y_train)
trainer.evaluate(model_tfidf, X_test_tfidf, y_test, model_name="LogReg", feature_set="TF-IDF")
trainer.feature_importance(model_tfidf, tfidf_vectorizer)

# Random Forest: TF-IDF only
model_rf_tfidf = trainer.train_rf_tfidf(X_train_tfidf, y_train)
trainer.evaluate(model_rf_tfidf, X_test_tfidf, y_test, model_name="RandomForest", feature_set="TF-IDF")
trainer.feature_importance(model_rf_tfidf, tfidf_vectorizer)


# Random Forest: TF-IDF + BERT
model_rf = trainer.train_rf_combined(X_train_combined, y_train)
trainer.evaluate(model_rf, X_test_combined, y_test, model_name="RandomForest", feature_set="TF-IDF+BERT")
trainer.feature_importance(model_rf, tfidf_vectorizer, feature_names_extra=["bert_cosine_similarity"])


# Logistic Regression: TF-IDF + BERT cosine similarity (combined features)
model_combined = trainer.train_logreg_combined(X_train_combined, y_train)
trainer.evaluate(model_combined, X_test_combined, y_test, model_name="LogReg", feature_set="TF-IDF+BERT")
trainer.feature_importance(model_combined, tfidf_vectorizer, feature_names_extra=["bert_cosine_similarity"])

# XGBoost: TF-IDF + BERT cosine similarity
scale_pos_weight = np.bincount(y_train)[0] / np.bincount(y_train)[1]
model_xgb = trainer.train_xgb_combined(X_train_combined, y_train, scale_pos_weight)
trainer.evaluate(model_xgb, X_test_combined, y_test, model_name="XGBoost", feature_set="TF-IDF+BERT")
trainer.feature_importance(model_xgb, None)


LogReg (TF-IDF)
F1-score: 0.6614817091915155
Log loss: 0.5354751673520033
Confusion matrix:
 [[37301 13694]
 [ 8330 21518]]
Classification report:
               precision    recall  f1-score   support

           0       0.82      0.73      0.77     50995
           1       0.61      0.72      0.66     29848

    accuracy                           0.73     80843
   macro avg       0.71      0.73      0.72     80843
weighted avg       0.74      0.73      0.73     80843

com: -5.7108
quickbook: 5.4210
sahara: 4.8227
grad: -4.7978
employe: -4.7780
abdomin: -4.3213
edmonton: 4.2891
pride: -4.0329
taffi: 3.9205
curb: 3.8340

RandomForest (TF-IDF)
F1-score: 0.7202475749239902
Log loss: 0.457589788942944
Confusion matrix:
 [[45486  5509]
 [ 9949 19899]]
Classification report:
               precision    recall  f1-score   support

           0       0.82      0.89      0.85     50995
           1       0.78      0.67      0.72     29848

    accuracy                           0.81     80843

((322840, 9), (80843, 9))

LogReg (TF-IDF)
F1-score: 0.6614817091915155
Log loss: 0.5354751673520033
Confusion matrix:
 [[37301 13694]
 [ 8330 21518]]
Classification report:
               precision    recall  f1-score   support

           0       0.82      0.73      0.77     50995
           1       0.61      0.72      0.66     29848

    accuracy                           0.73     80843
   macro avg       0.71      0.73      0.72     80843
weighted avg       0.74      0.73      0.73     80843

com: -5.7108
quickbook: 5.4210
sahara: 4.8227
grad: -4.7978
employe: -4.7780
abdomin: -4.3213
edmonton: 4.2891
pride: -4.0329
taffi: 3.9205
curb: 3.8340

RandomForest (TF-IDF)
F1-score: 0.7202475749239902
Log loss: 0.457589788942944
Confusion matrix:
 [[45486  5509]
 [ 9949 19899]]
Classification report:
               precision    recall  f1-score   support

           0       0.82      0.89      0.85     50995
           1       0.78      0.67      0.72     29848

    accuracy                           0.81     80843
   macro avg       0.80      0.78      0.79     80843
weighted avg       0.81      0.81      0.81     80843

Feature 708: 0.0143
Feature 5690: 0.0087
Feature 2886: 0.0064
Feature 7772: 0.0058
Feature 4081: 0.0057
Feature 3481: 0.0054
Feature 4092: 0.0052
Feature 7532: 0.0051
Feature 4011: 0.0046
Feature 2944: 0.0042

RandomForest (TF-IDF+BERT)
F1-score: 0.7408400515109285
Log loss: 0.4157966333337342
Confusion matrix:
 [[45154  5841]
 [ 8850 20998]]
Classification report:
               precision    recall  f1-score   support

           0       0.84      0.89      0.86     50995
           1       0.78      0.70      0.74     29848

    accuracy                           0.82     80843
   macro avg       0.81      0.79      0.80     80843
weighted avg       0.82      0.82      0.82     80843

Feature 8000: 0.1297
Feature 708: 0.0112
Feature 5690: 0.0071
Feature 2886: 0.0053
Feature 7772: 0.0052
Feature 3481: 0.0047
Feature 4081: 0.0046
Feature 7532: 0.0043
Feature 4092: 0.0042
Feature 3447: 0.0041

LogReg (TF-IDF+BERT)
F1-score: 0.6992122188916838
Log loss: 0.4943549745710451
Confusion matrix:
 [[37664 13331]
 [ 6638 23210]]
Classification report:
               precision    recall  f1-score   support

           0       0.85      0.74      0.79     50995
           1       0.64      0.78      0.70     29848

    accuracy                           0.75     80843
   macro avg       0.74      0.76      0.74     80843
weighted avg       0.77      0.75      0.76     80843

bert_cosine_similarity: 24.0564
employe: -6.4466
quickbook: 5.9698
grad: -5.6324
com: -5.5943
visitor: -4.9239
sahara: 4.7797
what: 4.5168
infinit: 4.1871
edmonton: 4.1270

XGBoost (TF-IDF+BERT)
F1-score: 0.6924198250728864
Log loss: 0.5046280051450243
Confusion matrix:
 [[35096 15899]
 [ 5623 24225]]
Classification report:
               precision    recall  f1-score   support

           0       0.86      0.69      0.77     50995
           1       0.60      0.81      0.69     29848

    accuracy                           0.73     80843
   macro avg       0.73      0.75      0.73     80843
weighted avg       0.77      0.73      0.74     80843

Feature 7480: 0.0082
Feature 1599: 0.0072
Feature 4250: 0.0063
Feature 1026: 0.0058
Feature 4859: 0.0055
Feature 1396: 0.0052
Feature 8000: 0.0048
Feature 2295: 0.0047
Feature 4526: 0.0039
Feature 5464: 0.0036

# Ensemble model

In [9]:
# TF-IDF and BERT cosine similarity:
# Prepare models fitted on combined features
lr = trainer.train_logreg_combined(X_train_combined, y_train)
rf = trainer.train_rf_combined(X_train_combined, y_train)
xgb = trainer.train_xgb_combined(X_train_combined, y_train, scale_pos_weight)

estimators = [
    ('lr', lr),
    ('rf', rf),
    ('xgb', xgb)
]

# Train the ensemble
ensemble_model = trainer.train_ensemble(estimators, X_train_combined, y_train)

In [None]:
# Save all models
trainer.models["logreg_tfidf"] = model_tfidf
trainer.models["logreg_combined"] = model_combined
trainer.models["xgb_combined"] = model_xgb
trainer.models["rf_combined"] = model_rf  
trainer.models["ensemble_lr"] = lr
trainer.models["ensemble_rf"] = rf
trainer.models["ensemble_xgb"] = xgb
trainer.models["ensemble_model"] = ensemble_model

trainer.save_model("logreg_tfidf", "../src/models/logreg_tfidf.joblib")
trainer.save_model("logreg_combined", "../src/models/logreg_combined.joblib")
trainer.save_model("xgb_combined", "../src/models/xgb_combined.joblib")
trainer.save_model("rf_combined", "../src/models/rf_combined.joblib") 
trainer.save_model("ensemble_lr", "../src/models/ensemble_lr.joblib")
trainer.save_model("ensemble_rf", "../src/models/ensemble_rf.joblib")
trainer.save_model("ensemble_xgb", "../src/models/ensemble_xgb.joblib")
trainer.save_model("ensemble_model", "../src/models/ensemble_model.joblib")



# Summary table of results 

In [28]:
if hasattr(trainer, "eval_results"):
    trainer.eval_results.clear()
else:
    trainer.eval_results = []
    
# Evaluate all models
trainer.evaluate(model_tfidf, X_test_tfidf, y_test, model_name="LogReg", feature_set="TF-IDF")
trainer.evaluate(model_combined, X_test_combined, y_test, model_name="LogReg", feature_set="TF-IDF+BERT")
trainer.evaluate(model_xgb, X_test_combined, y_test, model_name="XGBoost", feature_set="TF-IDF+BERT")
trainer.evaluate(model_rf, X_test_combined, y_test, model_name="RandomForest", feature_set="TF-IDF+BERT")  # ← added RF eval
trainer.evaluate(ensemble_model, X_test_combined, y_test, model_name="Ensemble", feature_set="TF-IDF+BERT")

# Get results
results_df = trainer.summary().drop_duplicates()


LogReg (TF-IDF)
F1-score: 0.6614817091915155
Log loss: 0.5354751673520033
Confusion matrix:
 [[37301 13694]
 [ 8330 21518]]
Classification report:
               precision    recall  f1-score   support

           0       0.82      0.73      0.77     50995
           1       0.61      0.72      0.66     29848

    accuracy                           0.73     80843
   macro avg       0.71      0.73      0.72     80843
weighted avg       0.74      0.73      0.73     80843


LogReg (TF-IDF+BERT)
F1-score: 0.6992122188916838
Log loss: 0.4943549745710451
Confusion matrix:
 [[37664 13331]
 [ 6638 23210]]
Classification report:
               precision    recall  f1-score   support

           0       0.85      0.74      0.79     50995
           1       0.64      0.78      0.70     29848

    accuracy                           0.75     80843
   macro avg       0.74      0.76      0.74     80843
weighted avg       0.77      0.75      0.76     80843


XGBoost (TF-IDF+BERT)
F1-score: 0.69241982

In [34]:
results_df["Notes"] = [
    "Baseline Logistic Regression, not promising", 
    "Baseline RandomForest, promising", 
    "Random Forest with BERT cosine similarity – the best overall", 
    "LogReg with BERT cosine similarity – improved but not better than RF",
    "XGBoost performs well, especially on positive class (higher recall)", 
    "Ensemble is strong, but does not outperform Random Forest with BERT"
]

print(results_df)

          Model     Features        F1   LogLoss  \
0        LogReg       TF-IDF  0.661482  0.535475   
1  RandomForest       TF-IDF  0.720248  0.457590   
2  RandomForest  TF-IDF+BERT  0.740840  0.415797   
3        LogReg  TF-IDF+BERT  0.699212  0.494355   
4       XGBoost  TF-IDF+BERT  0.692420  0.504628   
9      Ensemble  TF-IDF+BERT  0.735595  0.441208   

                                               Notes  
0        Baseline Logistic Regression, not promising  
1                   Baseline RandomForest, promising  
2  Random Forest with BERT cosine similarity – th...  
3  LogReg with BERT cosine similarity – improved ...  
4  XGBoost performs well, especially on positive ...  
9  Ensemble is strong, but does not outperform Ra...  


**Summary:**  
Models that use BERT cosine similarity features consistently outperform those that use only TF-IDF. Random Forest with BERT features achieves the best overall balance of precision and recall. While XGBoost achieves higher recall for class 1, Random Forest maintains stronger overall metrics. The ensemble model is competitive, but does not surpass Random Forest.


In [35]:
# Save summary
results_df.to_csv("../reports/results.csv", index=False)