This notebook addresses class imbalance in the Quora duplicate question detection task by augmenting the training dataset with approximately 2,700 additional duplicate question pairs generated using a language model. All synthetic pairs were filtered using BERT-based semantic similarity to ensure label consistency and diversity. The final dataset was vectorized using TF-IDF features, and semantic similarity scores were computed using BERT embeddings. Random Forest was selected as the primary model due to its robust performance on the base dataset

In [None]:
import sys
sys.path.append('..')
import pandas as pd
import re
import ftfy

from src.features.similarity import TextEmbedder, compute_and_save_bert_features
import joblib
import numpy as np
from scipy.sparse import hstack
from src.models.trainers import ClassicMLTrainer 

# BERT embeddings and cosine similarity

In [3]:
df_full = pd.read_csv("../data/processed/full_train_augmented.csv", index_col=False)

embedder = TextEmbedder(model_name="bert-base-uncased")
compute_and_save_bert_features(df_full, embedder, "../data/processed/aug_quora_train", batch_size=64)

Number of unique questions: 451315


BERT embeddings: 100%|██████████| 7052/7052 [42:31<00:00,  2.76it/s]


Saved to ../data/processed/aug_quora_train_embeddings.npy, shape: (451315, 768)
Saved ../data/processed/aug_quora_train_with_bert_sim.csv and .npy embeddings.


# Prepare features for ML models

In [11]:
# Load test set (with precomputed cosine and cleaned text)
test_df = pd.read_csv("../data/processed/quora_test_with_bert_sim.csv")

# 3. Load vectorizer
tfidf_vectorizer = joblib.load("../src/models/tfidf_vectorizer.joblib")

def preprocess_text(text):
    text = str(text)
    text = ftfy.fix_text(text)
    text = text.strip().lower()
    text = re.sub(r"\s+", " ", text)
    return text

df_full["cleaned_text"] = (df_full["question1"] + " [SEP] " + df_full["question2"]).apply(preprocess_text)

df_full = df_full.dropna(subset=["question1", "question2"]).reset_index(drop=True)
df_full["question1"] = df_full["question1"].astype(str)
df_full["question2"] = df_full["question2"].astype(str)

test_df = test_df.dropna(subset=["question1", "question2"]).reset_index(drop=True)
test_df["question1"] = test_df["question1"].astype(str)
test_df["question2"] = test_df["question2"].astype(str)

test_df["cleaned_text"] = (test_df["question1"] + " [SEP] " + test_df["question2"]).apply(preprocess_text)

# Transform cleaned text columns to TF-IDF feature matrices
X_train_tfidf = tfidf_vectorizer.transform(df_full["cleaned_text"])
X_test_tfidf = tfidf_vectorizer.transform(test_df["cleaned_text"])

# 5. Add cosine similarity column
X_train_cos = df_full[["bert_cosine_similarity"]].values
X_test_cos = test_df[["bert_cosine_similarity"]].values

# 6. Combine TF-IDF and BERT cosine into final feature set
X_train_combined = hstack([X_train_tfidf, X_train_cos])
X_test_combined = hstack([X_test_tfidf, X_test_cos])

# 7. Target values
y_train = df_full["is_duplicate"].values
y_test = test_df["is_duplicate"].values


# Train classical ML models

In [12]:
trainer = ClassicMLTrainer(n_jobs=-1)

# Random Forest: TF-IDF only
model_rf_tfidf = trainer.train_rf_tfidf(X_train_tfidf, y_train)
trainer.evaluate(model_rf_tfidf, X_test_tfidf, y_test, model_name="RandomForest", feature_set="TF-IDF")
trainer.feature_importance(model_rf_tfidf, tfidf_vectorizer)


# Random Forest: TF-IDF + BERT
model_rf = trainer.train_rf_combined(X_train_combined, y_train)
trainer.evaluate(model_rf, X_test_combined, y_test, model_name="RandomForest", feature_set="TF-IDF+BERT")
trainer.feature_importance(model_rf, tfidf_vectorizer, feature_names_extra=["bert_cosine_similarity"])


RandomForest (TF-IDF)
F1-score: 0.6777985279578161
Log loss: 0.5274705664285126
Confusion matrix:
 [[44750  6255]
 [11343 18510]]
Classification report:
               precision    recall  f1-score   support

           0       0.80      0.88      0.84     51005
           1       0.75      0.62      0.68     29853

    accuracy                           0.78     80858
   macro avg       0.77      0.75      0.76     80858
weighted avg       0.78      0.78      0.78     80858

Feature 7812: 0.0335
Feature 3451: 0.0247
Feature 1042: 0.0208
Feature 708: 0.0175
Feature 5690: 0.0110
Feature 652: 0.0084
Feature 2886: 0.0080
Feature 3481: 0.0073
Feature 3389: 0.0073
Feature 7850: 0.0068

RandomForest (TF-IDF+BERT)
F1-score: 0.7045192835757437
Log loss: 0.4963457345884709
Confusion matrix:
 [[44295  6710]
 [ 9969 19884]]
Classification report:
               precision    recall  f1-score   support

           0       0.82      0.87      0.84     51005
           1       0.75      0.67      0.

# Ensemble model

In [15]:
# Save all models
trainer.models["rf_combined"] = model_rf  
trainer.models["rf_tfidf"] = model_rf_tfidf 

trainer.save_model("rf_combined", "../src/models/rf_combined.joblib") 
trainer.save_model("rf_tfidf", "../src/models/rf_tfidf.joblib")


# Summary table of results 

In [None]:
if hasattr(trainer, "eval_results"):
    trainer.eval_results.clear()
else:
    trainer.eval_results = []
    
# Evaluate all models
trainer.evaluate(model_rf_tfidf, X_test_tfidf, y_test, model_name="RandomForest", feature_set="TF-IDF")
trainer.evaluate(model_rf, X_test_combined, y_test, model_name="RandomForest", feature_set="TF-IDF+BERT")  # ← added RF eval

# Get results
results_df_augm = trainer.summary().drop_duplicates()

In [21]:
results_df_augm["Notes"] = [
    "Baseline RandomForest get worse", 
    "Random Forest with BERT cosine similarity – get worse", 
]

print(results_df_augm)

          Model     Features        F1   LogLoss  \
0  RandomForest       TF-IDF  0.677799  0.527471   
1  RandomForest  TF-IDF+BERT  0.704519  0.496346   

                                               Notes  
0                    Baseline RandomForest get worse  
1  Random Forest with BERT cosine similarity – ge...  


In [22]:
# Save summary
results_df_augm.to_csv("../reports/results_augm.csv", index=False)

Despite the intended improvements from data augmentation, the overall model performance declined. The Random Forest model trained on the augmented dataset showed reduced F1-score and higher log loss compared to the version trained on the original dataset. This drop in quality may be attributed to the relatively low semantic fidelity of some augmented duplicate pairs, which introduced noise rather than improving class balance. The experiment suggests that while data augmentation can address class imbalance, the quality of synthetic examples plays a critical role in maintaining or improving model performance.