### E2E ML Pipeline

This notebook details the feature engineering pipeline as well as the model exploration and evaluation process. At the end, we full trained the Bi-directional LSTM and DistilBERT models.

Library imports

In [None]:
import os
import re
import warnings
import joblib

import numpy as np
import pandas as pd

import torch
from torch.utils.data import Dataset, DataLoader

import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Dense, Dropout, Embedding, LSTM, Bidirectional, Concatenate, TextVectorization
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau, ModelCheckpoint

from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification, AdamW

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import OneHotEncoder, StandardScaler, FunctionTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.metrics import (
    precision_score, 
    recall_score, 
    f1_score, 
    roc_auc_score,
    classification_report,
    confusion_matrix
)
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

from xgboost import XGBClassifier

from imblearn.over_sampling import SVMSMOTE

from sentence_transformers import SentenceTransformer

# suppress warnings
warnings.filterwarnings('ignore')

Loading of the dataset

In [None]:
df = pd.read_csv('../data/fake_job_postings.csv')

df.head()

Unnamed: 0,job_id,title,location,department,salary_range,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent
0,1,Marketing Intern,"US, NY, New York",Marketing,,"We're Food52, and we've created a groundbreaki...","Food52, a fast-growing, James Beard Award-winn...",Experience with content management systems a m...,,0,1,0,Other,Internship,,,Marketing,0
1,2,Customer Service - Cloud Video Production,"NZ, , Auckland",Success,,"90 Seconds, the worlds Cloud Video Production ...",Organised - Focused - Vibrant - Awesome!Do you...,What we expect from you:Your key responsibilit...,What you will get from usThrough being part of...,0,1,0,Full-time,Not Applicable,,Marketing and Advertising,Customer Service,0
2,3,Commissioning Machinery Assistant (CMA),"US, IA, Wever",,,Valor Services provides Workforce Solutions th...,"Our client, located in Houston, is actively se...",Implement pre-commissioning and commissioning ...,,0,1,0,,,,,,0
3,4,Account Executive - Washington DC,"US, DC, Washington",Sales,,Our passion for improving quality of life thro...,THE COMPANY: ESRI – Environmental Systems Rese...,"EDUCATION: Bachelor’s or Master’s in GIS, busi...",Our culture is anything but corporate—we have ...,0,1,0,Full-time,Mid-Senior level,Bachelor's Degree,Computer Software,Sales,0
4,5,Bill Review Manager,"US, FL, Fort Worth",,,SpotSource Solutions LLC is a Global Human Cap...,JOB TITLE: Itemization Review ManagerLOCATION:...,QUALIFICATIONS:RN license in the State of Texa...,Full Benefits Offered,0,1,1,Full-time,Mid-Senior level,Bachelor's Degree,Hospital & Health Care,Health Care Provider,0


### Feature Engineering Pipeline

In the next section we will create reusable functions that can be fit into a pipeline. We start with creating features from numeric and categorical columns from the dataset

In [None]:
SCAMMY_PHRASES = [
    'urgent', 'guaranteed', 'money', 'cash', 'quick cash', 'investment', 'upfront fee', 'wire transfer',
    'limited time', 'winner', 'prize', 'bonus', 'earn', 'easy money', 'no experience', 'click here',
    'apply fast', 'instant', 'payable immediately', 'work from home', 'sms us', 'act now', 'free',
    'risk free', 'no risk', 'make money fast', 'earn extra income', 'financial freedom', 'be your own boss',
    'unlimited earning', 'passive income', 'get paid', 'weekly pay', 'daily pay', 'commission',
    'multi level marketing', 'mlm', 'pyramid', 'recruitment', 'sign up fee', 'registration fee',
    'processing fee', 'training fee', 'starter kit', 'pay to apply', 'credit card required',
    'bank account', 'personal information', 'social security', 'copy and paste', 'no selling',
    'no interview', 'hired immediately', 'start today', 'start immediately', 'too good to be true'
]

UNIVERSAL_CURRENCIES = [
    '$', '€', '£', '¥', '₹', '₽', '₩', '₪', '₱', '₫', '฿', '₡', '₦', '₨', '₴', '₵', '₸', '₺', '₼', '₾',
    'USD', 'EUR', 'GBP', 'JPY', 'CNY', 'INR', 'RUB', 'KRW', 'AUD', 'CAD', 'CHF', 'HKD', 'SGD', 'SEK',
    'NOK', 'DKK', 'PLN', 'THB', 'MXN', 'BRL', 'ZAR', 'TRY', 'IDR', 'MYR', 'PHP', 'VND', 'AED', 'SAR',
    'ILS', 'EGP', 'NGN', 'PKR', 'BDT', 'UAH', 'CZK', 'HUF', 'RON', 'NZD', 'CLP', 'ARS', 'COP', 'PEN'
]

SENTENCE_TRANSFORMER_MODEL = None

def similarity_function(column_a, column_b):
    combined = pd.concat([column_a, column_b], axis=0)
    embeddings = embed_texts(combined.tolist())
    first = embeddings[: len(column_a)]
    second = embeddings[len(column_a) :]
    similarities = rowwise_cosine(first, second)
    return pd.Series(similarities, index=column_a.index)

def embed_texts(texts):
    global SENTENCE_TRANSFORMER_MODEL
    if SENTENCE_TRANSFORMER_MODEL is None:
        import torch
        # check if theres gpu
        if torch.backends.mps.is_available():
            device = 'mps'
        elif torch.cuda.is_available():
            device = 'cuda'
        else:
            device = 'cpu'

        SENTENCE_TRANSFORMER_MODEL = SentenceTransformer('all-MiniLM-L6-v2', device=device)
        print(f"[INFO] Using {device} for sentence transformer")
    return np.asarray(SENTENCE_TRANSFORMER_MODEL.encode(list(texts), show_progress_bar=True, batch_size=128, convert_to_numpy=True))

def rowwise_cosine(first, second):
    numerators = np.sum(first * second, axis=1)
    denom = np.linalg.norm(first, axis=1) * np.linalg.norm(second, axis=1)
    denom = np.where(denom == 0, 1e-12, denom)
    return numerators / denom

def create_textual_features(df):
    # create output dataframe
    out = pd.DataFrame(index=df.index)
    # get salary_range
    salary_range = df['salary_range']
    # create binary column for salary_range
    out['has_salary_range'] = salary_range.notna().astype(int)

    # for has required experience
    required_experience = df['required_experience']
    out['has_required_experience'] = required_experience.notna().astype(int)

    # for has employment type
    employment_type = df['employment_type']
    out['has_employment_type'] = employment_type.notna().astype(int)

    # for has required education
    required_education = df['required_education']
    out['has_required_education'] = required_education.notna().astype(int)

    # for has company profile
    company_profile = df['company_profile']
    out['has_company_profile'] = company_profile.notna().astype(int)
    
    # calculate lengths for description, company_profile, requirements, and benefits
    out['description_length'] = df['description'].fillna('').str.len()
    out['company_profile_length'] = df['company_profile'].fillna('').str.len()
    out['requirements_length'] = df['requirements'].fillna('').str.len()
    out['benefits_length'] = df['benefits'].fillna('').str.len()
    
    # engineered features
    # legitimacy index through weighted sum 
    # description 0.3, requirements 0.4, logo 0.1, profile 0.2
    has_company_logo = df['has_company_logo']
    has_company_profile = out['has_company_profile']
    has_requirements = df['requirements'].notna().astype(int)
    has_description = df['description'].notna().astype(int)
    out['legitimacy_index'] = 0.3 * has_description + 0.4 * has_requirements + 0.1 * has_company_logo + 0.2 * has_company_profile

    # cosine similarities
    profile_desc_sim = similarity_function(df['company_profile'].fillna(''), df['description'].fillna(''))
    # do normalisation to output 0 to 1
    out['profile_desc_similarity'] = ((profile_desc_sim + 1) / 2).clip(0.0, 1.0)

    descr_req_sim = similarity_function(df['description'].fillna(''), df['requirements'].fillna(''))
    out['desc_req_similarity'] = ((descr_req_sim + 1) / 2).clip(0.0, 1.0)

    # check whether description column contains '@' or 5+ consecutive digits
    out['desc_contact_flag'] = df['description'].fillna('').apply(lambda x: float(bool('@' in x or re.search(r'\d{5,}', x) is not None)))
    
    # scammy word count
    out['scammy_word_count'] = df['description'].fillna('').apply(lambda x: sum(1 for phrase in SCAMMY_PHRASES if phrase.lower() in x.lower()))

    # check if title has currency symbols
    out['has_currency_title'] = df['title'].fillna('').apply(lambda x: float(any(currency in x for currency in UNIVERSAL_CURRENCIES)))

    # check if description has currency symbols
    out['has_currency_description'] = df['description'].fillna('').apply(lambda x: float(any(currency in x for currency in UNIVERSAL_CURRENCIES)))

    return out


Functions to extract features from the textual columns of the dataset

In [None]:
SCAMMY_PHRASES = [
    'urgent', 'guaranteed', 'money', 'cash', 'quick cash', 'investment', 'upfront fee', 'wire transfer',
    'limited time', 'winner', 'prize', 'bonus', 'earn', 'easy money', 'no experience', 'click here',
    'apply fast', 'instant', 'payable immediately', 'work from home', 'sms us', 'act now', 'free',
    'risk free', 'no risk', 'make money fast', 'earn extra income', 'financial freedom', 'be your own boss',
    'unlimited earning', 'passive income', 'get paid', 'weekly pay', 'daily pay', 'commission',
    'multi level marketing', 'mlm', 'pyramid', 'recruitment', 'sign up fee', 'registration fee',
    'processing fee', 'training fee', 'starter kit', 'pay to apply', 'credit card required',
    'bank account', 'personal information', 'social security', 'copy and paste', 'no selling',
    'no interview', 'hired immediately', 'start today', 'start immediately', 'too good to be true'
]

UNIVERSAL_CURRENCIES = [
    '$', '€', '£', '¥', '₹', '₽', '₩', '₪', '₱', '₫', '฿', '₡', '₦', '₨', '₴', '₵', '₸', '₺', '₼', '₾',
    'USD', 'EUR', 'GBP', 'JPY', 'CNY', 'INR', 'RUB', 'KRW', 'AUD', 'CAD', 'CHF', 'HKD', 'SGD', 'SEK',
    'NOK', 'DKK', 'PLN', 'THB', 'MXN', 'BRL', 'ZAR', 'TRY', 'IDR', 'MYR', 'PHP', 'VND', 'AED', 'SAR',
    'ILS', 'EGP', 'NGN', 'PKR', 'BDT', 'UAH', 'CZK', 'HUF', 'RON', 'NZD', 'CLP', 'ARS', 'COP', 'PEN'
]

SENTENCE_TRANSFORMER_MODEL = None

def similarity_function(column_a, column_b):
    combined = pd.concat([column_a, column_b], axis=0)
    embeddings = embed_texts(combined.tolist())
    first = embeddings[: len(column_a)]
    second = embeddings[len(column_a) :]
    similarities = rowwise_cosine(first, second)
    return pd.Series(similarities, index=column_a.index)

def embed_texts(texts):
    global SENTENCE_TRANSFORMER_MODEL
    if SENTENCE_TRANSFORMER_MODEL is None:
        import torch
        # check if theres gpu
        if torch.backends.mps.is_available():
            device = 'mps'
        elif torch.cuda.is_available():
            device = 'cuda'
        else:
            device = 'cpu'

        SENTENCE_TRANSFORMER_MODEL = SentenceTransformer('all-MiniLM-L6-v2', device=device)
        print(f"[INFO] Using {device} for sentence transformer")
    return np.asarray(SENTENCE_TRANSFORMER_MODEL.encode(list(texts), show_progress_bar=True, batch_size=128, convert_to_numpy=True))

def rowwise_cosine(first, second):
    numerators = np.sum(first * second, axis=1)
    denom = np.linalg.norm(first, axis=1) * np.linalg.norm(second, axis=1)
    denom = np.where(denom == 0, 1e-12, denom)
    return numerators / denom

def create_textual_features(df):
    # create output dataframe
    out = pd.DataFrame(index=df.index)
    # get salary_range
    salary_range = df['salary_range']
    # create binary column for salary_range
    out['has_salary_range'] = salary_range.notna().astype(int)

    # for has required experience
    required_experience = df['required_experience']
    out['has_required_experience'] = required_experience.notna().astype(int)

    # for has employment type
    employment_type = df['employment_type']
    out['has_employment_type'] = employment_type.notna().astype(int)

    # for has required education
    required_education = df['required_education']
    out['has_required_education'] = required_education.notna().astype(int)

    # for has company profile
    company_profile = df['company_profile']
    out['has_company_profile'] = company_profile.notna().astype(int)
    
    # calculate lengths for description, company_profile, requirements, and benefits
    out['description_length'] = df['description'].fillna('').str.len()
    out['company_profile_length'] = df['company_profile'].fillna('').str.len()
    out['requirements_length'] = df['requirements'].fillna('').str.len()
    out['benefits_length'] = df['benefits'].fillna('').str.len()
    
    # engineered features
    # legitimacy index through weighted sum 
    # description 0.3, requirements 0.4, logo 0.1, profile 0.2
    has_company_logo = df['has_company_logo']
    has_company_profile = out['has_company_profile']
    has_requirements = df['requirements'].notna().astype(int)
    has_description = df['description'].notna().astype(int)
    out['legitimacy_index'] = 0.3 * has_description + 0.4 * has_requirements + 0.1 * has_company_logo + 0.2 * has_company_profile

    # cosine similarities
    profile_desc_sim = similarity_function(df['company_profile'].fillna(''), df['description'].fillna(''))
    # do normalisation to output 0 to 1
    out['profile_desc_similarity'] = ((profile_desc_sim + 1) / 2).clip(0.0, 1.0)

    descr_req_sim = similarity_function(df['description'].fillna(''), df['requirements'].fillna(''))
    out['desc_req_similarity'] = ((descr_req_sim + 1) / 2).clip(0.0, 1.0)

    # check whether description column contains '@' or 5+ consecutive digits
    out['desc_contact_flag'] = df['description'].fillna('').apply(lambda x: float(bool('@' in x or re.search(r'\d{5,}', x) is not None)))
    
    # scammy word count
    out['scammy_word_count'] = df['description'].fillna('').apply(lambda x: sum(1 for phrase in SCAMMY_PHRASES if phrase.lower() in x.lower()))

    # check if title has currency symbols
    out['has_currency_title'] = df['title'].fillna('').apply(lambda x: float(any(currency in x for currency in UNIVERSAL_CURRENCIES)))

    # check if description has currency symbols
    out['has_currency_description'] = df['description'].fillna('').apply(lambda x: float(any(currency in x for currency in UNIVERSAL_CURRENCIES)))

    return out


The pipeline that integrates all the different functions

In [None]:
WORD_NGRAM_RANGE = (1, 2)
MIN_DF = 3
WORD_MAX_FEATURES = 5000

def transform_text_word_tfidf(ngram_range=WORD_NGRAM_RANGE, min_df=MIN_DF, max_features=WORD_MAX_FEATURES):
    return Pipeline([
        # combine text fields
        ('combine_text', FunctionTransformer(combine_text_fields)),
        # then tf-idf
        ('tfidf', TfidfVectorizer(
            ngram_range=WORD_NGRAM_RANGE,
            min_df=MIN_DF,
            max_features=WORD_MAX_FEATURES,
            sublinear_tf=True,
            lowercase=True,
            stop_words='english'
        )) # don't scale to retain sparsity and semantic meaning
    ])

# add numeric features from categorical columns
def transform_numeric_features():
    return Pipeline([
        # run add_numeric_features function
        ('add_numeric', FunctionTransformer(add_numeric_features)),
        # ensure output is a numpy array
        ('to_matrix', FunctionTransformer(lambda df: df.values if isinstance(df, pd.DataFrame) else df)),
        # then standard scale
        ('scale', StandardScaler(with_mean=True)) # binary features benefit from centering
    ])

# add features from categorical columns and then one hot encode
def transform_categorical_features():
    return Pipeline([
        # use wrapper to avoid data leakage (fits major_countries once on training data)
        ('add_categorical', CategoricalFeaturesWrapper()),
        # then one hot encode
        ('ohe', OneHotEncoder(handle_unknown='ignore', sparse_output=False)) # no need to scale since they are already on consistent scale of 0/1
    ])

# add features from textual columns
def transform_textual_features():
    return Pipeline([
        # run create_textual_features function
        ('create_textual', FunctionTransformer(create_textual_features)),
        # ensure output is a numpy array
        ('to_matrix', FunctionTransformer(lambda df: df.values if isinstance(df, pd.DataFrame) else df)),
        # then standard scale
        ('scale', StandardScaler(with_mean=True)) # mixed feature types benefit from centering
    ])

# union pipeline
def build_complete_pipeline():
    return FeatureUnion([
        ('numeric', transform_numeric_features()),
        ('categorical', transform_categorical_features()),
        ('textual', transform_textual_features()),
        ('text_word_tfidf', transform_text_word_tfidf())
    ])


### Model Evaluation - Stratified K-Fold Cross Validation (CV)

In this section, we experiment with traditional ML models (Logistic Regression as the baseline, Gaussian Naive Bayes, Random Forests, XGBoost) and deep learning models (Bi-directional LSTM, DistilBERT) by carrying out CV to have an understanding of how the models will likely perform. SVMSMOTE is also applied on the training fold for oversampling of minority class.

*All the models will go through the same pipeline and go through 5 folds.

_note: deep learning models go through a separate process as outlined in the report_

In [None]:
# prepare X and y from the full dataset
X = df.drop('fraudulent', axis=1)
y = df['fraudulent']

_Logistic Regression_

In [None]:
def cv_lr_oof_metrics(X, y, n_splits=5):
    cv = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)
    oof_lr = np.zeros(len(y), dtype=float)

    for fold, (train_idx, val_idx) in enumerate(cv.split(X, y), 1):
        X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
        y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]

        # build features per fold (your existing pipeline)
        features = build_complete_pipeline()
        X_train = features.fit_transform(X_train)
        X_val = features.transform(X_val)

        # Oversample the minority class on the TRAIN data
        smote = SVMSMOTE(random_state=42)
        X_res, y_res = smote.fit_resample(X_train, y_train)

        # ----- Logistic Regression -----
        lr = LogisticRegression(
            max_iter=1000,
            n_jobs=-1
        )
        lr.fit(X_res, y_res)

        # Out-of-fold predictions (probabilities for class 1)
        oof_lr[val_idx] = lr.predict_proba(X_val)[:, 1]

        # Hard predictions for metrics
        y_pred = lr.predict(X_val)

        fold_precision = precision_score(y_val, y_pred)
        fold_recall = recall_score(y_val, y_pred)
        fold_f1 = f1_score(y_val, y_pred)
        fold_auc = roc_auc_score(y_val, oof_lr[val_idx])

        print(
            f"[LR] Fold {fold}: "
            f"Precision={fold_precision:.5f}, "
            f"Recall={fold_recall:.5f}, "
            f"F1={fold_f1:.5f}, "
            f"AUC={fold_auc:.5f}"
        )

    # Overall metrics across all folds
    y_pred_overall = (oof_lr >= 0.5).astype(int)
    overall_precision = precision_score(y, y_pred_overall)
    overall_recall = recall_score(y, y_pred_overall)
    overall_f1 = f1_score(y, y_pred_overall)
    overall_auc = roc_auc_score(y, oof_lr)

    return overall_precision, overall_recall, overall_f1, overall_auc, oof_lr

In [None]:
# test 5-fold CV on logistic regr
precision, recall, f1, auc, oof_predictions = cv_lr_oof_metrics(X, y, n_splits=5)

print("\n" + "="*60)
print("Overall LR Performance (5-Fold CV):")
print("="*60)
print(f"Precision: {precision:.5f}")
print(f"Recall:    {recall:.5f}")
print(f"F1 Score:  {f1:.5f}")
print(f"ROC AUC:   {auc:.5f}")
print("="*60)

_Gaussian Naive Bayes_

In [None]:
def cv_nb_oof_metrics(X, y, n_splits=5):
    cv = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)
    oof_nb = np.zeros(len(y), dtype=float)
    
    for fold, (train_idx, val_idx) in enumerate(cv.split(X, y), 1):
        X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
        y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]

        # build features per fold
        features = build_complete_pipeline()
        X_train = features.fit_transform(X_train)
        X_val = features.transform(X_val)

        # convert sparse to dense only for Naive Bayes
        if hasattr(X_train, 'toarray'):
            X_train = X_train.toarray()
        if hasattr(X_val, 'toarray'):
            X_val = X_val.toarray()

        # apply SVMSMOTE only on training fold
        smote = SVMSMOTE(random_state=42)
        X_train, y_train = smote.fit_resample(X_train, y_train)

        # init naive bayes
        model = GaussianNB()
        model.fit(X_train, y_train)
        oof_nb[val_idx] = model.predict_proba(X_val)[:,1]

        # calculate metrics for this fold
        y_pred = (oof_nb[val_idx] >= 0.5).astype(int)
        fold_precision = precision_score(y_val, y_pred)
        fold_recall = recall_score(y_val, y_pred)
        fold_f1 = f1_score(y_val, y_pred)
        fold_auc = roc_auc_score(y_val, oof_nb[val_idx])

        print(f"[NB] Fold {fold}: Precision={fold_precision:.5f}, Recall={fold_recall:.5f}, F1={fold_f1:.5f}, AUC={fold_auc:.5f}")

    # calculate overall metrics
    y_pred_overall = (oof_nb >= 0.5).astype(int)
    overall_precision = precision_score(y, y_pred_overall)
    overall_recall = recall_score(y, y_pred_overall)
    overall_f1 = f1_score(y, y_pred_overall)
    overall_auc = roc_auc_score(y, oof_nb)
    
    return overall_precision, overall_recall, overall_f1, overall_auc, oof_nb

In [None]:
# test 5-fold CV on naive bayes
precision, recall, f1, auc, oof_predictions = cv_nb_oof_metrics(X, y, n_splits=5)

print("\n" + "="*60)
print("Overall Naive Bayes Performance (5-Fold CV):")
print("="*60)
print(f"Precision: {precision:.5f}")
print(f"Recall:    {recall:.5f}")
print(f"F1 Score:  {f1:.5f}")
print(f"ROC AUC:   {auc:.5f}")
print("="*60)

Batches: 100%|██████████| 224/224 [01:44<00:00,  2.14it/s]
Batches: 100%|██████████| 224/224 [01:53<00:00,  1.97it/s]
Batches: 100%|██████████| 56/56 [00:32<00:00,  1.74it/s]
Batches: 100%|██████████| 56/56 [00:25<00:00,  2.21it/s]
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[NB] Fold 1: Precision=0.37183, Recall=0.76301, F1=0.50000, AUC=0.84839


Batches: 100%|██████████| 224/224 [01:30<00:00,  2.48it/s]
Batches: 100%|██████████| 224/224 [01:33<00:00,  2.39it/s]
Batches: 100%|██████████| 56/56 [00:25<00:00,  2.22it/s]
Batches: 100%|██████████| 56/56 [00:24<00:00,  2.28it/s]


[NB] Fold 2: Precision=0.39414, Recall=0.69942, F1=0.50417, AUC=0.82468


Batches: 100%|██████████| 224/224 [01:42<00:00,  2.19it/s]
Batches: 100%|██████████| 224/224 [01:35<00:00,  2.35it/s]
Batches: 100%|██████████| 56/56 [00:26<00:00,  2.12it/s]
Batches: 100%|██████████| 56/56 [00:28<00:00,  1.95it/s]


[NB] Fold 3: Precision=0.45255, Recall=0.71676, F1=0.55481, AUC=0.83601


Batches: 100%|██████████| 224/224 [01:31<00:00,  2.46it/s]
Batches: 100%|██████████| 224/224 [01:32<00:00,  2.43it/s]
Batches: 100%|██████████| 56/56 [00:24<00:00,  2.28it/s]
Batches: 100%|██████████| 56/56 [00:23<00:00,  2.40it/s]


[NB] Fold 4: Precision=0.35562, Recall=0.67630, F1=0.46614, AUC=0.80667


Batches: 100%|██████████| 224/224 [01:32<00:00,  2.43it/s]
Batches: 100%|██████████| 224/224 [01:31<00:00,  2.44it/s]
Batches: 100%|██████████| 56/56 [00:25<00:00,  2.22it/s]
Batches: 100%|██████████| 56/56 [00:23<00:00,  2.36it/s]


[NB] Fold 5: Precision=0.40892, Recall=0.63218, F1=0.49661, AUC=0.79202

Overall Naive Bayes Performance (5-Fold CV):
Precision: 0.39374
Recall:    0.69746
F1 Score:  0.50333
ROC AUC:   0.82152


In [None]:
def cv_rf_oof_metrics(X, y, n_splits=5):
    print("[CV] Starting RandomForest CV with", n_splits, "folds")
    cv = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)
    oof = np.zeros(len(y), dtype=float)
    metrics = []

    for fold, (train_idx, val_idx) in enumerate(cv.split(X, y), start=1):
        print(f"\n========== FOLD {fold} ==========")
        X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
        y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]

        print("[CV] Building pipeline for this fold...")
        features = build_complete_pipeline()

        print("[CV] Fit-transform TRAIN data...")
        X_train_f = features.fit_transform(X_train, y_train)
        print("[CV] TRAIN feature matrix shape:", X_train_f.shape)

        print("[CV] Transform VAL data...")
        X_val_f = features.transform(X_val)
        print("[CV] VAL feature matrix shape:", X_val_f.shape)

        # SMOTE on training fold
        print("[CV] Applying SVMSMOTE on train fold...")
        smote = SVMSMOTE(random_state=42)
        X_train_res, y_train_res = smote.fit_resample(X_train_f, y_train)
        print("[CV] After SMOTE: X_train_res shape:", X_train_res.shape)

        # Random Forest
        print("[CV] Training RandomForestClassifier...")
        model = RandomForestClassifier(
            n_estimators=300,
            max_depth=None,
            random_state=42,
            n_jobs=-1
        )
        model.fit(X_train_res, y_train_res)

        # Predictions
        print("[CV] Predicting on VAL fold...")
        val_proba = model.predict_proba(X_val_f)[:, 1]
        oof[val_idx] = val_proba
        val_pred = (val_proba >= 0.5).astype(int)

        # Metrics
        fold_precision = precision_score(y_val, val_pred)
        fold_recall = recall_score(y_val, val_pred)
        fold_f1 = f1_score(y_val, val_pred)
        fold_auc = roc_auc_score(y_val, val_proba)

        print(f"[FOLD {fold}] Precision={fold_precision:.5f}, "
              f"Recall={fold_recall:.5f}, F1={fold_f1:.5f}, AUC={fold_auc:.5f}")

        metrics.append((fold_precision, fold_recall, fold_f1, fold_auc))

    # Aggregate metrics
    metrics = np.array(metrics)
    overall_precision = metrics[:, 0].mean()
    overall_recall = metrics[:, 1].mean()
    overall_f1 = metrics[:, 2].mean()
    overall_auc = metrics[:, 3].mean()

    print("\n==== OVERALL OUT-OF-FOLD METRICS ====")
    print(f"Precision: {overall_precision:.5f}")
    print(f"Recall:    {overall_recall:.5f}")
    print(f"F1 Score:  {overall_f1:.5f}")
    print(f"ROC AUC:   {overall_auc:.5f}")
    print("=====================================")

    return overall_precision, overall_recall, overall_f1, overall_auc, oof

print("[MAIN] Starting 5-fold CV with Random Forest...")
precision, recall, f1, auc, oof_predictions = cv_rf_oof_metrics(X, y, n_splits=5)

print("\n" + "="*60)
print("Overall Random Forest Performance (5-Fold CV):")
print("="*60)
print(f"Precision: {precision:.5f}")
print(f"Recall:    {recall:.5f}")
print(f"F1 Score:  {f1:.5f}")
print(f"ROC AUC:   {auc:.5f}")
print("="*60)
print("[MAIN] Finished.")


[INFO] Loading CSV...
[INFO] Data loaded: (17880, 18)
[MAIN] Preparing X and y...
[MAIN] Starting 5-fold CV with Random Forest...
[CV] Starting RandomForest CV with 5 folds

[CV] Building pipeline for this fold...
[PIPE] Building full FeatureUnion...
[CV] Fit-transform TRAIN data...
[NUM] Running numeric features...
[NUM] Output shape: (14304, 4)
[CAT WRAP] Fitting...
[CAT WRAP] Transforming...
[CAT] Running categorical features...
[CAT] Output shape: (14304, 5)
[TXT] Running textual features...
[SIM] Computing similarity for 14304 rows...
[EMBED] Loading sentence transformer on mps


Batches: 100%|██████████| 224/224 [03:43<00:00,  1.00it/s]


[SIM] Computing similarity for 14304 rows...


Batches: 100%|██████████| 224/224 [03:50<00:00,  1.03s/it]


[TXT] Output shape: (14304, 16)
[TFIDF] Combining text fields...
[TFIDF] Combined text length: 14304
[CV] TRAIN feature matrix shape: (14304, 5187)
[CV] Transform VAL data...
[NUM] Running numeric features...
[NUM] Output shape: (3576, 4)
[CAT WRAP] Transforming...
[CAT] Running categorical features...
[CAT] Output shape: (3576, 5)
[TXT] Running textual features...
[SIM] Computing similarity for 3576 rows...


Batches: 100%|██████████| 56/56 [00:59<00:00,  1.07s/it]


[SIM] Computing similarity for 3576 rows...


Batches: 100%|██████████| 56/56 [01:00<00:00,  1.08s/it]


[TXT] Output shape: (3576, 16)
[TFIDF] Combining text fields...
[TFIDF] Combined text length: 3576
[CV] VAL feature matrix shape: (3576, 5187)
[CV] Applying SVMSMOTE on train fold...
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[CV] After SMOTE: X_train_res shape: (27222, 5187)
[CV] Training RandomForestClassifier...
[CV] Predicting on VAL fold...
[FOLD 1] Precision=0.97521, Recall=0.68208, F1=0.80272, AUC=0.98941

[CV] Building pipeline for this fold...
[PIPE] Building full FeatureUnion...
[CV] Fit-transform TRAIN data...
[NUM] Running numeric features...
[NUM] Output shape: (14304, 4)
[CAT WRAP] Fitting...
[CAT WRAP] Transforming...
[CAT] Running categorical features...
[CAT] Output shape: (14304, 5)
[TXT] Running textual features...
[SIM] Computing 

Batches: 100%|██████████| 224/224 [04:17<00:00,  1.15s/it]


[SIM] Computing similarity for 14304 rows...


Batches: 100%|██████████| 224/224 [04:23<00:00,  1.17s/it]


[TXT] Output shape: (14304, 16)
[TFIDF] Combining text fields...
[TFIDF] Combined text length: 14304
[CV] TRAIN feature matrix shape: (14304, 5188)
[CV] Transform VAL data...
[NUM] Running numeric features...
[NUM] Output shape: (3576, 4)
[CAT WRAP] Transforming...
[CAT] Running categorical features...
[CAT] Output shape: (3576, 5)
[TXT] Running textual features...
[SIM] Computing similarity for 3576 rows...


Batches: 100%|██████████| 56/56 [01:05<00:00,  1.17s/it]


[SIM] Computing similarity for 3576 rows...


Batches: 100%|██████████| 56/56 [01:09<00:00,  1.24s/it]


[TXT] Output shape: (3576, 16)
[TFIDF] Combining text fields...
[TFIDF] Combined text length: 3576
[CV] VAL feature matrix shape: (3576, 5188)
[CV] Applying SVMSMOTE on train fold...
[CV] After SMOTE: X_train_res shape: (27222, 5188)
[CV] Training RandomForestClassifier...
[CV] Predicting on VAL fold...
[FOLD 2] Precision=0.96825, Recall=0.70520, F1=0.81605, AUC=0.98085

[CV] Building pipeline for this fold...
[PIPE] Building full FeatureUnion...
[CV] Fit-transform TRAIN data...
[NUM] Running numeric features...
[NUM] Output shape: (14304, 4)
[CAT WRAP] Fitting...
[CAT WRAP] Transforming...
[CAT] Running categorical features...
[CAT] Output shape: (14304, 5)
[TXT] Running textual features...
[SIM] Computing similarity for 14304 rows...


Batches: 100%|██████████| 224/224 [04:25<00:00,  1.18s/it]


[SIM] Computing similarity for 14304 rows...


Batches: 100%|██████████| 224/224 [04:36<00:00,  1.23s/it]


[TXT] Output shape: (14304, 16)
[TFIDF] Combining text fields...
[TFIDF] Combined text length: 14304
[CV] TRAIN feature matrix shape: (14304, 5189)
[CV] Transform VAL data...
[NUM] Running numeric features...
[NUM] Output shape: (3576, 4)
[CAT WRAP] Transforming...
[CAT] Running categorical features...
[CAT] Output shape: (3576, 5)
[TXT] Running textual features...
[SIM] Computing similarity for 3576 rows...


Batches: 100%|██████████| 56/56 [01:15<00:00,  1.35s/it]


[SIM] Computing similarity for 3576 rows...


Batches: 100%|██████████| 56/56 [01:21<00:00,  1.45s/it]


[TXT] Output shape: (3576, 16)
[TFIDF] Combining text fields...
[TFIDF] Combined text length: 3576
[CV] VAL feature matrix shape: (3576, 5189)
[CV] Applying SVMSMOTE on train fold...
[CV] After SMOTE: X_train_res shape: (27222, 5189)
[CV] Training RandomForestClassifier...
[CV] Predicting on VAL fold...
[FOLD 3] Precision=0.96581, Recall=0.65318, F1=0.77931, AUC=0.98736

[CV] Building pipeline for this fold...
[PIPE] Building full FeatureUnion...
[CV] Fit-transform TRAIN data...
[NUM] Running numeric features...
[NUM] Output shape: (14304, 4)
[CAT WRAP] Fitting...
[CAT WRAP] Transforming...
[CAT] Running categorical features...
[CAT] Output shape: (14304, 5)
[TXT] Running textual features...
[SIM] Computing similarity for 14304 rows...


Batches: 100%|██████████| 224/224 [04:07<00:00,  1.11s/it]


[SIM] Computing similarity for 14304 rows...


Batches: 100%|██████████| 224/224 [04:41<00:00,  1.26s/it]


[TXT] Output shape: (14304, 16)
[TFIDF] Combining text fields...
[TFIDF] Combined text length: 14304
[CV] TRAIN feature matrix shape: (14304, 5186)
[CV] Transform VAL data...
[NUM] Running numeric features...
[NUM] Output shape: (3576, 4)
[CAT WRAP] Transforming...
[CAT] Running categorical features...
[CAT] Output shape: (3576, 5)
[TXT] Running textual features...
[SIM] Computing similarity for 3576 rows...


Batches: 100%|██████████| 56/56 [01:14<00:00,  1.33s/it]


[SIM] Computing similarity for 3576 rows...


Batches: 100%|██████████| 56/56 [06:47<00:00,  7.27s/it] 


[TXT] Output shape: (3576, 16)
[TFIDF] Combining text fields...
[TFIDF] Combined text length: 3576
[CV] VAL feature matrix shape: (3576, 5186)
[CV] Applying SVMSMOTE on train fold...
[CV] After SMOTE: X_train_res shape: (27222, 5186)
[CV] Training RandomForestClassifier...
[CV] Predicting on VAL fold...
[FOLD 4] Precision=0.97619, Recall=0.71098, F1=0.82274, AUC=0.98923

[CV] Building pipeline for this fold...
[PIPE] Building full FeatureUnion...
[CV] Fit-transform TRAIN data...
[NUM] Running numeric features...
[NUM] Output shape: (14304, 4)
[CAT WRAP] Fitting...
[CAT WRAP] Transforming...
[CAT] Running categorical features...
[CAT] Output shape: (14304, 5)
[TXT] Running textual features...
[SIM] Computing similarity for 14304 rows...


Batches: 100%|██████████| 224/224 [03:32<00:00,  1.06it/s]


[SIM] Computing similarity for 14304 rows...


Batches: 100%|██████████| 224/224 [04:33<00:00,  1.22s/it]


[TXT] Output shape: (14304, 16)
[TFIDF] Combining text fields...
[TFIDF] Combined text length: 14304
[CV] TRAIN feature matrix shape: (14304, 5188)
[CV] Transform VAL data...
[NUM] Running numeric features...
[NUM] Output shape: (3576, 4)
[CAT WRAP] Transforming...
[CAT] Running categorical features...
[CAT] Output shape: (3576, 5)
[TXT] Running textual features...
[SIM] Computing similarity for 3576 rows...


Batches: 100%|██████████| 56/56 [01:17<00:00,  1.38s/it]


[SIM] Computing similarity for 3576 rows...


Batches: 100%|██████████| 56/56 [01:14<00:00,  1.33s/it]


[TXT] Output shape: (3576, 16)
[TFIDF] Combining text fields...
[TFIDF] Combined text length: 3576
[CV] VAL feature matrix shape: (3576, 5188)
[CV] Applying SVMSMOTE on train fold...
[CV] After SMOTE: X_train_res shape: (27224, 5188)
[CV] Training RandomForestClassifier...
[CV] Predicting on VAL fold...
[FOLD 5] Precision=0.97368, Recall=0.63793, F1=0.77083, AUC=0.98328

==== OVERALL OUT-OF-FOLD METRICS ====
Precision: 0.97183
Recall:    0.67788
F1 Score:  0.79833
ROC AUC:   0.98603

Overall Random Forest Performance (5-Fold CV):
Precision: 0.97183
Recall:    0.67788
F1 Score:  0.79833
ROC AUC:   0.98603
[MAIN] Finished.


In [None]:
def cv_xgb_oof_metrics(X, y, n_splits=5):
    cv = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)
    oof_xgb = np.zeros(len(y), dtype=float)

    for fold, (train_idx, val_idx) in enumerate(cv.split(X, y), 1):
        X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
        y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]

        # build features per fold
        features = build_complete_pipeline()
        X_train_f = features.fit_transform(X_train)
        X_val_f = features.transform(X_val)

        # oversample minority class (fraud) on the train data
        smote = SVMSMOTE(random_state=42)
        X_res, y_res = smote.fit_resample(X_train_f, y_train)

        # define xgboost model (binary classifier)
        xgb = XGBClassifier(
            n_estimators=300,
            max_depth=4,
            learning_rate=0.1,
            subsample=0.8,
            colsample_bytree=0.8,
            objective="binary:logistic",
            eval_metric="logloss",
            tree_method="hist",
            n_jobs=-1,
            random_state=42
        )

        # fit on resampled training data
        xgb.fit(X_res, y_res)

        # out-of-fold predicted probabilities for class 1 (fraud)
        oof_xgb[val_idx] = xgb.predict_proba(X_val_f)[:, 1]

        # hard predictions for per-fold metrics
        y_pred = (oof_xgb[val_idx] >= 0.5).astype(int)

        fold_precision = precision_score(y_val, y_pred, zero_division=0)
        fold_recall = recall_score(y_val, y_pred, zero_division=0)
        fold_f1 = f1_score(y_val, y_pred, zero_division=0)
        fold_auc = roc_auc_score(y_val, oof_xgb[val_idx])

        print(
            f"[XGB] Fold {fold}: "
            f"Precision={fold_precision:.5f}, "
            f"Recall={fold_recall:.5f}, "
            f"F1={fold_f1:.5f}, "
            f"AUC={fold_auc:.5f}"
        )

    # overall metrics across all folds (using oof predictions)
    y_pred_overall = (oof_xgb >= 0.5).astype(int)
    overall_precision = precision_score(y, y_pred_overall, zero_division=0)
    overall_recall = recall_score(y, y_pred_overall, zero_division=0)
    overall_f1 = f1_score(y, y_pred_overall, zero_division=0)
    overall_auc = roc_auc_score(y, oof_xgb)

    return overall_precision, overall_recall, overall_f1, overall_auc, oof_xgb


In [None]:
# test 5-fold CV on XGBoost 
precision, recall, f1, auc, oof_predictions = cv_xgb_oof_metrics(X, y, n_splits=5)

print("\n" + "="*60)
print("Overall XGBoost Performance (5-Fold CV):")
print("="*60)
print(f"Precision: {precision:.5f}")
print(f"Recall:    {recall:.5f}")
print(f"F1 Score:  {f1:.5f}")
print(f"ROC AUC:   {auc:.5f}")
print("="*60)

#### Deep Learning Models

Due to compute reasons, we ran the deep learning models with 3 folds instead. They also won't use minority oversampling techniques, relying on class weights instead.

_Bi-directional LSTM_
- incorporates engineered features + raw text columns

In [None]:
def build_hybrid_model(engineered_feature_dim, vocab_size=10000, max_length=200):
    # handle our feature engineering columns
    engineered_input = Input(shape=(engineered_feature_dim,), name='engineered_features')
    
    x_eng = Dense(512, activation='relu')(engineered_input)
    x_eng = Dropout(0.4)(x_eng)
    x_eng = Dense(256, activation='relu')(x_eng)
    x_eng = Dropout(0.3)(x_eng)
    
    # handle the text inputs for lstm to learn in addition to engineered features
    # shared embedding dimension
    embedding_dim = 128
    lstm_units = 128
    
    # text inputs
    title_input = Input(shape=(1,), dtype=tf.string, name='title')
    company_profile_input = Input(shape=(1,), dtype=tf.string, name='company_profile')
    description_input = Input(shape=(1,), dtype=tf.string, name='description')
    requirements_input = Input(shape=(1,), dtype=tf.string, name='requirements')
    benefits_input = Input(shape=(1,), dtype=tf.string, name='benefits')
    
    # shared text vectorization layer
    text_vectorizer = TextVectorization(
        max_tokens=vocab_size,
        output_mode='int',
        output_sequence_length=max_length
    )
    
    # shared embedding layer
    embedding_layer = Embedding(
        input_dim=vocab_size,
        output_dim=embedding_dim,
        mask_zero=True
    )
    
    # process each text column through bi-lstm
    def process_text(text_input):
        x = text_vectorizer(text_input)
        x = embedding_layer(x)
        x = Bidirectional(LSTM(lstm_units, dropout=0.3, recurrent_dropout=0.2))(x)
        return x
    
    title_features = process_text(title_input)
    company_profile_features = process_text(company_profile_input)
    description_features = process_text(description_input)
    requirements_features = process_text(requirements_input)
    benefits_features = process_text(benefits_input)
    
    # concatenate all text features
    x_text = Concatenate()([
        title_features,
        company_profile_features,
        description_features,
        requirements_features,
        benefits_features
    ])
    
    # combine engineered features and text features
    combined = Concatenate()([x_eng, x_text])
    
    x = Dense(256, activation='relu')(combined)
    x = Dropout(0.3)(x)
    x = Dense(128, activation='relu')(x)
    x = Dropout(0.3)(x)
    
    output = Dense(1, activation='sigmoid', name='output')(x)
    
    # build the model
    model = Model(
        inputs=[
            engineered_input,
            title_input,
            company_profile_input,
            description_input,
            requirements_input,
            benefits_input
        ],
        outputs=output
    )
    
    model.compile(
        optimizer=Adam(learning_rate=0.001),
        loss='binary_crossentropy',
        metrics=['accuracy', tf.keras.metrics.AUC(name='auc'), # include accuracy just for sanity check
                 tf.keras.metrics.Precision(name='precision'),
                 tf.keras.metrics.Recall(name='recall')]
    )
    
    return model, text_vectorizer


def cv_hybrid_with_class_weights(X, y, n_splits=5):
    """
    stratified 5-fold cv with class weights (no smote)
    """
    from sklearn.utils.class_weight import compute_class_weight
    
    cv = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)
    oof_predictions = np.zeros(len(y), dtype=float)
    
    for fold, (train_idx, val_idx) in enumerate(cv.split(X, y), 1):
        print(f"\n{'='*60}")
        print(f"Fold {fold}/{n_splits}")
        print(f"{'='*60}")
        
        # split data
        X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
        y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]
        
        # build features per fold
        print("Building engineered features...")
        feature_pipeline = build_complete_pipeline()
        X_train_engineered = feature_pipeline.fit_transform(X_train)
        X_val_engineered = feature_pipeline.transform(X_val)
        
        # convert to dense
        if hasattr(X_train_engineered, 'toarray'):
            X_train_engineered = X_train_engineered.toarray()
            X_val_engineered = X_val_engineered.toarray()
        
        # compute class weights
        class_weights = compute_class_weight(
            'balanced',
            classes=np.unique(y_train),
            y=y_train
        )
        class_weight_dict = {i: w for i, w in enumerate(class_weights)}
        print(f"Class weights: {class_weight_dict}")
        
        # prepare text data
        print("Preparing text data...")
        # convert to numpy arrays of strings (required for keras with mixed input types)
        X_train_text = {
            'title': np.array(X_train['title'].fillna('').astype(str).tolist(), dtype=object),
            'company_profile': np.array(X_train['company_profile'].fillna('').astype(str).tolist(), dtype=object),
            'description': np.array(X_train['description'].fillna('').astype(str).tolist(), dtype=object),
            'requirements': np.array(X_train['requirements'].fillna('').astype(str).tolist(), dtype=object),
            'benefits': np.array(X_train['benefits'].fillna('').astype(str).tolist(), dtype=object)
        }

        X_val_text = {
            'title': np.array(X_val['title'].fillna('').astype(str).tolist(), dtype=object),
            'company_profile': np.array(X_val['company_profile'].fillna('').astype(str).tolist(), dtype=object),
            'description': np.array(X_val['description'].fillna('').astype(str).tolist(), dtype=object),
            'requirements': np.array(X_val['requirements'].fillna('').astype(str).tolist(), dtype=object),
            'benefits': np.array(X_val['benefits'].fillna('').astype(str).tolist(), dtype=object)
        }
        
        # build model
        print("Building hybrid model...")
        model, text_vectorizer = build_hybrid_model(
            engineered_feature_dim=X_train_engineered.shape[1]
        )
        
        # adapt text vectorizer to training data
        all_text = np.concatenate([
            X_train['title'].fillna('').values,
            X_train['company_profile'].fillna('').values,
            X_train['description'].fillna('').values,
            X_train['requirements'].fillna('').values,
            X_train['benefits'].fillna('').values
        ])
        text_vectorizer.adapt(all_text)
        
        # prepare inputs
        train_inputs = {
            'engineered_features': X_train_engineered,
            **X_train_text
        }
        
        val_inputs = {
            'engineered_features': X_val_engineered,
            **X_val_text
        }
        
        # train the model
        print("Training model...")
        history = model.fit(
            train_inputs,
            y_train,
            validation_data=(val_inputs, y_val),
            class_weight=class_weight_dict,  # handles imbalance
            epochs=3, # use 3 epochs for validation since 20 takes too long
            batch_size=32,
            verbose=1,
            callbacks=[
                EarlyStopping(monitor='val_auc', patience=2, mode='max', restore_best_weights=True),  # patience=2 for 3 epochs
                ReduceLROnPlateau(monitor='val_loss', patience=1, factor=0.5)  # patience=1 for 3 epochs
            ]
        )
        
        # predict
        oof_predictions[val_idx] = model.predict(val_inputs, verbose=0).flatten()
        
        # metrics
        y_pred = (oof_predictions[val_idx] >= 0.5).astype(int)
        fold_precision = precision_score(y_val, y_pred)
        fold_recall = recall_score(y_val, y_pred)
        fold_f1 = f1_score(y_val, y_pred)
        fold_auc = roc_auc_score(y_val, oof_predictions[val_idx])
        
        print(f"\n[Hybrid] Fold {fold}: Precision={fold_precision:.5f}, Recall={fold_recall:.5f}, F1={fold_f1:.5f}, AUC={fold_auc:.5f}")
        
        # clear memory
        del model
        tf.keras.backend.clear_session()
    
    # overall metrics
    y_pred_overall = (oof_predictions >= 0.5).astype(int)
    overall_precision = precision_score(y, y_pred_overall)
    overall_recall = recall_score(y, y_pred_overall)
    overall_f1 = f1_score(y, y_pred_overall)
    overall_auc = roc_auc_score(y, oof_predictions)
    
    return overall_precision, overall_recall, overall_f1, overall_auc, oof_predictions

In [None]:
# test 3-fold CV on LSTM
precision, recall, f1, auc, oof_predictions = cv_hybrid_with_class_weights(X, y, n_splits=3)

print("\n" + "="*60)
print("Overall Bidirect-LSTM Performance (3-Fold CV):") # 3 fold for validation since 5 fold is too slow
print("="*60)
print(f"Precision: {precision:.5f}")
print(f"Recall:    {recall:.5f}")
print(f"F1 Score:  {f1:.5f}")
print(f"ROC AUC:   {auc:.5f}")
print("="*60)


Fold 1/3
Building engineered features...


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

[INFO] Using cuda for sentence transformer


Batches:   0%|          | 0/187 [00:00<?, ?it/s]

Batches:   0%|          | 0/187 [00:00<?, ?it/s]

Batches:   0%|          | 0/94 [00:00<?, ?it/s]

Batches:   0%|          | 0/94 [00:00<?, ?it/s]

Class weights: {0: 0.5254805149003703, 1: 10.311418685121108}
Preparing text data...
Building hybrid model...


I0000 00:00:1763270710.209881      19 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 12944 MB memory:  -> device: 0, name: Tesla T4, pci bus id: 0000:00:04.0, compute capability: 7.5
I0000 00:00:1763270710.210487      19 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 13942 MB memory:  -> device: 1, name: Tesla T4, pci bus id: 0000:00:05.0, compute capability: 7.5


Training model...
Epoch 1/3
[1m373/373[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2017s[0m 5s/step - accuracy: 0.8783 - auc: 0.8851 - loss: 0.4120 - precision: 0.2559 - recall: 0.7127 - val_accuracy: 0.9151 - val_auc: 0.9910 - val_loss: 0.1633 - val_precision: 0.3610 - val_recall: 0.9826 - learning_rate: 0.0010
Epoch 2/3
[1m373/373[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1966s[0m 5s/step - accuracy: 0.9586 - auc: 0.9918 - loss: 0.0904 - precision: 0.5417 - recall: 0.9731 - val_accuracy: 0.9757 - val_auc: 0.9810 - val_loss: 0.0759 - val_precision: 0.6857 - val_recall: 0.9167 - learning_rate: 0.0010
Epoch 3/3
[1m373/373[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1956s[0m 5s/step - accuracy: 0.9906 - auc: 0.9988 - loss: 0.0246 - precision: 0.8425 - recall: 0.9961 - val_accuracy: 0.9836 - val_auc: 0.9767 - val_loss: 0.0642 - val_precision: 0.7969 - val_recall: 0.8854 - learning_rate: 0.0010

[Hybrid] Fold 1: Precision=0.36097, Recall=0.98264, F1=0.52799, AUC=0.99210

F

Batches:   0%|          | 0/187 [00:00<?, ?it/s]

Batches:   0%|          | 0/187 [00:00<?, ?it/s]

Batches:   0%|          | 0/94 [00:00<?, ?it/s]

Batches:   0%|          | 0/94 [00:00<?, ?it/s]

Class weights: {0: 0.5254341884862911, 1: 10.329289428076256}
Preparing text data...
Building hybrid model...
Training model...
Epoch 1/3
[1m373/373[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1999s[0m 5s/step - accuracy: 0.9008 - auc: 0.8598 - loss: 0.4352 - precision: 0.2654 - recall: 0.6615 - val_accuracy: 0.9413 - val_auc: 0.9915 - val_loss: 0.1243 - val_precision: 0.4512 - val_recall: 0.9758 - learning_rate: 0.0010
Epoch 2/3
[1m373/373[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1963s[0m 5s/step - accuracy: 0.9685 - auc: 0.9940 - loss: 0.0925 - precision: 0.6145 - recall: 0.9664 - val_accuracy: 0.9628 - val_auc: 0.9838 - val_loss: 0.0877 - val_precision: 0.5714 - val_recall: 0.9273 - learning_rate: 0.0010
Epoch 3/3
[1m373/373[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1955s[0m 5s/step - accuracy: 0.9785 - auc: 0.9917 - loss: 0.0747 - precision: 0.6941 - recall: 0.9727 - val_accuracy: 0.9404 - val_auc: 0.9861 - val_loss: 0.1826 - val_precision: 0.4476 - val_recall: 

Batches:   0%|          | 0/187 [00:00<?, ?it/s]

Batches:   0%|          | 0/187 [00:00<?, ?it/s]

Batches:   0%|          | 0/94 [00:00<?, ?it/s]

Batches:   0%|          | 0/94 [00:00<?, ?it/s]

Class weights: {0: 0.5254341884862911, 1: 10.329289428076256}
Preparing text data...
Building hybrid model...
Training model...
Epoch 1/3
[1m373/373[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1980s[0m 5s/step - accuracy: 0.8447 - auc: 0.8918 - loss: 0.4462 - precision: 0.2474 - recall: 0.7996 - val_accuracy: 0.9332 - val_auc: 0.9922 - val_loss: 0.1574 - val_precision: 0.4183 - val_recall: 0.9654 - learning_rate: 0.0010
Epoch 2/3
[1m373/373[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1948s[0m 5s/step - accuracy: 0.9695 - auc: 0.9959 - loss: 0.0788 - precision: 0.6227 - recall: 0.9697 - val_accuracy: 0.9661 - val_auc: 0.9916 - val_loss: 0.0697 - val_precision: 0.5969 - val_recall: 0.9273 - learning_rate: 0.0010
Epoch 3/3
[1m373/373[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1945s[0m 5s/step - accuracy: 0.9856 - auc: 0.9978 - loss: 0.0437 - precision: 0.7798 - recall: 0.9934 - val_accuracy: 0.9824 - val_auc: 0.9840 - val_loss: 0.0634 - val_precision: 0.7674 - val_recall: 

_DistilBERT_
- done solely on text columns to see how it fares against our customised BiLSTM

In [None]:
# config
CSV_PATH   = "../data/fake_job_postings.csv"
MODEL_NAME = "distilbert-base-uncased"

N_SPLITS   = 3
EPOCHS     = 3
BATCH_SIZE = 16
MAX_LENGTH = 256
LR        = 2e-5
RANDOM_SEED = 42

np.random.seed(RANDOM_SEED)
torch.manual_seed(RANDOM_SEED)

# load data
print("[INFO] Loading CSV...")
df = pd.read_csv(CSV_PATH)
print("[INFO] Data loaded:", df.shape)

# target
labels = df["fraudulent"].astype(int).values

# build text input (distilbert will see only this)
def make_text(row):
    parts = [
        str(row.get("title", "")),
        str(row.get("location", "")),
        str(row.get("department", "")),
        str(row.get("employment_type", "")),
        str(row.get("required_experience", "")),
        str(row.get("required_education", "")),
        str(row.get("industry", "")),
        str(row.get("function", "")),
        str(row.get("company_profile", "")),
        str(row.get("description", "")),
        str(row.get("requirements", "")),
        str(row.get("benefits", "")),
    ]
    return " [SEP] ".join(parts)

print("[INFO] Building combined text column...")
df["combined_text"] = df.apply(make_text, axis=1)
texts = df["combined_text"].fillna("").tolist()

print("[INFO] Example text sample:\n", texts[0][:500], "...\n")

# dataset class
class JobPostingDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_length=256):
        self.texts = list(texts)
        self.labels = list(labels)
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = str(self.texts[idx])
        label = int(self.labels[idx])

        encoding = self.tokenizer(
            text,
            truncation=True,
            padding="max_length",
            max_length=self.max_length,
            return_tensors="pt",
        )

        item = {key: val.squeeze(0) for key, val in encoding.items()}
        item["labels"] = torch.tensor(label, dtype=torch.long)
        return item

# device + tokenizer
if torch.backends.mps.is_available():
    device = torch.device("mps")
elif torch.cuda.is_available():
    device = torch.device("cuda")
else:
    device = torch.device("cpu")

print(f"[INFO] Using device: {device}")

print("[INFO] Loading DistilBERT tokenizer...")
tokenizer = DistilBertTokenizerFast.from_pretrained(MODEL_NAME)

# train / eval helpers
def train_one_epoch(model, train_loader, optimizer, device, epoch, num_epochs):
    model.train()
    total_loss = 0.0

    print(f"\n[TRAIN] Epoch {epoch+1}/{num_epochs}")
    for step, batch in enumerate(train_loader, start=1):
        batch = {k: v.to(device) for k, v in batch.items()}

        outputs = model(**batch)
        loss = outputs.loss

        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

        total_loss += loss.item()

        if step % 50 == 0 or step == 1:
            print(f"  Step {step:4d} | Batch loss: {loss.item():.4f}")

    avg_loss = total_loss / len(train_loader)
    print(f"[TRAIN] Epoch {epoch+1} finished. Avg loss: {avg_loss:.4f}")
    return avg_loss


def eval_on_loader(model, val_loader, device):
    model.eval()
    all_labels = []
    all_probs  = []

    print("\n[EVAL] Running on validation set...")
    with torch.no_grad():
        for batch in val_loader:
            # labels on cpu for metrics
            labels = batch["labels"].numpy()
            all_labels.append(labels)

            batch = {k: v.to(device) for k, v in batch.items()}

            outputs = model(**batch)
            logits = outputs.logits
            probs = torch.softmax(logits, dim=1)[:, 1].cpu().numpy()

            all_probs.append(probs)

    all_labels = np.concatenate(all_labels)
    all_probs  = np.concatenate(all_probs)
    preds = (all_probs >= 0.5).astype(int)

    precision = precision_score(all_labels, preds)
    recall    = recall_score(all_labels, preds)
    f1        = f1_score(all_labels, preds)
    auc       = roc_auc_score(all_labels, all_probs)

    print(f"[EVAL] Precision: {precision:.5f}")
    print(f"[EVAL] Recall:    {recall:.5f}")
    print(f"[EVAL] F1:        {f1:.5f}")
    print(f"[EVAL] ROC AUC:   {auc:.5f}")

    return precision, recall, f1, auc

# 3-fold stratified cross-validation
skf = StratifiedKFold(n_splits=N_SPLITS, shuffle=True, random_state=RANDOM_SEED)

fold_metrics = []

for fold, (train_idx, val_idx) in enumerate(skf.split(texts, labels), start=1):
    print("\n" + "=" * 60)
    print(f"[FOLD {fold}/{N_SPLITS}]")
    print("=" * 60)

    # split texts and labels for this fold
    X_train_texts = [texts[i] for i in train_idx]
    X_val_texts   = [texts[i] for i in val_idx]
    y_train       = labels[train_idx]
    y_val         = labels[val_idx]

    print(f"[FOLD {fold}] Train size: {len(X_train_texts)} | Val size: {len(X_val_texts)}")

    # build datasets and loaders
    train_dataset = JobPostingDataset(X_train_texts, y_train, tokenizer, max_length=MAX_LENGTH)
    val_dataset   = JobPostingDataset(X_val_texts,   y_val,   tokenizer, max_length=MAX_LENGTH)

    train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
    val_loader   = DataLoader(val_dataset,   batch_size=BATCH_SIZE, shuffle=False)

    # re-init model + optimizer for this fold
    print(f"[FOLD {fold}] Loading DistilBERT model...")
    model = DistilBertForSequenceClassification.from_pretrained(
        MODEL_NAME,
        num_labels=2,
    )
    model.to(device)

    optimizer = AdamW(model.parameters(), lr=LR)

    best_f1 = 0.0
    best_metrics = None

    # train for epochs on this fold
    for epoch in range(EPOCHS):
        train_one_epoch(model, train_loader, optimizer, device, epoch, EPOCHS)
        precision, recall, f1, auc = eval_on_loader(model, val_loader, device)

        if f1 > best_f1:
            print(f"[FOLD {fold}] New best F1: {f1:.5f} (prev {best_f1:.5f})")
            best_f1 = f1
            best_metrics = (precision, recall, f1, auc)

    print(f"\n[FOLD {fold}] Best metrics:")
    print(f"  Precision: {best_metrics[0]:.5f}")
    print(f"  Recall:    {best_metrics[1]:.5f}")
    print(f"  F1:        {best_metrics[2]:.5f}")
    print(f"  ROC AUC:   {best_metrics[3]:.5f}")

    fold_metrics.append(best_metrics)

# summary across folds
fold_metrics = np.array(fold_metrics)

mean_precision = fold_metrics[:, 0].mean()
mean_recall    = fold_metrics[:, 1].mean()
mean_f1        = fold_metrics[:, 2].mean()
mean_auc       = fold_metrics[:, 3].mean()

std_precision  = fold_metrics[:, 0].std()
std_recall     = fold_metrics[:, 1].std()
std_f1         = fold_metrics[:, 2].std()
std_auc        = fold_metrics[:, 3].std()

print("\n" + "=" * 60)
print("  3-FOLD CV SUMMARY (Best per fold, then mean ± std)")
print("=" * 60)
for i, (p, r, f1, auc) in enumerate(fold_metrics, start=1):
    print(f"Fold {i}: P={p:.5f}, R={r:.5f}, F1={f1:.5f}, AUC={auc:.5f}")

print("\n[MEAN ± STD]")
print(f"Precision: {mean_precision:.5f} ± {std_precision:.5f}")
print(f"Recall:    {mean_recall:.5f} ± {std_recall:.5f}")
print(f"F1:        {mean_f1:.5f} ± {std_f1:.5f}")
print(f"ROC AUC:   {mean_auc:.5f} ± {std_auc:.5f}")
print("=" * 60)


[INFO] Loading CSV...
[INFO] Data loaded: (17880, 18)
[INFO] Building combined text column...
[INFO] Example text sample:
 Marketing Intern [SEP] US, NY, New York [SEP] Marketing [SEP] Other [SEP] Internship [SEP] nan [SEP] nan [SEP] Marketing [SEP] We're Food52, and we've created a groundbreaking and award-winning cooking site. We support, connect, and celebrate home cooks, and give them everything they need in one place.We have a top editorial, business, and engineering team. We're focused on using technology to find new and better ways to connect people around their specific food interests, and to offer them supe ...

[INFO] Using device: mps
[INFO] Loading DistilBERT tokenizer...

[FOLD 1/3]
[FOLD 1] Train size: 11920 | Val size: 5960
[FOLD 1] Loading DistilBERT model...


Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.weight', 'classifier.weight', 'pre_classifier.


[TRAIN] Epoch 1/3
  Step    1 | Batch loss: 0.6608
  Step   50 | Batch loss: 0.0417
  Step  100 | Batch loss: 0.0609
  Step  150 | Batch loss: 0.2594
  Step  200 | Batch loss: 0.2317
  Step  250 | Batch loss: 0.0163
  Step  300 | Batch loss: 0.0138
  Step  350 | Batch loss: 0.0062
  Step  400 | Batch loss: 0.0325
  Step  450 | Batch loss: 0.0130
  Step  500 | Batch loss: 0.0492
  Step  550 | Batch loss: 0.1361
  Step  600 | Batch loss: 0.0094
  Step  650 | Batch loss: 0.0274
  Step  700 | Batch loss: 0.0097
[TRAIN] Epoch 1 finished. Avg loss: 0.1183

[EVAL] Running on validation set...
[EVAL] Precision: 0.74917
[EVAL] Recall:    0.78819
[EVAL] F1:        0.76819
[EVAL] ROC AUC:   0.98164
[FOLD 1] New best F1: 0.76819 (prev 0.00000)

[TRAIN] Epoch 2/3
  Step    1 | Batch loss: 0.3793
  Step   50 | Batch loss: 0.0038
  Step  100 | Batch loss: 0.0376
  Step  150 | Batch loss: 0.2105
  Step  200 | Batch loss: 0.0141
  Step  250 | Batch loss: 0.0287
  Step  300 | Batch loss: 0.1520
  Step 

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.weight', 'classifier.weight', 'pre_classifier.


[TRAIN] Epoch 1/3
  Step    1 | Batch loss: 0.7212
  Step   50 | Batch loss: 0.3663
  Step  100 | Batch loss: 0.0228
  Step  150 | Batch loss: 0.3797
  Step  200 | Batch loss: 0.1537
  Step  250 | Batch loss: 0.0425
  Step  300 | Batch loss: 0.0364
  Step  350 | Batch loss: 0.3200
  Step  400 | Batch loss: 0.0084
  Step  450 | Batch loss: 0.0220
  Step  500 | Batch loss: 0.5117
  Step  550 | Batch loss: 0.0079
  Step  600 | Batch loss: 0.0135
  Step  650 | Batch loss: 0.1147
  Step  700 | Batch loss: 0.0340
[TRAIN] Epoch 1 finished. Avg loss: 0.1194

[EVAL] Running on validation set...
[EVAL] Precision: 0.95775
[EVAL] Recall:    0.47059
[EVAL] F1:        0.63109
[EVAL] ROC AUC:   0.96852
[FOLD 2] New best F1: 0.63109 (prev 0.00000)

[TRAIN] Epoch 2/3
  Step    1 | Batch loss: 0.2329
  Step   50 | Batch loss: 0.0126
  Step  100 | Batch loss: 0.0280
  Step  150 | Batch loss: 0.0123
  Step  200 | Batch loss: 0.0282
  Step  250 | Batch loss: 0.0309
  Step  300 | Batch loss: 0.0995
  Step 

KeyboardInterrupt: 

### Full Training

We did this for BiLSTM and DistilBERT

_BiLISTM_

In [None]:
# train final bi-lstm model on full dataset with train/test split
def train_final_model(X, y, test_size=0.2, epochs=15, patience=5, save_dir='models'):
    os.makedirs(save_dir, exist_ok=True)
    
    print("="*60)
    print("FINAL MODEL TRAINING - BI-DIRECTIONAL LSTM")
    print("="*60)
    
    # stratified train/test split
    print("\n[1/6] Splitting data...")
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, 
        test_size=test_size, 
        random_state=42, 
        stratify=y
    )
    
    print(f"Train samples: {len(X_train):,} ({len(X_train)/len(X)*100:.1f}%)")
    print(f"Test samples:  {len(X_test):,} ({len(X_test)/len(X)*100:.1f}%)")
    print(f"Train fraud rate: {y_train.mean()*100:.2f}%")
    print(f"Test fraud rate:  {y_test.mean()*100:.2f}%")
    
    # build feature pipeline
    print("\n[2/6] Building feature pipeline on train set...")
    pipeline = build_complete_pipeline()
    X_train_engineered = pipeline.fit_transform(X_train)
    X_test_engineered = pipeline.transform(X_test)
    
    print(f"Feature pipeline fitted!")
    print(f"Engineered features shape: {X_train_engineered.shape}")
    
    if hasattr(X_train_engineered, 'toarray'):
        print("Converting sparse matrices to dense...")
        X_train_engineered = X_train_engineered.toarray()
        X_test_engineered = X_test_engineered.toarray()
    
    # compute class weights
    print("\n[3/6] Computing class weights...")
    from sklearn.utils.class_weight import compute_class_weight
    class_weights = compute_class_weight(
        'balanced',
        classes=np.unique(y_train),
        y=y_train
    )
    class_weight_dict = {i: w for i, w in enumerate(class_weights)}
    print(f"Class weights: {class_weight_dict}")
    
    # prepare text data
    print("\n[4/6] Preparing text data...")
    
    X_train_text = {
        'title': np.array(X_train['title'].fillna('').astype(str).tolist(), dtype=object),
        'company_profile': np.array(X_train['company_profile'].fillna('').astype(str).tolist(), dtype=object),
        'description': np.array(X_train['description'].fillna('').astype(str).tolist(), dtype=object),
        'requirements': np.array(X_train['requirements'].fillna('').astype(str).tolist(), dtype=object),
        'benefits': np.array(X_train['benefits'].fillna('').astype(str).tolist(), dtype=object)
    }
    
    X_test_text = {
        'title': np.array(X_test['title'].fillna('').astype(str).tolist(), dtype=object),
        'company_profile': np.array(X_test['company_profile'].fillna('').astype(str).tolist(), dtype=object),
        'description': np.array(X_test['description'].fillna('').astype(str).tolist(), dtype=object),
        'requirements': np.array(X_test['requirements'].fillna('').astype(str).tolist(), dtype=object),
        'benefits': np.array(X_test['benefits'].fillna('').astype(str).tolist(), dtype=object)
    }
    
    print("Text data prepared!")
    
    # build and train model
    print("\n[5/6] Building hybrid Bi-LSTM model...")
    model, text_vectorizer = build_hybrid_model(
        engineered_feature_dim=X_train_engineered.shape[1]
    )
    
    print("Adapting text vectorizer to training data...")
    all_train_text = np.concatenate([
        X_train['title'].fillna('').values,
        X_train['company_profile'].fillna('').values,
        X_train['description'].fillna('').values,
        X_train['requirements'].fillna('').values,
        X_train['benefits'].fillna('').values
    ])
    text_vectorizer.adapt(all_train_text)
    
    train_inputs = {
        'engineered_features': X_train_engineered,
        **X_train_text
    }
    
    test_inputs = {
        'engineered_features': X_test_engineered,
        **X_test_text
    }
    
    callbacks = [
        EarlyStopping(
            monitor='val_auc', 
            patience=patience, 
            mode='max', 
            restore_best_weights=True,
            verbose=1
        ),
        ReduceLROnPlateau(
            monitor='val_loss', 
            patience=3, 
            factor=0.5,
            min_lr=0.00001,
            verbose=1
        ),
        ModelCheckpoint(
            filepath=os.path.join(save_dir, 'best_model.h5'),
            monitor='val_auc',
            mode='max',
            save_best_only=True,
            verbose=1
        )
    ]
    
    print(f"\nTraining model for up to {epochs} epochs...")
    print(f"Early stopping patience: {patience} epochs")
    print(f"Validation split: 10% of training data")
    print("="*60)
    
    history = model.fit(
        train_inputs,
        y_train,
        validation_split=0.1,
        class_weight=class_weight_dict,
        epochs=epochs,
        batch_size=32,
        verbose=1,
        callbacks=callbacks
    )
    
    # evaluate on test set
    print("\n[6/6] Evaluating on test set...")
    
    y_pred_proba = model.predict(test_inputs, verbose=0).flatten()
    y_pred = (y_pred_proba >= 0.5).astype(int)
    
    test_auc = roc_auc_score(y_test, y_pred_proba)
    test_precision = precision_score(y_test, y_pred)
    test_recall = recall_score(y_test, y_pred)
    test_f1 = f1_score(y_test, y_pred)
    conf_matrix = confusion_matrix(y_test, y_pred)
    class_report = classification_report(y_test, y_pred, target_names=['Legitimate', 'Fraudulent'])
    
    test_metrics = {
        'auc': test_auc,
        'precision': test_precision,
        'recall': test_recall,
        'f1': test_f1,
        'confusion_matrix': conf_matrix,
        'classification_report': class_report
    }
    
    print("\n" + "="*60)
    print("FINAL TEST SET EVALUATION")
    print("="*60)
    print(f"Test AUC:       {test_auc:.5f}")
    print(f"Test Precision: {test_precision:.5f}")
    print(f"Test Recall:    {test_recall:.5f}")
    print(f"Test F1:        {test_f1:.5f}")
    
    print("\nConfusion Matrix:")
    print("                 Predicted")
    print("              Legit   Fraud")
    print(f"Actual Legit  {conf_matrix[0][0]:5d}   {conf_matrix[0][1]:5d}")
    print(f"       Fraud  {conf_matrix[1][0]:5d}   {conf_matrix[1][1]:5d}")
    
    print("\nClassification Report:")
    print(class_report)
    
    print("="*60)
    print("TRAINING COMPLETE")
    print("="*60)
    
    # compare with CV result
    cv_auc = 0.99184
    auc_diff = test_auc - cv_auc
    print(f"\nPerformance Comparison:")
    print(f"   CV AUC:   {cv_auc:.5f}")
    print(f"   Test AUC: {test_auc:.5f}")
    print(f"   Difference: {auc_diff:+.5f}")
    
    if test_auc >= 0.990:
        print("\nTest performance matches CV. Model is ready.")
    elif test_auc >= 0.985:
        print("\nTest performance is strong. Consider if further tuning is needed.")
    else:
        print("\nTest performance dropped significantly. Consider hyperparameter tuning.")
    
    print("="*60)
    
    return model, pipeline, test_metrics, history, y_pred_proba, y_test


In [None]:
model, pipeline, test_metrics, history, y_pred_proba, y_test = train_final_model(
    X, y,
    test_size=0.2,
    epochs=15, # max 15 epochs
    patience=5, # stop if no improvement for 5 epochs
    save_dir='models'
)

FINAL MODEL TRAINING - BI-DIRECTIONAL LSTM

[1/6] Splitting data...
Train samples: 14,304 (80.0%)
Test samples:  3,576 (20.0%)
Train fraud rate: 4.84%
Test fraud rate:  4.84%

[2/6] Building feature pipeline on train set...


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

[INFO] Using cuda for sentence transformer


Batches:   0%|          | 0/224 [00:00<?, ?it/s]

Batches:   0%|          | 0/224 [00:00<?, ?it/s]

Batches:   0%|          | 0/56 [00:00<?, ?it/s]

Batches:   0%|          | 0/56 [00:00<?, ?it/s]

Feature pipeline fitted!
Engineered features shape: (14304, 5187)
Converting sparse matrices to dense...

[3/6] Computing class weights...
Class weights: {0: 0.5254573506722504, 1: 10.32034632034632}

[4/6] Preparing text data...
Text data prepared!

[5/6] Building hybrid Bi-LSTM model...


I0000 00:00:1763375839.097009      19 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 12944 MB memory:  -> device: 0, name: Tesla T4, pci bus id: 0000:00:04.0, compute capability: 7.5
I0000 00:00:1763375839.097610      19 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 13942 MB memory:  -> device: 1, name: Tesla T4, pci bus id: 0000:00:05.0, compute capability: 7.5


Adapting text vectorizer to training data...

Training model for up to 15 epochs...
Early stopping patience: 5 epochs
Validation split: 10% of training data
Epoch 1/15
[1m403/403[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5s/step - accuracy: 0.8863 - auc: 0.8724 - loss: 0.4083 - precision: 0.2412 - recall: 0.6763
Epoch 1: val_auc improved from -inf to 0.98039, saving model to /kaggle/working/best_model.h5




[1m403/403[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2084s[0m 5s/step - accuracy: 0.8863 - auc: 0.8726 - loss: 0.4080 - precision: 0.2414 - recall: 0.6767 - val_accuracy: 0.9154 - val_auc: 0.9804 - val_loss: 0.1714 - val_precision: 0.3118 - val_recall: 0.9298 - learning_rate: 0.0010
Epoch 2/15
[1m403/403[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5s/step - accuracy: 0.9663 - auc: 0.9956 - loss: 0.0831 - precision: 0.6039 - recall: 0.9635
Epoch 2: val_auc did not improve from 0.98039
[1m403/403[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2020s[0m 5s/step - accuracy: 0.9663 - auc: 0.9955 - loss: 0.0832 - precision: 0.6038 - recall: 0.9635 - val_accuracy: 0.9448 - val_auc: 0.9632 - val_loss: 0.1247 - val_precision: 0.4127 - val_recall: 0.9123 - learning_rate: 0.0010
Epoch 3/15
[1m403/403[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5s/step - accuracy: 0.9768 - auc: 0.9965 - loss: 0.0556 - precision: 0.6839 - recall: 0.9893
Epoch 3: val_auc did not improve

_DistilBERT_

In [None]:
# target
labels = df["fraudulent"].astype(int).values

# build text input - one big text per row
def make_text(row):
    parts = [
        str(row.get("title", "")),
        str(row.get("location", "")),
        str(row.get("department", "")),
        str(row.get("employment_type", "")),
        str(row.get("required_experience", "")),
        str(row.get("required_education", "")),
        str(row.get("industry", "")),
        str(row.get("function", "")),
        str(row.get("company_profile", "")),
        str(row.get("description", "")),
        str(row.get("requirements", "")),
        str(row.get("benefits", "")),
    ]
    # join with separators so bert gets some structure
    return " [SEP] ".join(parts)

print("[INFO] Building combined text column...")
df["combined_text"] = df.apply(make_text, axis=1)

texts = df["combined_text"].fillna("").tolist()
print("[INFO] Example text sample:\n", texts[0][:500], "...\n")

# train-val split
print("[INFO] Splitting train/validation...")
X_train_texts, X_val_texts, y_train, y_val = train_test_split(
    texts,
    labels,
    test_size=0.2,
    random_state=42,
    stratify=labels,
)

print("[INFO] Train size:", len(X_train_texts))
print("[INFO] Val size:  ", len(X_val_texts))

# dataset + dataloader
class JobPostingDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_length=256):
        self.texts = list(texts)
        self.labels = list(labels)
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = str(self.texts[idx])
        label = int(self.labels[idx])

        encoding = self.tokenizer(
            text,
            truncation=True,
            padding="max_length",
            max_length=self.max_length,
            return_tensors="pt",
        )

        # squeeze to remove batch dimension
        item = {key: val.squeeze(0) for key, val in encoding.items()}
        item["labels"] = torch.tensor(label, dtype=torch.long)
        return item


print("[INFO] Loading DistilBERT tokenizer...")
tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased")

train_dataset = JobPostingDataset(X_train_texts, y_train, tokenizer, max_length=256)
val_dataset   = JobPostingDataset(X_val_texts,   y_val,   tokenizer, max_length=256)

BATCH_SIZE = 16

train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
val_loader   = DataLoader(val_dataset,   batch_size=BATCH_SIZE, shuffle=False)

# model + device
if torch.backends.mps.is_available():
    device = torch.device("mps")
elif torch.cuda.is_available():
    device = torch.device("cuda")
else:
    device = torch.device("cpu")

print(f"[INFO] Using device: {device}")

print("[INFO] Loading DistilBERT classification head...")
model = DistilBertForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=2,
)
model.to(device)

optimizer = AdamW(model.parameters(), lr=2e-5)

# training + evaluation loops
EPOCHS = 3

def train_one_epoch(epoch):
    model.train()
    total_loss = 0.0

    print(f"\n[TRAIN] Epoch {epoch+1}/{EPOCHS}")
    for step, batch in enumerate(train_loader, start=1):
        batch = {k: v.to(device) for k, v in batch.items()}

        outputs = model(**batch)
        loss = outputs.loss

        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

        total_loss += loss.item()

        if step % 50 == 0 or step == 1:
            print(f"  Step {step:4d} | Batch loss: {loss.item():.4f}")

    avg_loss = total_loss / len(train_loader)
    print(f"[TRAIN] Epoch {epoch+1} finished. Avg loss: {avg_loss:.4f}")
    return avg_loss


def eval_on_val():
    model.eval()
    all_labels = []
    all_probs  = []

    print("\n[EVAL] Running on validation set...")
    with torch.no_grad():
        for batch in val_loader:
            labels = batch["labels"].numpy()
            batch = {k: v.to(device) for k, v in batch.items()}

            outputs = model(**batch)
            logits = outputs.logits
            probs = torch.softmax(logits, dim=1)[:, 1].cpu().numpy()

            all_labels.append(labels)
            all_probs.append(probs)

    all_labels = np.concatenate(all_labels)
    all_probs  = np.concatenate(all_probs)
    preds = (all_probs >= 0.5).astype(int)

    precision = precision_score(all_labels, preds)
    recall    = recall_score(all_labels, preds)
    f1        = f1_score(all_labels, preds)
    auc       = roc_auc_score(all_labels, all_probs)

    print(f"[EVAL] Precision: {precision:.5f}")
    print(f"[EVAL] Recall:    {recall:.5f}")
    print(f"[EVAL] F1:        {f1:.5f}")
    print(f"[EVAL] ROC AUC:   {auc:.5f}")

    return precision, recall, f1, auc


# full training run
best_f1 = 0.0
best_state_dict = None

for epoch in range(EPOCHS):
    train_one_epoch(epoch)
    precision, recall, f1, auc = eval_on_val()

    if f1 > best_f1:
        print(f"[MODEL] New best F1: {f1:.5f} (prev {best_f1:.5f}). Saving weights in memory.")
        best_f1 = f1
        best_state_dict = model.state_dict()

print("\n======================================")
print("  TRAINING COMPLETE")
print("  Best validation F1:", f1)
print("======================================")

if best_state_dict is not None:
    model.load_state_dict(best_state_dict)
    save_path = "distilbert_fakejobs_finetuned"
    print(f"[SAVE] Saving best model to: {save_path}")
    model.save_pretrained(save_path)
    tokenizer.save_pretrained(save_path)
    print("[SAVE] Done.")


[INFO] Loading CSV...
[INFO] Data loaded: (17880, 18)
[INFO] Building combined text column...
[INFO] Example text sample:
 Marketing Intern [SEP] US, NY, New York [SEP] Marketing [SEP] Other [SEP] Internship [SEP] nan [SEP] nan [SEP] Marketing [SEP] We're Food52, and we've created a groundbreaking and award-winning cooking site. We support, connect, and celebrate home cooks, and give them everything they need in one place.We have a top editorial, business, and engineering team. We're focused on using technology to find new and better ways to connect people around their specific food interests, and to offer them supe ...

[INFO] Splitting train/validation...
[INFO] Train size: 14304
[INFO] Val size:   3576
[INFO] Loading DistilBERT tokenizer...
[INFO] Using device: mps
[INFO] Loading DistilBERT classification head...


Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.weight', 'classifier.weight', 'pre_classifier.


[TRAIN] Epoch 1/3
  Step    1 | Batch loss: 0.6872
  Step   50 | Batch loss: 0.2436
  Step  100 | Batch loss: 0.1258
  Step  150 | Batch loss: 0.0801
  Step  200 | Batch loss: 0.1151
  Step  250 | Batch loss: 0.1016
  Step  300 | Batch loss: 0.0682
  Step  350 | Batch loss: 0.2627
  Step  400 | Batch loss: 0.0625
  Step  450 | Batch loss: 0.0052
  Step  500 | Batch loss: 0.0358
  Step  550 | Batch loss: 0.0273
  Step  600 | Batch loss: 0.0193
  Step  650 | Batch loss: 0.0165
  Step  700 | Batch loss: 0.0263
  Step  750 | Batch loss: 0.0132
  Step  800 | Batch loss: 0.0122
  Step  850 | Batch loss: 0.0047
[TRAIN] Epoch 1 finished. Avg loss: 0.1120

[EVAL] Running on validation set...
[EVAL] Precision: 0.89542
[EVAL] Recall:    0.79191
[EVAL] F1:        0.84049
[EVAL] ROC AUC:   0.98648
[MODEL] New best F1: 0.84049 (prev 0.00000). Saving weights in memory.

[TRAIN] Epoch 2/3
  Step    1 | Batch loss: 0.0062
  Step   50 | Batch loss: 0.0248
  Step  100 | Batch loss: 0.0072
  Step  150 | 

BiLSTM has a high recall, while pre-trained DistilBERT has a high precision; both models can be used as final models depending on use case. Please refer to our report for more discussions on this.