# 2. Proposed Approaches for Classification/Detection

In [18]:
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer

from datasets import Dataset
from transformers import AutoTokenizer

In [19]:
recruitment_df_processed = pd.read_csv('recruitment_df_processed.csv')

As stated earlier, the dataset is a highly imbalanced dataset with only 5% of observations being tagged as fraudulent. Given this imbalance, we deal with it at the model level using class weights.

## 2A. Classification Models with Features and Empirical Rules

The baseline model for this classification/fraud detection model would be to use the features given in the dataset together with a list of empirical rules to train a baseline classification model, specifically a Logistic Regression model.

Given that through EDA, we found out that given features such as required experience, required education and whether the job advertisement had a company profile/logo/screening questions were different between fraudulent job advertisements and non-fraudulent ones, the natural step would be to use these features as regressors. We also found out that derived features such as the length of company profile, description and requirements, together with other signals like money in description, money in title, had significant differences between fraudulent and non-fraudulent job advertisements, so we include these features as well in the logistic regression. 

### Preparing Features

List of features used:
1. Binary Features: `has_company_profile`, `has_company_logo`, `has_questions`, `money_in_desc`, `money_in_title`, `url_in_description`, `consecutive_punct`
2. Categorical Features: `employment_type`, `required_experience`, `required_education`
3. Numerical Features: `company_profile_len`, `description_len`, `requirements_len`

These features have been generated above in the EDA stage and the code below prepares the features for model training (e.g. converting all binary features to '1's and '0's).

In [20]:
df = recruitment_df_processed.copy()

df["fraud_target"] = df["fraudulent"].map({'t':1, 'f':0}).astype(int)

# --- Feature groups ---
bin_cols = ["has_company_profile", "has_company_logo", "has_questions", 
            "money_in_desc", "money_in_title", 
            "url_in_description", "consecutive_punct"]

cat_cols = ["employment_type", "required_experience", "required_education", "industry"]

num_cols = ["company_profile_stripped_len", "description_stripped_len", "requirements_stripped_len"]

# --- Binary features: coerce to {0,1} ---
to01 = {
    True: 1, False: 0,
    't': 1, 'f': 0, 'T': 1, 'F': 0,
    'true': 1, 'false': 0, 'TRUE': 1, 'FALSE': 0,
    '1': 1, '0': 0,
    1: 1, 0: 0
}
for b in bin_cols:
    df[b] = df[b].map(to01)
    df[b] = df[b].fillna(0).astype(int) 

# --- Categorical features: fill NAs with "Unknown" ---
for c in cat_cols:
    df[c] = df[c].fillna("Unknown")

# --- Target ---
y = df["fraud_target"]

# --- Preprocessing pipeline ---
preprocessor = ColumnTransformer(
    transformers=[
        ("bin", "passthrough", bin_cols),  # already 0/1
        ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols),
        ("num", StandardScaler(), num_cols),
    ]
)

clf = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("classifier", LogisticRegression(class_weight="balanced", max_iter=1000))
])

# --- Train/test split ---
X = df[bin_cols + cat_cols + num_cols]
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

## 2B. TF-IDF and Classification Models

While the previous classification model uses metadata features, term frequency-inverse document frequency captures the nuance of the job description, title, company profile and requirements. TF-IDF is a simplistic way of allowing the model to learn patterns from the writeups before proceeding to more advanced methods like word embeddings and using LLMs.

### Preparing Features

We have previously used BeautifulSoup to clean up the job description and here we also build a composite text where we append the job title, company profile, description and requirements together before fitting the term frequency-inverse document frequency transformation. We combine these text so that the logistic regression model can treat the entire advertisement as a single document. 

We also convert all text to lowercase to ensure that the model does not treat capitalised and non-capitalised words like "Job" and "job" as different tokens. This reduces sparsity and improves generalisation.

We allow for the identification of frequent phrases as well, not just words i.e. allowing ngram_range=(1, 2).

In [21]:
# --- Build composite text (title + profile + description + requirements + benefits) ---
text_cols = ["title", "company_profile_stripped", "description_stripped", "requirements_stripped", "benefits_stripped"]

df["text_all"] = (
    df[text_cols]
    .fillna("")
    .agg(" ".join, axis=1)
    .str.lower()   # lowercase everything
)

X_text = df["text_all"]
y = df["fraud_target"]

In [22]:
# --- Train/test split (stratified) ---
X_train, X_test, y_train, y_test = train_test_split(
    X_text, y, test_size=0.2, random_state=42, stratify=y
)

# --- TF-IDF + class-weighted Logistic Regression pipeline ---
tfidf_lr = Pipeline([
    ("tfidf", TfidfVectorizer(
        lowercase=True,          # lowercase everything 
        stop_words="english",    # remove common stopwords
        ngram_range=(1, 2),      # unigrams + bigrams catch short phrases
        max_df=0.9,              # drop terms in >90% of docs (too common)
        min_df=5,                # keep terms that appear in at least 5 docs
        max_features=200         # cap dimensionality
    )),
    ("clf", LogisticRegression(
        class_weight="balanced", # handle class imbalance at the model level
        max_iter=2000,
        solver="liblinear"
    ))
])

The cell above vectorises the composite texts into a TF-IDF feature matrix and pipes it into a class weighted logistic regression.

## 2C. Combination of TF-IDF, Features and Empirical Rules

In the following cell, we combine the TF-IDF vectorised scores, the standard features and the rule-based features. Through combining these features, we hope that the model's performance will be able to learn both the distinctive patterns in the text as well as non-linguistic features that TF-IDF cannot capture. 

Some limitations of combining both sets of features include the possibility that the high number of features in the TF-IDF vector might overshadow the structured features, which might cause the model to underweight the non-linguistic signals. If after running the model and we realise that this might be the case (through running feature importance), we can adjust the number of features in the TF-IDF vector.


In [23]:
# ---------- Build X, y ----------
text_col = "text_all"
X = pd.concat([df[[text_col]], df[bin_cols + cat_cols + num_cols]], axis=1)
y = df["fraud_target"]

# ---------- Preprocessor: text + categoricals + numerics + binaries ----------
preprocessor = ColumnTransformer(
    transformers=[
        ("txt", TfidfVectorizer(
            lowercase=True,
            stop_words="english",
            ngram_range=(1, 2),
            max_df=0.9,
            min_df=5,
            max_features=200
        ), text_col),
        ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols),
        ("num", StandardScaler(), num_cols),
        ("bin", "passthrough", bin_cols),
    ],
    remainder="drop"
)

# ---------- Class-weighted Logistic Regression (combined model) ----------
combined_lr = Pipeline([
    ("prep", preprocessor),
    ("clf", LogisticRegression(
        class_weight="balanced",
        max_iter=2000,
        solver="liblinear"
    ))
])

# ---------- Train/test split (stop here if only preparing) ----------
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

## 2D. Word Embeddings and Transformer Models

For our last method, we embed the composite texts using transformer specific embedding models and train transformer models with these embeddings and the specific tag (fraud/non-fraudulent) and use transformer models to predict whether a specific job advertisement is fraudulent or not.

We choose transformer models over large language models (LLMs) because transformers are smaller and more efficient to fine-tune on labeled data, which in this case we are working with labeled fraud data. Furthermore, encoder-only transformers use bidirectional attention mechanisms and are able to capture the linguistic patterns listed in the composite texts of job advertisements. LLMs are more suited for text generation. 

### Preparing Data

To prepare our data, we create the composite text field. This time, we do 3 things differently from before:
1. We append tags to each component of the composite text. If the text belongs to the job title feature, we append `[JOB TITLE]` in front of the text and if the text belongs to the job description feature, we append `[JOB DESCRIPTION]` in front of the text. This helps the transformer model better understand the lingiustic patterns in the composite text. We don't do this for the TF-IDF model because these tags would inflate term frequency but not allow the TF-IDF model to pick up linguistic cues.
2. We also include `required_education` and `required_experience` this time because the transformer model will be able to pick up these linguistic cues through its self-attention mechanism. In contrast to the data preparation for the TF-IDF vector, these would add to the term frequencies of specific words but will not have any linguistic significance. Hence we exclude it from the data preparation step above but include it in this approach.
3. We do not convert all text to lowercase because RoBERTa is pretrained on cased text. 

In [24]:
# Normalize all source fields used in the composite
text_fields = [
    "title", "company_profile_stripped", "description_stripped", "requirements_stripped",
    "benefits_stripped", "required_education", "required_experience"
]

# Make empty categories explicit
df["required_education"]  = df["required_education"].replace({"": "Unknown"}).fillna("Unknown")
df["required_experience"] = df["required_experience"].replace({"": "Unknown"}).fillna("Unknown")

# Use explicit section tags so the encoder gets structure cues
df["text"] = (
    "[JOB TITLE] "          + df["title"].str.strip() + " "
    "[COMPANY PROFILE] "    + df["company_profile_stripped"].str.strip() + " "
    "[JOB DESCRIPTION] "    + df["description_stripped"].str.strip() + " "
    "[JOB REQUIREMENTS] "   + df["requirements_stripped"].str.strip() + " "
    "[BENEFITS] "           + df["benefits_stripped"].str.strip() + " "
    "[REQUIRED EDUCATION] " + df["required_education"].str.strip() + " "
    "[REQUIRED EXPERIENCE] "+ df["required_experience"].str.strip()
).str.replace(r"\s+", " ", regex=True).str.strip()

# Ensure text is clean strings
df["text"] = df["text"].fillna("").astype(str)
df["label"] = df["fraud_target"].astype(int)

In [25]:
X_train_text, X_val_text, y_train, y_val = train_test_split(
    df["text"], df["label"], test_size=0.2, random_state=42, stratify=df["label"]
)

# ======= RoBERTa-specific tokenization =======
train_ds = Dataset.from_pandas(pd.DataFrame({"text": X_train_text.values, "label": y_train.values}))
val_ds   = Dataset.from_pandas(pd.DataFrame({"text": X_val_text.values,   "label": y_val.values}))

tokenizer = AutoTokenizer.from_pretrained("roberta-base")

def tokenize(batch):
    return tokenizer(
        batch["text"],
        truncation=True,
        padding="max_length",
        max_length=256
    )

train_tok = train_ds.map(tokenize, batched=True, remove_columns=["text"]).with_format("torch")
val_tok   = val_ds.map(tokenize,   batched=True, remove_columns=["text"]).with_format("torch")

print("Prepared RoBERTa tokenized datasets.")
print("Train examples:", len(train_tok), " | Val examples:", len(val_tok))

Map: 100%|██████████| 14304/14304 [00:04<00:00, 3570.15 examples/s]
Map: 100%|██████████| 3576/3576 [00:01<00:00, 3406.71 examples/s]

Prepared RoBERTa tokenized datasets.
Train examples: 14304  | Val examples: 3576





## Evaluation of approaches 2A - 2D

Having chosen 2A as a baseline model, we can see whether subsequent approaches improve model performance. 

Given that the dataset is highly imbalanced with less than 5% of observations being tagged as fraudulent, in order to evaluate model performance, choosing accuracy as a metric would not be sufficient as a simplistic model that predicts non-fradulent for all observations will give an accuracy of ~95%. Hence I propose using metrics like `precision and recall` instead.

In the case of fraud detection, while the cost of leaving up fraudulent job advertisements is higher than the cost of wrongly taking down non-fraudulent job advertisements (i.e. minimising false negatives are important), it is important to balance both precision and recall because if we only maximise on recall, there will be too many false positives.

Hence we should maximise the F1-score, which balances out the importance of precision and recall. 

After selecting the best performing model based on the F1-score, we will calibrate the classification threshold based on the `PR-AUC curve`, choosing the classification threshold that gives the highest AUC so as to ensure that our model balances between catching as many fraudulent job advertisements as possible (high recall) while minimising false alarms (high precision).