# 2. Proposed Approaches for Classification/Detection

In [4]:
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline

In [18]:
recruitment_df_processed = pd.read_csv('recruitment_df_processed.csv')

As stated earlier, the dataset is a highly imbalanced dataset with only 5% of observations being tagged as fraudulent. Given this imbalance, we deal with it at the model level using class weights.

## 2A. Classification Models with Features and Empirical Rules

The baseline model for this classification/fraud detection model would be to use the features given in the dataset together with a list of empirical rules to train a baseline classification model, specifically a Logistic Regression model.

Given that through EDA, we found out that given features such as required experience, required education and whether the job advertisement had a company profile/logo/screening questions were different between fraudulent job advertisements and non-fraudulent ones, the natural step would be to use these features as regressors. We also found out that derived features such as the length of company profile, description and requirements, together with other signals like money in description, money in title, had significant differences between fraudulent and non-fraudulent job advertisements, so we include these features as well in the logistic regression. 

### Preparing Features

List of features used:
1. Binary Features: `has_company_profile`, `has_company_logo`, `has_questions`, `money_in_desc`, `money_in_title`, `url_in_description`, `consecutive_punct`
2. Categorical Features: `employment_type`, `required_experience`, `required_education`
3. Numerical Features: `company_profile_len`, `description_len`, `requirements_len`

These features have been generated above in the EDA stage and the code below prepares the features for model training (e.g. converting all binary features to '1's and '0's).

In [19]:
df = recruitment_df_processed.copy()

df["fraud_target"] = df["fraudulent"].map({'t':1, 'f':0}).astype(int)

# --- Feature groups ---
bin_cols = ["has_company_profile", "has_company_logo", "has_questions", 
            "money_in_desc", "money_in_title", 
            "url_in_description", "consecutive_punct"]

cat_cols = ["employment_type", "required_experience", "required_education"]

num_cols = ["company_profile_stripped_len", "description_stripped_len", "requirements_stripped_len"]

# --- Binary features: coerce to {0,1} ---
to01 = {
    True: 1, False: 0,
    't': 1, 'f': 0, 'T': 1, 'F': 0,
    'true': 1, 'false': 0, 'TRUE': 1, 'FALSE': 0,
    '1': 1, '0': 0,
    1: 1, 0: 0
}
for b in bin_cols:
    df[b] = df[b].map(to01)  # map known representations
    df[b] = df[b].fillna(0).astype(int)  # default missing → 0

# --- Categorical features: fill NAs with "Unknown" ---
for c in cat_cols:
    df[c] = df[c].fillna("Unknown")

# --- Target ---
y = df["fraud_target"]

# --- Preprocessing pipeline ---
preprocessor = ColumnTransformer(
    transformers=[
        ("bin", "passthrough", bin_cols),  # already 0/1
        ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols),
        ("num", StandardScaler(), num_cols),
    ]
)

clf = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("classifier", LogisticRegression(class_weight="balanced", max_iter=1000))
])

# --- Train/test split (if training) ---
X = df[bin_cols + cat_cols + num_cols]
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

## 2B. TF-IDF and Classification Models

While the previous classification model uses metadata features, term frequency-inverse document frequency captures the nuance of the job description, title, company profile and requirements. TF-IDF is a simplistic way of allowing the model learn patterns from the writeups before proceeding to more advanced methods like word embeddings and using LLMs.

### Preparing Features

We have previously used BeautifulSoup to clean up the job description and here we also build a composite text where we append the job title, company profile, description and requirements together before fitting the term frequency-inverse document frequency transformation. We combine these text so that the logistic regression model can treat the entire advertisement as a single document. 

In [20]:
# --- Build composite text (title + profile + description + requirements) ---
text_cols = ["title", "company_profile_stripped", "description_stripped", "requirements_stripped"]

df["text_all"] = df[text_cols].fillna("").agg(" ".join, axis=1)

# (Optional) light cleanup: collapse whitespace
df["text_all"] = df["text_all"].str.replace(r"\s+", " ", regex=True).str.strip()

X_text = df["text_all"]
y = df["fraud_target"]

In [None]:
# # --- Train/test split (stratified) ---
# X_train, X_test, y_train, y_test = train_test_split(
#     X_text, y, test_size=0.2, random_state=42, stratify=y
# )

# # --- TF-IDF + class-weighted Logistic Regression pipeline ---
# tfidf_lr = Pipeline([
#     ("tfidf", TfidfVectorizer(
#         lowercase=True,
#         stop_words="english",    # remove common stopwords
#         ngram_range=(1, 2),      # unigrams + bigrams catch short phrases
#         max_df=0.9,              # drop terms in >90% of docs (too common)
#         min_df=5,                # keep terms that appear in at least 5 docs
#         max_features=50000       # cap dimensionality (tune as needed)
#     )),
#     ("clf", LogisticRegression(
#         class_weight="balanced", # handle class imbalance at the model level
#         max_iter=2000,
#         solver="liblinear"       # good default for sparse features
#         # try 'saga' if you want L1/elasticnet regularization
#     ))
# ])

## 2C. Combination of TF-IDF, Features and Empirical Rules

## 2D. Word Embeddings and LLMs