                                                    Notebook 03 — Modeling (Classification)

This notebook applies supervised classification algorithms. The goal is to predict whether a crime results in an arrest.

The target variable is:

- Arrest = 1 => an arrest was made

- Arrest = 0 => no arrest was made

All data preprocessing steps were completed in Notebook 02 — Data Preparation.

The main objectives of this notebook are: 

- Train one baseline classification model

- Train one improved classification model

- Compare their performance using standard classification metrics

In [1]:
import pandas as pd
import numpy as np

  
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

import sys
sys.path.append("..")

from utils.visualization import metrics_table, compute_metrics

from sklearn.metrics import (
accuracy_score,
precision_score,
recall_score,
f1_score
)



**Data Loading** 

The training and testing datasets were generated in Notebook 02 (Data Preparation): 

They are:

- X_train, X_test: feature 

- y_train, y_test: target 


In [2]:
X_train = pd.read_csv("../data/X_train.csv")
X_test  = pd.read_csv("../data/X_test.csv")
y_train = pd.read_csv("../data/y_train.csv").squeeze()
y_test  = pd.read_csv("../data/y_test.csv").squeeze()


All datasets are already cleaned, encoded, scaled, and split.

In [3]:
X_train.shape, X_test.shape

((1079526, 41), (269882, 41))

    Baseline Model — Logistic Regression

Why Logistic Regression ?

Logistic Regression is is one of the simplest models for classification. 

We use it as a baseline because: 

- It trains fast

- It provides a reference performance for comparison

**Model Training**

Our dataset is imbalanced (most crimes do not lead to an arrest). We saw this in Notebook 02, where the target variable distribution shows:

- Arrest = 0 : 4 875 610 cases

- Arrest = 1 : 1 871 430 cases.

So, we applied class weighting during model training (class_weight='balanced) to give more importance to arrest cases and prevent the model from favoring the majority class (Arrest = 0).

In [4]:
# Define and train the Logistic Regression model
logistic_model = LogisticRegression(
max_iter=1000,
class_weight='balanced',
random_state=42
)


logistic_model.fit(X_train, y_train)

**Predictions**

In [5]:
y_pred_log = logistic_model.predict(X_test)

**Evaluation Metrics**

The model is evaluated using four metrics:

- Accuracy => how often the model is correct

- Precision => when the model predicts an arrest, how often it is right

- Recall => how many real arrests the model detects

- F1-score => balance between precision and recall

In [6]:
log_m = compute_metrics(y_test, y_pred_log, "Logistic Regression")

log_accuracy = accuracy_score(y_test, y_pred_log)
log_precision = precision_score(y_test, y_pred_log)
log_recall = recall_score(y_test, y_pred_log)
log_f1 = f1_score(y_test, y_pred_log)


print("Logistic Regression Results")
print(f"Accuracy : {log_accuracy:.4f}")
print(f"Precision: {log_precision:.4f}")
print(f"Recall : {log_recall:.4f}")
print(f"F1-score : {log_f1:.4f}")


Logistic Regression Results
Accuracy : 0.8613
Precision: 0.8995
Recall : 0.5635
F1-score : 0.6929


Based on the results :
- The model is usually correct

- It is very careful when predicting arrests

- But it misses some real arrest cases 


    Improved Model — Random Forest Classifier

Why Random Forest? 

Random Forest is a stronger model that combines many decision trees.

Random Forest is chosen as an improved model because:

- It can learn more complex patterns

- It works well with large datasets

- It usually performs better than simple models

**Model Training**

In [7]:
rf_model = RandomForestClassifier(
n_estimators=100,
class_weight='balanced',
random_state=42,
n_jobs=-1
)


rf_model.fit(X_train, y_train)

**Predictions**

In [8]:
y_pred_rf = rf_model.predict(X_test)

**Evaluation Metrics**

In [9]:
rf_m  = compute_metrics(y_test, y_pred_rf, "Random Forest")

rf_accuracy = accuracy_score(y_test, y_pred_rf)
rf_precision = precision_score(y_test, y_pred_rf)
rf_recall = recall_score(y_test, y_pred_rf)
rf_f1 = f1_score(y_test, y_pred_rf)


print("Random Forest Results")
print(f"Accuracy : {rf_accuracy:.4f}")
print(f"Precision: {rf_precision:.4f}")
print(f"Recall : {rf_recall:.4f}")
print(f"F1-score : {rf_f1:.4f}")


Random Forest Results
Accuracy : 0.8681
Precision: 0.9200
Recall : 0.5751
F1-score : 0.7077


Based on the results :
- The model is slightly more accurate

- It finds more real arrests

- It gives better overall balance than Logistic Regression 


    Model Comparison

The table below summarizes the performance of both models using metrics

In [10]:
results = metrics_table([log_m, rf_m])
results

Unnamed: 0,Model,Accuracy,Precision,Recall,F1-score
0,Logistic Regression,0.861336,0.899472,0.563522,0.692924
1,Random Forest,0.868146,0.920033,0.575053,0.707742


The Random Forest model outperforms Logistic Regression across all key metrics.

To ensure a clean workflow and avoid retraining, the trained models are saved for later evaluation in Notebook 04.

In [11]:
import os

# Create the models directory if it doesn't exist
os.makedirs("../models", exist_ok=True)


In [12]:
import joblib

# Save the trained models
joblib.dump(logistic_model, "../models/logistic_model.pkl")
joblib.dump(rf_model, "../models/rf_model.pkl")

['../models/rf_model.pkl']