                                                                Data Preparation

This notebook prepares the Chicago Crimes dataset for supervised machine learning.
All preprocessing steps are based on the insights obtained during the Exploratory Data Analysis (EDA).

The main objectives are:
- Clean the dataset
- Handle missing values
- Select relevant features
- Encode categorical variables
- Scale numerical features when needed
- Split the data into training and testing sets

In [108]:
import sys
sys.path.append('..')

import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

from utils.preprocessing import (
    drop_redundant_columns,
    extract_datetime_features,
    fill_missing_values,
    remove_outliers_iqr
)

In [109]:
#Load Dataset
df = pd.read_csv("../data/city_of_chicago_crimes_2001_to_present.csv")
df.shape


(6747040, 22)

In [110]:
#Target Variable
df["Arrest"] = df["Arrest"].astype(int)


The target variable must be numerical in order to be used by supervised machine learning algorithms.
Here, Arrest is converted from a boolean variable to a binary variable:

1 → arrest

0 → no arrest

In [111]:
# check imbalance between arrest and no arrest
print(df["Arrest"].value_counts(dropna=False))

Arrest
0    4875610
1    1871430
Name: count, dtype: int64


=> we observe that the dataset is highly imbalanced (most crimes do not lead to an arrest). This will be considered later during modeling and evaluation.

In [112]:
# Remove Duplicate Records
df = df.drop_duplicates(subset="ID")
print("After duplicates:", df.shape)


After duplicates: (6747040, 22)


Each crime record should appear only once in the dataset.

Removing duplicate records keeps the data consistent and prevents the model from learning the same event multiple times

In [113]:
# Drop Irrelevant and Redundant Columns
df = drop_redundant_columns(df)




***The dropped columns are:***

    ID, Case Number (identifiers)
    
    Block (high cardinality text)

    X Coordinate, Y Coordinate (duplicate of latitude and longitude)

    Location (string version of coordinates)

    Updated On (not related to the crime itself)

In [114]:
# drop columns with too many missing values
df = df.drop(columns=["Ward", "Community Area"])


From our notebook 1 the EDA we observed that :


    Ward has more than 600k missing values
    Community Area has more than 600k missing values
Filling these columns would introduce a lot of artificial data,
so we drop them completely.

In [115]:
df.head()


Unnamed: 0,Date,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,FBI Code,Year,Latitude,Longitude
0,03/18/2015 07:44:00 PM,041A,BATTERY,AGGRAVATED: HANDGUN,STREET,0,False,1111,11.0,04B,2015,41.891399,-87.744385
1,03/18/2015 11:00:00 PM,4625,OTHER OFFENSE,PAROLE VIOLATION,STREET,1,False,725,7.0,26,2015,41.773372,-87.665319
2,03/18/2015 10:45:00 PM,0486,BATTERY,DOMESTIC BATTERY SIMPLE,APARTMENT,0,True,222,2.0,08B,2015,41.813861,-87.596643
3,03/18/2015 10:30:00 PM,0460,BATTERY,SIMPLE,APARTMENT,0,False,225,2.0,08B,2015,41.800802,-87.622619
4,03/18/2015 09:00:00 PM,031A,ROBBERY,ARMED: HANDGUN,SIDEWALK,0,False,1113,11.0,03,2015,41.878065,-87.743354


The Date column contains both date and time information. We extract:

**Hour**

**Day**

**Month**

**Day of week** (monday, tuesday)

In [116]:
# Work on a representative subset to reduce computation time
df = df.sample(frac=0.2, random_state=42)

# extract datetime features
df = extract_datetime_features(df)
df[["Month", "Day", "Hour", "DayOfWeek"]].isna().sum()

print("After extracting datetime features:", df.shape)



  df[date_col] = pd.to_datetime(


After extracting datetime features: (1349408, 16)


After removing columns with too many missing values, some cells in the dataset are still empty.                     
                                                                                    
        For numerical columns, we replace missing values with the median (the middle value).                                                                                                                                                    
        For categorical columns, we replace missing values with the mode (the most common value).

In [117]:
# handle remaining missing values
df = fill_missing_values(df)


In our EDA notebook 1 we observe that :.

***IUCR , FBI Code and Primary Type***



all describe the same crime at different levels.

We already analyzed arrest outcomes using Primary Type, so we keep it and drop the others.

In [118]:
# remove redundant crime encoding columns
df = df.drop(columns=["IUCR", "FBI Code"])


Feature selection is based on the exploratory analysis performed in Notebook 01.
Only variables that were shown to vary with arrest outcomes or provide essential
context (crime type, time, location, and administrative area) were retained.
Redundant, highly incomplete, or unexplored variables were excluded.


In [119]:
# Feature selection based on EDA findings

features = [
"Primary Type", # Crime category
"Domestic", # Contextual information
"District", # Administrative differences
"Latitude", # Location
"Longitude",
"Year", # Long-term trends
"Month",
"Day",
"Hour",
"DayOfWeek"
]


X = df[features]
y = df["Arrest"]


Machine learning models require numerical input. We apply one-hot encoding to categorical variables.

In [120]:
# encode categorical variables
X = pd.get_dummies(X, drop_first=True)


We split the dataset into:

        **80% training**

        **20% testing**

In [121]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

Scaling transforms numerical features so they are centered around 0 and have similar ranges, preventing features with large values from dominating the model.

In [122]:
scaler = StandardScaler()

num_features = [
"Latitude",
"Longitude",
"Year",
"Month",
"Day",
"Hour",
"DayOfWeek"
]

# Scale numerical features
X_train[num_features] = scaler.fit_transform(X_train[num_features])
X_test[num_features] = scaler.transform(X_test[num_features])

An IQR-based method was implemented but not applied, because extreme values represent valid crime incidents, not data errors.

In [123]:
# X_train = remove_outliers_iqr(X_train, num_features)
# y_train = y_train.loc[X_train.index]

In [124]:
X_train.shape, X_test.shape


((1079526, 41), (269882, 41))

In [125]:
X_train[num_features].mean()

Latitude    -8.316517e-14
Longitude    7.671358e-13
Year        -4.613421e-15
Month       -6.234459e-17
Day         -4.133489e-18
Hour        -9.470165e-17
DayOfWeek    5.448405e-17
dtype: float64

In [126]:
X_train[num_features].std()

Latitude     1.0
Longitude    1.0
Year         1.0
Month        1.0
Day          1.0
Hour         1.0
DayOfWeek    1.0
dtype: float64

In [127]:
X_train.to_csv("../data/X_train.csv", index=False)
X_test.to_csv("../data/X_test.csv", index=False)
y_train.to_csv("../data/y_train.csv", index=False)
y_test.to_csv("../data/y_test.csv", index=False)
