                                                                Data Preparation

This notebook prepares the Chicago Crimes dataset for supervised machine learning.
All preprocessing steps are based on the insights obtained during the Exploratory Data Analysis (EDA).

The main objectives are:
- Clean the dataset
- Handle missing values
- Select relevant features
- Encode categorical variables
- Scale numerical features when needed
- Split the data into training and testing sets

In [1]:
import sys
sys.path.append('..')

import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

from utils.preprocessing import (
    drop_redundant_columns,
    extract_datetime_features,
    fill_missing_values,
    remove_outliers_iqr
)

In [2]:
#Load Dataset
df = pd.read_csv("../data/city_of_chicago_crimes_2001_to_present.csv")
df.shape


(6747040, 22)

In [3]:
#Target Variable
df["Arrest"] = df["Arrest"].astype(int)


The target variable must be numerical in order to be used by supervised machine learning algorithms.
Here, Arrest is converted from a boolean variable to a binary variable:

1 → arrest

0 → no arrest

In [4]:
# Remove Duplicate Records
df = df.drop_duplicates(subset="ID")


Each crime record should appear only once in the dataset.
Removing duplicate records keeps the data consistent and prevents the model from learning the same event multiple times

In [5]:
# Drop Irrelevant and Redundant Columns
df = drop_redundant_columns(df)




***The dropped columns are:***

    ID, Case Number (identifiers)
    
    Block (high cardinality text)

    X Coordinate, Y Coordinate (duplicate of latitude and longitude)

    Location (string version of coordinates)

    Updated On (not related to the crime itself)

In [6]:
# drop columns with too many missing values
df = df.drop(columns=["Ward", "Community Area"])


From our notebook 1 the EDA we observed that :


    Ward has more than 600k missing values
    Community Area has more than 600k missing values
Filling these columns would introduce a lot of artificial data,
so we drop them completely.

In [7]:
df.head()


Unnamed: 0,Date,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,FBI Code,Year,Latitude,Longitude
0,03/18/2015 07:44:00 PM,041A,BATTERY,AGGRAVATED: HANDGUN,STREET,0,False,1111,11.0,04B,2015,41.891399,-87.744385
1,03/18/2015 11:00:00 PM,4625,OTHER OFFENSE,PAROLE VIOLATION,STREET,1,False,725,7.0,26,2015,41.773372,-87.665319
2,03/18/2015 10:45:00 PM,0486,BATTERY,DOMESTIC BATTERY SIMPLE,APARTMENT,0,True,222,2.0,08B,2015,41.813861,-87.596643
3,03/18/2015 10:30:00 PM,0460,BATTERY,SIMPLE,APARTMENT,0,False,225,2.0,08B,2015,41.800802,-87.622619
4,03/18/2015 09:00:00 PM,031A,ROBBERY,ARMED: HANDGUN,SIDEWALK,0,False,1113,11.0,03,2015,41.878065,-87.743354


In [8]:
# Work on a representative subset to reduce computation time
df = df.sample(frac=0.2, random_state=42)

# Then extract datetime features

df = extract_datetime_features(df)
df[["Month", "Day", "Hour", "DayOfWeek"]].isna().sum()



  df[date_col] = pd.to_datetime(


Month        0
Day          0
Hour         0
DayOfWeek    0
dtype: int64

The Date column contains both date and time information.
We extract useful temporal features from it.

In [9]:
# handle remaining missing values
df = fill_missing_values(df)


After removing columns with too many missing values, some cells in the dataset are still empty.                     
                                                                                    
        For numerical columns, we replace missing values with the median (the middle value).                                                                                                                                                    
        For categorical columns, we replace missing values with the mode (the most common value).

In [10]:
# remove redundant crime encoding columns
df = df.drop(columns=["IUCR", "FBI Code"])


In our EDA notebook 1 we that :.

***IUCR , FBI Code and Primary Type***



all describe the same crime at different levels.

We already analyzed arrest outcomes using Primary Type, so we keep it and drop the others.

In [11]:
# Feature selection based on EDA findings

features = [
    "Primary Type",     # Crime category (EDA showed strong variation)
    "Domestic",         # Boolean context
    "District",         # Administrative area with different arrest rates
    "Latitude",         # Location (non-redundant)
    "Longitude",
    "Year",             # Temporal pattern observed in EDA
    "Month",
    "Day",
    "Hour",
    "DayOfWeek"
]

X = df[features]
y = df["Arrest"]


Feature selection is based on the exploratory analysis performed in Notebook 01.
Only variables that were shown to vary with arrest outcomes or provide essential
context (crime type, time, location, and administrative area) were retained.
Redundant, highly incomplete, or unexplored variables were excluded to keep the
model simple and interpretable.


In [12]:
# encode categorical variables
X = pd.get_dummies(X, drop_first=True)


Machine learning models require numerical inputs.
We apply one-hot encoding to categorical features.

In [13]:
# scale numerical features
scaler = StandardScaler()

num_features = [
    "Latitude",
    "Longitude",
    "Year",
    "Month",
    "Day",
    "Hour",
    "DayOfWeek"
]

X[num_features] = scaler.fit_transform(X[num_features])


We only scale continuous numerical variables.

In [14]:
# outlier detection

# X = remove_outliers_iqr(X, num_features)
# y = y.loc[X.index]

# Outlier detection was considered, but no rows were removed
# because extreme values represent valid crime incidents

X.head()



Unnamed: 0,Domestic,District,Latitude,Longitude,Year,Month,Day,Hour,DayOfWeek,Primary Type_ASSAULT,...,Primary Type_OTHER OFFENSE,Primary Type_PROSTITUTION,Primary Type_PUBLIC INDECENCY,Primary Type_PUBLIC PEACE VIOLATION,Primary Type_RITUALISM,Primary Type_ROBBERY,Primary Type_SEX OFFENSE,Primary Type_STALKING,Primary Type_THEFT,Primary Type_WEAPONS VIOLATION
1587123,False,4.0,-1.050961,1.89698,-0.462382,1.340142,-0.750967,-0.029421,0.006574,True,...,False,False,False,False,False,False,False,False,False,False
946346,True,7.0,-0.584799,0.176125,-1.061302,0.739694,-1.656426,-0.029421,-1.50773,True,...,False,False,False,False,False,False,False,False,False,False
3593204,False,25.0,0.971581,-1.70417,1.134739,-0.761426,1.625863,-1.216749,-0.498194,False,...,False,False,False,False,False,True,False,False,False,False
1368205,False,15.0,0.46253,-1.186367,-0.462382,-1.361874,-1.316879,1.454738,1.016111,False,...,False,False,False,False,False,False,False,False,False,False
1671047,False,4.0,-2.146417,2.013447,-0.262742,0.739694,-0.750967,0.267411,1.520879,False,...,False,False,False,False,False,False,False,False,False,False


As required by the project, we apply a simple IQR-based outlier detection
on continuous numerical features.

In [15]:
# train / test split 
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

scaler = StandardScaler()
X_train[num_features] = scaler.fit_transform(X_train[num_features])
X_test[num_features]  = scaler.transform(X_test[num_features])



We split the data into training and testing sets.

In [16]:
if 'X_train' in globals() and 'X_test' in globals():
	print(X_train.shape, X_test.shape)
else:
	print("X_train and X_test are not defined. Ensure the train/test split cell has been executed successfully.")


(1079526, 41) (269882, 41)


In [17]:
X_train[num_features].mean()
X_train[num_features].std()


Latitude     1.0
Longitude    1.0
Year         1.0
Month        1.0
Day          1.0
Hour         1.0
DayOfWeek    1.0
dtype: float64

In [18]:
X_train[num_features].mean()


Latitude     2.720994e-17
Longitude   -6.713628e-18
Year        -1.632333e-18
Month       -4.100579e-17
Day         -2.274735e-17
Hour         2.435336e-17
DayOfWeek    1.089319e-18
dtype: float64