In [3]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import accuracy_score
from tensorflow.keras.layers import Input, Dense, Dropout, BatchNormalization
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam

df = pd.read_csv('crimeTime.csv', engine='python', on_bad_lines='skip')

Install panda and numpy library

<br>**sklearn.model_selection** for splitting data into training/testing sets
<br>**sklearn.preprocessing** includes OneHotEncoder which just turns categorical variables into numeric features
<br>**sklearn.metrics** to measure accuracy
<br>**tensorflow.keras** is a framework for building models, MLP in my case


In [4]:
#one-hot encode 'Category' (crime type) for classification
encoder = OneHotEncoder(sparse_output=False)
category_encoded = encoder.fit_transform(df[['Category']])
category_labels = encoder.categories_[0]

#create dataframe for encoded labels
category_df = pd.DataFrame(category_encoded, columns=category_labels)

This step transforms the "Category" column into a group of binary columns, one for every unique category.
<br> The model handles the different crime types as separate numeric inputs.

In [5]:
#one-hot encode 'ordinalDistrict'
district_encoder = OneHotEncoder(sparse_output=False)
district_encoded = district_encoder.fit_transform(df[['ordinalDistrict']])
district_labels = district_encoder.categories_[0]

#create dataframe for encoded districts
district_df = pd.DataFrame(district_encoded, columns=[f'District_{int(d)}' for d in district_labels])

Exactly the same as the previous step, but this time it transforms the "PdDistrict" column.

In [10]:
#merge with main dataframe
df = pd.concat([df[['ordinalDOW', 'Time_Minutes']], district_df, category_df], axis=1)

#normalize 'Time_Minutes'
df['Time_Minutes'] = df['Time_Minutes'] / 1440 #1440 minutes = 1 day

#split dataset into features (X) and target (y)
X = df.drop(category_labels, axis=1)
y = category_encoded

#train-test split (70% train, 30% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

**df = pd.concat()** combines the original dataframe columns with the new dataframe columns created from the previous two steps.

<br>Time_Minutes are then normalized into a 0-1 range, divided by 1440  because 1440minutes=1day

<br>The dataset is then split between features and the target.

<br>The dataset is then split into two sets, one for training and one for testing. 70/30 split

In [7]:
#define MLP architecture for classification
model = Sequential([
    Input(shape=(X_train.shape[1],)),
    Dense(128, activation='relu'),
    BatchNormalization(),
    Dropout(0.3),
    Dense(64, activation='relu'),
    BatchNormalization(),
    Dropout(0.3),
    Dense(32, activation='relu'),
    BatchNormalization(),
    Dropout(0.3),
    Dense(y_train.shape[1], activation='softmax')
])

This step creates the multi-layer perceptron (MLP) with:

*   **Input Layer** that recieves an input = # of features in the training model
*   **3 Hidden Layers** (128 -> 64 -> 32 neurons), each followed by a batch norm and dropout
    *   Batch Normalization normalizes the outputs from the previous layer (faster training)
    *   Dropout randomly turns off 30% of the neurons during training (helps prevents overfitting)
*   **A final Dense Layer** that outputs the probability for each crime category
*   ReLu which is an activation function that helps the model train better since it is only comparing with 0





In [9]:
#compile the model with a smaller learning rate
optimizer = Adam(learning_rate=0.0005)
model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])

#train the model
model.fit(X_train, y_train,
          epochs=1, #set to 1 for demonstration purposes because each pass through could take minutes
          batch_size=32,
          validation_split=0.2,
          verbose=1)

[1m37267/37267[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m202s[0m 5ms/step - accuracy: 0.2346 - loss: 2.5798 - val_accuracy: 0.2422 - val_loss: 2.5393


<keras.src.callbacks.history.History at 0x7b83b691ab10>

Using the Adam (Adaptive Moment Estimation) optimizer, it adjusts the learning rate for each parameter.

<br>The model is then compiled using the categorical_crossentrophy loss function
*   specifically because it is suitable for multi-class classifications and it tries to minimize the differences between the predicted probabilities and the actual probabilities
*   epochs are the # of pass throughs the model takes
*   batch size are the  of samples the model processes before updating the weights
*   validation split is 20%, meaning  that 80% of the samples are used for training and remaining 20% of the samples are used for performance evaluation



In [None]:
#evaluate the model on the test data
y_pred_prob = model.predict(X_test)
y_pred = np.argmax(y_pred_prob, axis=1)
y_test_labels = np.argmax(y_test, axis=1)

#calculate accuracy
accuracy = accuracy_score(y_test_labels, y_pred)
print("Model Accuracy:", accuracy)

Final step is to evaluate the model.
<br>This compares the predicted class indices with the true class indices, then prints the fraction of correct predictions.