# San Francisco Crime Classification

Analyzes crime data from San Francisco between June 2015 and June 2016, aiming to predict crime categories based on time and location features. Using modern machine learning techniques including CatBoostClassifier, we develop a model to help law enforcement better understand and predict crime patterns. The analysis includes comprehensive data exploration, feature engineering, and model evaluation.

Dataset: https://www.kaggle.com/competitions/sf-crime/data

Hugging Face: https://huggingface.co/spaces/alperugurcan/crime-predictor

In [1]:
# 1. Import Libraries
import pandas as pd
import numpy as np
from catboost import CatBoostClassifier
from sklearn.model_selection import train_test_split

In [2]:
# 2. Load Data
train = pd.read_csv("/kaggle/input/sf-crime/train.csv.zip")
test = pd.read_csv("/kaggle/input/sf-crime/test.csv.zip")

In [3]:
# 3. Feature Engineering
def process_data(df):
    # Convert dates to datetime and extract features
    df['Dates'] = pd.to_datetime(df['Dates'])
    df['Hour'] = df['Dates'].dt.hour
    df['Month'] = df['Dates'].dt.month
    df['DayOfWeek'] = pd.Categorical(df['DayOfWeek']).codes
    df['PdDistrict'] = pd.Categorical(df['PdDistrict']).codes
    
    # Keep only necessary columns
    features = ['Hour', 'Month', 'DayOfWeek', 'PdDistrict', 'X', 'Y']
    return df[features]

In [4]:
# Process datasets
X_train = process_data(train)
X_test = process_data(test)
y_train = train['Category']

In [7]:
# 4. Train Model
model = CatBoostClassifier(
    iterations=500,
    learning_rate=0.1,
    eval_metric='MultiClass',
    verbose=100,
    task_type='GPU'  # Use GPU if available
)

In [8]:
model.fit(X_train, y_train)

0:	learn: 3.3420487	total: 100ms	remaining: 50s
100:	learn: 2.4813666	total: 7.49s	remaining: 29.6s
200:	learn: 2.4529673	total: 14.8s	remaining: 22.1s
300:	learn: 2.4343192	total: 22.2s	remaining: 14.7s
400:	learn: 2.4200870	total: 29.7s	remaining: 7.33s
499:	learn: 2.4093618	total: 37.1s	remaining: 0us


<catboost.core.CatBoostClassifier at 0x7de5f4f26c80>

In [12]:
predictions = model.predict_proba(X_test)
categories = model.classes_
submission = pd.DataFrame(0, index=np.arange(len(test)), columns=categories)


for i, category in enumerate(categories):
    submission[category] = predictions[:, i]


submission.insert(0, 'Id', test['Id'])

submission.to_csv('submission.csv', index=False)

In [13]:
model.save_model('model.cbm')