# <a id='toc1_'></a>[Libraries and Data](#toc0_)

**Table of contents**<a id='toc0_'></a>    
- [Libraries and Data](#toc1_)    
    - [Data Cleaning Notes](#toc1_1_1_)    
  - [Features](#toc1_2_)    
- [Data Cleaning](#toc2_)    
  - [Feature Engineering](#toc2_1_)    
  - [X and y](#toc2_2_)    
  - [Encoding and scaling](#toc2_3_)    
- [Baseline Model  - Logistic Regression](#toc3_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

In [81]:
import pandas as pd
import os
import sys
import numpy as np

# Data cleaning
#from library import la_functions as la

# Visualisation
import matplotlib.pyplot as plt
import seaborn as sns

# Modelling
from sklearn.model_selection import cross_validate
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, FunctionTransformer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix, classification_report

# Export model
import pickle

# Path to the folder containing the pesonalized functions
folder_path = os.path.abspath(os.path.join('..', 'library'))
sys.path.insert(0, folder_path)

# Now you can import your module or functions
import la_functions as la

In [82]:
# fetch all the data from the raw_data folder
df = pd.read_csv('../raw_data/data.csv')

### <a id='toc1_1_1_'></a>[Data Cleaning Notes](#toc0_)
- Note: All crimes are unique, however when you remove a number of columns it appears that they are duplicates

## <a id='toc1_2_'></a>[Features](#toc0_)

- What do '-' mean?
- Code drop down using dictionary mapping
- What does H mean?

# <a id='toc2_'></a>[Data Cleaning](#toc0_)

In [83]:
# Remove victim_sex rows with missing data
col_remove_sex = ['X', '-']
df = df[~df['victim_sex'].isin(col_remove_sex)]
df = df[df['victim_sex'].notnull()]
len(df)


664923

In [84]:
# Remove victim_descent rows with missing data
df = df[df['victim_descent'].notnull()]
col_remove_descent = ['X', '-']
df = df[~df['victim_descent'].isin(col_remove_descent)]
len(df)

656473

In [85]:
# Remove victim age = 0
df=df[df['victim_age']>=0]
len(df)

656450

In [87]:
# lower case
df.crime_description = df.crime_description.str.lower()

In [88]:
# Group weapons into type and prepare for encoding
df['weapon_description'].fillna('None', inplace=True)
weapon_types = {
    'STRONG-ARM (HANDS, FIST, FEET OR BODILY FORCE)': 'Physical Force',
    'UNKNOWN WEAPON/OTHER WEAPON': 'Unknown',
    'VERBAL THREAT': 'Verbal Threat',
    'HAND GUN': 'Firearm',
    'SEMI-AUTOMATIC PISTOL': 'Firearm',
    'KNIFE WITH BLADE 6INCHES OR LESS': 'Knife',
    'UNKNOWN FIREARM': 'Firearm',
    'OTHER KNIFE': 'Knife',
    'MACE/PEPPER SPRAY': 'Chemical',
    'VEHICLE': 'Vehicle',
    'ROCK/THROWN OBJECT': 'Thrown Object',
    'PIPE/METAL PIPE': 'Blunt Object',
    'BOTTLE': 'Blunt Object',
    'STICK': 'Blunt Object',
    'FOLDING KNIFE': 'Knife',
    'CLUB/BAT': 'Blunt Object',
    'KITCHEN KNIFE': 'Knife',
    'AIR PISTOL/REVOLVER/RIFLE/BB GUN': 'Firearm',
    'KNIFE WITH BLADE OVER 6 INCHES IN LENGTH': 'Knife',
    'BLUNT INSTRUMENT': 'Blunt Object',
    'HAMMER': 'Blunt Object',
    'SIMULATED GUN': 'Firearm',
    'REVOLVER': 'Firearm',
    'MACHETE': 'Knife',
    'OTHER FIREARM': 'Firearm',
    'OTHER CUTTING INSTRUMENT': 'Knife',
    'PHYSICAL PRESENCE': 'Physical Force',
    'UNKNOWN TYPE CUTTING INSTRUMENT': 'Knife',
    'SCREWDRIVER': 'Sharp Object',
    'CONCRETE BLOCK/BRICK': 'Blunt Object',
    'FIRE': 'Fire',
    'BELT FLAILING INSTRUMENT/CHAIN': 'Blunt Object',
    'SCISSORS': 'Sharp Object',
    'RIFLE': 'Firearm',
    'FIXED OBJECT': 'Blunt Object',
    'STUN GUN': 'Electric Weapon',
    'GLASS': 'Sharp Object',
    'AXE': 'Sharp Object',
    'BOARD': 'Blunt Object',
    'SHOTGUN': 'Firearm',
    'CAUSTIC CHEMICAL/POISON': 'Chemical',
    'SWITCH BLADE': 'Knife',
    'BRASS KNUCKLES': 'Blunt Object',
    'BOMB THREAT': 'Explosive',
    'TOY GUN': 'Firearm',
    'TIRE IRON': 'Blunt Object',
    'SCALDING LIQUID': 'Chemical',
    'SWORD': 'Sharp Object',
    'RAZOR BLADE': 'Sharp Object',
    'HECKLER & KOCH 93 SEMIAUTOMATIC ASSAULT RIFLE': 'Firearm',
    'DIRK/DAGGER': 'Knife',
    'EXPLOXIVE DEVICE': 'Explosive',
    'ASSAULT WEAPON/UZI/AK47/ETC': 'Firearm',
    'DEMAND NOTE': 'Threat',
    'ICE PICK': 'Sharp Object',
    'RAZOR': 'Sharp Object',
    'LIQUOR/DRUGS': 'Chemical',
    'SEMI-AUTOMATIC RIFLE': 'Firearm',
    'DOG/ANIMAL (SIC ANIMAL ON)': 'Animal',
    'ROPE/LIGATURE': 'Strangling',
    'STARTER PISTOL/REVOLVER': 'Firearm',
    'CLEAVER': 'Knife',
    'BOWIE KNIFE': 'Knife',
    'SAWED OFF RIFLE/SHOTGUN': 'Firearm',
    'AUTOMATIC WEAPON/SUB-MACHINE GUN': 'Firearm',
    'BOW AND ARROW': 'Projectile',
    'SYRINGE': 'Sharp Object',
    'STRAIGHT RAZOR': 'Sharp Object',
    'MARTIAL ARTS WEAPONS': 'Physical Force',
    'UNK TYPE SEMIAUTOMATIC ASSAULT RIFLE': 'Firearm',
    'BLACKJACK': 'Blunt Object',
    'RELIC FIREARM': 'Firearm',
    'ANTIQUE FIREARM': 'Firearm',
    'UZI SEMIAUTOMATIC ASSAULT RIFLE': 'Firearm',
    'MAC-11 SEMIAUTOMATIC ASSAULT WEAPON': 'Firearm',
    'MAC-10 SEMIAUTOMATIC ASSAULT WEAPON': 'Firearm',
    'HECKLER & KOCH 91 SEMIAUTOMATIC ASSAULT RIFLE': 'Firearm',
    'M1-1 SEMIAUTOMATIC ASSAULT RIFLE': 'Firearm',
    'M-14 SEMIAUTOMATIC ASSAULT RIFLE': 'Firearm',
    'None': 'None'
}

# Creating a new column 'weapon_type' based on the mapping
df['weapon_type'] = df['weapon_description'].map(weapon_types)

# Additional engineering could include severity


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['weapon_description'].fillna('None', inplace=True)


In [115]:
# Drop null values in Premise code
df = df.dropna(subset=['premise_code'])

## <a id='toc2_1_'></a>[Feature Engineering](#toc0_)

In [89]:
# Split 'date_occurred' into year, month,  day, hour and drop date occurred
df['date_occurred'] = pd.to_datetime(df['date_occurred'])
df['year_occurred'] = df['date_occurred'].dt.year
df['month_occurred'] = df['date_occurred'].dt.month
df['day_occurred'] = df['date_occurred'].dt.day
df['hour_occurred'] = df['date_occurred'].dt.hour

In [90]:
# Create new column 'time_of_day' based on time and drop hour occurred
def categorize_time(hour):
    if 6 <= hour < 12:
        return 'morning'
    elif 12 <= hour < 18:
        return 'afternoon'
    elif 18 <= hour < 24:
        return 'evening'
    else:
        return 'night'

df['time_of_day'] = df['hour_occurred'].apply(categorize_time)

## <a id='toc2_2_'></a>[X and y](#toc0_)

In [91]:
df.columns

Index(['division_number', 'date_reported', 'date_occurred', 'area',
       'area_name', 'reporting_district', 'part', 'crime_code',
       'crime_description', 'modus_operandi', 'victim_age', 'victim_sex',
       'victim_descent', 'premise_code', 'premise_description', 'weapon_code',
       'weapon_description', 'status', 'status_description', 'crime_code_1',
       'crime_code_2', 'crime_code_3', 'crime_code_4', 'location',
       'cross_street', 'latitude', 'longitude', 'weapon_type', 'year_occurred',
       'month_occurred', 'day_occurred', 'hour_occurred', 'time_of_day'],
      dtype='object')

In [116]:
# Features that lack relevance
irrelevant_cols = df[['division_number', 'date_reported', 'reporting_district', 'part', 'modus_operandi', 'status', 'status_description', 'crime_code_1',
       'crime_code_2', 'crime_code_3', 'crime_code_4','cross_street']]

# Features that are duplicates of numerical features (e.g. description vs code))
cat_duplicates = df[['area_name', 'crime_description', 'premise_description', 'weapon_description', 'location']]
num_duplicates = df[['weapon_code', 'date_occurred']]


X = df[['victim_age', 'crime_code', 'victim_sex', 'victim_descent', 'premise_code', 'latitude', 'longitude', 'weapon_type', 'year_occurred',
       'month_occurred', 'day_occurred', 'hour_occurred', 'time_of_day' ]]


In [117]:
def assign_gravity(crime_description):
    if any(word in crime_description for word in ['petty theft', 'vandalism', 'minor fraud', 'trespass','stole']):
        return 1  # Low Gravity
    elif any(word in crime_description for word in ['burglary', 'serious fraud', 'aggravated assault', 'robbery']):
        return 2  # Medium Gravity
    elif any(word in crime_description for word in ['homicide', 'rape', 'kidnapping', 'arson','dead','penetration','penis','child pornography']):
        return 3  # High Gravity
    else:
        return 1  # Default to Low Gravity if not clearly fitting other categories


df['crime_gravity'] = df.crime_description.apply(assign_gravity)
y = df['crime_gravity']

## <a id='toc2_3_'></a>[Encoding and scaling](#toc0_)

In [118]:
num_values = X.select_dtypes(include=['number'])
cat_values = X.select_dtypes(include=['object'])

In [119]:
X_cat_encoded = pd.get_dummies(X, columns=cat_values.columns)
X_cat_encoded = X_cat_encoded.replace({True: 1, False: 0})
X_cat_encoded.head()

  X_cat_encoded = X_cat_encoded.replace({True: 1, False: 0})


Unnamed: 0,victim_age,crime_code,premise_code,latitude,longitude,year_occurred,month_occurred,day_occurred,hour_occurred,victim_sex_F,...,weapon_type_Strangling,weapon_type_Threat,weapon_type_Thrown Object,weapon_type_Unknown,weapon_type_Vehicle,weapon_type_Verbal Threat,time_of_day_afternoon,time_of_day_evening,time_of_day_morning,time_of_day_night
0,36,624,501.0,34.0141,-118.2978,2020,1,8,22,1,...,0,0,0,0,0,0,0,1,0,0
1,25,624,102.0,34.0459,-118.2545,2020,1,1,3,0,...,0,0,0,1,0,0,0,0,0,1
3,76,745,502.0,34.1685,-118.4019,2020,1,1,17,1,...,0,0,0,0,0,0,1,0,0,0
5,25,121,735.0,34.0452,-118.2534,2020,1,1,0,1,...,0,0,0,1,0,0,0,0,0,1
6,23,442,404.0,34.0483,-118.2631,2020,1,2,13,0,...,0,0,0,0,0,0,1,0,0,0


In [120]:
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X_cat_encoded)
X_scaled_df = pd.DataFrame(X_scaled, columns = X_cat_encoded.columns)


## Basic Logistic Regression

In [122]:
X_scaled.shape, y.shape

((656449, 52), (656449,))

In [123]:
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

In [124]:
lr_model = LogisticRegression(max_iter=100000)

cv_results = cross_validate(lr_model, X_scaled, y, cv=5)

accuracy = cv_results['test_score'].mean()
accuracy

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

0.9187872716665767

In [125]:
lr_model.fit(X_train, y_train)
lr_model.score(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


0.9200965802737837

In [126]:
y_pred = lr_model.predict(X_test)

In [128]:
print("Classification Report:")
print(classification_report(y_test, y_pred))

print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Classification Report:
              precision    recall  f1-score   support

           1       0.95      0.95      0.95     93051
           2       0.85      0.89      0.87     36018
           3       0.60      0.03      0.05      2221

    accuracy                           0.92    131290
   macro avg       0.80      0.62      0.62    131290
weighted avg       0.92      0.92      0.91    131290

Confusion Matrix:
[[88743  4286    22]
 [ 3962 32038    18]
 [  998  1163    60]]


# <a id='toc3_'></a>[Baseline Model  - Logistic Regression](#toc0_)

In [7]:
X = df[['victim_age','latitude','longitude','day_occurred','month_occurred','year_occurred']]
y = df['crime_gravity']

In [21]:
df.victim_age.value_counts()

victim_age
 0      211842
 30      19421
 35      19008
 31      18603
 29      18552
         ...  
 97         63
-1          60
-2          13
 120         1
-3           1
Name: count, Length: 103, dtype: int64

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [36]:
# DATA CLEANING PIPES
def isolate_age(X):
    return X.apply(lambda age: age if 1 <= age <= 99 else None)

def dropna(X):
    return X.dropna()

# Remove null values from Age
dropna_pipe = FunctionTransformer(dropna)

# select valid age range
age_range_pipe = FunctionTransformer(isolate_age)


In [37]:
# Preprocess numerical data
norm_scaler = MinMaxScaler()
preprocessor = ColumnTransformer(
    transformers=[
        #('dropna_pipe', dropna_pipe, ['victim_age']),
        #('age_range_pipe', age_range_pipe, ['victim_age']),
        ('num', norm_scaler, ['victim_age', 'latitude', 'longitude', 'day_occurred', 'month_occurred', 'year_occurred'])
    ])

In [38]:
# Logistic Regression model
lr_model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression())
])

In [39]:
lr_model.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [41]:
accuracy = lr_model.score(X_test, y_test)
print("Accuracy:", accuracy)

Accuracy: 0.1892666627586611


In [42]:
y_pred = lr_model.predict(X_test)

In [44]:
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(conf_matrix)

Confusion Matrix:
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


In [45]:
class_report = classification_report(y_test, y_pred)
print("Classification Report:")
print(class_report)

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Classification Report:
                                                          precision    recall  f1-score   support

                                                   ARSON       0.00      0.00      0.00       473
            ASSAULT WITH DEADLY WEAPON ON POLICE OFFICER       0.00      0.00      0.00       209
          ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT       0.00      0.00      0.00      9717
                                       ATTEMPTED ROBBERY       0.00      0.00      0.00       883
                                BATTERY - SIMPLE ASSAULT       0.11      0.56      0.18     13639
                                BATTERY ON A FIREFIGHTER       0.00      0.00      0.00        54
                                 BATTERY POLICE (SIMPLE)       0.00      0.00      0.00       457
                             BATTERY WITH SEXUAL CONTACT       0.00      0.00      0.00       727
BEASTIALITY, CRIME AGAINST NATURE SEXUAL ASSLT WITH ANIM       0.00      0.00      0.00       

  _warn_prf(average, modifier, msg_start, len(result))


In [None]:
columns_keep = [
#'division_number',
#'date_reported',
'date_occurred',
#'area',
'area_name',
#'reporting_district',
#'part',
#'crime_code',
'crime_description',
#'modus_operandi',
'victim_age',
'victim_sex',
'victim_descent',
#'premise_code',
'premise_description',
#'weapon_code',
'weapon_description',
#'status',
#'status_description',
#'crime_code_1',
#'crime_code_2',
#'crime_code_3',
#'crime_code_4',
'location',
#'cross_street',
'latitude',
'longitude',
]