# Capstone: Visualizations

**San Francisco Police Department Incident Reports**

This notebook supports the visualizations for the final report of this Capstone Project. Please reference 
the [README.md](https://github.com/fazeelgm/UCB_ML_AI_Capstone/blob/main/README.md) 
for details.

## Imports & Utilities

### Imports

In [7]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# Import utilities
import time

# Export dataFrame's as images
import dataframe_image as dfi

# import project utils
import sys
sys.path.append('../src')

import data_utils
from data_utils import Config

In [8]:
# Configure logging
import logging
logging.basicConfig(level=logging.INFO, format='%(levelname)s: %(message)s')

# logging.getLogger().setLevel(logging.DEBUG)
# logging.getLogger().setLevel(logging.INFO)

### Utility Functions

In [10]:
def time_secs_to_msg(lapse_time_secs, mins_label='m', secs_label='s'):
    if lapse_time_secs <= 60:
        return f'{lapse_time_secs%60:.2f}{secs_label}'
    else:
        return f'{lapse_time_secs//60:,.0f}{mins_label} {lapse_time_secs%60:.2f}{secs_label}'

## The Data

Details for the data can be found in the project [README.md](https://github.com/fazeelgm/UCB_ML_AI_Capstone/blob/main/README.md).

### Read the Data

In [13]:
# Which dataset to work from? Select sample size percentage

sample_file = data_utils.select_sample_csv_file(pct=10)
# sample_file = data_utils.select_sample_csv_file(pct=100)
# sample_file = data_utils.select_sample_csv_file(pct=75)
# sample_file = data_utils.select_sample_csv_file(pct=50)
# sample_file = data_utils.select_sample_csv_file(pct=25)

print(f'Selected sample file: {sample_file}')

Selected sample file: ../data/incidents_clean_10_pct.csv


In [14]:
current_raw_df, current_clean_df = data_utils.get_clean_data_from_csv(sample_file)

Reading file: ../data/incidents_clean_10_pct.csv ... Done: 89,458 rows, 37 columns
... Converting datetime to timeseries ... Done
... Setting index to datetime ... Done
Done


### Apply Feature Engineering Learnings

We will re-use the learnings from the Exploratory Data Analysis (EDA) and apply it to clean the data using two shared methods:

* `data_utils.preprocess_data()`
* `data_utils.fix_data_artifacts()`
* `data_utils.apply_synthetic_features()`

Please refer to the EDA notebook, 
[ExploratoryDataAnalysis.ipynb](https://github.com/fazeelgm/UCB_ML_AI_Capstone/blob/main/notebooks/ExploratoryDataAnalysis.ipynb), for details.

In [17]:
data = data_utils.preprocess_data(current_raw_df.copy())

Pre-processing ... 
... Dropping unwanted columns ... 
... preprocess_drop_cols: Column Unnamed: 0 dropped
... preprocess_drop_cols: Column esncag_-_boundary_file dropped
... preprocess_drop_cols: Column central_market/tenderloin_boundary_polygon_-_updated dropped
... preprocess_drop_cols: Column civic_center_harm_reduction_project_boundary dropped
... preprocess_drop_cols: Column hsoc_zones_as_of_2018-06-05 dropped
... preprocess_drop_cols: Column invest_in_neighborhoods_(iin)_areas dropped
... preprocess_drop_cols: Column report_type_code dropped
... preprocess_drop_cols: Column report_type_description dropped
... preprocess_drop_cols: Column filed_online dropped
... preprocess_drop_cols: Column intersection dropped
... preprocess_drop_cols: Column cnn dropped
... preprocess_drop_cols: Column point dropped
... preprocess_drop_cols: Column supervisor_district dropped
... preprocess_drop_cols: Column supervisor_district_2012 dropped
... preprocess_drop_cols: Column current_supervisor_d

In [18]:
# Fix data value artifacts that were discovered during EDA
data = data_utils.fix_data_artifacts(data)

Fixing data artifacts (in-place) ... 
... Category column:
    ..."Human Trafficking*"
    ..."Motor Vehicle Theft"
    ..."Weapons Offence"
Done


Create the new, synthetic features that were introduced during EDA:

In [20]:
data = data_utils.apply_synthetic_features(data)

Generating synthetic feature columns (in-place) ... 
... Adding columns ['hour', 'minute', 'day', 'month']'
... Adding column ['weekend']
... Adding column ['season']
... Adding column ['holiday']
... Adding column ['tod']
Done


In [21]:
data.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 82888 entries, 2024-08-01 08:01:00 to 2018-10-02 16:53:00
Data columns (total 18 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   date             82888 non-null  object 
 1   time             82888 non-null  object 
 2   year             82888 non-null  int64  
 3   dow              82888 non-null  object 
 4   category         82888 non-null  object 
 5   resolution       82888 non-null  object 
 6   police_district  82888 non-null  object 
 7   neighborhood     82888 non-null  object 
 8   latitude         82888 non-null  float64
 9   longitude        82888 non-null  float64
 10  hour             82888 non-null  int64  
 11  minute           82888 non-null  int64  
 12  day              82888 non-null  int64  
 13  month            82888 non-null  int64  
 14  weekend          82888 non-null  int64  
 15  season           82888 non-null  object 
 16  holiday          82888 

In [22]:
data.head(2)

Unnamed: 0_level_0,date,time,year,dow,category,resolution,police_district,neighborhood,latitude,longitude,hour,minute,day,month,weekend,season,holiday,tod
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
2024-08-01 08:01:00,2024/08/01,08:01,2024,Thursday,Other Miscellaneous,Open or Active,Mission,Mission,37.768272,-122.419983,8,1,1,8,0,Summer,False,Morning
2021-11-25 23:30:00,2021/11/25,23:30,2021,Thursday,Burglary,Open or Active,Northern,Haight Ashbury,37.773757,-122.432467,23,30,25,11,0,Fall,False,Evening


In [23]:
data.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 82888 entries, 2024-08-01 08:01:00 to 2018-10-02 16:53:00
Data columns (total 18 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   date             82888 non-null  object 
 1   time             82888 non-null  object 
 2   year             82888 non-null  int64  
 3   dow              82888 non-null  object 
 4   category         82888 non-null  object 
 5   resolution       82888 non-null  object 
 6   police_district  82888 non-null  object 
 7   neighborhood     82888 non-null  object 
 8   latitude         82888 non-null  float64
 9   longitude        82888 non-null  float64
 10  hour             82888 non-null  int64  
 11  minute           82888 non-null  int64  
 12  day              82888 non-null  int64  
 13  month            82888 non-null  int64  
 14  weekend          82888 non-null  int64  
 15  season           82888 non-null  object 
 16  holiday          82888 

In [24]:
# data.to_csv('../data/incidents_10.csv')

## Load Models

In [26]:
import datetime
import joblib

In [27]:
# Edit the model file prefix and suffix for the models you want loaded
timestamp_prefix = 'saved_'
timestamp_suffix = '_incidents_clean_10_pct_2024-10-10_2205.pkl'

saved_models = {
    'RandomForestClassifier',
    'XGBClassifier',
    # 'TEST_FAILURE'
}

loaded_models = {}
for model in saved_models:
    print(f'Loading models ...')
    
    try:
        saved_file = Config.MODELS_DIR / f'{timestamp_prefix}{model}{timestamp_suffix}'
        print(f'... Loading {model} from {saved_file}')
        loaded_models[model] = joblib.load(saved_file)
    except:
        print(f'... FAILED {model}: {repr(sys.exception())}')

print(f'Done')

Loading models ...
... Loading XGBClassifier from ../models/saved_XGBClassifier_incidents_clean_10_pct_2024-10-10_2205.pkl
Loading models ...
... Loading RandomForestClassifier from ../models/saved_RandomForestClassifier_incidents_clean_10_pct_2024-10-10_2205.pkl
Done


In [28]:
loaded_models

{'XGBClassifier': XGBClassifier(base_score=None, booster=None, callbacks=None,
               colsample_bylevel=0.8390144719977516, colsample_bynode=None,
               colsample_bytree=0.9416576386904312, device=None,
               early_stopping_rounds=None, enable_categorical=False,
               eval_metric=None, feature_types=None, gamma=None,
               grow_policy=None, importance_type=None,
               interaction_constraints=None, learning_rate=0.02806554771929606,
               max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None,
               max_delta_step=None, max_depth=95, max_leaves=None,
               min_child_weight=9, missing=nan, monotone_constraints=None,
               multi_strategy=None, n_estimators=109, n_jobs=None,
               num_parallel_tree=None, objective='multi:softprob', ...),
 'RandomForestClassifier': XGBClassifier(base_score=None, booster=None, callbacks=None,
               colsample_bylevel=0.8390144719977516, colsample_b

## Data Preparation

### Create Train/Test Splits

In [31]:
X = data.drop('category', axis='columns')
y = data['category']

In [32]:
# OneHot Encode the features and drop the first value to reduce multicollinearity
X = pd.get_dummies(X, drop_first=True)

In [33]:
# Consistent random_state for the project
print(f'Project-wide random_state: {Config.RANDOM_STATE}')

Project-wide random_state: 42


In [34]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                    stratify=y, random_state=Config.RANDOM_STATE)

In [35]:
print('AFTER TRAIN_TEST_SPLIT: Data{}, X_train{}, X_test{}, y_train{}, y_test{}'
      .format(data.shape, X_train.shape, X_test.shape, y_train.shape, y_test.shape))

AFTER TRAIN_TEST_SPLIT: Data(82888, 18), X_train(66310, 3976), X_test(16578, 3976), y_train(66310,), y_test(16578,)


In [36]:
# spot-check feature encoding
X.T.iloc[:, 0:5]

datetime,2024-08-01 08:01:00,2021-11-25 23:30:00,2018-06-20 21:00:00,2022-07-06 12:41:00,2021-02-27 23:02:00
year,2024,2021,2018,2022,2021
latitude,37.768272,37.773757,37.723642,37.777457,37.770063
longitude,-122.419983,-122.432467,-122.461251,-122.413158,-122.403878
hour,8,23,21,12,23
minute,1,30,0,41,2
...,...,...,...,...,...
season_Summer,True,False,True,True,False
season_Winter,False,False,False,False,True
tod_Evening,False,True,True,False,True
tod_Morning,True,False,False,False,False


### Feature Scaling

In [38]:
from sklearn.preprocessing import StandardScaler

# Scale the data - we'll use StandardScaler for the baseline model
logging.debug('Scaling data')
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print('AFTER SCALING: Data{}, X_train_scaled{}, X_test_scaled{}, y_train{}, y_test{}'
      .format(data.shape, X_train_scaled.shape, X_test_scaled.shape, y_train.shape, y_test.shape))

AFTER SCALING: Data(82888, 18), X_train_scaled(66310, 3976), X_test_scaled(16578, 3976), y_train(66310,), y_test(16578,)


## SHAP

In [40]:
import shap

In [41]:
X_test.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 16578 entries, 2019-09-27 21:00:00 to 2021-03-19 12:38:00
Columns: 3976 entries, year to tod_Night
dtypes: bool(3968), float64(2), int64(6)
memory usage: 63.9 MB


In [42]:
X_test.iloc[0]

year                   2019
latitude          37.765117
longitude       -122.418579
hour                     21
minute                    0
                    ...    
season_Summer         False
season_Winter         False
tod_Evening            True
tod_Morning           False
tod_Night             False
Name: 2019-09-27 21:00:00, Length: 3976, dtype: object

In [74]:
loaded_models.keys()

dict_keys(['XGBClassifier', 'RandomForestClassifier'])

In [43]:
%%time
# Select model to SHAP
model = loaded_models['RandomForestClassifier']
model = loaded_models['RandomForestClassifier']

CPU times: user 5 µs, sys: 2 µs, total: 7 µs
Wall time: 19.1 µs


In [70]:
%%time
explainer = shap.TreeExplainer(model)

CPU times: user 4.57 s, sys: 804 ms, total: 5.37 s
Wall time: 5.26 s


In [72]:
%%time

# shap_values = explainer.shap_values(X_test)
shap_values = explainer.shap_values(X_test.iloc[0])

ValueError: DataFrame.dtypes for data must be int, float, bool or category. When categorical type is supplied, the experimental DMatrix parameter`enable_categorical` must be set to `True`.  Invalid columns:2019-09-27 21:00:00: object

In [None]:
explainer.expected_value[0]

In [None]:
shap_values.shape, X_test.shape

In [None]:
shap_values[0][0,:]

In [None]:
shap_values.shape[1]

In [None]:
%%time
shap.initjs()

In [None]:
%%time

# shap.force_plot(exp.expected_value[0], shap_values[0][0,:], X_test.iloc[0,:])
shap.force_plot(exp.expected_value[0], shap_values[:, 0], X_test.iloc[0, :])