## Initialization
We'll start off by importing the necessary libraries and modules. Unique to this notebook will be a separate notebook containing any
functions required for this project.

In [2]:
# Importing the standard libraries and modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error

In [1]:
# Importing the file containing our unique functions
"""
vals_by_col
date_to_col
secondary_unique
counts_to_portions
expand_categories
"""
# pip install ipynb
from ipynb.fs.full.blog_post_functions import *

## Load Data and Perform Initial Inspection
We gathered our data from https://data.seattle.gov/Public-Safety/Call-Data/33kz-ixgy, which provided a dataset named "Call Data" containing data on emergency calls to the Seattle Police Department call center.

The dataset itself is just over 5 million rows, and very much so over GitHub's size limitation. The dataset we will be performing our analysis on is a subset containing 10,000 rows of data.

In [None]:
"""
DO NOT RUN!

This cell will not run, it is only included to show how we obtained our subset.
"""

df = pd.read_csv('Call_Data.csv')

np.random.seed(0)
df_sample = np.random.choice(df.shape[0], 10000, replace = False)
df_subset = df.loc[df_sample]

df_subset.to_csv('Call_Data_Subset.csv')

In [3]:
# Loading the Data (pretending we're starting with the 10,000 row dataset)
df = pd.read_csv('Call_Data_Subset.csv')

## Initial Inspection
1. Check data types, and how it is all formatted.
2. Check for missing values, and amount of missing values.
3. Look for anything unique. Is there anything we haven't encountered before?

In [7]:
# Overall shape
df.shape

(10000, 12)

In [9]:
# Columns
df.columns

Index(['Unnamed: 0', 'CAD Event Number', 'Event Clearance Description',
       'Call Type', 'Priority', 'Initial Call Type', 'Final Call Type',
       'Original Time Queued', 'Arrived Time', 'Precinct', 'Sector', 'Beat'],
      dtype='object')


In [10]:
# What do the first few rows look like?
df.head(10)

Unnamed: 0.1,Unnamed: 0,CAD Event Number,Event Clearance Description,Call Type,Priority,Initial Call Type,Final Call Type,Original Time Queued,Arrived Time,Precinct,Sector,Beat
0,2255988,2017000106644,PROBLEM SOLVING PROJECT,ONVIEW,4,REQUEST TO WATCH,--PREMISE CHECKS - REQUEST TO WATCH,03/27/2017 05:02:47 AM,03/27/2017 05:02:47 AM,SOUTH,OCEAN,O3
1,602424,2022000027644,ASSISTANCE RENDERED,ONVIEW,9,-ASSIGNED DUTY - DETAIL BY SUPERVISOR,--MISCHIEF OR NUISANCE - GENERAL,02/02/2022 09:42:35 AM,02/02/2022 09:42:35 AM,WEST,KING,K1
2,4322263,2020000160933,ASSISTANCE RENDERED,911,3,"CHILD - ABAND, ABUSED, MOLESTED, NEGLECTED",--DISTURBANCE - OTHER,05/15/2020 11:23:01 PM,05/15/2020 11:33:28 PM,SOUTH,ROBERT,R1
3,1660533,2021000000962,ASSISTANCE RENDERED,ONVIEW,7,"PREMISE CHECK, OFFICER INITIATED ONVIEW ONLY",--PREMISE CHECKS - CRIME PREVENTION,01/02/2021 10:16:02 AM,01/02/2021 10:16:02 AM,WEST,DAVID,D3
4,717846,2018000405368,DUPLICATED OR CANCELLED BY RADIO,911,1,UNKNOWN - ANI/ALI - LANDLINE (INCLUDES OPEN LINE),UNKNOWN - ANI/ALI - LANDLINE (INCLUDES OPEN LINE),10/29/2018 05:54:26 PM,01/01/1900 12:00:00 AM,NORTH,LINCOLN,L2
5,593820,2021000157964,NO POLICE ACTION POSSIBLE OR NECESSARY,911,2,TRESPASS,--PROWLER - TRESPASS,06/24/2021 11:02:12 PM,01/01/1900 12:00:00 AM,NORTH,LINCOLN,L3
6,2468962,2019000255821,UNABLE TO LOCATE INCIDENT OR COMPLAINANT,911,3,NUISANCE - MISCHIEF,--MISCHIEF OR NUISANCE - GENERAL,07/12/2019 08:51:59 PM,07/12/2019 09:12:50 PM,SOUTHWEST,FRANK,F1
7,2651296,2020000063174,ASSISTANCE RENDERED,ONVIEW,9,OFF DUTY EMPLOYMENT,-OFF DUTY EMPLOYMENT,02/19/2020 04:19:09 PM,02/19/2020 04:19:09 PM,WEST,MARY,M1
8,346189,2019000117647,REPORT WRITTEN (NO ARREST),"TELEPHONE OTHER, NOT 911",3,"SUSPICIOUS PERSON, VEHICLE OR INCIDENT",--SUSPICIOUS CIRCUM. - SUSPICIOUS PERSON,04/03/2019 08:19:42 AM,04/03/2019 08:50:03 AM,NORTH,BOY,B2
9,784470,2017000073616,ASSISTANCE RENDERED,ONVIEW,7,"PREMISE CHECK, OFFICER INITIATED ONVIEW ONLY",--PREMISE CHECKS - REQUEST TO WATCH,03/01/2017 04:38:28 AM,03/01/2017 04:38:28 AM,NORTH,LINCOLN,L1


In [11]:
# Data Types
df.dtypes

Unnamed: 0                      int64
CAD Event Number                int64
Event Clearance Description    object
Call Type                      object
Priority                        int64
Initial Call Type              object
Final Call Type                object
Original Time Queued           object
Arrived Time                   object
Precinct                       object
Sector                         object
Beat                           object
dtype: object

In [12]:
# Missing Data
df.isnull().sum()

Unnamed: 0                      0
CAD Event Number                0
Event Clearance Description     0
Call Type                       0
Priority                        0
Initial Call Type               0
Final Call Type                 0
Original Time Queued            0
Arrived Time                    0
Precinct                        0
Sector                         95
Beat                            0
dtype: int64

In [6]:
# Unique Values
col_uniques = vals_by_col(df, df.columns[2:])
print(col_uniques)

Event Clearance Description    [ASSISTANCE RENDERED, REPORT WRITTEN (NO ARRES...
Call Type                      [ONVIEW, 911, TELEPHONE OTHER, NOT 911, ALARM ...
Priority                                            [3, 2, 7, 1, 9, 4, 5, 6, -1]
Initial Call Type              [PREMISE CHECK, OFFICER INITIATED ONVIEW ONLY,...
Final Call Type                [--SUSPICIOUS CIRCUM. - SUSPICIOUS PERSON, --P...
Original Time Queued           [03/27/2017 05:02:47 AM, 05/18/2016 06:04:43 P...
Arrived Time                   [01/01/1900 12:00:00 AM, 03/27/2017 05:02:47 A...
Precinct                          [WEST, NORTH, SOUTH, EAST, SOUTHWEST, UNKNOWN]
Sector                         [KING, MARY, EDWARD, DAVID, UNION, SAM, NORA, ...
Beat                           [K3, M3, E2, D2, N3, M1, K2, M2, D1, K1, Q3, U...
dtype: object


In [8]:
# How many uniques are there?
[len(uniques) for uniques in col_uniques]

[25, 7, 9, 219, 245, 10000, 9291, 6, 17, 72]

In [10]:
# Potentially hidden Missing Data: "-"
"""
Hard to tell from the view above, but further examination revealed that there was another type of missing data point,
which presented itself as a dash, "-"
"""
marked_null = []
for col_num, col in enumerate(col_uniques):
    if '-' in col:
        marked_null.append(col_uniques.index[col_num])
print(marked_null)

['Event Clearance Description']


In [12]:
df['Event Clearance Description'].value_counts()

ASSISTANCE RENDERED                                                    4348
REPORT WRITTEN (NO ARREST)                                             2016
UNABLE TO LOCATE INCIDENT OR COMPLAINANT                                693
CITATION ISSUED (CRIMINAL OR NON-CRIMINAL)                              490
PHYSICAL ARREST MADE                                                    314
NO POLICE ACTION POSSIBLE OR NECESSARY                                  270
PROBLEM SOLVING PROJECT                                                 264
FALSE COMPLAINT/UNFOUNDED                                               264
OTHER REPORT MADE                                                       259
FOLLOW-UP REPORT MADE                                                   161
RESPONDING UNIT(S) CANCELLED BY RADIO                                   141
DUPLICATED OR CANCELLED BY RADIO                                         74
-                                                                        63
STREET CHECK

In [13]:
df['Event Clearance Description'].value_counts().loc['-']

63

In [17]:
# Remove nulls and blanks in data
df.dropna(axis = 0, inplace = True)
df.drop(df[df['Event Clearance Description'] == '-'].index, inplace = True)

In [19]:
# Additionally, the first two columns are essentially covered by index, we'll remove those
df = df.drop(['Unnamed: 0', 'CAD Event Number'], axis=1)
df.shape

(9843, 10)

## Potential Questions
1. What time of the year / week / day do the majority of the calls come in?
2. Do amount of calls / priority of calls change by precint, sector or beat?
3. What determines priority?

After performing some research on Computer-Aided-Dispatch systems (CAD), it appears the system itself assigns a Priority rating. Unfortunately, we're missing some of the datapoints actually used in assigning the Priority rating. However, let's see if how closely we can build a model that will predict Priority ratings.

4. Can we buld a model to accurately predict Priority ratings?

## Potential Question 1
What time of the year / week / day do the majority of the calls come in?

## Potential Question 2
Do amount of calls / priority of calls change by precint, sector or beat?

## Potential Question 3
What determines priority?

## Potential Question Freestyle
Did we uncover anything interesting that we should explore further before building a model?

## Creating the Predictive Model