# CAPSTONE: Chicago Crime Stats EDA/Modeling Preparation

**Author: Darius Smith**

**BrainStation | Data Science | April 11, 2023**

This notebook a continuation of the Chicago Crime Stats CAPSTONE project. In Chicago Crime Stats Data Wrangling/Cleaning and Basic EDA, there was an introduction to the crime data. The first step that was taken was to clean the data, and then conducting some basic EDA. There was specifically an analysis of what did the data contain in each column, what insights were noticed, and then creating visuals for a few numerical and categorical columns to get a representation of what what being viewed. In this notebook there will be a deeper analysis by exploring the following:

>  **What is the relationship between 'domestic' and 'arrest'? What is the relationship between 'description' and 'arrest'? What is the relationship between 'domestic' and 'location'?**

This will be done using visuals, but also using some statistical modeling techniques. From there, the intial exploration will be extended the problem statement to be more predictive in nature: 

>**"Using machine learning, how may we predict how 'Arrests' change with respect to time and location description?**

## Dataset Information

This dataset reflects reported incidents of crime (with the exception of murders where data exists for each victim) that occurred in the City of Chicago from 2001 to present, minus the most recent seven days. Data is extracted from the Chicago Police Department's CLEAR (Citizen Law Enforcement Analysis and Reporting) system. In order to protect the privacy of crime victims, addresses are shown at the block level only and specific locations are not identified.

## Data Dictionary  

Data about *WHERE* incidents occured: 

- **Block** - The partially redacted address where the incident occured, placing it on the same black as the actual address. (categorical) 

- **Location Description** - Description of the location where the incident occured. (categorical)

- **Beat** - Indicates the beat where the incident occured. A beat is the smallest geographic area - each beat has a dedicated police beat car. Three (3) to five (5) beats make up a police sector, and three sectors make up a police district. The Chicago Police Department has (22) beats. (categorical) 

- **District** - Indicates the police district where the incident occured. (categorical) 

- **Ward** - The ward (City Council district) where the incident occured. (numerical) 

- **Community Area** - Indicates the community area where the incident occured. Chicago has (77) community areas. (categorical) 

- **X Coordinate** - The x coordinate of the location where the incident occured in the State Plane Illinois East NAD 1983 projection. This location is shifted from the actual location for partial redaction but falls on the same block. (numerical)

- **Y Coordinate** - The y coordinate of the location where the incident occured in the State Plane Illinois East NAD 1983 projection. This location is shifted from the actual location for partial redaction but falls on the same block. (numerical)

- **Latitude** - The latitude of the location where the incident occured. This location is shifted from the actual location for partial redaction but falls on the same block. (numerical)

- **Longitude** - The longitude of the location where the incident occured. This location is shifted from the actual location for partial redaction but falls on the same block. (numerical)

- **Location** - The location where the incident occured in a format that allows for creation of maps and other geographic operations on this data portal. This location is shifted from the actual location for partial redaction but falls on the same block. (numerical)



Data about *WHEN* incidents occured:


- **Date** - Date when the incident occured. (numerical)

- **Year** - Year the incident occured. (numerical)

- **ID** - Unique identifier for the record. (numerical) 

- **Case Number** - The Chicago Police Department RD Number (Records Division Number), which is unique to the incident. (categorical) 


Data about *WHAT* was the incident

- **IUCR** - The Illinois Uniform Crime Reporting code. This is directly linked to the Primary Type and Description. (categorical)

- **Primary Type** - The primary description of the IUCR code.(categorical) 

- **Description** - The secondary description of the IUCR code, a subcategory of the primary description. (categorical)

- **Domestic** - Indicates whether the incident was domestic related as defined by the Illinois Domestic Violence Act. (categorical) 

- **FBI Code** - Indicates the crime classification as outlined in the FBI's National Incident-Based Reporting System (NIBRS). (categorical) 


Data about *CONSEQUENCES* of the incident
 
- **Arrest** - Indicates whether an arrest was made. (categorical) 


*WHEN* data was updated by the city of Chicago
 
- **Updated On** - Date and time the record was last updated. (numerical)



**Exploratory Questions of Interest:**

- What is the relationship between 'domestic' and 'arrest'?
- What is the relationship between 'description' and 'arrest'?
- What is the relationship between 'domestic' and 'location'?

In [50]:
#Import packages

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [51]:
#Viewing the clean dataframe. 
crime_df2 = pd.read_csv("crimes_chicago_2001_to_present_clean.csv")
crime_df2.head(5)

Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
0,10224738,HY411648,2015-09-05 13:30:00,043XX S WOOD ST,486,BATTERY,DOMESTIC BATTERY SIMPLE,RESIDENCE,False,True,...,12.0,61.0,08B,1165074.0,1875917.0,2015,02/10/2018 03:50:01 PM,41.815117,-87.67,"(41.815117282, -87.669999562)"
1,10224739,HY411615,2015-09-04 11:30:00,008XX N CENTRAL AVE,870,THEFT,POCKET-PICKING,CTA BUS,False,False,...,29.0,25.0,06,1138875.0,1904869.0,2015,02/10/2018 03:50:01 PM,41.89508,-87.7654,"(41.895080471, -87.765400451)"
2,11646166,JC213529,2018-09-01 00:01:00,082XX S INGLESIDE AVE,810,THEFT,OVER $500,RESIDENCE,False,True,...,8.0,44.0,06,0.0,0.0,2018,04/06/2019 04:04:43 PM,0.0,0.0,0
3,10224740,HY411595,2015-09-05 12:45:00,035XX W BARRY AVE,2023,NARCOTICS,POSS: HEROIN(BRN/TAN),SIDEWALK,True,False,...,35.0,21.0,18,1152037.0,1920384.0,2015,02/10/2018 03:50:01 PM,41.937406,-87.71665,"(41.937405765, -87.716649687)"
4,10224741,HY411610,2015-09-05 13:00:00,0000X N LARAMIE AVE,560,ASSAULT,SIMPLE,APARTMENT,False,True,...,28.0,25.0,08A,1141706.0,1900086.0,2015,02/10/2018 03:50:01 PM,41.881903,-87.755121,"(41.881903443, -87.755121152)"


In [52]:
crime_df2.tail(5)

Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
7742467,12936285,JF526139,2022-06-27 10:05:00,025XX N HALSTED ST,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,unknown,False,False,...,43.0,7.0,11,1170513.0,1917030.0,2022,01/03/2023 03:46:28 PM,41.927817,-87.648846,"(41.927817456, -87.648845932)"
7742468,12936301,JF526810,2022-12-22 18:00:00,020XX W CORNELIA AVE,1320,CRIMINAL DAMAGE,TO VEHICLE,STREET,False,False,...,32.0,5.0,14,1161968.0,1923233.0,2022,01/03/2023 03:46:28 PM,41.945022,-87.680072,"(41.945021752, -87.680071764)"
7742469,12938501,JF523997,2022-12-26 22:30:00,021XX W DEVON AVE,915,MOTOR VEHICLE THEFT,"TRUCK, BUS, MOTOR HOME",PARKING LOT / GARAGE (NON RESIDENTIAL),False,False,...,50.0,2.0,7,1160681.0,1942466.0,2022,01/03/2023 03:46:28 PM,41.997825,-87.684267,"(41.997824802, -87.684266677)"
7742470,12936397,JF526745,2022-12-19 14:00:00,044XX N ROCKWELL ST,620,BURGLARY,UNLAWFUL ENTRY,APARTMENT,False,False,...,47.0,4.0,5,1158237.0,1929586.0,2022,01/03/2023 03:46:28 PM,41.962532,-87.693611,"(41.962531969, -87.693611152)"
7742471,12935341,JF525383,2022-12-20 06:45:00,027XX W ROOSEVELT RD,810,THEFT,OVER $500,STREET,False,False,...,28.0,29.0,6,1158071.0,1894595.0,2022,01/03/2023 03:46:28 PM,41.866517,-87.695179,"(41.866517317, -87.695178701)"


In [53]:
#Checking to see if we have the proper datatypes. 
crime_df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7742472 entries, 0 to 7742471
Data columns (total 22 columns):
 #   Column                Dtype  
---  ------                -----  
 0   ID                    int64  
 1   Case Number           object 
 2   Date                  object 
 3   Block                 object 
 4   IUCR                  object 
 5   Primary Type          object 
 6   Description           object 
 7   Location Description  object 
 8   Arrest                bool   
 9   Domestic              bool   
 10  Beat                  int64  
 11  District              float64
 12  Ward                  float64
 13  Community Area        float64
 14  FBI Code              object 
 15  X Coordinate          float64
 16  Y Coordinate          float64
 17  Year                  int64  
 18  Updated On            object 
 19  Latitude              float64
 20  Longitude             float64
 21  Location              object 
dtypes: bool(2), float64(7), int64(3), object(1

**'Date' is in 'datetime' format. This is good, however, 'Arrest' and 'Domestic' need to be converted to binary columns. After this, dropping unnecessary columns for EDA and modeling.** 

In [54]:
#Observing the array for Arrest. It is True or False. 
crime_df2['Arrest'].unique()

array([False,  True])

In [55]:
#Observing the array for Domestic.
crime_df2['Domestic'].unique()

array([ True, False])

In [56]:
# converting Arrest into a binary column. 
crime_df2['Domestic'] = np.where(crime_df2['Domestic'] == True,1,0)

In [57]:
crime_df2['Domestic'].value_counts()

0    6673294
1    1069178
Name: Domestic, dtype: int64