# Exploratory Analysis

This notebook contains decisions in the exploratory analysis process of the Austin crime project.

## The Required Imports

Here we'll import all the modules required to run the code cells in this notebook.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from wrangle import wrangle_crime_data
from prepare import split_data

## Acquire and Prepare Data

Here we'll acquire and prepare the data using the wrangle module.

In [2]:
df = wrangle_crime_data()
df.shape

Using cached csv


(349530, 17)

## Split the Data

Now we will need to split the data before we begin exploring because we will now begin exploring the relationship between the target variable and other features in the data.

In [3]:
train, validate, test = split_data(df)
train.shape, validate.shape, test.shape

((195736, 17), (83888, 17), (69906, 17))

From here on we'll only use the train variable.

In [4]:
train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 195736 entries, 256829 to 125659
Data columns (total 17 columns):
 #   Column            Non-Null Count   Dtype         
---  ------            --------------   -----         
 0   crime_type        195736 non-null  object        
 1   family_violence   195736 non-null  object        
 2   occurrence_time   195736 non-null  datetime64[ns]
 3   occurrence_date   195736 non-null  datetime64[ns]
 4   report_time       195736 non-null  datetime64[ns]
 5   report_date       195736 non-null  datetime64[ns]
 6   location_type     195736 non-null  object        
 7   address           195736 non-null  object        
 8   zip_code          195736 non-null  float64       
 9   council_district  195736 non-null  float64       
 10  sector            195736 non-null  object        
 11  district          195736 non-null  object        
 12  latitude          195736 non-null  float64       
 13  longitude         195736 non-null  float64       
 14 

## Does the difference between time of incident and report time relate to a crime being solved/closed?

In this notebook we're going to explore the relationship between the clearance status of a case and the difference in time between when the incident occurred and when the incident was reported.

### Engineer Time Difference Feature

In order to answer this question we will need to create a feature which contains the difference in time between when a crime occurred and when it was reported.

In [7]:
train['time_to_report'] = train.report_time - train.occurrence_time
train.time_to_report.head()

256829    2 days 19:26:00
123369    0 days 21:19:00
319089    1 days 03:10:00
37631    16 days 22:47:00
221040    0 days 16:17:00
Name: time_to_report, dtype: timedelta64[ns]

Now we have the difference between the time when a crime was reported and when it occurred, but we may need to bin this column to help with exploration. There are a lot of unique values in here so in order to gain meaningful insights from this feature we will need to create bins.

We'll

In [23]:
(train.time_to_report < pd.Timedelta('1m')).value_counts()

False    113865
True      81871
Name: time_to_report, dtype: int64

In [27]:
train[train.time_to_report > pd.Timedelta('1m')][['time_to_report', 'occurrence_time', 'report_time']]

Unnamed: 0,time_to_report,occurrence_time,report_time
256829,2 days 19:26:00,2019-09-20 17:00:00,2019-09-23 12:26:00
123369,0 days 21:19:00,2021-01-10 14:00:00,2021-01-11 11:19:00
319089,1 days 03:10:00,2019-02-20 13:00:00,2019-02-21 16:10:00
37631,16 days 22:47:00,2021-12-11 12:00:00,2021-12-28 10:47:00
221040,0 days 16:17:00,2020-01-21 16:00:00,2020-01-22 08:17:00
...,...,...,...
142843,0 days 12:18:00,2020-11-01 02:15:00,2020-11-01 14:33:00
126993,0 days 22:14:00,2020-12-27 10:45:00,2020-12-28 08:59:00
296143,0 days 03:34:00,2019-05-12 02:20:00,2019-05-12 05:54:00
319723,0 days 03:42:00,2019-02-18 00:19:00,2019-02-18 04:01:00


In [26]:
81871 / 195736

0.41827257121837574