# Data Preparation

This notebook contains all steps and decisions made in the 2nd iteration of the Austin Crime project.

## The Required Imports

Here we'll import all the required modules for this notebook.

In [14]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from acquire import get_crime_data
import prepare
from wrangle import *

## Acquire the Data

We'll acquire the data using the get_crime_data function from the acquire module. Here we'll explicitly read from the source using an API, but going forward we will use the cache file 'Crime_Reports.csv'.

In [2]:
# Acquire the data using the API

df = get_crime_data()
df.shape

Using cached csv


(500000, 31)

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500000 entries, 0 to 499999
Data columns (total 31 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   incident_report_number       500000 non-null  int64  
 1   crime_type                   500000 non-null  object 
 2   ucr_code                     500000 non-null  int64  
 3   family_violence              500000 non-null  object 
 4   occ_date_time                500000 non-null  object 
 5   occ_date                     500000 non-null  object 
 6   occ_time                     500000 non-null  int64  
 7   rep_date_time                500000 non-null  object 
 8   rep_date                     500000 non-null  object 
 9   rep_time                     500000 non-null  int64  
 10  location_type                498336 non-null  object 
 11  address                      500000 non-null  object 
 12  zip_code                     497118 non-null  float64
 13 

### Limit Time Frame of the Data

We are only interested in crimes reported between the years 2018 and 2021. Here we'll remove all observations that occur outside of this time frame.

In [4]:
# Let's see how the date information is stored in the dataframe.

df.head(1).occ_date

0    2022-05-28T00:00:00.000
Name: occ_date, dtype: object

In [5]:
# Set the occ_date column to a datetime type.

df.occ_date = pd.to_datetime(df.occ_date, format = '%Y-%m-%d')

In [6]:
df.occ_date.head()

0   2022-05-28
1   2022-05-28
2   2022-05-28
3   2022-05-28
4   2022-05-28
Name: occ_date, dtype: datetime64[ns]

In [7]:
# Subset the data to include observations between 2018-01-01 and 2021-12-31.

df = df[(df.occ_date >= '2018-01-01') & (df.occ_date <= '2021-12-31')]
df.shape

(401978, 31)

### New Decisions

*After discussion and research on the meaning of cleared by exception, we decided that instead of lumping cleared by arrest and cleared by exception together it might be better to drop cleared by exception all together. Cleared by exception and cleared by arrest can mean very different things, so depending on the proportion of cleared by exception values it may make sense to drop all of them for the sake of data integrity.*

In [8]:
# Checking the proportions of cleared by arrest, cleared by exception, and not cleared
df.clearance_status.value_counts()

N    281250
C     73852
O      1668
Name: clearance_status, dtype: int64

In [9]:
# Checking as a percentage
df.clearance_status.value_counts(normalize=True)

N    0.788323
C    0.207002
O    0.004675
Name: clearance_status, dtype: float64

**Cleared by exception makes up less than 1 percent of our data. Moving forward We will drop all rows with these values.**

In [12]:
# Removing the O value meaning cleared by exception
df = df[~(df.clearance_status == 'O')]

In [13]:
df.clearance_status.value_counts()

N    281250
C     73852
Name: clearance_status, dtype: int64

**Testing this change after adding it to the prepare.py**

In [15]:
df = wrangle_crime_data(drop_cleared_by_exception=True)
df.head()

Using cached csv


Unnamed: 0,crime_type,family_violence,occurrence_time,occurrence_date,report_time,report_date,location_type,address,zip_code,council_district,sector,district,latitude,longitude,clearance_status,clearance_date,cleared,time_to_report
34573,ASSAULT ON PUBLIC SERVANT,N,2021-12-31 23:50:00,2021-12-31,2021-12-31 23:50:00,2021-12-31,COMMERCIAL / OFFICE BUILDING,111 CONGRESS AVE,78701.0,9.0,GE,3,30.263739,-97.743651,cleared by arrest,2022-01-03,True,0 days 00:00:00
34574,THEFT,N,2021-12-31 23:50:00,2021-12-31,2022-01-07 14:12:00,2022-01-07,OTHER / UNKNOWN,6936 E BEN WHITE BLVD SVRD WB,78741.0,3.0,HE,5,30.215264,-97.703019,not cleared,2022-01-10,False,6 days 14:22:00
34575,PUBLIC INTOXICATION,N,2021-12-31 23:50:00,2021-12-31,2021-12-31 23:50:00,2021-12-31,HWY / ROAD / ALLEY/ STREET/ SIDEWALK,406 E 6TH ST,78701.0,9.0,GE,2,30.2673,-97.738857,cleared by arrest,2021-12-31,True,0 days 00:00:00
34576,DOC DISCHARGE GUN - PUB PLACE,N,2021-12-31 23:47:00,2021-12-31,2021-12-31 23:47:00,2021-12-31,RESIDENCE / HOME,1202 E ST JOHNS AVE,78752.0,4.0,ID,1,30.328049,-97.693683,not cleared,2022-01-05,False,0 days 00:00:00
34577,AGG ASLT STRANGLE/SUFFOCATE,Y,2021-12-31 23:40:00,2021-12-31,2022-01-01 00:44:00,2022-01-01,RESIDENCE / HOME,10000 N LAMAR BLVD,78758.0,4.0,ED,1,30.369262,-97.695105,not cleared,2022-01-05,False,0 days 01:04:00


In [16]:
df.clearance_status.value_counts()

not cleared          275577
cleared by arrest     72431
Name: clearance_status, dtype: int64