# <u><center>**Crime Data Part 1**</u>
* Authored By: Eric N Valdez
* Date: 02/11/2024

## <u>Chicago Crime Data</u>
* **We have prepared a zip file with the Chicago crime data** which you can download [here](https://drive.google.com/file/d/1avxUlCAros-R9GF6SKXqM_GopzO7VwA5/view).
* `Original Source:` Chicago Data Portal: Crimes 2001 to Present
    * Data Description:
        * All Crimes were reported in the city of Chicago and their details [View Preview](https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-Present/ijzp-q8t2/data)
    * Includes:
        * type of crime, exact date /time, lat/long, District 
    * `Note:` 
    * We have provided a .zipfile (linked above with the data in a repo-friendly format. For those who are curious the code for converting the downloaded file to the .zip file of individual years, please see [this helper notebook.](https://github.com/coding-dojo-data-science/preparing-chicago-crime-data/blob/admin/Workflow%20-%20Prep%20Chicago%20Crime%20Data.ipynb)

* **Supplemental Data: Holiday Data**
    * Check the lesson on "Feature Engineering: Holidays" to see how to use they Python 'holidays' package to add holidays to your dataset.
* **Notes/Considerations:**
    * You may need to keep 2 forms of the data:
        * The `original` individual crime data wotj a datetime index. `(Each row is 1 crime)`
        * A resampled / converted crime counts version `(Each row is 1 day)`  

# <u>Task</u>
Your task is to answer a series of questions about trends in crimes in Chicago for a reporter for the local newspaper.

**Stakeholder Questions to Answer (Pick at least 3 tipics)**:

<u>Select 3 or more of the following topics to analyze:

### <u>Helper Notes
* Load the data
* Holidays
* pd.to_datetime
* Df.set_index(‘Date’)
* Df.sort_index()
* crime_counts = Df.groupby(‘Primary Type’).resample(‘D’).size()
* df_counts = crime_counts.unstack(level = 0)
* df_counts = df_counts.fillna(0)
* Then feature engineering

## <u>Imports

In [1]:
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import glob
import os
import calendar

# imports for supplemental Data
!pip install holidays
import holidays
import datetime as dt
from matplotlib import dates as mdates
from holidays import country_holidays

#import tick customization tools
import matplotlib.ticker as mticks
import matplotlib.dates as mdates

## Setting figures to timeseries-friendly
plt.rcParams['figure.figsize'] = (12,4)
plt.rcParams['figure.facecolor'] = 'white'
sns.set_context("talk", font_scale=0.9)

# set random seed
SEED = 321
np.random.seed(SEED)

#display more columns
pd.set_option('display.max_columns',50)



In [2]:
# function to format y-axis units
def thousands(x, pos):
    """formats count in thousands"""
    new_x = x / 1000
    return f"{new_x:,.0f}K"

In [3]:
# # Set the path to the directory containing  CSV files
# csv_files_path = 'Data/*.csv'
# # Use the glob module to get a list of all CSV files in the specified directory
# file_list = glob.glob(csv_files_path)
# file_list
### Changing load process to see if it help my code below
file = "Data/Chicago-Crime*.csv"
crime_data = sorted(glob.glob(file))
crime_data

['Data\\Chicago-Crime_2001.csv',
 'Data\\Chicago-Crime_2002.csv',
 'Data\\Chicago-Crime_2003.csv',
 'Data\\Chicago-Crime_2004.csv',
 'Data\\Chicago-Crime_2005.csv',
 'Data\\Chicago-Crime_2006.csv',
 'Data\\Chicago-Crime_2007.csv',
 'Data\\Chicago-Crime_2008.csv',
 'Data\\Chicago-Crime_2009.csv',
 'Data\\Chicago-Crime_2010.csv',
 'Data\\Chicago-Crime_2011.csv',
 'Data\\Chicago-Crime_2012.csv',
 'Data\\Chicago-Crime_2013.csv',
 'Data\\Chicago-Crime_2014.csv',
 'Data\\Chicago-Crime_2015.csv',
 'Data\\Chicago-Crime_2016.csv',
 'Data\\Chicago-Crime_2017.csv',
 'Data\\Chicago-Crime_2018.csv',
 'Data\\Chicago-Crime_2019.csv',
 'Data\\Chicago-Crime_2020.csv',
 'Data\\Chicago-Crime_2021.csv',
 'Data\\Chicago-Crime_2022.csv']

In [4]:
# # Looking over data
# df = pd.concat([pd.read_csv(f) for f in file_list])
# df
# looking to see if this new code will help me out better for the codes below
df = pd.concat([pd.read_csv(f, lineterminator='\n') for f in crime_data])
df.head()

Unnamed: 0,ID,Date,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Ward,Latitude,Longitude
0,1326041,01/01/2001 01:00:00 AM,BATTERY,SIMPLE,RESIDENCE,False,False,1624,16.0,,41.95785,-87.749185
1,1319931,01/01/2001 01:00:00 PM,BATTERY,SIMPLE,RESIDENCE,False,True,825,8.0,,41.783892,-87.684841
2,1324743,01/01/2001 01:00:00 PM,GAMBLING,ILLEGAL ILL LOTTERY,STREET,True,False,313,3.0,,41.780412,-87.61197
3,1310717,01/01/2001 01:00:00 AM,CRIMINAL DAMAGE,TO VEHICLE,STREET,False,False,2424,24.0,,42.012391,-87.678032
4,1318099,01/01/2001 01:00:00 AM,BATTERY,SIMPLE,RESIDENCE PORCH/HALLWAY,False,True,214,2.0,,41.819538,-87.62002


In [5]:
# CORRECT - properly recognizes dates and does not interpret them as seconds
df['Date'] = pd.to_datetime(df['Date'], format='%m/%d/%Y %I:%M:%S %p')
df['Date']

0        2001-01-01 01:00:00
1        2001-01-01 13:00:00
2        2001-01-01 13:00:00
3        2001-01-01 01:00:00
4        2001-01-01 01:00:00
                 ...        
238853   2022-12-31 12:50:00
238854   2022-12-31 12:50:00
238855   2022-12-31 00:52:00
238856   2022-12-31 12:52:00
238857   2022-12-31 12:59:00
Name: Date, Length: 7713109, dtype: datetime64[ns]

In [6]:
#Inspect the new index of your dataframe.
df = df.set_index('Date')
df.head(3)

Unnamed: 0_level_0,ID,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Ward,Latitude,Longitude
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2001-01-01 01:00:00,1326041,BATTERY,SIMPLE,RESIDENCE,False,False,1624,16.0,,41.95785,-87.749185
2001-01-01 13:00:00,1319931,BATTERY,SIMPLE,RESIDENCE,False,True,825,8.0,,41.783892,-87.684841
2001-01-01 13:00:00,1324743,GAMBLING,ILLEGAL ILL LOTTERY,STREET,True,False,313,3.0,,41.780412,-87.61197


In [7]:
# Checking the dataframe index
df.index

DatetimeIndex(['2001-01-01 01:00:00', '2001-01-01 13:00:00',
               '2001-01-01 13:00:00', '2001-01-01 01:00:00',
               '2001-01-01 01:00:00', '2001-01-01 01:00:00',
               '2001-01-01 01:00:00', '2001-01-01 01:00:00',
               '2001-01-01 01:00:00', '2001-01-01 01:00:00',
               ...
               '2022-12-31 12:41:00', '2022-12-31 00:42:00',
               '2022-12-31 00:44:00', '2022-12-31 00:45:00',
               '2022-12-31 12:45:00', '2022-12-31 12:50:00',
               '2022-12-31 12:50:00', '2022-12-31 00:52:00',
               '2022-12-31 12:52:00', '2022-12-31 12:59:00'],
              dtype='datetime64[ns]', name='Date', length=7713109, freq=None)

## `Feature Engineering Holidays`

In [8]:
# making a date range that covers full dataset
all_days = pd.date_range(df["Date"].min(), df["Date"].max())
all_days

KeyError: 'Date'

In [None]:
# Create an instance of the US country holidays.
us_holidays = country_holidays('US')
us_holidays

In [None]:
# Testing first date
print(all_days[0])
us_holidays.get(all_days[0])

In [None]:
# Getting us holidays for all dates
holiday_list = [us_holidays.get(day) for day in all_days]

In [None]:
# For a specific subdivisions (e.g. state or province):
il_holidays = country_holidays('US', subdiv='IL')
il_holidays

In [None]:
# Saving both holiday types as columns
df["US Holiday"] = [us_holidays.get(day) for day in df['Date']]
df['IL Holiday'] = [il_holidays.get(day) for day in df['Date']]
df.head()

In [None]:
# US Holidays
df['US Holiday'].value_counts()

In [None]:
Illinois Holidays
df['IL Holiday'].value_counts()

In [None]:
# Saving a binary is holiday feature
df['Is_Holiday'] = df['US Holiday'].notna()
df['Is_Holiday'].value_counts()

In [None]:
# Convert the list to a datetime object
df['Date'] = pd.to_datetime(df['Date'], format='%m/%d/%Y %I:%M:%S %p')
df['Date']

In [None]:
# Setting the index
df = df.set_index('Date')
df

In [None]:
# Sorting df index
df = df.sort_index()

In [None]:
# Creating a Copy df_crime
df_crime = df.copy()

In [None]:
# Grouping by "Primary Type" and resampling to daily
crime_counts = df_crime.groupby('Primary Type').resample('D').size()
crime_counts

In [None]:
df_counts = crime_counts.unstack(level=0)
df_counts = df_counts.fillna(0)
df_counts

In [None]:
# Adding Year, Month, Hour of Day and Day of Week to df (Feature Engineering)
df_crime['Date Only'] = df_crime.index.date
df_crime["Date Only"] = df_crime["Date Only"].astype(str)

df_crime['HourOfDay'] = df_crime.index.hour
df_crime['Year'] = df_crime.index.month
df_crime['DayOfWeek'] = df_crime.index.day_name()
df_crime.head()

## <u>Topic 1) Comparing Police Districts</u>
* Which district had the most crimes in 2022?
* Which had the least?

In [None]:
df['District'].value_counts()

### `The highest amount of crimes was in district 8 the lowest was in district 21.`

## <u>Topic 2) Crimes Across the Years:</u>
* Is the total number of crimes increasing or decreasing across the years?

In [None]:
#df = df_crime.resample("A").size()
df_annual  = df_crime.groupby("Year").size()
df_annual.head()

In [None]:
ax = df_annual.plot(style='0-')

* `Are there any individual crimes that are doing the opposite` ***`(e.g., decreasing when overall crime is increasing or vice-versa)`***?

In [None]:
df_annual_by_crime = df_crime.groupby(['Primary Type'])['Year'].value_counts().sort_index()
df_annual_by_crime

## <u>Topic 3)Comparing AM vs PM Rush Hour:</u>
* Are crimes more common during the AM rush hour or PM rush hour?
    * You can consider any crime that occurred between 7 AM - 10 am as AM rush hour.
    * You can consider any crime that occured between 4 - 7 PM as PM rush hour.
* `Answer the questions:` What are the top 5  most common crimes during AM rush hour? What are the top 5 most common crimes during PM rush hour?
* `Answer the questions:` Are Motor Vehicle Thefts more common during AM rush hour or PM rsuh hour?

## <u>Topic 4) Comparing Months:
* `Answer the questions:` What months have the most crime? What months have the least?

* `Answer the questions:` Are there any individual crimes that do not follow this patter? If so, which crimes?

## <u>Topic 5) Comparing Months:</u>

* `Answer the questions:` What are the top 3 holidays with the largest number of crimes?

In [None]:
# Looking at US Holidays
df['US Holiday'].value_counts()

`Top 3 Holiday for crime are` 

* `Answer the questions:` For each of the top 3 Holidays with the the most crime, what are the top 5 most common crimes of that holiday?

In [None]:
df.groupby('US Holiday')['Primary Type'].value_counts()

In [None]:
# Plot New Year's Day crimes counts by types
ax = holiday.loc["New Year's Day"].plot(kind='bar')
ax.set(title="New Years Day Crimes", ylabel='# of Crimes');

## <u>Topic 6) What cycles (seasonality) can you find in this data?
* **Make sure to select the data of interest and that it is resampled to the frequency you want.** `(See the "Suggested data to check for seasons" list at the bottom of topic for suggestions).`
* **Use statmodels.tsa.seasonal.seasonal_decompose() the time series.**
    *  `Note:` seasonal_decompose cannot read data resampled as minutes or smaller, and ifyou try seconds, you will crash you computer. Keep your resampling at hours or more.
* **Show and describe each cycle you can find.**
    * ***(Hint: If your seasonal results are too dense to read, try zooming in to look at just one year or one month and try different levels of resampling).***
    * `Answer the questions:` How long is a cycle?
    * `Answer the questions:` What is the magnitude of the cycle? ***(Compare min and max)***
* <u>Suggested data to check for seasons:</u>
    * Total Crime `(Daily)`
    * Total Crime `(Weekly)`
    * Total Crime `(Monthly)`
    * Select a Primary Type of interest to you `(Daily)`
    * Select a Primary Type of interest to you `(Weekly)`
    * Select a Primary Type of interest to you `(Monthly)`