# Part--1 Defining CHICAGO CRIMES Dataset. And our Goals.
This dataset reflects reported incidents of crime that occurred in the City of Chicago from 2001 to present, Data is extracted from the Chicago Police Department's CLEAR (Citizen Law Enforcement Analysis and Reporting) system.
We will go through Exploratory Data Analysis first, Then we will Do some Time Series Analysis.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import pydeck as pdk
import folium

# Exploring the Dataset.

# Let's visualize and possibily gain some insights about null values and data gathering process, for this we use missingno library.

For this we load the dataset initially, take a look at it, and after that we can decide about the procedure of cleaning our dataset, besides it is always a good idea to deeply take a look at the null values using "missingno".

In [None]:
df = pd.read_csv("Crimes_-_2001_to_Present.csv", low_memory=False) # loading the dataset.

In [None]:
df.head(5) # we have 22 columns in the dataset

In [None]:
df.isnull().sum() # let's see the missing values per column,

In [None]:
df.shape # Here we see that we are dealing with more than 7M rows of data.

In [None]:
df.info() # this return column names also dtypes of the columns,
          # also the Dataframes memory usage.

In [None]:
list(df.columns) # this way we can also take a look at the column
                 # names

# Observations about the data.
From the above cells we can observe that there are 22 columns and well over 7.2 million rows, we also see the data type values for every column.
The Date column may need to be changed to python's datatime.datetime format to extract the month, time and day of the week information.
# some of the columns in the dataset overlap in importance so it would not make sense to keep all of them in our final dataframe configuration. We will handle this in the future though.
# It is bad practice to feed this kind of bloated configuration into a machine learning (ML) task as it does not help the model generalize well on the data.

If we are dealing with a real world dataset, It is common that we see missing values in the dataset. To build a good machine learning model  we need to have a good understanding of how the NaN values are distributed in our dataset.

Missingno library offers a very nice way to visualize the distribution of NaN values.

In [None]:
import missingno as msno

# Visualize missing values as a matrix

Using matrix can very quickly find the pattern of missingness in the dataset. The columns X Coordinate/Y Coordinate/Latitude/Longitude and location have a similar pattern of missing values while others shows a different pattern.

In [None]:
msno.matrix(df)

BAR chart gives an idea about how many missing values are there in each column.

It shows bars that are proportional to the number of non-missing values as well as providing the actual number of non-missing values. We get an idea of how much of each column is missing.

In [None]:
msno.bar(df)

HEATMAP shows the correlation of missingness between every 2 columns. In our example,  the correlation between X Coordinate and Latitude is 1 which means if one of them is present then the other one must be present.


A value near -1 means if one variable appears then the other variable is very likely to be missing.
A value near 0 means there is no dependence between the occurrence of missing values of two variables.
A value near 1 means if one variable appears then the other variable is very likely to be present.

In [None]:
msno.heatmap(df)

Below we are plotting dendrogram which shows hierarchical cluster creation based on missing values correlation between various datasets. The columns of the dataset which have a deep connection in missing values between them will be kept in the same cluster.

In [None]:
msno.dendrogram(df)

# we can also visualize this using seaborn.

In [None]:
plt.figure(figsize=(15,10))
sns.heatmap(df.isnull(), cbar = False, cmap = "viridis")

we can see that the most null values are for years 2001 and 2002 ..., perhaps due to lack of organization mostly latitudes and longitudes and X/Y coordiantes also ward etc are missing in the starting days of this dataset. After that we see that the dataset gets better and better in terms of being solid in capturing latitudes and longitudes of each record.

# Now that we have a basic understanding regarding null values and the structure of the dataset, we can write a class to Load The Dataset, also we can write some methods to handle transforming and cleaning the data.

In [None]:
class DataTransformer():
    def __init__(self, dataframe_path, nrows):
        self.nrows = nrows
        self.dataframe = pd.read_csv(dataframe_path,
                                    nrows=self.nrows,
                                    low_memory=False)
        
    def transformer(self):
        # Transforming Date column from object to datetime
        self.dataframe["Date"] = pd.to_datetime(self.dataframe["Date"])
        # Extracting meaningful data form Date Column which is now in datetime format.
        self.dataframe["month"] = self.dataframe["Date"].dt.month
        self.dataframe["day_of_month"] = self.dataframe["Date"].dt.day
        self.dataframe["day_of_year"] = self.dataframe["Date"].dt.dayofyear
        self.dataframe["day_of_week"] = self.dataframe["Date"].dt.dayofweek
        self.dataframe["hour_of_day"] = self.dataframe["Date"].dt.hour
        #Turning All the Columns to Lowercase.
        lowercase = lambda x: x.replace(" ", "_").lower()
        self.dataframe.rename(lowercase, axis="columns", inplace=True)
         
 
    
    def cleaning(self):
        self.transformer()
        # In some tests we see a few point outside the chicago region
        # so it's better to remove those points
        self.dataframe = self.dataframe[
            (self.dataframe.latitude >= 41.64) & (self.dataframe.longitude <= -87.50)  
        ]
        
        # In the dataset we see that there are 23 point that have x_coordinate and y_coordinate
        # set to zero wich is wise to remove them.
        # data[data['y_coordinate'] == 0] == data[data['x_coordinate'] == 0] #23 rows × 26 columns
        self.dataframe[['x_coordinate', 'y_coordiante']] = self.dataframe[['x_coordinate', 'y_coordinate']].replace(0.0, np.nan)
        self.dataframe.dropna(inplace=True)
        
        # in the dataset we have to ways of getting the year
        # one way in through dataset["date"].dt.year
        # The other way is to use the year column. dataset.year
        # let's verify that these two values are equal if not we will filter them out from our dataset.
        self.dataframe[self.dataframe["date"].dt.year == self.dataframe.year]
        
        # it is always good to filter out the possibility of duplicates.
        self.dataframe.drop_duplicates(subset=['id', 'case_number'], inplace=True)

        return self.dataframe

In [None]:
data = DataTransformer("Crimes_-_2001_to_Present.csv",7245927).cleaning()

# Part--2 EDA SECTION

Let's Explore the dataset in order to find some trends, also some meaningful phenomenon.
In this part we try to answer some questions and catch some patters. On our way data visualization techniques will come to our hand, mostly we will use seaborn and plotly. 

# Let's take a look at Categories of crime in Chicago

Here we can see that the common crimes include THEFT, BATTERY, and CRIMINAL DAMAGE

In [None]:
data["primary_type"].value_counts() # let's plot this

In [None]:
plt.figure(figsize=(20,8))
fig = px.bar(data["primary_type"].value_counts())
fig.show()

We clearly see that Theft and Battery are among the most crimes commited over the years.

# It's a good idea to look at the Crime Trends over these years.

let's focus on the crime variables which are stores in the primary_type

In [None]:
data["primary_type"].unique()

In [None]:
len(data["primary_type"].unique()) # we can see that we are dealing with 35 different crimes
                                   # in our dataset.

# let's visualize all the crimes based on crime's latitude and longitude, for this we use seaborn.

In [None]:
sns.lmplot(x = 'longitude', 
           y = 'latitude',data=data,fit_reg=False, hue="district",palette='Paired',height=100,
           ci=3,scatter_kws={"marker": "D", 
                        "s": 10, "alpha": 0.9})
ax = plt.gca()
ax.set_title("All Crime Distribution per District")

In [None]:
# geographical distribution scatter plots by crime
g = sns.lmplot(x="longitude",y="latitude",col="primary_type",hue="district",data=data, 
               col_wrap=4, height=10, fit_reg=False, sharey=False, palette='Paired',ci=3,
               scatter_kws={"marker": "D",
                            "s": 2, "alpha": 0.9})

high homicide rates are clustered on the top left and middle bottom of the scatter plot.

# Let's further look at the top crimes in the city over the years. probably we can catch some neat trends in the dataset.

Let's take a look at the Yearly Theft, Battery, Criminal Damage, and NARCOTICS

In [None]:
plt.figure(figsize=(12,6))
sns.barplot(x='year',
            y='THEFT',
            color="year",
            ci=1,
            data=data.groupby(['year'])['primary_type'].value_counts().unstack().reset_index(),
            palette='husl').\
            set_title("CHICAGO THEFT RATES: 2001 - 2020")

In [None]:
plt.figure(figsize=(12,6))
sns.barplot(x='year',
            y='CRIMINAL DAMAGE',
            color="year",
            linewidth=4,
            ci=1,
            data=data.groupby(['year'])['primary_type'].value_counts().unstack().reset_index(),
            palette='rocket').\
            set_title("CHICAGO CRIMINAL DAMAGE RATES: 2001 - 2020")

In [None]:
plt.figure(figsize=(12,6))
sns.barplot(x='year',
            y='BATTERY',
            color="year",
            ci=1,
            data=data.groupby(['year'])['primary_type'].value_counts().unstack().reset_index(),
            palette='Set2').\
            set_title("CHICAGO BATTERY RATES: 2001 - 2020")

In [None]:
plt.figure(figsize=(12,6))
sns.barplot(x='year',
            y='NARCOTICS',
            color="year",
            ci=1,
            data=data.groupby(['year'])['primary_type'].value_counts().unstack().reset_index(),
            palette='tab10').\
            set_title("CHICAGO NARCOTICS RATES: 2001 - 2020")

In [None]:
plt.figure(figsize=(12,6))
sns.barplot(x='year',
            y='HOMICIDE',
            color="year",
            ci=1,
            data=data.groupby(['year'])['primary_type'].value_counts().unstack().reset_index(),
            palette='hls').\
            set_title("CHICAGO HOMICIDE RATES: 2001 - 2020")

# Let's also visualize Monthly Crime Rates, We can capture monthly trends.

In [None]:
fig, ax = plt.subplots(figsize=(14,6))
month_nms = ['January','February','March','April','May','June','July','August','September','October','November','December']    
fig = sns.barplot(x='month',
                  y='THEFT',
                  data=data.groupby(['year','month'])['primary_type'].value_counts().unstack().reset_index(),palette="crest")

ax.set_xticklabels(month_nms)
plt.title("CHICAGO THEFT RATES by MONTH -- All Years")

In [None]:
fig, ax = plt.subplots(figsize=(14,6))
month_nms = ['January','February','March','April','May','June','July','August','September','October','November','December']    
fig = sns.barplot(x='month',
                  y='BATTERY',
                  data=data.groupby(['year','month'])['primary_type'].value_counts().unstack().reset_index(),palette="husl")

ax.set_xticklabels(month_nms)
plt.title("CHICAGO BATTERY RATES by MONTH -- All Years")

In [None]:
fig, ax = plt.subplots(figsize=(14,6))
month_nms = ['January','February','March','April','May','June','July','August','September','October','November','December']    
fig = sns.barplot(x='month',
                  y='CRIMINAL DAMAGE',
                  data=data.groupby(['year','month'])['primary_type'].value_counts().unstack().reset_index(),palette="viridis")

ax.set_xticklabels(month_nms)
plt.title("CHICAGO CRIMINAL DAMAGE RATES by MONTH -- All Years")

In [None]:
fig, ax = plt.subplots(figsize=(14,6))
month_nms = ['January','February','March','April','May','June','July','August','September','October','November','December']    
fig = sns.barplot(x='month',
                  y='NARCOTICS',
                  data=data.groupby(['year','month'])['primary_type'].value_counts().unstack().reset_index(),palette="mako")

ax.set_xticklabels(month_nms)
plt.title("CHICAGO NARCOTICS RATES by MONTH -- All Years")

In [None]:
fig, ax = plt.subplots(figsize=(14,6))
month_nms = ['January','February','March','April','May','June','July','August','September','October','November','December']    
fig = sns.barplot(x='month',
                  y='HOMICIDE',
                  data=data.groupby(['year','month'])['primary_type'].value_counts().unstack().reset_index(),palette="flare")

ax.set_xticklabels(month_nms)
plt.title("CHICAGO HOMICIDE RATES by MONTH -- All Years")

# Day of the Week Homicide Rates

In [None]:
fig, ax = plt.subplots(figsize=(14,6))
week_days = ['Sunday','Monday','Tuesday','Wednesday','Thursday','Friday','Saturday']    
fig = sns.barplot(x='day_of_week',y='THEFT',data=data.groupby(['year','day_of_week'])['primary_type'].value_counts().unstack().reset_index(),palette='Set2')
ax.set_xticklabels(week_days)
plt.title('THEFT RATE BY DAY OF THE WEEK')

In [None]:
fig, ax = plt.subplots(figsize=(14,6))
week_days = ['Sunday','Monday','Tuesday','Wednesday','Thursday','Friday','Saturday']    
fig = sns.barplot(x='day_of_week',y='BATTERY',data=data.groupby(['year','day_of_week'])['primary_type'].value_counts().unstack().reset_index(),palette='husl')
ax.set_xticklabels(week_days)
plt.title('BATTERY RATE BY DAY OF THE WEEK')

In [None]:
fig, ax = plt.subplots(figsize=(14,6))
week_days = ['Sunday','Monday','Tuesday','Wednesday','Thursday','Friday','Saturday']    
fig = sns.barplot(x='day_of_week',y='CRIMINAL DAMAGE',data=data.groupby(['year','day_of_week'])['primary_type'].value_counts().unstack().reset_index(),palette='mako')
ax.set_xticklabels(week_days)
plt.title('CRIMINAL DAMAGE RATE BY DAY OF THE WEEK')

In [None]:
fig, ax = plt.subplots(figsize=(14,6))
week_days = ['Sunday','Monday','Tuesday','Wednesday','Thursday','Friday','Saturday']    
fig = sns.barplot(x='day_of_week',y='NARCOTICS',data=data.groupby(['year','day_of_week'])['primary_type'].value_counts().unstack().reset_index(),palette='rocket_r')
ax.set_xticklabels(week_days)
plt.title('NARCOTICS RATE DAY OF THE WEEK')

In [None]:
fig, ax = plt.subplots(figsize=(14,6))
week_days = ['Sunday','Monday','Tuesday','Wednesday','Thursday','Friday','Saturday']    
fig = sns.barplot(x='day_of_week',y='HOMICIDE',data=data.groupby(['year','day_of_week'])['primary_type'].value_counts().unstack().reset_index(),palette='rocket')
ax.set_xticklabels(week_days)
plt.title('HOMICIDE BY DAY OF THE WEEK')

# Let's also visualize Hourly Crime Rates.

In [None]:
fig, ax = plt.subplots(figsize=(14,6))
fig = sns.barplot(x='hour_of_day',y='THEFT',data=data.groupby(['year','hour_of_day'])['primary_type'].value_counts().unstack().reset_index(),palette='mako',alpha=.9)
plt.title('THEFT BY HOUR OF THE DAY')

In [None]:
fig, ax = plt.subplots(figsize=(14,6))
fig = sns.barplot(x='hour_of_day',y='BATTERY',data=data.groupby(['year','hour_of_day'])['primary_type'].value_counts().unstack().reset_index(),palette='mako',alpha=.9)
plt.title('BATTERY BY HOUR OF THE DAY')

In [None]:
fig, ax = plt.subplots(figsize=(14,6))
fig = sns.barplot(x='hour_of_day',y='CRIMINAL DAMAGE',data=data.groupby(['year','hour_of_day'])['primary_type'].value_counts().unstack().reset_index(),palette='mako',alpha=.9)
plt.title('CRIMINAL DAMAGE BY HOUR OF THE DAY')

In [None]:
fig, ax = plt.subplots(figsize=(14,6))
fig = sns.barplot(x='hour_of_day',y='NARCOTICS',data=data.groupby(['year','hour_of_day'])['primary_type'].value_counts().unstack().reset_index(),palette='mako',alpha=.9)
plt.title('NARCOTICS DAMAGE BY HOUR OF THE DAY')

In [None]:
fig, ax = plt.subplots(figsize=(14,6))
fig = sns.barplot(x='hour_of_day',y='HOMICIDE',data=data.groupby(['year','hour_of_day'])['primary_type'].value_counts().unstack().reset_index(),palette='mako',alpha=.9)
plt.title('HOMICIDE BY HOUR OF THE DAY')

In [None]:
data.head(1)

In [None]:
corr = data.groupby(['year','district']).count().date.unstack().fillna(0)
fig, ax = plt.subplots(figsize=(20,10))
sns.set(font_scale=1.0)
sns.heatmap(corr.dropna(axis=1), cbar_kws={'label': 'THEFT'}, annot=True,linewidths=0.2,cmap='Blues',robust=True,)
plt.title('HOMICIDE vs DISTRICT vs YEAR')

In [None]:
corr = data.groupby(['year','district']).count().date.unstack().fillna(0)
fig, ax = plt.subplots(figsize=(20,10))
sns.set(font_scale=1.0)
sns.heatmap(corr.dropna(axis=1), cbar_kws={'label': 'NARCOTICS'}, annot=True,linewidths=0.2,cmap='magma',robust=True,)
plt.title('NARCOTICS vs DISTRICT vs YEAR')

# Most Dangerous & Least Dangerous Police Districts

# LET'S take a look at the top places that crimes occur.

In [None]:
plt.figure(figsize = (12, 8))
sns.countplot(y= 'location_description', data = data, order = data['location_description'].value_counts().iloc[:10].index)

In [None]:
data.groupby(['district']).count().arrest.reset_index().sort_values("arrest", ascending=False)

In [None]:
with sns.plotting_context('notebook',font_scale=1.5):
    sorted_homicides = data.groupby(['district']).count().arrest.reset_index().sort_values("arrest", ascending=False)
    fig, ax = plt.subplots(figsize=(20,6))
    sns.barplot(x='district',
                y='arrest',
                data=sorted_homicides,
                palette='magma',
                order = list(sorted_homicides['district'].astype(int)),
                label='big')
    plt.title('ARREST PER DISTRICT')

# Visualizing with MAP

In [None]:
THEFT = data.groupby(["primary_type"]).get_group("THEFT")
BATTERY = data.groupby(["primary_type"]).get_group("BATTERY")
CRIMINAL_DAMAGE = data.groupby(["primary_type"]).get_group("CRIMINAL DAMAGE")
NARCOTICS= data.groupby(["primary_type"]).get_group("NARCOTICS")
HOMICIDE = data.groupby(["primary_type"]).get_group("HOMICIDE")
MOTOR_VEHICLE_THEFT = data.groupby(["primary_type"]).get_group("MOTOR VEHICLE THEFT")
ROBBERY = data.groupby(["primary_type"]).get_group("ROBBERY")

In [None]:
plt.figure(figsize=(100, 50))
fig = px.density_mapbox(THEFT, lat='latitude', lon='longitude', z='district', radius=1,
                        center=dict(lat=41.8781, lon=-87.6298), zoom=2,
                        mapbox_style="stamen-watercolor")
fig.show()

In [None]:
plt.figure(figsize=(100, 50))
fig = px.density_mapbox(BATTERY, lat='latitude', lon='longitude', z='district', radius=1,
                        center=dict(lat=41.8781, lon=-87.6298), zoom=8,
                        mapbox_style="carto-darkmatter")
fig.show()

In [None]:
plt.figure(figsize=(100, 50))
fig = px.density_mapbox(CRIMINAL_DAMAGE, lat='latitude', lon='longitude', z='district', radius=1,
                        center=dict(lat=41.8781, lon=-87.6298), zoom=8,
                        mapbox_style="carto-darkmatter")
fig.show()

In [None]:
plt.figure(figsize=(100, 50))
fig = px.density_mapbox(NARCOTICS, lat='latitude', lon='longitude', z='district', radius=1,
                        center=dict(lat=41.8781, lon=-87.6298), zoom=8,
                        mapbox_style="carto-darkmatter")
fig.show()

In [None]:
plt.figure(figsize=(100, 50))
fig = px.density_mapbox(HOMICIDE, lat='latitude', lon='longitude', z='district', radius=1,
                        center=dict(lat=41.8781, lon=-87.6298), zoom=8,
                        mapbox_style="carto-darkmatter")
fig.show()

In [None]:
plt.figure(figsize=(100, 50))
fig = px.density_mapbox(MOTOR_VEHICLE_THEFT, lat='latitude', lon='longitude', z='district', radius=1,
                        center=dict(lat=41.8781, lon=-87.6298), zoom=8,
                        mapbox_style="carto-darkmatter")
fig.show()

In [None]:
plt.figure(figsize=(100, 50))
fig = px.density_mapbox(THEFT, lat='latitude', lon='longitude', z='district', radius=1,
                        center=dict(lat=41.8781, lon=-87.6298), zoom=8,
                        mapbox_style="carto-darkmatter")
fig.show()

In [None]:
import plotly.express as px
import geopandas as gpd

fig = px.scatter_geo(narcotics,
                    lat=narcotics.latitude,
                    lon=narcotics.longitude,
                    color=narcotics.district,
                    )
fig.update_geos(fitbounds="locations", visible=False)
fig.update_layout(
        title = 'NARCOTICS MAP',
        geo_scope="north america", 
        margin={"r":0,"t":0,"l":0,"b":0}
    
    )
fig.show()


In [None]:
kidnapping_df = data.groupby("primary_type").get_group("KIDNAPPING")

In [None]:
import plotly.express as px
import geopandas as gpd

fig = px.scatter_geo(kidnapping_df,
                    lat=kidnapping_df.latitude,
                    lon=kidnapping_df.longitude,
                    color=kidnapping_df.district,
                    )
fig.update_geos(fitbounds="locations", visible=False)
fig.update_layout(
        title = 'HOMICIDE MAP',
        geo_scope="north america", 
        margin={"r":0,"t":0,"l":0,"b":0}
    
    )
fig.show()

In [None]:
from urllib.request import urlopen
import json
chicago_geo = json.load(open("chicago_police_districts.geojson", "r"))

In [None]:
data.head(2)

In [None]:
import plotly.express as px

df = data
geojson = chicago_geo

fig = px.choropleth(df, geojson=geojson, color="district",
                    locations="district", featureidkey="properties.dist_num",
                    #projection="district"
                   )
fig.update_geos(fitbounds="locations", visible=False)
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

# Time Series Analysis. Part 2 of our Project.

Here we want to predict the number of THEFT in Chicago.

In [None]:
data.index = pd.DatetimeIndex(data.date) #setting the index to be the date

In [None]:
data.head(2)

In [None]:
data["primary_type"] = pd.Categorical(data["primary_type"])
data["location_description"] = pd.Categorical(data["location_description"])
data["description"] = pd.Categorical(data["description"])

In [None]:
data.dtypes # so we have changed primary_type / location_description / description
            # from object to categorical and here we can see the results.

# Plot the crimes per month

In [None]:
plt.figure(figsize=(20,12))
data.resample("M").size().plot(legend=False)
plt.title("Rate of crimes per Month")
plt.xlabel("Months")
plt.ylabel("Crime Rate")
plt.show()

In [None]:
plt.figure(figsize=(20,12))
data.resample("D").size().rolling(365).sum().plot(legend=False)
plt.title("Rolling sum of crimes")
plt.xlabel("Number of Crimes")
plt.ylabel("Days")
plt.show()

In [None]:
days = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
data.groupby([data.index.dayofweek]).size().plot(kind="barh")
plt.ylabel("Day of The Week")
plt.yticks(np.arange(7), days)
plt.xlabel("Number Of Crimes")
plt.title("Crimes per Day of The Week")
plt.show()

In [None]:
plt.figure(figsize=(20,12))
location_description = data.groupby([data["location_description"]]).size().sort_values(ascending=False)
sns.barplot(location_description)

In [None]:
location_description

In [None]:
location_by_type = data.pivot_table(values="id", index="location_description", columns="primary_type",
                                   aggfunc=np.size).fillna(0)

In [None]:
location_by_type

In [None]:
from sklearn.cluster import AgglomerativeClustering as AC
def normalize(df):
    result = df.copy()
    for feature_name in df.columns:
        max_value = df[feature_name].max()
        min_value = df[feature_name].min()
        result[feature_name] = (df[feature_name] - min_value) / (max_value - min_value)
    return result    

In [None]:
df = normalize(location_by_type)

# This Heatmap shows location frequency for each Crime

In [None]:
ix = AC(3).fit(df.T).labels_.argsort()
plt.figure(figsize=(50,50))
plt.imshow(df.T.iloc[ix,:], cmap="Reds")
plt.colorbar(fraction=0.03)
plt.xticks(np.arange(df.shape[0]), df.index, rotation="vertical")
plt.yticks(np.arange(df.shape[1]), df.columns)
plt.title("Normalized Location Frequency for each Crime")
plt.grid(False)
plt.show()

# Let's begin with our first Time Series Model which is ARIMA

# Here we focus on the most common crime which is Theft

In [None]:
crimes_theft = data[data["primary_type"] == "THEFT"]

In [None]:
crimes_theft = crimes_theft.drop("date", axis=1)

In [None]:
#crimes_theft["date"].min(), crimes_theft["date"].max()
crimes_theft = crimes_theft.sort_index

In [None]:
crime_theft

In [None]:
import statsmodels.api as sm

In [None]:
crimes_theft = data[data["primary_type"] == "THEFT"]
crimes_theft = crimes_theft.groupby([crimes_theft["date"]]).size()

In [None]:
crimes_theft.head(5)

In [None]:
# Relative sampling Based on Month

In [None]:
plottable = crimes_theft.resample("MS").mean()

In [None]:
plottable

In [None]:
# sum of the number of crimes happening every minute from 2001 to 2020

In [None]:
plottable[np.isnan(plottable)] = 1
plottable.dropna()
plottable["2013":]
plottable.plot(figsize=(20,12))
plt.show()

In [None]:
from pylab import rcParams
rcParams["figure.figsize"] = 18, 8
decomposition = sm.tsa.seasonal_decompose(plottable, model="additive")
fig = decomposition.plot()
plt.show()

In [None]:
# we can see that the model is additive

# Finding the Parameter combination to feed to the SARIMAX Function of the ARIMA model.

In [None]:
import itertools
p = d = q = range(0, 2)
pdq = list(itertools.product(p, d, q))
seasonal_pdq = [(x[0], x[1], x[2], 12) for x in list(itertools.product(p, d, q))]
print("Example of Parameter combinations for Seasonal ARIMA")
print("SARIMAX: {} x {}".format(pdq[1], seasonal_pdq[1]))
print("SARIMAX: {} x {}".format(pdq[1], seasonal_pdq[2]))
print("SARIMAX: {} x {}".format(pdq[2], seasonal_pdq[3]))
print("SARIMAX: {} x {}".format(pdq[2], seasonal_pdq[4]))

# We should find best possible Combination with lowest AIC value

In [None]:
for param in pdq:
    for param_seasonal in seasonal_pdq:
        try:
            mod = sm.tsa.statespace.SARIMAX(plottable, order=param, seasonal_order=param_seasonal,
                                           enforce_stationarity=False,
                                           enforce_invertibility=False) #?
            result = mod.fit()
            print("ARIMA{}x{}12  -  AIC:{}".format(param, param_seasonal, result.aic))
        except:
            continue

# Choosing the Values with the Lowest AIC as per the above Step

In [None]:
mod = sm.tsa.statespace.SARIMAX(plottable,
                               order=(0, 1, 1),
                               seasonal_order=(0, 0, 0, 12),
                               enforce_stationarity=False,
                               enforce_invertibility=False)
results = mod.fit()
print(results.summary().tables[1])

In [None]:
results.plot_diagnostics(figsize=(20,12))
plt.show()

In [None]:
pred = results.get_prediction(start=pd.to_datetime("2012-01-01"), dynamic=False)

pred_ci = pred.conf_int()
ax = plottable["2001":].plot(label="observed")
pred.predicted_mean.plot(ax=ax, label="One-Step ahead Forcast",
                        alpha=0.9,
                        color="k",
                        figsize=(20,12))
ax.fill_between(pred_ci.index,
               pred_ci.iloc[:,0],
               pred_ci.iloc[:, 1],
               color="k",
               alpha=0.2)
ax.set_xlabel("Date")
ax.set_ylabel("Theft Rates")
plt.legend()
plt.show()

# LSTM Approach

In [None]:
from sklearn.preprocessing import MinMaxScaler
from keras.models import Sequential
from keras.layers import Dense, Dropout, LSTM
import datetime as dt

In [None]:
pip install tensorflow

In [None]:
print("hiww")

In [None]:
pip install --upgrade pip