# Timeseries
## Processing and visualization with specific tools

In [None]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px

* Load the dataset chicago_crimes.csv representing the number of crimes in Chicago aggregated at hourly base, divided into  **Arrest** (at least an arrest was made) and **No Arrest** (no arrest was made). Note: the dataset is a simplified version of the official dataset from the city of Chicago available [here](https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-Present/ijzp-q8t2). The pre-processing steps are in the notebook chicago_crimes_preproc.ipynb.

In [None]:
# Adapted and simplified from https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-Present/ijzp-q8t2
# (the original dataset is 1.6GB!) look
df_crimes_hourly = pd.read_csv(os.path.join("data", "chicago_crimes.csv"))
df_crimes_hourly.head()

* Create a new column **Total** as the sum of **Arrest** + **No Arrest**

In [None]:
df_crimes_hourly["Total"] = df_crimes_hourly["Arrest"] + df_crimes_hourly["No Arrest"]

* Convert the **Date** column to datetime format

In [None]:
df_crimes_hourly["Date"] = pd.to_datetime(df_crimes_hourly["Date"])

* Set the **Date** column as index of the dataframe (you may modify the existing dataframe df_crimes_hourly)

In [None]:
df_crimes_hourly.set_index("Date", inplace=True)
df_crimes_hourly.head()

* Select the portion of the dataset from January 2001 until the end April 2021 (included). Use this portion for the rest of the analysis

In [None]:
df_crimes_hourly = df_crimes_hourly.loc["2001-01":"2021-04"]

In [None]:
df_crimes_hourly.head()

In [None]:
df_crimes_hourly.tail()

* Visualize the dataset above with a line plot using plotly (note: it could take a few seconds). What do you observe?

In [None]:
px.line(df_crimes_hourly, y=["Arrest", "No Arrest", "Total"])

Many crimes are committed around New Year's Eve?! Actually, there are similar peaks at around midnight of the first day of each month. Perhaps some crimes are reported by the authorities in particular moments by default.

* Resample and visualize the dataset at a monthly time base. What do you observe?

In [None]:
df_crimes_monthly = df_crimes_hourly.resample("M").sum()

In [None]:
px.line(df_crimes_monthly, y=["Arrest", "No Arrest", "Total"])

The time series **Arrest** seems to have a strong yearly seasonality. The seasonality is perhaps less strong in the **No Arrest** column. Then, we expect more criminal acts to be unpunished in Summer.

In [None]:
df_crimes_monthly["Arrest_Perc"] = df_crimes_monthly["Arrest"]/df_crimes_monthly["Total"]*100

In [None]:
px.line(df_crimes_monthly, y="Arrest_Perc")

This is indeed true. Note that **Arrest_Perc** has dropped after Covid-19.

* Compute and visualize the average number of **Arrest**/**No Arrest**/**Total** for the 7 different days of the week. What do you observe?

In [None]:
df_weekday_avg = df_crimes_hourly.groupby(df_crimes_hourly.index.weekday).mean()
df_weekday_avg.index.name = "weekday"
dict_weekday = {0:'Mon', 1:'Tue', 2:'Wed', 3:'Thur', 4:'Fri', 5:'Sat', 6:'Sun'}
df_weekday_avg.index = df_weekday_avg.index.map(dict_weekday)
df_weekday_avg

In [None]:
px.bar(df_weekday_avg)

There are fewer crimes on Sundays

* Compute and visualize the hourly average of **Arrest**/**No Arrest**/**Total**. What do you observe?

In [None]:
df_hour_avg = df_crimes_hourly.groupby(df_crimes_hourly.index.hour).mean()
df_hour_avg.index.name="Hour"
px.line(df_hour_avg)

There is a natural daily cycle. Positive peak at 12, high rate during the evening/night, negative peak in the early morning. Makes sense!

* Compute and visualize the monthly average of **Arrest**/**No Arrest**/**Total**. What do you observe?

### Extra

* What does the following command do?

In [None]:
df_crimes_ma = df_crimes_hourly.resample("D").sum().rolling(31, center=True).sum()

It first resample the signal at a daily basis, then apply a length-31 rolling window operation around each data point. 

In [None]:
df_crimes_ma.iloc[13:24]

In [None]:
px.line(df_crimes_ma, y=["Arrest", "No Arrest", "Total"])

In [None]:
px.line(df_crimes_monthly, y=["Arrest", "No Arrest", "Total"])

The result is similar to the monthly resampling, but it is defined at a daily resolution.

* Compute and visualize the average number of **Arrest**/**No Arrest**/**Total** for the 12 different months. What do you observe?

In [None]:
df_month_avg = df_crimes_hourly.groupby(df_crimes_hourly.index.month).mean()
df_month_avg.index = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]
df_month_avg.index.name = "Month"

In [None]:
fig = px.bar(df_month_avg)
fig.show()

There are more crimes in the warmer months of the year.