In [2]:
import pandas as pd
import os

# Time series analysis
In this analysis we aim to predict the `number of accidents` and the `severity of the accidents`.

## Setting up the dataset
It is our intention to create a "found" time series, that is a dataset which is properly cleaned and structured to work with state of the art time series analysis algorithms. The reason for this is that our original Kaggle dataset was not intended to be used for time series analysis and was therefore not organized as such.

In the following, we'll look at the main problems with this dataset, before outlining the strategy we've developed to address them.

### The problem of a messy dataset
The reader should bear in mind that the dataset used is a *government dataset*. As such, due to politics, budgets and other exogenous considerations, missing and inaccurate values, incomplete rows and discontinuities should be expected.

And that's not even the half of it. Given the nature of the dataset we should expect the *time discounting* phenomenom to hinder the accuracy of the recorded information. More specifically, as this is an accident-related dataset, the event is recorded after the intervention of a traffic officer and after a report has been written. This means that the timestamp of the registration is likely to be distant from the timestamp of the accident, and details outside the report may be misreported, due to the *time discounting* phenomenom.

### The problem of daylight savings time
<p align="center">
    <img src="img/time_zone.png" width="60%"/>
</p>

As the dataset is based on accidents that took place in the United Kingdom, the problem of time zone mismatch does not arise. In fact, the whole country is in the UTC time zone, so not only do we not have to worry about multiple time zones based on latitude and longitude values, but also we have equivalence between UTC and the local time zone.

However, there's still the problem of daylight savings time, which causes some instants to occur twice a year and others to not exist at all.

<!-- 
- time discounting, huge dataset => downsample with aggregation, timestamp should prevent lookahead
- daylight savings => needs investigation, however we don't care due to the aggregation
- utc (photo), fortunatly
- null values?
-->

In [3]:
DATASET_NAME='Accident_Information.csv'

In [16]:
df = pd.read_csv(os.path.join('../dataset', DATASET_NAME), low_memory=False)

In [81]:
df_pruned = (
    df[['Accident_Severity', 'Date', 'Time']]
    .assign(timestamp=pd.to_datetime(df['Date'] + ' ' + df['Time']))
    .drop(columns=['Date', 'Time'])
)

In [72]:
f'Original size = {df.size/1024 **
                   2:.2f} MB', f'Pruned size = {df_pruned.size/1024**2:.2f} MB'

('Original size = 66.38 MB', 'Pruned size = 3.90 MB')

In [82]:
print(f'{len(df_pruned[df_pruned.isnull().any(axis=1)])
         } rows have a null value, so I remove them')
df_pruned = df_pruned.dropna()

156 rows have a null value, so I remove them


## Exploratory analysis

<!-- 
- plot of 12 TS 1 for each year, and then 1 for each month
- stagionality is additive or multiplicative? Should plot the residuals from the mean
- stationarity, Aug Dickey Fuller (and KPSS?)
- BoxCox
-->