# Data Exploration

This notebook describes the data exploration steps.

## Install dependencies

In [3]:
%pip install -r data/requirements.txt

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


## Load data

In [48]:
import pandas as pd

df = pd.read_sql_table('offenses', 'sqlite:///data/data.sqlite')

### Dataset
Below, you can see the dataset. The available columns are:
- `exceedance`: The exceedance of the speed limit in km/h
- `datetime`: The date and time of the offense
- `lat`: The latitude of the offense
- `lon`: The longitude of the offense
- `temperature`: The temperature at the offense location at the offense time
- `precipitation`: The precipitation at the offense location at the offense time
- `wind speed`: The wind speed at the offense location at the offense time

In [50]:
df.head(10)

Unnamed: 0,exceedance,datetime,lat,lon,temperature,precipitation,wind speed
0,6,2018-01-01 00:00:29,50.951697,6.981953,7.4,0.8,28.4
1,17,2018-01-01 00:01:11,50.947906,6.941059,7.4,0.8,28.4
2,6,2018-01-01 00:06:44,50.951697,6.981953,7.4,0.8,28.4
3,8,2018-01-01 00:08:34,50.936374,6.935985,7.4,0.8,28.4
4,21,2018-01-01 00:12:08,50.947906,6.941059,7.4,0.8,28.4
5,9,2018-01-01 00:14:17,50.947906,6.941059,7.4,0.8,28.4
6,16,2018-01-01 00:19:01,50.947906,6.941059,7.4,0.8,28.4
7,12,2018-01-01 00:20:21,50.936374,6.935985,7.4,0.8,28.4
8,52,2018-01-01 00:20:45,50.947906,6.941059,7.4,0.8,28.4
9,6,2018-01-01 00:21:11,50.947906,6.941059,7.4,0.8,28.4


### Data exploration
To get a first impression of the data, I executed Pandas [`describe`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html) function on the dataset. The output can be seen below.

In [52]:
df.describe()

Unnamed: 0,exceedance,datetime,lat,lon,temperature,precipitation,wind speed
count,665634.0,665634,665634.0,665634.0,665634.0,665634.0,662922.0
mean,10.613538,2019-08-11 14:12:28.312513280,50.952222,6.950637,13.105685,0.086298,12.980096
min,6.0,2018-01-01 00:00:29,50.849154,6.846078,-11.9,0.0,0.0
25%,7.0,2018-10-02 20:00:29,50.947906,6.941059,6.9,0.0,7.9
50%,9.0,2019-05-12 07:38:51.500000,50.947906,6.941059,12.3,0.0,11.9
75%,12.0,2020-05-25 12:51:47.500000,50.951697,6.976871,18.9,0.0,16.9
max,171.0,2021-12-31 23:57:25,51.046593,7.085497,40.8,32.7,72.0
std,6.268805,,0.024807,0.034965,7.827022,0.522429,6.646248


The `count` row shows that there are 665635 rows in our dataset for the `exceedance`, `datetime`, `lat`, `lon`, `temperature` and `precipitation` columns. The `wind speed` column only has 662922 rows, which means that there are 2713 rows with missing values. This is not a lot, so we can ignore these rows in our analysis.