In [1]:
%load_ext autoreload
%load_ext watermark

In [2]:
%autoreload 2

In [3]:
%watermark -ntz -p numpy,pandas,scipy,sklearn

Thu Aug 25 2022 12:39:36 India Standard Time 

numpy 1.18.5
pandas 1.0.4
scipy 1.4.1
sklearn 0.23.1


In [4]:
from pathlib import Path

import numpy as np
import pandas as pd
import scipy.stats as stats
import sklearn

import traffic_exercise.data

In [5]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer

In [84]:
import seaborn as sns

---
# Exercise 6: Create prediction dataset
In exercise 1,
you cleaned raw weather and traffic data
to create datasets suitable for analysis.
In subsequent exercises,
you explored those datasets
to obtain insights
into the relationship between
rainfall,
temperature
and traffic
across Cumbria.
In this exercise,
you will combine those insights
to create a dataset which contains information
relevant to training a predictive model.

At the end of this exercise,
you should have analysed and reasoned about
data features which carry some predictive signal
as to the traffic level on a given day
_at a given counting site_.
You will have created this predictive dataset
for use in a later exercise.

## Learning objectives
Objectives which _may_ be met during this exercise.

- I can identify key features which drive traffic levels (_DATA 2_)
- I can identify shortcomings in the data and suggest ways to improve the analysis (_DATA 2_)
- I can measure simple relationships between data points (_STAT 1_)
- I can appropriately partition the data into test/train/validation sets (_ML 1_)
- I can log the results and methods of my analysis to aid reproducibility (_DATA 1_)

_Refer to the [exercise document](../references/exercise_background.md#development-objectives) for more information on objectives_


---
## Step 1: Prepare data

**Tasks:**
- Load weather and traffic data (train, val, test datasets)
- Choose a counting site for which traffic is particularly correlated with weather
    - Refer back to previous exercises

In [40]:
traffic_data_path = Path.cwd().resolve().parent /'analysis' / "data" / "interim" / "counter_data.csv"
weather_data_path = Path.cwd().resolve().parent /'analysis'/ "data" / "interim" / "weather_data_i.csv"

In [41]:
traffic_df = pd.read_csv(traffic_data_path)

traffic_df["Date" ] = pd.to_datetime(traffic_df["Date"])

#Original time is end of a timespan, subtract 1 hour to get start of period. 
#This reduces drawing artifacts where the last hour of each day has next days date.
traffic_df["Date"] = traffic_df["Date"] - pd.Timedelta(hours=1)

# TODO add month, weekday, hour columns
# TODO turn "Date" column into date only (i.e. year, month, day)
traffic_df['Month'] = traffic_df['Date'].apply(lambda x: x.month)
traffic_df['Day'] = traffic_df['Date'].apply(lambda x: x.day)
traffic_df['Weekday'] = traffic_df['Date'].apply(lambda x: x.dayofweek)
traffic_df["Hour"] = traffic_df["Date"].apply(lambda x: x.hour)
traffic_df['Date'] = traffic_df['Date'].apply(lambda d: d.date())
traffic_df.describe(include=["object", "int64", "datetime64[ns]"])

Unnamed: 0,Date,Hour Ending,Special day,Counter ID,Counts,Month,Day,Weekday,Hour
count,186624,186624.0,45120,186624.0,186624.0,186624.0,186624.0,186624.0,186624.0
unique,365,,4,,,,,,
top,2019-05-25,,o,,,,,,
freq,576,,40416,,,,,,
mean,,12.5,,44059.051569,116.501983,6.636574,15.743699,3.002829,11.5
std,,6.922205,,13443.770875,181.584278,3.381432,8.760233,2.000775,6.922205
min,,1.0,,20011.0,0.0,1.0,1.0,0.0,0.0
25%,,6.75,,30023.0,8.0,4.0,8.0,1.0,5.75
50%,,12.5,,50011.0,49.0,7.0,16.0,3.0,11.5
75%,,18.25,,50077.0,137.0,10.0,23.0,5.0,17.25


Load weather data

In [42]:
weather_df = pd.read_csv(weather_data_path)
weather_df["Date"] = pd.to_datetime(weather_df["Date"])
weather_df["Month"] = weather_df["Date"].apply(lambda d: d.month)
weather_df["Day"] = weather_df["Date"].apply(lambda d: d.weekday)
weather_df["Hour"] = weather_df["Date"].apply(lambda d: d.hour)
weather_df["Date"] = weather_df["Date"].apply(lambda d: d.date())

In [44]:
weather_df.shape

(365, 6)

In [43]:
traffic_df = traffic_df.replace({'Special day' : {np.nan : 'norm'}})

In [45]:
traffic_df['Special day'].value_counts()

norm    141504
o        40416
bo        2592
b         2064
s           48
Name: Special day, dtype: int64

In [86]:
weather_df

Unnamed: 0,Date,Rainfall (mm),MaxApparentTemp (degC),Month,Day,Hour
0,2019-01-01,0.0,7.7,1,<built-in method weekday of Timestamp object a...,0
1,2019-01-02,0.0,0.4,1,<built-in method weekday of Timestamp object a...,0
2,2019-01-03,0.0,1.0,1,<built-in method weekday of Timestamp object a...,0
3,2019-01-04,0.0,2.8,1,<built-in method weekday of Timestamp object a...,0
4,2019-01-05,0.0,3.4,1,<built-in method weekday of Timestamp object a...,0
...,...,...,...,...,...,...
360,2019-12-27,0.6,5.8,12,<built-in method weekday of Timestamp object a...,0
361,2019-12-28,0.3,6.8,12,<built-in method weekday of Timestamp object a...,0
362,2019-12-29,0.0,6.3,12,<built-in method weekday of Timestamp object a...,0
363,2019-12-30,0.0,7.3,12,<built-in method weekday of Timestamp object a...,0


In [50]:
traffic_df[traffic_df['Counter ID'] == 30023]['Counts'].mean()

102.07955689828802

In [46]:
spday = traffic_df[['Date','Counter ID','Special day']]
spday = spday.drop_duplicates()

In [47]:
spday

Unnamed: 0,Date,Counter ID,Special day
0,2019-01-01,20011,bo
24,2019-01-02,20011,o
48,2019-01-03,20011,o
72,2019-01-04,20011,o
96,2019-01-05,20011,o
...,...,...,...
186504,2019-12-27,60006,o
186528,2019-12-28,60006,o
186552,2019-12-29,60006,o
186576,2019-12-30,60006,o


In [48]:
temp = traffic_df.groupby(['Counter ID','Date']).agg({'Counts': ['max', 'mean']})
temp.columns = temp.columns.droplevel()

In [49]:
temp['mean'] = np.round(temp['mean'],2)
temp.reset_index(inplace = True)
temp.rename(columns = {'mean':'Daily_mean_traffic', 'max': 'Daily_peak_traffic'}, inplace = True)

In [50]:
temp

Unnamed: 0,Counter ID,Date,Daily_peak_traffic,Daily_mean_traffic
0,20011,2019-01-01,210,47.46
1,20011,2019-01-02,210,53.50
2,20011,2019-01-03,120,38.33
3,20011,2019-01-04,148,44.54
4,20011,2019-01-05,119,37.42
...,...,...,...,...
7771,60006,2019-12-27,73,22.00
7772,60006,2019-12-28,126,23.83
7773,60006,2019-12-29,103,25.04
7774,60006,2019-12-30,94,24.21


In [51]:
temp[temp['Counter ID'] == 30023]['Daily_mean_traffic'].mean()

102.0796978851964

In [51]:
df = spday.merge(temp, on = ['Date', 'Counter ID'])

In [53]:
df.shape

(7776, 5)

In [54]:
df['Prev_day_mean'] = df.groupby('Counter ID')['Daily_mean_traffic'].shift(1)
df['Prev_day_peak'] = df.groupby('Counter ID')['Daily_peak_traffic'].shift(1)
df['Prev_week_mean'] = df.groupby('Counter ID')['Daily_mean_traffic'].shift(7)
df['Prev_week_peak'] = df.groupby('Counter ID')['Daily_peak_traffic'].shift(7)

In [60]:
df[df.Prev_day_mean.isna()].groupby('Counter ID')['Date'].min()

Counter ID
20011    2019-01-01
20012    2019-01-01
20053    2019-01-01
20054    2019-01-01
30021    2019-01-01
30022    2019-01-31
30023    2019-01-01
30024    2019-01-25
50003    2019-01-15
50004    2019-01-16
50009    2019-01-01
50010    2019-01-01
50011    2019-01-01
50012    2019-01-01
50043    2019-01-01
50044    2019-01-01
50053    2019-01-01
50054    2019-01-01
50077    2019-01-01
50078    2019-01-01
60003    2019-01-01
60004    2019-01-01
60005    2019-01-01
60006    2019-01-01
Name: Date, dtype: object

In [71]:
df[df['Counter ID'] == 60006]

Unnamed: 0,Date,Counter ID,Special day,Daily_peak_traffic,Daily_mean_traffic,Prev_day_mean,Prev_day_peak,Prev_week_mean,Prev_week_peak
7450,2019-01-01,60006,bo,83,22.00,,,,
7451,2019-01-02,60006,o,123,27.88,22.00,83.0,,
7452,2019-01-03,60006,o,76,20.92,27.88,123.0,,
7453,2019-01-04,60006,o,69,22.42,20.92,76.0,,
7454,2019-01-05,60006,o,52,17.08,22.42,69.0,,
...,...,...,...,...,...,...,...,...,...
7771,2019-12-27,60006,o,73,22.00,13.12,56.0,19.92,83.0
7772,2019-12-28,60006,o,126,23.83,22.00,73.0,15.96,71.0
7773,2019-12-29,60006,o,103,25.04,23.83,126.0,15.29,74.0
7774,2019-12-30,60006,o,94,24.21,25.04,103.0,19.08,61.0


In [72]:
#ohe = OneHotEncoder()
#ct = make_column_transformer( (ohe , ['Special day']), remainder = 'passthrough' )

In [74]:
#pd.DataFrame(ct.fit_transform(df))

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,0,1,0,0,0,2019-01-01,20011,210,47.46,,,,
1,0,0,0,1,0,2019-01-02,20011,210,53.5,47.46,210,,
2,0,0,0,1,0,2019-01-03,20011,120,38.33,53.5,210,,
3,0,0,0,1,0,2019-01-04,20011,148,44.54,38.33,120,,
4,0,0,0,1,0,2019-01-05,20011,119,37.42,44.54,148,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
7771,0,0,0,1,0,2019-12-27,60006,73,22,13.12,56,19.92,83
7772,0,0,0,1,0,2019-12-28,60006,126,23.83,22,73,15.96,71
7773,0,0,0,1,0,2019-12-29,60006,103,25.04,23.83,126,15.29,74
7774,0,0,0,1,0,2019-12-30,60006,94,24.21,25.04,103,19.08,61


In [75]:
df.head()

Unnamed: 0,Date,Counter ID,Special day,Daily_peak_traffic,Daily_mean_traffic,Prev_day_mean,Prev_day_peak,Prev_week_mean,Prev_week_peak
0,2019-01-01,20011,bo,210,47.46,,,,
1,2019-01-02,20011,o,210,53.5,47.46,210.0,,
2,2019-01-03,20011,o,120,38.33,53.5,210.0,,
3,2019-01-04,20011,o,148,44.54,38.33,120.0,,
4,2019-01-05,20011,o,119,37.42,44.54,148.0,,


In [61]:
dummies = pd.get_dummies(df['Special day']).rename(columns=lambda x: 'special_day_' + str(x))
df = pd.concat([df, dummies], axis=1)
df.drop(['Special day'], inplace=True, axis=1)

In [77]:
df.head()

Unnamed: 0,Date,Counter ID,Daily_peak_traffic,Daily_mean_traffic,Prev_day_mean,Prev_day_peak,Prev_week_mean,Prev_week_peak,special_day_b,special_day_bo,special_day_norm,special_day_o,special_day_s
0,2019-01-01,20011,210,47.46,,,,,0,1,0,0,0
1,2019-01-02,20011,210,53.5,47.46,210.0,,,0,0,0,1,0
2,2019-01-03,20011,120,38.33,53.5,210.0,,,0,0,0,1,0
3,2019-01-04,20011,148,44.54,38.33,120.0,,,0,0,0,1,0
4,2019-01-05,20011,119,37.42,44.54,148.0,,,0,0,0,1,0


In [62]:
df['Weekday'] = df['Date'].apply(lambda x: x.weekday())
df['weekend'] = np.where(df["Weekday"] >= 4 , 1, 0)


In [63]:
df = df.merge(weather_df, on = 'Date')

In [64]:
df.shape

(7776, 20)

In [143]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7776 entries, 0 to 7775
Data columns (total 20 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Date                    7776 non-null   object 
 1   Counter ID              7776 non-null   int64  
 2   Daily_peak_traffic      7776 non-null   int64  
 3   Daily_mean_traffic      7776 non-null   float64
 4   Prev_day_mean           7752 non-null   float64
 5   Prev_day_peak           7752 non-null   float64
 6   Prev_week_mean          7608 non-null   float64
 7   Prev_week_peak          7608 non-null   float64
 8   special_day_b           7776 non-null   uint8  
 9   special_day_bo          7776 non-null   uint8  
 10  special_day_norm        7776 non-null   uint8  
 11  special_day_o           7776 non-null   uint8  
 12  special_day_s           7776 non-null   uint8  
 13  Weekday                 7776 non-null   int64  
 14  weekend                 7776 non-null   

In [65]:
df['Month'] = df['Date'].apply(lambda x: x.month)

In [34]:
df.head()

Unnamed: 0,Date,Counter ID,Daily_peak_traffic,Daily_mean_traffic,Prev_day_mean,Prev_day_peak,Prev_week_mean,Prev_week_peak,special_day_b,special_day_bo,special_day_norm,special_day_o,special_day_s,Weekday,weekend,Rainfall (mm),MaxApparentTemp (degC),Month,Day,Hour
0,2019-01-01,20011,210,47.46,,,,,0,1,0,0,0,1,0,0.0,7.7,1,<built-in method weekday of Timestamp object a...,0
1,2019-01-01,20012,169,38.42,,,,,0,1,0,0,0,1,0,0.0,7.7,1,<built-in method weekday of Timestamp object a...,0
2,2019-01-01,20053,226,68.58,,,,,0,1,0,0,0,1,0,0.0,7.7,1,<built-in method weekday of Timestamp object a...,0
3,2019-01-01,20054,238,70.71,,,,,0,1,0,0,0,1,0,0.0,7.7,1,<built-in method weekday of Timestamp object a...,0
4,2019-01-01,30021,144,40.79,,,,,0,1,0,0,0,1,0,0.0,7.7,1,<built-in method weekday of Timestamp object a...,0


In [66]:
df.columns

Index(['Date', 'Counter ID', 'Daily_peak_traffic', 'Daily_mean_traffic',
       'Prev_day_mean', 'Prev_day_peak', 'Prev_week_mean', 'Prev_week_peak',
       'special_day_b', 'special_day_bo', 'special_day_norm', 'special_day_o',
       'special_day_s', 'Weekday', 'weekend', 'Rainfall (mm)',
       'MaxApparentTemp (degC)', 'Month', 'Day', 'Hour'],
      dtype='object')

In [67]:
#train_counters = [60003,50010,50078,50053,30021,30023,20054,50003,50044,30022,50011,20012,60006,20053,20011,30024]
test_counters = [50077, 60004,50012,60005,50004,50009,50043,50054]

In [68]:
train = df[ ~(df['Counter ID'].isin(test_counters) & df['Month'].isin([7,8,9,10,11,12]) )]
test = df[ (df['Counter ID'].isin(test_counters) & df['Month'].isin([7,8,9,10,11,12]) )]

In [77]:
print(train.shape,test.shape)

(6399, 14) (1377, 14)


In [74]:
train[train['Counter ID'] == 50077].groupby(['Counter ID', 'Month'], as_index = False).count()

Unnamed: 0,Counter ID,Month,Date,Daily_peak_traffic,Daily_mean_traffic,Prev_day_mean,Prev_day_peak,Prev_week_mean,Prev_week_peak,special_day_b,special_day_bo,special_day_norm,special_day_o,special_day_s,Weekday,weekend,Rainfall (mm),MaxApparentTemp (degC),Day,Hour
0,50077,1,29,29,29,28,28,22,22,29,29,29,29,29,29,29,29,29,29,29
1,50077,2,28,28,28,28,28,28,28,28,28,28,28,28,28,28,28,28,28,28
2,50077,3,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31
3,50077,4,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29
4,50077,5,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31
5,50077,6,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30


In [75]:
test.columns

Index(['Date', 'Counter ID', 'Daily_peak_traffic', 'Daily_mean_traffic',
       'Prev_day_mean', 'Prev_day_peak', 'Prev_week_mean', 'Prev_week_peak',
       'special_day_b', 'special_day_bo', 'special_day_norm', 'special_day_o',
       'special_day_s', 'Weekday', 'weekend', 'Rainfall (mm)',
       'MaxApparentTemp (degC)', 'Month', 'Day', 'Hour'],
      dtype='object')

In [78]:
train.drop(['Date','Daily_peak_traffic','Weekday','Month','Day','Hour'], axis = 1, inplace = True)
test.drop(['Date','Daily_peak_traffic','Weekday','Month','Day','Hour'], axis = 1, inplace = True)

KeyError: "['Date' 'Daily_peak_traffic' 'Weekday' 'Month' 'Day' 'Hour'] not found in axis"

In [113]:
np.round(train.groupby('Counter ID')['Daily_mean_traffic'].mean(), 2).reset_index().rename(columns = {'Daily_mean_traffic': 'Counter_avg_traffic'})

Unnamed: 0,Counter ID,Counter_avg_traffic
0,20011,44.91
1,20012,48.1
2,20053,96.06
3,20054,93.38
4,30021,47.85
5,30022,47.98
6,30023,102.08
7,30024,103.42
8,50003,383.61
9,50004,372.76


In [None]:
train.merge(np.round(train.groupby('Counter ID')['Daily_mean_traffic'].mean(), 2).reset_index().rename(columns = {'Daily_mean_traffic': 'Counter_avg_traffic'}), on = 'Counter ID')

In [79]:
train = train.merge(np.round(train.groupby('Counter ID')['Daily_mean_traffic'].mean(), 2).reset_index().rename(columns = {'Daily_mean_traffic': 'Counter_avg_traffic'}), on = 'Counter ID')

In [80]:
test = test.merge(np.round(train.groupby('Counter ID')['Daily_mean_traffic'].mean(), 2).reset_index().rename(columns = {'Daily_mean_traffic': 'Counter_avg_traffic'}),on = 'Counter ID')

In [81]:
train = train[['special_day_b','special_day_bo','special_day_norm','special_day_o','special_day_s','weekend','Rainfall (mm)','MaxApparentTemp (degC)','Counter_avg_traffic','Prev_day_mean','Prev_day_peak','Prev_week_mean','Prev_week_peak','Daily_mean_traffic']]
test = test[['special_day_b','special_day_bo','special_day_norm','special_day_o','special_day_s','weekend','Rainfall (mm)','MaxApparentTemp (degC)','Counter_avg_traffic','Prev_day_mean','Prev_day_peak','Prev_week_mean','Prev_week_peak','Daily_mean_traffic']]

In [83]:
train.corr()

Unnamed: 0,special_day_b,special_day_bo,special_day_norm,special_day_o,special_day_s,weekend,Rainfall (mm),MaxApparentTemp (degC),Counter_avg_traffic,Prev_day_mean,Prev_day_peak,Prev_week_mean,Prev_week_peak,Daily_mean_traffic
special_day_b,1.0,-0.013623,-0.2164,-0.058022,-0.002064,0.013521,-0.042194,0.089139,0.003761,0.038847,0.042546,0.015666,0.019026,0.032916
special_day_bo,-0.013623,1.0,-0.2164,-0.058022,-0.002064,-0.101582,-0.0571,0.001642,-0.005691,-0.016352,-0.012532,-0.007245,-0.011421,-0.025754
special_day_norm,-0.2164,-0.2164,1.0,-0.921691,-0.032783,0.084221,0.032807,-0.168,-0.006466,-0.04212,-0.043408,-0.04862,-0.032763,-0.044103
special_day_o,-0.058022,-0.058022,-0.921691,1.0,-0.00879,-0.063741,-0.005306,0.148803,0.007467,0.03674,0.036183,0.048227,0.031667,0.044092
special_day_s,-0.002064,-0.002064,-0.032783,-0.00879,1.0,0.020316,-0.008994,0.024009,-0.002988,0.002403,-0.000695,-0.000897,-0.000411,0.001493
weekend,0.013521,-0.101582,0.084221,-0.063741,0.020316,1.0,0.090664,0.010369,0.00053,0.041809,0.007798,0.027275,0.028346,0.029598
Rainfall (mm),-0.042194,-0.0571,0.032807,-0.005306,-0.008994,0.090664,1.0,-0.066733,0.006516,-0.005997,-0.014764,0.007075,0.00618,-0.029582
MaxApparentTemp (degC),0.089139,0.001642,-0.168,0.148803,0.024009,0.010369,-0.066733,1.0,0.014888,0.14419,0.108054,0.138809,0.103843,0.142795
Counter_avg_traffic,0.003761,-0.005691,-0.006466,0.007467,-0.002988,0.00053,0.006516,0.014888,1.0,0.966452,0.966753,0.967659,0.967576,0.966519
Prev_day_mean,0.038847,-0.016352,-0.04212,0.03674,0.002403,0.041809,-0.005997,0.14419,0.966452,1.0,0.983553,0.96941,0.9627,0.978083


In [82]:
print(train.shape,test.shape)

(6399, 14) (1377, 14)


In [122]:
train.to_csv(Path.cwd().resolve().parent /'analysis'/ "data" / "interim" / 'train.csv', index = False)
test.to_csv(Path.cwd().resolve().parent /'analysis'/ "data" / "interim" / 'test.csv', index = False)

---
## Step 2: Create data features

In previous exercises,
you explored the traffic and weather data
in great detail
and will have identified features
which may be of use in a predictive model.
In this step,
you will engineer these features.

**Tasks:**
- Compile a list of features which may be useful to a predictive model. Feature to consider are:
    - Previous day's peak and mean traffic
    - Peak and mean traffic on the same day of the preceding week
    - Rainfall and temperature of each day (assume these are predictions, not measurements, and therefore available in practice)
    - Special day status (e.g. bank holiday)
    - The current month
    - Whether the day is mid-week or weekend
- Engineer the features and add them to the dataset
- Consider your features in the context of train/validation/test datasets. Have you accidentally provided information which should not be available (look-ahead bias)?

# Making A Paralle model

In [88]:
# Practice
df_33 = traffic_df.copy()
df_33['previous'] = df_33.groupby('Counter ID')['Counts'].shift(1)

In [94]:
df_33.Weekday.unique()

array([1, 2, 3, 4, 5, 6, 0], dtype=int64)

In [101]:
temp1 = temp.copy()

In [103]:
temp1['Prev_day_mean'] = temp1.groupby('Counter ID')['Daily_mean_traffic'].shift(1)
temp1['Prev_day_peak'] = temp1.groupby('Counter ID')['Daily_peak_traffic'].shift(1)
temp1['Prev_week_mean'] = temp1.groupby('Counter ID')['Daily_mean_traffic'].shift(7)
temp1['Prev_week_peak'] = temp1.groupby('Counter ID')['Daily_peak_traffic'].shift(7)

In [104]:
temp1.head(22)

Unnamed: 0,Counter ID,Date,Daily_peak_traffic,Daily_mean_traffic,Prev_day_mean,Prev_day_peak,Prev_week_mean,Prev_week_peak
0,20011,2019-01-01,210,47.46,,,,
1,20011,2019-01-02,210,53.5,47.46,210.0,,
2,20011,2019-01-03,120,38.33,53.5,210.0,,
3,20011,2019-01-04,148,44.54,38.33,120.0,,
4,20011,2019-01-05,119,37.42,44.54,148.0,,
5,20011,2019-01-06,181,39.38,37.42,119.0,,
6,20011,2019-01-07,125,32.54,39.38,181.0,,
7,20011,2019-01-08,111,29.17,32.54,125.0,47.46,210.0
8,20011,2019-01-09,120,32.04,29.17,111.0,53.5,210.0
9,20011,2019-01-10,113,29.21,32.04,120.0,38.33,120.0


In [120]:
df_1.head()

Unnamed: 0,Date,Counter ID,Special day,Month,Hour,Weekday,Counts
0,2019-01-01,20011,bo,1,0,1,12
1,2019-01-01,20011,bo,1,1,1,10
2,2019-01-01,20011,bo,1,2,1,2
3,2019-01-01,20011,bo,1,3,1,8
4,2019-01-01,20011,bo,1,4,1,2


In [119]:
traffic_df = pd.read_csv(traffic_data_path)

traffic_df["Date" ] = pd.to_datetime(traffic_df["Date"])

#Original time is end of a timespan, subtract 1 hour to get start of period. 
#This reduces drawing artifacts where the last hour of each day has next days date.
traffic_df["Date"] = traffic_df["Date"] - pd.Timedelta(hours=1)

# TODO add month, weekday, hour columns
# TODO turn "Date" column into date only (i.e. year, month, day)
traffic_df['Month'] = traffic_df['Date'].apply(lambda x: x.month)
traffic_df['Day'] = traffic_df['Date'].apply(lambda x: x.day)
traffic_df['Weekday'] = traffic_df['Date'].apply(lambda x: x.dayofweek)
traffic_df["Hour"] = traffic_df["Date"].apply(lambda x: x.hour)

In [121]:
df_1 = traffic_df.copy()
df_1.drop(['Hour Ending','Day'], axis = 1, inplace = True)

In [122]:
df_1 = df_1[['Date', 'Counter ID', 'Special day', 'Month', 'Hour',
        'Weekday', 'Counts']]

In [131]:
df_2 = df_1.set_index('Date')

In [142]:
df_2['previous_day_same_hour'] = df_2.groupby('Counter ID')['Counts'].shift(1, freq = 'D')

TypeError: Argument 'tuples' has incorrect type (expected numpy.ndarray, got DatetimeArray)

In [141]:
df_2

Unnamed: 0_level_0,Counter ID,Special day,Month,Hour,Weekday,Counts,previous_day_same_hour
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2019-01-01 00:00:00,20011,bo,1,0,1,12,
2019-01-01 01:00:00,20011,bo,1,1,1,10,12.0
2019-01-01 02:00:00,20011,bo,1,2,1,2,10.0
2019-01-01 03:00:00,20011,bo,1,3,1,8,2.0
2019-01-01 04:00:00,20011,bo,1,4,1,2,8.0
...,...,...,...,...,...,...,...
2019-12-31 19:00:00,60006,o,12,19,1,6,12.0
2019-12-31 20:00:00,60006,o,12,20,1,6,6.0
2019-12-31 21:00:00,60006,o,12,21,1,2,6.0
2019-12-31 22:00:00,60006,o,12,22,1,0,2.0


---
## Step 3: Investigate features
You have likely created a lot of features,
some of which overlap with others.
In this step,
you will investigate the predictive signal
each feature provides.

**Tasks:**
- Estimate [VC dimension](https://towardsdatascience.com/measuring-the-power-of-a-classifier-c765a7446c1c) of dataset
    - How many features will be useful for traffic prediction?
- Calculate correlation between features
    - Are there any which overlap strongly?
    - Can they be removed or combined?
- Calculate correlation between each feature and traffic
    - Consider removing features with low signal

---
## Step 4: Pre-process

- Normalise parameters
- Apply a noise reduction method such as PCA

---
## Step 5: Save data

**Tasks:**
- Complete the `create_prediction_dataset` function in `src/traffic_exercise/data/prediction_dataset.py` to create the dataset as in this notebook
- Save the dataset to `data/processed`

---
# Review

After this exercise:
- [ ] I can identify key features which drive traffic levels (_DATA 2_)
- [ ] I can identify shortcomings in the data and suggest ways to improve the analysis (_DATA 2_)
- [ ] I can measure simple relationships between data points (_STAT 1_)
- [ ] I can appropriately partition the data into test/train/validation sets (_ML 1_)
- [ ] I can log the results and methods of my analysis to aid reproducibility (_DATA 1_)