# Data preparation for prediction

The task for the prediction is as follows:

... forecasting total system-level demand in the next hour ...

Therefore we need a dataset which has an hourly resolution. An intuitive system-level demand measure would be the amount of trips per hour. To achive this we resample the whole dataset by the hour and count the number of trips per hour.

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib import rcParams

In [2]:
#read dataset
path = "../chicago_2017_prepared.csv"
bike_data = pd.read_csv(path, parse_dates=["start_time","end_time"])
bike_data.head(1)

Unnamed: 0,start_time,end_time,start_station_id,end_station_id,start_station_name,end_station_name,bike_id,user_type,start_hour,start_day,start_month,start_weekday,max_temp,min_temp,precip,start_lat,start_long,end_lat,end_long
0,2017-01-01 00:00:36,2017-01-01 00:06:32,414,191,Canal St & Taylor St,Canal St & Monroe St (*),2511,Customer,0,1,1,6,-0.6,-0.6,0.0,41.870257,-87.639474,41.880884,-87.639525


The data contains trips specfic information like the start and end station. Since this data cannot be used for system-level demand those features are not needed for the prediction.

In [8]:
# feature engineering
# add trip duration feature
timedelta = bike_data["end_time"] - bike_data["start_time"]
bike_data["duration"] = timedelta.apply(lambda x:x.seconds)
bike_data["duration_min"] = bike_data["duration"].apply(lambda x:(int) (x/60))
# add weekend dummy variable
bike_data["is_weekend"] = bike_data["start_weekday"].isin([5,6])
# only include possibel features
feature_data = bike_data[["start_time","bike_id","start_hour","start_day","start_month","start_weekday","is_weekend","duration_min","max_temp","precip"]]
feature_data.head(1)


Unnamed: 0,start_time,bike_id,start_hour,start_day,start_month,start_weekday,is_weekend,duration_min,max_temp,precip
0,2017-01-01 00:00:36,2511,0,1,1,6,True,5,-0.6,0.0


In [9]:
#aggregate by hour (for demand prediction in the next hour)
feature_data_by_hour = feature_data.groupby(pd.Grouper(key='start_time',freq='1H')).agg({"bike_id":"count","start_hour":"mean","start_day":"mean","start_month":"mean","start_weekday":"mean","is_weekend":"mean","duration_min":"mean","max_temp":"mean","precip":"mean"})
feature_data_by_hour.rename(columns={"duration_min":"mean_duration_min","bike_id":"trip_amount"}, inplace=True)
feature_data_by_hour.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 8760 entries, 2017-01-01 00:00:00 to 2017-12-31 23:00:00
Freq: H
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   trip_amount        8760 non-null   int64  
 1   start_hour         8743 non-null   float64
 2   start_day          8743 non-null   float64
 3   start_month        8743 non-null   float64
 4   start_weekday      8743 non-null   float64
 5   is_weekend         8743 non-null   float64
 6   mean_duration_min  8743 non-null   float64
 7   max_temp           8743 non-null   float64
 8   precip             8743 non-null   float64
dtypes: float64(8), int64(1)
memory usage: 684.4 KB


As we can see there are some hours where there were no trips done. This results in null values. However we predictive models can not handle null values. To make the data usable we forwards fill the weather data and calculate values for the other columns.

In [11]:
# deal with NaN values
feature_data_by_hour.sort_index(inplace=True)
feature_data_by_hour["max_temp"] = feature_data_by_hour["max_temp"].ffill()
feature_data_by_hour["precip"] = feature_data_by_hour["precip"].ffill()
feature_data_by_hour["mean_duration_min"] = feature_data_by_hour["mean_duration_min"].fillna(value=0)
feature_data_by_hour["start_hour"] = feature_data_by_hour.index.to_series().apply(lambda x:x.hour)
feature_data_by_hour["start_day"] = feature_data_by_hour.index.to_series().apply(lambda x:x.day)
feature_data_by_hour["start_month"] = feature_data_by_hour.index.to_series().apply(lambda x:x.month)
feature_data_by_hour["start_weekday"] = feature_data_by_hour.index.to_series().apply(lambda x:x.dayofweek)
feature_data_by_hour["is_weekend"] = feature_data_by_hour["start_weekday"].apply(lambda x: 1 if (x in [5,6]) else 0 )
feature_data_by_hour.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 8760 entries, 2017-01-01 00:00:00 to 2017-12-31 23:00:00
Freq: H
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   trip_amount        8760 non-null   int64  
 1   start_hour         8760 non-null   int64  
 2   start_day          8760 non-null   int64  
 3   start_month        8760 non-null   int64  
 4   start_weekday      8760 non-null   int64  
 5   is_weekend         8760 non-null   int64  
 6   mean_duration_min  8760 non-null   float64
 7   max_temp           8760 non-null   float64
 8   precip             8760 non-null   float64
dtypes: float64(3), int64(6)
memory usage: 684.4 KB


In [6]:
feature_data_by_hour.head()

Unnamed: 0_level_0,trip_amount,start_hour,start_day,start_month,start_weekday,is_weekend,mean_duration_min,max_temp,precip
start_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2017-01-01 00:00:00,46,0,1,1,6,1,25.652174,-0.6,0.0
2017-01-01 01:00:00,46,1,1,1,6,1,10.891304,-2.2,0.0
2017-01-01 02:00:00,36,2,1,1,6,1,8.027778,-2.8,0.0
2017-01-01 03:00:00,18,3,1,1,6,1,11.111111,-3.3,0.0
2017-01-01 04:00:00,6,4,1,1,6,1,8.0,-3.3,0.0


In [12]:
feature_data_by_hour.describe()

Unnamed: 0,trip_amount,start_hour,start_day,start_month,start_weekday,is_weekend,mean_duration_min,max_temp,precip
count,8760.0,8760.0,8760.0,8760.0,8760.0,8760.0,8760.0,8760.0,8760.0
mean,437.095205,11.5,15.720548,6.526027,3.008219,0.287671,13.968518,11.540594,0.087443
std,498.340244,6.922582,8.796749,3.448048,2.003519,0.452703,8.133743,10.917582,0.282499
min,0.0,0.0,1.0,1.0,0.0,0.0,0.0,-19.4,0.0
25%,57.75,5.75,8.0,4.0,1.0,0.0,9.91435,3.3,0.0
50%,234.5,11.5,16.0,7.0,3.0,0.0,12.178727,11.7,0.0
75%,660.0,17.25,23.0,10.0,5.0,1.0,16.317608,20.6,0.0
max,2852.0,23.0,31.0,12.0,6.0,1.0,263.25,35.0,1.0


Now we have a good dataset for creating ML models.

In [13]:
# save data as csv
# feature_data_by_hour.to_csv("load_prediction_data.csv")

: 