# Prediction of Bikesharing in Washington DC

Optimized using Dask

One dataset contains hourly data and the other one has daily data from the years 2011 and 2012.

The following variables are included in the data:

* instant: Record index
* dteday: Date
* season: Season (1:springer, 2:summer, 3:fall, 4:winter)
* yr: Year (0: 2011, 1:2012)
* mnth: Month (1 to 12)
* hr: Hour (0 to 23, only available in the hourly dataset)
* holiday: whether day is holiday or not (extracted from Holiday Schedule)
* weekday: Day of the week
* workingday: If day is neither weekend nor holiday is 1, otherwise is 0.
* weathersit: (extracted from Freemeteo)
    1: Clear, Few clouds, Partly cloudy, Partly cloudy
    2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
    3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
    4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
* temp: Normalized temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-8, t_max=+39 (only in hourly scale)
* atemp: Normalized feeling temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-16, t_max=+50 (only in hourly scale)
* hum: Normalized humidity. The values are divided to 100 (max)
* windspeed: Normalized wind speed. The values are divided to 67 (max)
* casual: count of casual users
* registered: count of registered users
* cnt: count of total rental bikes including both casual and registered (Our target variable)

We are tasked with building a predictive model that can determine how many people will use the service on an hourly basis, therefore we take the first 5 quarters of the data for our training dataset and the last quarter of 2012 will be the holdout against which we perform our validation. Since that data was not used for training, we are sure that the evaluation metric that we get for it (R2 score) is an objective measurement of its predictive power.

Import the necessary libraries

In [1]:
import helpers as hp
from dask import dataframe as dd
from distributed import Client, progress
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score as metric_scorer
from sklearn.externals.joblib import parallel_backend

### Setting Key Values

The following values are used throught the code, this cell gives a central source where they can be managed

In [2]:
SEED = 1
DATA_PATH = 'https://gist.githubusercontent.com/f-loguercio/f5c10c97fe9afe58f77cd102ca81719b/raw/99fb846b22abc8855de305c2159a57a77c9764cf/bikesharing_hourly.csv'
DATA_PATH2 = 'https://gist.githubusercontent.com/f-loguercio/14ac934fabcca41093a51efef335f8f2/raw/58e00b425c711ac1da2fb75f851f4fc9ce814cfa/bikesharing_daily.csv'
PREC_PATH = 'https://gist.githubusercontent.com/akoury/6fb1897e44aec81cced8843b920bad78/raw/b1161d2c8989d013d6812b224f028587a327c86d/precipitation.csv'
TARGET_VARIABLE = 'cnt'
ESTIMATORS = 50

### Set up Dask

In [3]:
client = Client()
client

0,1
Client  Scheduler: tcp://127.0.0.1:49636  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 4  Cores: 4  Memory: 8.59 GB


### Data Loading

In [4]:
def read_data(input_path):
    return dd.read_csv(input_path)

data = read_data(DATA_PATH)
data_daily = read_data(DATA_PATH2)

In [5]:
data.head()

Unnamed: 0,instant,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,2011-01-01,1,0,1,0,0,6,0,1,0.24,0.2879,0.81,0.0,3,13,16
1,2,2011-01-01,1,0,1,1,0,6,0,1,0.22,0.2727,0.8,0.0,8,32,40
2,3,2011-01-01,1,0,1,2,0,6,0,1,0.22,0.2727,0.8,0.0,5,27,32
3,4,2011-01-01,1,0,1,3,0,6,0,1,0.24,0.2879,0.75,0.0,3,10,13
4,5,2011-01-01,1,0,1,4,0,6,0,1,0.24,0.2879,0.75,0.0,0,1,1


### Precipitation Data

In order to generate my model, I will add precipitation data obtained from the National Climatic Data Center https://www.ncdc.noaa.gov/cdo-web/datasets

However, since most of the values are 0, I will convert them to a boolean that determines if rain was present or not at that specific hour

In [6]:
precipitation = read_data(PREC_PATH)
data = data.merge(precipitation, how='left', on=['dteday','hr'])
data['precipitation'] = data['precipitation'].fillna(0)

In [7]:
data['precipitation'] = data['precipitation'].map(lambda x: x > 0)
data['precipitation'] = data['precipitation'].astype(int).astype('category')

In [8]:
data.head()

Unnamed: 0,instant,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt,precipitation
0,1,2011-01-01,1,0,1,0,0,6,0,1,0.24,0.2879,0.81,0.0,3,13,16,1
1,2,2011-01-01,1,0,1,1,0,6,0,1,0.22,0.2727,0.8,0.0,8,32,40,1
2,3,2011-01-01,1,0,1,2,0,6,0,1,0.22,0.2727,0.8,0.0,5,27,32,1
3,4,2011-01-01,1,0,1,3,0,6,0,1,0.24,0.2879,0.75,0.0,3,10,13,1
4,5,2011-01-01,1,0,1,4,0,6,0,1,0.24,0.2879,0.75,0.0,0,1,1,1


Coerce date to datetime

In [9]:
data['dteday'] = dd.to_datetime(data.dteday, format='%Y/%m/%d')

### Data types

We review the data types for each column

In [10]:
data.dtypes

instant                   int64
dteday           datetime64[ns]
season                    int64
yr                        int64
mnth                      int64
hr                        int64
holiday                   int64
weekday                   int64
workingday                int64
weathersit                int64
temp                    float64
atemp                   float64
hum                     float64
windspeed               float64
casual                    int64
registered                int64
cnt                       int64
precipitation          category
dtype: object

### Converting columns to their true categorical type
Now we convert the data types of numerical columns that are actually categorical

In [11]:
data['season'] = data['season'].astype('category')
data['yr'] = data['yr'].astype('category')
data['mnth'] = data['mnth'].astype('category')
data['hr'] = data['hr'].astype('category')
data['holiday'] = data['holiday'].astype('category')
data['weekday'] = data['weekday'].astype('category')
data['workingday'] = data['workingday'].astype('category')
data['weathersit'] = data['weathersit'].astype('category')
data.dtypes

instant                   int64
dteday           datetime64[ns]
season                 category
yr                     category
mnth                   category
hr                     category
holiday                category
weekday                category
workingday             category
weathersit             category
temp                    float64
atemp                   float64
hum                     float64
windspeed               float64
casual                    int64
registered                int64
cnt                       int64
precipitation          category
dtype: object

### Adding lag of registered users

In order to improve the long term prediction efficacy of our model we add lags of the number of users for the previous hour and the previous 24 hours

In [12]:
lagged = data['registered'].shift(1).rename(str('registered') + '_' + str(1))

In [13]:
def add_lag(df, col, lag):
    lagged = df[col].shift(lag).rename(str(col) + '_' + str(lag))
    # We will loose the first "lag" number of observations, but that is neglectable
    #lagged[0:(lag)] = lagged.compute()[lag:(lag*2)]
    return lagged

data = dd.concat([data, add_lag(data, 'registered', 1), add_lag(data, 'registered', 24)], axis = 1)
# Drop the rows that have nan due to lag
data = data.dropna(how = 'any')

data.head()

We're assuming that the indexes of each dataframes are 
 aligned. This assumption is not generally safe.


Unnamed: 0,instant,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt,precipitation,registered_1,registered_24
24,25,2011-01-02,1,0,1,0,0,0,0,2,0.46,0.4545,0.88,0.2985,4,13,17,1,24.0,13.0
25,26,2011-01-02,1,0,1,1,0,0,0,2,0.44,0.4394,0.94,0.2537,1,16,17,1,13.0,32.0
26,27,2011-01-02,1,0,1,2,0,0,0,2,0.42,0.4242,1.0,0.2836,1,8,9,1,16.0,27.0
27,28,2011-01-02,1,0,1,3,0,0,0,2,0.46,0.4545,0.94,0.194,2,4,6,1,8.0,10.0
28,29,2011-01-02,1,0,1,4,0,0,0,2,0.46,0.4545,0.94,0.194,2,1,3,1,4.0,1.0


### Dropping Columns

We drop casual and registered columns

In [14]:
data = hp.drop_columns(data, ['casual', 'registered'])
data.head()

Unnamed: 0,instant,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,cnt,precipitation,registered_1,registered_24
24,25,2011-01-02,1,0,1,0,0,0,0,2,0.46,0.4545,0.88,0.2985,17,1,24.0,13.0
25,26,2011-01-02,1,0,1,1,0,0,0,2,0.44,0.4394,0.94,0.2537,17,1,13.0,32.0
26,27,2011-01-02,1,0,1,2,0,0,0,2,0.42,0.4242,1.0,0.2836,9,1,16.0,27.0
27,28,2011-01-02,1,0,1,3,0,0,0,2,0.46,0.4545,0.94,0.194,6,1,8.0,10.0
28,29,2011-01-02,1,0,1,4,0,0,0,2,0.46,0.4545,0.94,0.194,3,1,4.0,1.0


## Baseline

A basic linear model is created in order to set a baseline which further models will be compared against

In [15]:
base_holdout = data[data['dteday'] >= '2012-10-01'].copy()
base_holdout = hp.drop_columns(base_holdout, ['dteday'])
base_data = data[data['dteday'] < '2012-10-01'].copy()
base_data = hp.drop_columns(base_data, ['dteday'])

X_train = base_data.loc[:, base_data.columns != TARGET_VARIABLE]
y_train = base_data.loc[:, TARGET_VARIABLE]

Use dask to fit the baseline model

In [16]:
model = LinearRegression()

with parallel_backend('dask'):
    model.fit(X_train, y_train)
    pred = model.predict(base_holdout.loc[:, base_holdout.columns != TARGET_VARIABLE])
    
y = base_holdout.loc[:, TARGET_VARIABLE]
score = metric_scorer(y, pred)
print('Baseline score: ' + str(score))

  contains = index in indices
  sub[blockwise_token(i)] = blockwise_token(indices.index(index))
  contains = index in indices
  sub[blockwise_token(i)] = blockwise_token(indices.index(index))


Baseline score: 0.7636766048033885


## Data Preparation and Feature Engineering

Multiple data preparation and feature engineering steps will be performed in order to improve the model's prediction power

### Extracting Day Variable

Here we extract the day from the date variable

In [17]:
def extract_day(df):
    df['day'] = df['dteday'].dt.day
    df = hp.convert_to_category(df, ['day'])
    return df

data = extract_day(data)
data.head()

Unnamed: 0,instant,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,cnt,precipitation,registered_1,registered_24,day
24,25,2011-01-02,1,0,1,0,0,0,0,2,0.46,0.4545,0.88,0.2985,17,1,24.0,13.0,2
25,26,2011-01-02,1,0,1,1,0,0,0,2,0.44,0.4394,0.94,0.2537,17,1,13.0,32.0,2
26,27,2011-01-02,1,0,1,2,0,0,0,2,0.42,0.4242,1.0,0.2836,9,1,16.0,27.0,2
27,28,2011-01-02,1,0,1,3,0,0,0,2,0.46,0.4545,0.94,0.194,6,1,8.0,10.0,2
28,29,2011-01-02,1,0,1,4,0,0,0,2,0.46,0.4545,0.94,0.194,3,1,4.0,1.0,2
