## 0. Objective ot this practical:
1. Build a decision tree model to predict the daily bicycle rental based on metrics of R2 and RMSE;
2. Tune the hyperparameter of maximal tree height of this model, using two ways of validation.

## 1. Load Libs

In [3]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import sklearn
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, GridSearchCV, validation_curve

import statsmodels.api as sm

pd.set_option('display.max_rows', 300) # specifies number of rows to show
pd.options.display.float_format = '{:40,.4f}'.format # specifies default number format to 4 decimal places
plt.style.use('ggplot') # specifies that graphs should use ggplot styling
%matplotlib inline


## 2. Data Loading and Exploration

The dataset you will use relates to daily counts of rented bicycles from the bicycle rental company Capital-Bikeshare in Washington D.C., along with weather and seasonal information. The goal here is to predict how many bikes will be rented depending on the weather and the day. The original data can be downloaded from the UCI Machine Learning Repository.

The dataset used in this workshop has been slightly processed by Christoph Molnar using the processing R-script from this Github repository. Here, the dataset is provided as a csv file on Moodle.

Here is a list of the variables in the dataset:

    Count of bicycles including both casual and registered users. The count is used as the response in the regression task.
    Indicator of the season, either spring, summer, fall or winter.
    Indicator whether the day was a holiday or not.
    The year: either 2011 or 2012.
    Number of days since the 01.01.2011 (the first day in the dataset). This predictor was introduced to take account of the trend over time.
    Indicator whether the day was a working day or weekend.
    The weather situation on that day. One of:
        'GOOD': including clear, few clouds, partly cloudy, cloudy
        'MISTY': including mist + clouds, mist + broken clouds, mist + few clouds, mist
        'RAIN/SNOW/STORM': including light snow, light rain + thunderstorm + scattered clouds, light rain + scattered clouds, heavy rain + ice pallets + thunderstorm + mist, snow + mist
    Temperature in degrees Celsius.
    Relative humidity in percent (0 to 100).
    Wind speed in km/h.

We will use Pandas package to load and explore this dataset:

    Import the Boston housing dataset as a Pandas dataframe (call it bike_rental)
    Inspect the data
    Calculate summary statistics on all attributes



In [5]:
bike_rental = pd.read_csv('https://raw.githubusercontent.com/huanfachen/Spatial_Data_Science/main/Dataset/daily_count_bike_rental.csv')
# drop the year variable as it is not useful
bike_rental = bike_rental.drop('yr', axis=1)

check out the dataset:

In [13]:
bike_rental.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 731 entries, 0 to 730
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   cnt              731 non-null    int64  
 1   season           731 non-null    object 
 2   mnth             731 non-null    object 
 3   holiday          731 non-null    object 
 4   weekday          731 non-null    object 
 5   workingday       731 non-null    object 
 6   weathersit       731 non-null    object 
 7   temp             731 non-null    float64
 8   hum              731 non-null    float64
 9   windspeed        731 non-null    float64
 10  days_since_2011  731 non-null    int64  
dtypes: float64(3), int64(2), object(6)
memory usage: 62.9+ KB


In [7]:
bike_rental.shape

(731, 11)

In [9]:
bike_rental.columns

Index(['cnt', 'season', 'mnth', 'holiday', 'weekday', 'workingday',
       'weathersit', 'temp', 'hum', 'windspeed', 'days_since_2011'],
      dtype='object')

In [11]:
bike_rental.head()

Unnamed: 0,cnt,season,mnth,holiday,weekday,workingday,weathersit,temp,hum,windspeed,days_since_2011
0,985,SPRING,JAN,NO HOLIDAY,SAT,NO WORKING DAY,MISTY,8.1758,80.5833,10.7499,0
1,801,SPRING,JAN,NO HOLIDAY,SUN,NO WORKING DAY,MISTY,9.0835,69.6087,16.6521,1
2,1349,SPRING,JAN,NO HOLIDAY,MON,WORKING DAY,GOOD,1.2291,43.7273,16.6367,2
3,1562,SPRING,JAN,NO HOLIDAY,TUE,WORKING DAY,GOOD,1.4,59.0435,10.7398,3
4,1600,SPRING,JAN,NO HOLIDAY,WED,WORKING DAY,GOOD,2.667,43.6957,12.5223,4


In bike_rental, there are two data types: <span style="color:yellow">categorical (aka object)</span>, and <span style="color:gold">numerical (including int64 and float64)</span>.

Before undertaking regression , some data processing should be done, which include:

(<span style="color:cyan">Also treat the col names a bit if the col names contain special chars.</span>)

* Converting categorical variables into dummy variables (aka one-hot encoding);

* Split the data into training and testing sets;

* For linear regression, we should deal with multicollinearity (and removing some variables if necessary)

One note, the reason for doing one-hot encoding is that sklearn decision trees don't handle categorical data. However, other packages have more efficient ways of handling cateogical variables than one-hot encoding for tree-based models.

We will discuss this issue in more depth next week. Currently, we stick to one-hot encoding.

in
[This link](https://stackoverflow.com/questions/38108832/passing-categorical-data-to-sklearn-decision-tree) is a good reference




### 2.1 Converting categorical variables

First, we need to convert categorical variables into dummy/indicator variables, using One-Hot Encoding.

In [14]:
bike_rentail_numeric = pd.get_dummies(bike_rental)

In [15]:
# check out the new dataFrame
bike_rentail_numeric.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 731 entries, 0 to 730
Data columns (total 35 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   cnt                         731 non-null    int64  
 1   temp                        731 non-null    float64
 2   hum                         731 non-null    float64
 3   windspeed                   731 non-null    float64
 4   days_since_2011             731 non-null    int64  
 5   season_FALL                 731 non-null    bool   
 6   season_SPRING               731 non-null    bool   
 7   season_SUMMER               731 non-null    bool   
 8   season_WINTER               731 non-null    bool   
 9   mnth_APR                    731 non-null    bool   
 10  mnth_AUG                    731 non-null    bool   
 11  mnth_DEZ                    731 non-null    bool   
 12  mnth_FEB                    731 non-null    bool   
 13  mnth_JAN                    731 non

Remember that, a cateogircal variable of K categories or levels, usually enters a regression as a sequence of K-1 dummy variables. The level that is left out becomes the reference level, and this is important for interpreting the regression model.

Here we manually choose the reference level for each categorical variable and exclude them from the DataFrame. You can change the reference levels if you want.

In [16]:
bike_rental_final = bike_rentail_numeric.drop(['season_SPRING', 'mnth_JAN', 'holiday_NO HOLIDAY', 'weekday_MON', 'workingday_WORKING DAY', 'weathersit_GOOD'], axis=1)

# double check the result
bike_rental_final.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 731 entries, 0 to 730
Data columns (total 29 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   cnt                         731 non-null    int64  
 1   temp                        731 non-null    float64
 2   hum                         731 non-null    float64
 3   windspeed                   731 non-null    float64
 4   days_since_2011             731 non-null    int64  
 5   season_FALL                 731 non-null    bool   
 6   season_SUMMER               731 non-null    bool   
 7   season_WINTER               731 non-null    bool   
 8   mnth_APR                    731 non-null    bool   
 9   mnth_AUG                    731 non-null    bool   
 10  mnth_DEZ                    731 non-null    bool   
 11  mnth_FEB                    731 non-null    bool   
 12  mnth_JUL                    731 non-null    bool   
 13  mnth_JUN                    731 non

### 2.2 Splitting data into random train and test subsets

By default, train_test_split will split the data according to a 75:25 split. Other proportions can be specified, check out the documentation for details.

Remember that the split should be random in order to avoid selection bias. Here, we set random_state=100 to guarantee reproducibility.

From the documentation:

The first argument of this function:

*arrays: sequence of indexables with same length / shape[0]
Allowed inputs are lists, numpy arrays, scipy-sparse matrices or pandas dataframes.

The output of this function:

splitting: list, length=2 * len(arrays)
List containing train-test split of inputs.

Here we input two dataframes (X and Y) and will get four outputs (train_x, test_x, train_y, test_y).
