# Bike Sharing Demand

Forecast use of a city bikeshare system

[Get started on this competition through Kaggle Scripts](https://www.kaggle.com/c/bike-sharing-demand/forums/t/13228/kaggle-scripts/69563#post69563)

Bike sharing systems are a means of renting bicycles where the process of obtaining membership, rental, and bike return is automated via a network of kiosk locations throughout a city. Using these systems, people are able rent a bike from a one location and return it to a different place on an as-needed basis. Currently, there are over 500 bike-sharing programs around the world.

The data generated by these systems makes them attractive for researchers because the duration of travel, departure location, arrival location, and time elapsed is explicitly recorded. Bike sharing systems therefore function as a sensor network, which can be used for studying mobility in a city. In this competition, participants are asked to combine historical usage patterns with weather data in order to forecast bike rental demand in the Capital Bikeshare program in Washington, D.C.

![Bikes](https://storage.googleapis.com/kaggle-competitions/kaggle/3948/media/bikes.png)

Acknowledgements
----------------

Kaggle is hosting this competition for the machine learning community to use for fun and practice. This dataset was provided by Hadi Fanaee Tork using data from [Capital Bikeshare](http://www.capitalbikeshare.com/system-data). We also thank the UCI machine learning repository for [hosting the dataset](http://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset). If you use the problem in publication, please cite:

Fanaee-T, Hadi, and Gama, Joao, _Event labeling combining ensemble detectors and background knowledge_, Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg.

Dataset Description
-------------------

[See, fork, and run a random forest benchmark model through Kaggle Scripts](https://www.kaggle.com/users/993/ben-hamner/bike-sharing-demand/random-forest-benchmark)

You are provided hourly rental data spanning two years. For this competition, the training set is comprised of the first 19 days of each month, while the test set is the 20th to the end of the month. You must predict the total count of bikes rented during each hour covered by the test set, using only information available prior to the rental period.

Data Fields
-----------

**datetime** - hourly date + timestamp    
**season** -  1 = spring, 2 = summer, 3 = fall, 4 = winter   
**holiday** - whether the day is considered a holiday  
**workingday** - whether the day is neither a weekend nor holiday  
**weather** - 1: Clear, Few clouds, Partly cloudy, Partly cloudy  
2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist  
3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds  
4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog   
**temp** - temperature in Celsius  
**atemp** - "feels like" temperature in Celsius  
**humidity** - relative humidity  
**windspeed** - wind speed  
**casual** - number of non-registered user rentals initiated  
**registered** - number of registered user rentals initiated  
**count** - number of total rentals

Link: https://www.kaggle.com/c/bike-sharing-demand

In [1]:
import numpy as np
import pandas as pd
from catboost import (
    CatBoostRegressor,
    Pool,
    sum_models,
    to_regressor,
    EFeaturesSelectionAlgorithm,
    EShapCalcType,
)
from sklearn.model_selection import TimeSeriesSplit, train_test_split
from sklearn.metrics import mean_squared_error
from tqdm.notebook import tqdm

In [2]:
%load_ext nb_black

<IPython.core.display.Javascript object>

In [3]:
sampleSubmission_df = pd.read_csv("../../data/bike-sharing-demand/sampleSubmission.csv")
sampleSubmission_df

Unnamed: 0,datetime,count
0,2011-01-20 00:00:00,0
1,2011-01-20 01:00:00,0
2,2011-01-20 02:00:00,0
3,2011-01-20 03:00:00,0
4,2011-01-20 04:00:00,0
...,...,...
6488,2012-12-31 19:00:00,0
6489,2012-12-31 20:00:00,0
6490,2012-12-31 21:00:00,0
6491,2012-12-31 22:00:00,0


<IPython.core.display.Javascript object>

In [4]:
test_df = pd.read_csv(
    "../../data/bike-sharing-demand/test.csv",
    parse_dates=["datetime"],
    index_col="datetime",
)
test_df

Unnamed: 0_level_0,season,holiday,workingday,weather,temp,atemp,humidity,windspeed
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2011-01-20 00:00:00,1,0,1,1,10.66,11.365,56,26.0027
2011-01-20 01:00:00,1,0,1,1,10.66,13.635,56,0.0000
2011-01-20 02:00:00,1,0,1,1,10.66,13.635,56,0.0000
2011-01-20 03:00:00,1,0,1,1,10.66,12.880,56,11.0014
2011-01-20 04:00:00,1,0,1,1,10.66,12.880,56,11.0014
...,...,...,...,...,...,...,...,...
2012-12-31 19:00:00,1,0,1,2,10.66,12.880,60,11.0014
2012-12-31 20:00:00,1,0,1,2,10.66,12.880,60,11.0014
2012-12-31 21:00:00,1,0,1,1,10.66,12.880,60,11.0014
2012-12-31 22:00:00,1,0,1,1,10.66,13.635,56,8.9981


<IPython.core.display.Javascript object>

In [5]:
train_df = pd.read_csv(
    "../../data/bike-sharing-demand/train.csv",
    parse_dates=["datetime"],
    index_col="datetime",
)
train_df

Unnamed: 0_level_0,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0000,3,13,16
2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0000,8,32,40
2011-01-01 02:00:00,1,0,0,1,9.02,13.635,80,0.0000,5,27,32
2011-01-01 03:00:00,1,0,0,1,9.84,14.395,75,0.0000,3,10,13
2011-01-01 04:00:00,1,0,0,1,9.84,14.395,75,0.0000,0,1,1
...,...,...,...,...,...,...,...,...,...,...,...
2012-12-19 19:00:00,4,0,1,1,15.58,19.695,50,26.0027,7,329,336
2012-12-19 20:00:00,4,0,1,1,14.76,17.425,57,15.0013,10,231,241
2012-12-19 21:00:00,4,0,1,1,13.94,15.910,61,15.0013,4,164,168
2012-12-19 22:00:00,4,0,1,1,13.94,17.425,61,6.0032,12,117,129


<IPython.core.display.Javascript object>

In [6]:
df = pd.concat([train_df, test_df])
df["isTest"] = df.index.isin(test_df.index).astype(int)
df

Unnamed: 0_level_0,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count,isTest
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0000,3.0,13.0,16.0,0
2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0000,8.0,32.0,40.0,0
2011-01-01 02:00:00,1,0,0,1,9.02,13.635,80,0.0000,5.0,27.0,32.0,0
2011-01-01 03:00:00,1,0,0,1,9.84,14.395,75,0.0000,3.0,10.0,13.0,0
2011-01-01 04:00:00,1,0,0,1,9.84,14.395,75,0.0000,0.0,1.0,1.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
2012-12-31 19:00:00,1,0,1,2,10.66,12.880,60,11.0014,,,,1
2012-12-31 20:00:00,1,0,1,2,10.66,12.880,60,11.0014,,,,1
2012-12-31 21:00:00,1,0,1,1,10.66,12.880,60,11.0014,,,,1
2012-12-31 22:00:00,1,0,1,1,10.66,13.635,56,8.9981,,,,1


<IPython.core.display.Javascript object>

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 17379 entries, 2011-01-01 00:00:00 to 2012-12-31 23:00:00
Data columns (total 12 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   season      17379 non-null  int64  
 1   holiday     17379 non-null  int64  
 2   workingday  17379 non-null  int64  
 3   weather     17379 non-null  int64  
 4   temp        17379 non-null  float64
 5   atemp       17379 non-null  float64
 6   humidity    17379 non-null  int64  
 7   windspeed   17379 non-null  float64
 8   casual      10886 non-null  float64
 9   registered  10886 non-null  float64
 10  count       10886 non-null  float64
 11  isTest      17379 non-null  int64  
dtypes: float64(6), int64(6)
memory usage: 1.7 MB


<IPython.core.display.Javascript object>

In [8]:
(df.isna().sum() / len(df)).sort_values(ascending=False)

casual        0.373612
registered    0.373612
count         0.373612
season        0.000000
holiday       0.000000
workingday    0.000000
weather       0.000000
temp          0.000000
atemp         0.000000
humidity      0.000000
windspeed     0.000000
isTest        0.000000
dtype: float64

<IPython.core.display.Javascript object>

# Feature engineering

In [9]:
# https://www.aboutdatablog.com/post/extracting-features-from-dates-in-pandas
df["year"] = df.index.year
df["month"] = df.index.month
df["day"] = df.index.day
df["hour"] = df.index.hour
df["minute"] = df.index.minute
df["day_of_year"] = df.index.day_of_year
df["week"] = df.index.isocalendar().week
df["day_of_week"] = df.index.day_of_week
df["quarter"] = df.index.quarter
df["is_weekend"] = df.index.weekday.isin([5, 6]).astype(int)

df

Unnamed: 0_level_0,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,...,year,month,day,hour,minute,day_of_year,week,day_of_week,quarter,is_weekend
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0000,3.0,13.0,...,2011,1,1,0,0,1,52,5,1,1
2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0000,8.0,32.0,...,2011,1,1,1,0,1,52,5,1,1
2011-01-01 02:00:00,1,0,0,1,9.02,13.635,80,0.0000,5.0,27.0,...,2011,1,1,2,0,1,52,5,1,1
2011-01-01 03:00:00,1,0,0,1,9.84,14.395,75,0.0000,3.0,10.0,...,2011,1,1,3,0,1,52,5,1,1
2011-01-01 04:00:00,1,0,0,1,9.84,14.395,75,0.0000,0.0,1.0,...,2011,1,1,4,0,1,52,5,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2012-12-31 19:00:00,1,0,1,2,10.66,12.880,60,11.0014,,,...,2012,12,31,19,0,366,1,0,4,0
2012-12-31 20:00:00,1,0,1,2,10.66,12.880,60,11.0014,,,...,2012,12,31,20,0,366,1,0,4,0
2012-12-31 21:00:00,1,0,1,1,10.66,12.880,60,11.0014,,,...,2012,12,31,21,0,366,1,0,4,0
2012-12-31 22:00:00,1,0,1,1,10.66,13.635,56,8.9981,,,...,2012,12,31,22,0,366,1,0,4,0


<IPython.core.display.Javascript object>

In [10]:
# https://www.analyticsvidhya.com/blog/2019/12/6-powerful-feature-engineering-techniques-time-series/
# https://habr.com/ru/company/ods/blog/327242/


def extract_ts_features(df, max_lag=7, rolling_size=4, expanding_size=2):
    ts_df = pd.DataFrame()

    for col_name in df.columns:
        # shift
        for lag in range(1, max_lag + 1):
            ts_df["{}_lag_{}".format(col_name, lag)] = df[col_name].shift(lag)

        # rolling
        ts_df["{}_rolling_mean".format(col_name)] = (
            df[col_name].rolling(rolling_size).mean()
        )

        # expanding window
        ts_df["{}_expanding_mean".format(col_name)] = (
            df[col_name].expanding(expanding_size).mean()
        )

    return ts_df.fillna(0)


df1 = df[["temp", "atemp", "humidity", "windspeed"]].copy()
ts_df = extract_ts_features(df1)
ts_df

Unnamed: 0_level_0,temp_lag_1,temp_lag_2,temp_lag_3,temp_lag_4,temp_lag_5,temp_lag_6,temp_lag_7,temp_rolling_mean,temp_expanding_mean,atemp_lag_1,...,humidity_expanding_mean,windspeed_lag_1,windspeed_lag_2,windspeed_lag_3,windspeed_lag_4,windspeed_lag_5,windspeed_lag_6,windspeed_lag_7,windspeed_rolling_mean,windspeed_expanding_mean
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2011-01-01 00:00:00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.000000,0.000,...,0.000000,0.0000,0.0000,0.0000,0.0000,0.0000,0.0000,0.0000,0.000000,0.000000
2011-01-01 01:00:00,9.84,0.00,0.00,0.00,0.00,0.00,0.00,0.00,9.430000,14.395,...,80.500000,0.0000,0.0000,0.0000,0.0000,0.0000,0.0000,0.0000,0.000000,0.000000
2011-01-01 02:00:00,9.02,9.84,0.00,0.00,0.00,0.00,0.00,0.00,9.293333,13.635,...,80.333333,0.0000,0.0000,0.0000,0.0000,0.0000,0.0000,0.0000,0.000000,0.000000
2011-01-01 03:00:00,9.02,9.02,9.84,0.00,0.00,0.00,0.00,9.43,9.430000,13.635,...,79.000000,0.0000,0.0000,0.0000,0.0000,0.0000,0.0000,0.0000,0.000000,0.000000
2011-01-01 04:00:00,9.84,9.02,9.02,9.84,0.00,0.00,0.00,9.43,9.512000,14.395,...,78.200000,0.0000,0.0000,0.0000,0.0000,0.0000,0.0000,0.0000,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2012-12-31 19:00:00,10.66,10.66,10.66,11.48,11.48,10.66,9.84,10.66,20.378711,13.635,...,62.723453,8.9981,6.0032,12.9980,8.9981,15.0013,11.0014,12.9980,9.750175,12.737170
2012-12-31 20:00:00,10.66,10.66,10.66,10.66,11.48,11.48,10.66,10.66,20.378151,12.880,...,62.723297,11.0014,8.9981,6.0032,12.9980,8.9981,15.0013,11.0014,9.251025,12.737070
2012-12-31 21:00:00,10.66,10.66,10.66,10.66,10.66,11.48,11.48,10.66,20.377592,12.880,...,62.723140,11.0014,11.0014,8.9981,6.0032,12.9980,8.9981,15.0013,10.500575,12.736970
2012-12-31 22:00:00,10.66,10.66,10.66,10.66,10.66,10.66,11.48,10.66,20.377033,12.880,...,62.722753,11.0014,11.0014,11.0014,8.9981,6.0032,12.9980,8.9981,10.500575,12.736755


<IPython.core.display.Javascript object>

In [11]:
df = df.join(ts_df)
df

Unnamed: 0_level_0,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,...,humidity_expanding_mean,windspeed_lag_1,windspeed_lag_2,windspeed_lag_3,windspeed_lag_4,windspeed_lag_5,windspeed_lag_6,windspeed_lag_7,windspeed_rolling_mean,windspeed_expanding_mean
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0000,3.0,13.0,...,0.000000,0.0000,0.0000,0.0000,0.0000,0.0000,0.0000,0.0000,0.000000,0.000000
2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0000,8.0,32.0,...,80.500000,0.0000,0.0000,0.0000,0.0000,0.0000,0.0000,0.0000,0.000000,0.000000
2011-01-01 02:00:00,1,0,0,1,9.02,13.635,80,0.0000,5.0,27.0,...,80.333333,0.0000,0.0000,0.0000,0.0000,0.0000,0.0000,0.0000,0.000000,0.000000
2011-01-01 03:00:00,1,0,0,1,9.84,14.395,75,0.0000,3.0,10.0,...,79.000000,0.0000,0.0000,0.0000,0.0000,0.0000,0.0000,0.0000,0.000000,0.000000
2011-01-01 04:00:00,1,0,0,1,9.84,14.395,75,0.0000,0.0,1.0,...,78.200000,0.0000,0.0000,0.0000,0.0000,0.0000,0.0000,0.0000,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2012-12-31 19:00:00,1,0,1,2,10.66,12.880,60,11.0014,,,...,62.723453,8.9981,6.0032,12.9980,8.9981,15.0013,11.0014,12.9980,9.750175,12.737170
2012-12-31 20:00:00,1,0,1,2,10.66,12.880,60,11.0014,,,...,62.723297,11.0014,8.9981,6.0032,12.9980,8.9981,15.0013,11.0014,9.251025,12.737070
2012-12-31 21:00:00,1,0,1,1,10.66,12.880,60,11.0014,,,...,62.723140,11.0014,11.0014,8.9981,6.0032,12.9980,8.9981,15.0013,10.500575,12.736970
2012-12-31 22:00:00,1,0,1,1,10.66,13.635,56,8.9981,,,...,62.722753,11.0014,11.0014,11.0014,8.9981,6.0032,12.9980,8.9981,10.500575,12.736755


<IPython.core.display.Javascript object>

# Prepare

In [12]:
X_test = (
    df[df["isTest"] == 1]
    .drop(["casual", "registered", "count", "isTest"], axis=1)
    .copy()
)
X_test

Unnamed: 0_level_0,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,year,month,...,humidity_expanding_mean,windspeed_lag_1,windspeed_lag_2,windspeed_lag_3,windspeed_lag_4,windspeed_lag_5,windspeed_lag_6,windspeed_lag_7,windspeed_rolling_mean,windspeed_expanding_mean
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2011-01-20 00:00:00,1,0,1,1,10.66,11.365,56,26.0027,2011,1,...,61.885919,8.9981,6.0032,15.0013,15.0013,26.0027,23.9994,26.0027,14.001325,12.800608
2011-01-20 01:00:00,1,0,1,1,10.66,13.635,56,0.0000,2011,1,...,61.885378,26.0027,8.9981,6.0032,15.0013,15.0013,26.0027,23.9994,10.251000,12.799433
2011-01-20 02:00:00,1,0,1,1,10.66,13.635,56,0.0000,2011,1,...,61.884838,0.0000,26.0027,8.9981,6.0032,15.0013,15.0013,26.0027,8.750200,12.798257
2011-01-20 03:00:00,1,0,1,1,10.66,12.880,56,11.0014,2011,1,...,61.884298,0.0000,0.0000,26.0027,8.9981,6.0032,15.0013,15.0013,9.251025,12.798092
2011-01-20 04:00:00,1,0,1,1,10.66,12.880,56,11.0014,2011,1,...,61.883757,11.0014,0.0000,0.0000,26.0027,8.9981,6.0032,15.0013,5.500700,12.797927
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2012-12-31 19:00:00,1,0,1,2,10.66,12.880,60,11.0014,2012,12,...,62.723453,8.9981,6.0032,12.9980,8.9981,15.0013,11.0014,12.9980,9.750175,12.737170
2012-12-31 20:00:00,1,0,1,2,10.66,12.880,60,11.0014,2012,12,...,62.723297,11.0014,8.9981,6.0032,12.9980,8.9981,15.0013,11.0014,9.251025,12.737070
2012-12-31 21:00:00,1,0,1,1,10.66,12.880,60,11.0014,2012,12,...,62.723140,11.0014,11.0014,8.9981,6.0032,12.9980,8.9981,15.0013,10.500575,12.736970
2012-12-31 22:00:00,1,0,1,1,10.66,13.635,56,8.9981,2012,12,...,62.722753,11.0014,11.0014,11.0014,8.9981,6.0032,12.9980,8.9981,10.500575,12.736755


<IPython.core.display.Javascript object>

In [13]:
X_train = (
    df[df["isTest"] == 0]
    .drop(["casual", "registered", "count", "isTest"], axis=1)
    .copy()
)
X_train

Unnamed: 0_level_0,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,year,month,...,humidity_expanding_mean,windspeed_lag_1,windspeed_lag_2,windspeed_lag_3,windspeed_lag_4,windspeed_lag_5,windspeed_lag_6,windspeed_lag_7,windspeed_rolling_mean,windspeed_expanding_mean
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0000,2011,1,...,0.000000,0.0000,0.0000,0.0000,0.0000,0.0000,0.0000,0.0000,0.000000,0.000000
2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0000,2011,1,...,80.500000,0.0000,0.0000,0.0000,0.0000,0.0000,0.0000,0.0000,0.000000,0.000000
2011-01-01 02:00:00,1,0,0,1,9.02,13.635,80,0.0000,2011,1,...,80.333333,0.0000,0.0000,0.0000,0.0000,0.0000,0.0000,0.0000,0.000000,0.000000
2011-01-01 03:00:00,1,0,0,1,9.84,14.395,75,0.0000,2011,1,...,79.000000,0.0000,0.0000,0.0000,0.0000,0.0000,0.0000,0.0000,0.000000,0.000000
2011-01-01 04:00:00,1,0,0,1,9.84,14.395,75,0.0000,2011,1,...,78.200000,0.0000,0.0000,0.0000,0.0000,0.0000,0.0000,0.0000,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2012-12-19 19:00:00,4,0,1,1,15.58,19.695,50,26.0027,2012,12,...,61.886694,23.9994,26.0027,23.9994,19.0012,12.9980,12.9980,19.0012,25.001050,12.799965
2012-12-19 20:00:00,4,0,1,1,14.76,17.425,57,15.0013,2012,12,...,61.886245,26.0027,23.9994,26.0027,23.9994,19.0012,12.9980,12.9980,22.751525,12.800167
2012-12-19 21:00:00,4,0,1,1,13.94,15.910,61,15.0013,2012,12,...,61.886163,15.0013,26.0027,23.9994,26.0027,23.9994,19.0012,12.9980,20.001175,12.800369
2012-12-19 22:00:00,4,0,1,1,13.94,17.425,61,6.0032,2012,12,...,61.886082,15.0013,15.0013,26.0027,23.9994,26.0027,23.9994,19.0012,15.502125,12.799745


<IPython.core.display.Javascript object>

In [14]:
y_train = df[df["isTest"] == 0][["count"]].copy()
y_train["count"] = np.log1p(y_train["count"])  # RMSLE
y_train

Unnamed: 0_level_0,count
datetime,Unnamed: 1_level_1
2011-01-01 00:00:00,2.833213
2011-01-01 01:00:00,3.713572
2011-01-01 02:00:00,3.496508
2011-01-01 03:00:00,2.639057
2011-01-01 04:00:00,0.693147
...,...
2012-12-19 19:00:00,5.820083
2012-12-19 20:00:00,5.488938
2012-12-19 21:00:00,5.129899
2012-12-19 22:00:00,4.867534


<IPython.core.display.Javascript object>

In [15]:
X_train, X_true, y_train, y_true = train_test_split(
    X_train, y_train, test_size=0.1, shuffle=False, random_state=42
)
X_train.shape, X_true.shape, y_train.shape, y_true.shape

((9797, 54), (1089, 54), (9797, 1), (1089, 1))

<IPython.core.display.Javascript object>

# Train

## Hyperparameter tuning

In [16]:
tscv = TimeSeriesSplit(n_splits=5)
tscv

TimeSeriesSplit(gap=0, max_train_size=None, n_splits=5, test_size=None)

<IPython.core.display.Javascript object>

In [17]:
model = CatBoostRegressor(logging_level="Silent")

# https://docs.aws.amazon.com/sagemaker/latest/dg/catboost-tuning.html
tuned_params = {
    "learning_rate": [
        0.001,
        0.002,
        0.003,
        0.004,
        0.005,
        0.006,
        0.007,
        0.008,
        0.009,
        0.01,
    ],
    "depth": [4, 5, 6, 7, 8, 9, 10],
    "l2_leaf_reg": [2, 3, 4, 5, 6, 7, 8, 9, 10],
    "random_strength": [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0],
    "iterations": [500, 600, 700, 800, 900, 1000],
}

grid_search_result = model.randomized_search(
    tuned_params, Pool(X_train, y_train), cv=tscv, verbose=False, plot=True
)

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

<IPython.core.display.Javascript object>

In [18]:
best_model_params = grid_search_result["params"]
best_model_params

{'depth': 9,
 'l2_leaf_reg': 8,
 'iterations': 1000,
 'random_strength': 3.0,
 'learning_rate': 0.008}

<IPython.core.display.Javascript object>

## Feature selection

In [19]:
importance_df = pd.DataFrame(
    {
        "Column": X_train.columns,
        "Score": model.get_feature_importance(),
    }
).sort_values(by="Score", ascending=False)

X_sf = X_train[importance_df["Column"]]
y_sf = y_train

X_sf.shape, y_sf.shape

((9797, 54), (9797, 1))

<IPython.core.display.Javascript object>

In [20]:
num_list = list(range(10, X_sf.shape[1], 3))
num_list

[10, 13, 16, 19, 22, 25, 28, 31, 34, 37, 40, 43, 46, 49, 52]

<IPython.core.display.Javascript object>

In [21]:
def select_features_loop(X, y, num_features=10):
    X = X.iloc[:, :num_features]

    X_sub_train, X_sub_val, y_sub_train, y_sub_val = train_test_split(
        X, y, test_size=0.1, shuffle=False, random_state=42
    )

    model = CatBoostRegressor(**best_model_params, logging_level="Silent")
    model.fit(
        Pool(X_sub_train, y_sub_train),
        eval_set=Pool(X_sub_val, y_sub_val),
        verbose=False,
    )

    score = mean_squared_error(y_true, model.predict(X_true), squared=False)

    return [num_features, score]


loss_list = []
for num_features in tqdm(num_list):
    loss_values = select_features_loop(X_sf, y_sf, num_features)
    loss_list.append(loss_values)

pd.DataFrame(loss_list, columns=["num_features", "score"]).set_index(
    "num_features"
).sort_values(by="score").head()

  0%|          | 0/15 [00:00<?, ?it/s]

Unnamed: 0_level_0,score
num_features,Unnamed: 1_level_1
46,0.445987
43,0.452033
25,0.454561
52,0.455287
49,0.455546


<IPython.core.display.Javascript object>

In [22]:
X_train = X_train.iloc[:, :46]
X_train.columns

Index(['season', 'holiday', 'workingday', 'weather', 'temp', 'atemp',
       'humidity', 'windspeed', 'year', 'month', 'day', 'hour', 'minute',
       'day_of_year', 'week', 'day_of_week', 'quarter', 'is_weekend',
       'temp_lag_1', 'temp_lag_2', 'temp_lag_3', 'temp_lag_4', 'temp_lag_5',
       'temp_lag_6', 'temp_lag_7', 'temp_rolling_mean', 'temp_expanding_mean',
       'atemp_lag_1', 'atemp_lag_2', 'atemp_lag_3', 'atemp_lag_4',
       'atemp_lag_5', 'atemp_lag_6', 'atemp_lag_7', 'atemp_rolling_mean',
       'atemp_expanding_mean', 'humidity_lag_1', 'humidity_lag_2',
       'humidity_lag_3', 'humidity_lag_4', 'humidity_lag_5', 'humidity_lag_6',
       'humidity_lag_7', 'humidity_rolling_mean', 'humidity_expanding_mean',
       'windspeed_lag_1'],
      dtype='object')

<IPython.core.display.Javascript object>

## Loop

In [23]:
def train_loop(X_train, y_train):
    ensemble = []

    for i, (train_index, val_index) in enumerate(tscv.split(X_train)):
        X_sub_train, X_sub_val = X_train.iloc[train_index], X_train.iloc[val_index]
        y_sub_train, y_sub_val = y_train.iloc[train_index], y_train.iloc[val_index]

        model = CatBoostRegressor(**best_model_params, logging_level="Silent")

        model.fit(
            Pool(X_sub_train, y_sub_train),
            eval_set=Pool(X_sub_val, y_sub_val),
            verbose=False,
        )

        ensemble.append(model)
        print(model.best_score_)

    return ensemble


ensemble = train_loop(X_train, y_train)

{'learn': {'RMSE': 0.3691517750318588}, 'validation': {'RMSE': 0.7724490462121082}}
{'learn': {'RMSE': 0.34902052907150627}, 'validation': {'RMSE': 0.5194982514177947}}
{'learn': {'RMSE': 0.33727648760437695}, 'validation': {'RMSE': 0.6617173371175511}}
{'learn': {'RMSE': 0.3306123350788192}, 'validation': {'RMSE': 0.4758906974556173}}
{'learn': {'RMSE': 0.3269694855362372}, 'validation': {'RMSE': 0.42524697144876444}}


<IPython.core.display.Javascript object>

In [24]:
models_avrg = to_regressor(
    sum_models(ensemble, weights=[1.0 / len(ensemble)] * len(ensemble))
)
models_avrg

<catboost.core.CatBoostRegressor at 0x7f5fcb046e50>

<IPython.core.display.Javascript object>

# Validate

In [25]:
y_preds_1 = models_avrg.predict(X_true)
y_preds_1

array([5.33364941, 5.36218346, 5.4966825 , ..., 4.76573145, 4.51426993,
       4.24447422])

<IPython.core.display.Javascript object>

In [26]:
mean_squared_error(y_true, y_preds_1, squared=False)

0.6878338095465699

<IPython.core.display.Javascript object>

# Combine frame

In [27]:
X_train_1 = pd.concat([X_train, X_true])
y_train_1 = pd.concat([y_train, y_true])

X_train_1.shape, y_train_1.shape

((10886, 54), (10886, 1))

<IPython.core.display.Javascript object>

In [28]:
ensemble = train_loop(X_train_1, y_train_1)

{'learn': {'RMSE': 0.3694597209692936}, 'validation': {'RMSE': 0.7610691264188832}}
{'learn': {'RMSE': 0.3441705950051839}, 'validation': {'RMSE': 0.6047196462406584}}
{'learn': {'RMSE': 0.3315627863476515}, 'validation': {'RMSE': 0.6854839170811264}}
{'learn': {'RMSE': 0.33447700662826485}, 'validation': {'RMSE': 0.41764445181691007}}
{'learn': {'RMSE': 0.32019396463354305}, 'validation': {'RMSE': 0.42252799903395305}}


<IPython.core.display.Javascript object>

In [29]:
models_avrg = to_regressor(
    sum_models(ensemble, weights=[1.0 / len(ensemble)] * len(ensemble))
)
models_avrg

<catboost.core.CatBoostRegressor at 0x7f5fcb091c40>

<IPython.core.display.Javascript object>

# Submission

In [30]:
y_preds_avrg = models_avrg.predict(X_test)
y_preds_avrg

array([2.80635494, 2.30283064, 1.93065592, ..., 4.48283304, 4.25620547,
       4.03190534])

<IPython.core.display.Javascript object>

In [31]:
submission = pd.DataFrame(
    {"datetime": X_test.index, "count": np.exp(y_preds_avrg)}
).set_index("datetime")
submission

Unnamed: 0_level_0,count
datetime,Unnamed: 1_level_1
2011-01-20 00:00:00,16.549484
2011-01-20 01:00:00,10.002456
2011-01-20 02:00:00,6.894031
2011-01-20 03:00:00,5.271689
2011-01-20 04:00:00,5.326224
...,...
2012-12-31 19:00:00,140.586401
2012-12-31 20:00:00,111.740839
2012-12-31 21:00:00,88.484999
2012-12-31 22:00:00,70.541802


<IPython.core.display.Javascript object>

In [32]:
submission.to_csv("../../data/bike-sharing-demand/submission.csv")

<IPython.core.display.Javascript object>