#  Gradient Boosting Regression over the Bike Sharing Dataset


## The dataset

Bike sharing systems are new generation of traditional bike rentals where whole process from membership, rental and return back has become automatic. Through these systems, user is able to easily rent a bike from a particular position and return back at another position. Currently, there are about over 500 bike-sharing programs around the world which is composed of over 500 thousands bicycles. Today, there exists great interest in these systems due to their important role in traffic, environmental and health issues.

Apart from interesting real world applications of bike sharing systems, the characteristics of data being generated by these systems make them attractive for the research. Opposed to other transport services such as bus or subway, the duration of travel, departure and arrival position is explicitly recorded in these systems. This feature turns bike sharing system into a virtual sensor network that can be used for sensing mobility in the city. Hence, it is expected that most of important events in the city could be detected via monitoring these data.

## The task

This is [my approach](http://alonsopg.com/) for the UCI's [bike sharing dataset](https://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset). I will load the [bike sharing dataset](https://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset) ``data/bike_day_raw.csv``, which has the regression target ``cnt``. This dataset is hourly bike rentals in the citybike platform. The ``cnt`` column is the number of rentals, which we want to predict from date and weather data.

I will split the data into a training and a test set using ``train_test_split``.
Use the ``XGboostRegressor`` class to learn a regression model on this data. I also show how to evaluate with the ``score`` method, which provides the $R^2$ or using the ``mean_squared_error`` function from ``sklearn.metrics`` (or write it yourself in numpy).

In [1]:
import pandas as pd
print('pandas version:',pd.__version__)
df = pd.read_csv("/Users/user/Jupyter/Courses/advanced_training-master/data/bike_day_raw.csv")
df.tail(2)

pandas version: 0.18.1


Unnamed: 0,season,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,cnt
729,1,12,0,0,0,1,0.255833,0.2317,0.483333,0.350754,1796
730,1,12,0,1,1,2,0.215833,0.223487,0.5775,0.154846,2729


In [2]:
X = df.drop("cnt", axis=1).values
y = df['cnt'].values

In [4]:
#from sklearn.cross_validation import train_test_split
#X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, )

from sklearn.cross_validation import KFold
kf = KFold(n=len(y), n_folds=10, shuffle=True, random_state=False)
#kfold = KFold(n=len(y), n_folds=10, random_state=False)
#Just for visualize it:
for train_index, test_index in kf:
    train_index, test_index
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

y_=[]
y_pred=[]

In [5]:
#from sklearn.cross_validation import train_test_split
#X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

In [6]:
import xgboost as xgb
gbm = xgb.XGBRegressor(max_depth=4,
                       n_estimators=3000, 
                       learning_rate=0.05).fit(X_train, y_train)


prediction = gbm.predict(X_test)
#prediction

In [7]:
from sklearn.metrics import mean_squared_error, accuracy_score, precision_score
y_.extend(y_test)
y_pred.extend(prediction)
mean_squared_error(y_, y_pred)

1680724.7873219044

In [8]:
gbm.score(X_test, y_test)

0.54419497467833078

In [9]:
gbm.score(X_train, y_train)

0.99991067786119536

## Testing data

Just to see the behaviour of the regressor model, I randomly picked up a chunk of the dataset, and used 

In [10]:
df_test = pd.read_csv('/Users/user/Jupyter/Courses/advanced_training-master/data/bike_day_raw_sample_test.csv')
df_test.tail(2)

Unnamed: 0,season,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,cnt
60,1,3,0,3,1,1,0.335,0.320071,0.449583,0.307833,2134
61,1,3,0,4,1,1,0.198333,0.200133,0.318333,0.225754,1685


In [11]:
X_new_test = df_test.drop("cnt", axis=1).values

In [12]:
prediction = gbm.predict(X_new_test)

In [13]:
df_test['predictions'] = prediction

In [14]:
df_test

Unnamed: 0,season,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,cnt,predictions
0,1,1,0,6,0,2,0.344167,0.363625,0.805833,0.160446,985,1001.282532
1,1,1,0,0,0,2,0.363478,0.353739,0.696087,0.248539,801,833.500610
2,1,1,0,1,1,1,0.196364,0.189405,0.437273,0.248309,1349,1352.128418
3,1,1,0,2,1,1,0.200000,0.212122,0.590435,0.160296,1562,1549.748535
4,1,1,0,3,1,1,0.226957,0.229270,0.436957,0.186900,1600,1626.992432
5,1,1,0,4,1,1,0.204348,0.233209,0.518261,0.089565,1606,1603.551392
6,1,1,0,5,1,2,0.196522,0.208839,0.498696,0.168726,1510,1514.599487
7,1,1,0,6,0,2,0.165000,0.162254,0.535833,0.266804,959,971.346741
8,1,1,0,0,0,1,0.138333,0.116175,0.434167,0.361950,822,830.308472
9,1,1,0,1,1,1,0.150833,0.150888,0.482917,0.223267,1321,1843.287842
