# 0. Intro and aim of the study

The first bike sharing analysis has been created by Claudia Franco, Francesca Manoni, Demetris Perdikos, Marcos Berges, Martin Hofbauer, Rahul Verma and Gerald Walravens (MBD O1) for the Python class in the second term of the Master Business Analytics & Big Data. The aim was to do a complete machine learning analysis using Python Libraries as numpy, pandas and sklearn.

Today this analysis has been simplified and reworked to Dask, a tool to scale analytics and decrease computing time. The analysis on the Bike Sharing dataset has been done with a linear regression to get a final accuracy score of 63%.

This study aims to predict the total number of bike sharing users in Washington on an hourly basis in the last quarter of 2012. This demand prediction of a time series will be measured with the R2 metric in order to see how much of our final model will encompass the information to represent our model. Along the way we measure MSE and MAE in order to not bring an error too large into the models and make strong predictions.

# 1. Loading libraries and activating Dask client

In [27]:
import distributed
from distributed import Client, progress

from dask import dataframe as dd
from dask_ml.preprocessing import Categorizer, DummyEncoder, StandardScaler
from dask_ml.linear_model import LinearRegression

from dask_ml.metrics import mean_squared_error

from sklearn import linear_model
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.metrics import r2_score

In [2]:
client = Client()
client

0,1
Client  Scheduler: tcp://127.0.0.1:53161  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 4  Cores: 4  Memory: 8.59 GB


# 2. Data loading and transformations

Automatic importing is done by means of a gist.

In [3]:
hour_data = dd.read_csv(
    "https://gist.githubusercontent.com/geraldwal/b5a83f4c670abe0a662abce558e5d433/raw/bce4bbfc63355606e4503964e25798b5d2190b9b/hour%2520-%2520Python%2520Bike%2520Sharing",
    sep=",",
    parse_dates=["dteday"],
)

In [4]:
hour_data.head()

Unnamed: 0,instant,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,2011-01-01,1,0,1,0,0,6,0,1,0.24,0.2879,0.81,0.0,3,13,16
1,2,2011-01-01,1,0,1,1,0,6,0,1,0.22,0.2727,0.8,0.0,8,32,40
2,3,2011-01-01,1,0,1,2,0,6,0,1,0.22,0.2727,0.8,0.0,5,27,32
3,4,2011-01-01,1,0,1,3,0,6,0,1,0.24,0.2879,0.75,0.0,3,10,13
4,5,2011-01-01,1,0,1,4,0,6,0,1,0.24,0.2879,0.75,0.0,0,1,1


In [5]:
hour_data = hour_data.drop(["dteday", "casual", "registered"], axis=1)

In [6]:
new_columns = ["instant", "season", "year", "month", "hour", "holiday", "weekday", "workingday", "weather", "temp", "atemp", "humidity", "windspeed", "count", "day"]
hour_data = hour_data.rename(columns=dict(zip(hour_data.columns, new_columns)))

In [7]:
hour_data.describe()

Unnamed: 0_level_0,instant,season,year,month,hour,holiday,weekday,workingday,weather,temp,atemp,humidity,windspeed,count
npartitions=1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
,int64,int64,int64,int64,int64,int64,int64,int64,int64,float64,float64,float64,float64,int64
,...,...,...,...,...,...,...,...,...,...,...,...,...,...


We continue working with a copy of the original dataframe

In [8]:
hour_data_copy = hour_data.copy()

## Scaling 

In [9]:
scaler = StandardScaler()
normalize = scaler.fit_transform(
    hour_data_copy.drop(
        [
            "season",
            "year",
            "month",
            "hour",
            "holiday",
            "weekday",
            "workingday",
            "weather",
            "count",
        ],
        axis=1,
    )
)

In [10]:
categoricals = hour_data_copy.loc[
    :,
    [
        "season",
        "year",
        "month",
        "hour",
        "holiday",
        "weekday",
        "workingday",
        "weather",
        "count",
    ],
]

In [11]:
hour_data_copy = dd.concat([categoricals, normalize], axis=1)

We're assuming that the indexes of each dataframes are 
 aligned. This assumption is not generally safe.


In [12]:
hour_data_copy.head()

Unnamed: 0,season,year,month,hour,holiday,weekday,workingday,weather,count,instant,temp,atemp,humidity,windspeed
0,1,0,1,0,0,6,0,1,16,-1.731951,-1.334648,-1.093281,0.947372,-1.553889
1,1,0,1,1,0,6,0,1,40,-1.731752,-1.438516,-1.181732,0.895539,-1.553889
2,1,0,1,2,0,6,0,1,32,-1.731552,-1.438516,-1.181732,0.895539,-1.553889
3,1,0,1,3,0,6,0,1,13,-1.731353,-1.334648,-1.093281,0.63637,-1.553889
4,1,0,1,4,0,6,0,1,1,-1.731154,-1.334648,-1.093281,0.63637,-1.553889


## One hot encoding 

In [13]:
to_pipeline = [
    "season",
    "year",
    "month",
    "hour",
    "holiday",
    "weekday",
    "workingday",
    "weather",
]

In [14]:
pipeline = make_pipeline(Categorizer(columns=to_pipeline), DummyEncoder(columns=to_pipeline))
hour_data_copy_dummied = pipeline.fit_transform(hour_data_copy)

In [15]:
hour_data_copy_dummied.head()

Unnamed: 0,count,instant,temp,atemp,humidity,windspeed,season_1,season_2,season_3,season_4,...,weekday_2,weekday_3,weekday_4,weekday_5,workingday_0,workingday_1,weather_1,weather_2,weather_3,weather_4
0,16,-1.731951,-1.334648,-1.093281,0.947372,-1.553889,1,0,0,0,...,0,0,0,0,1,0,1,0,0,0
1,40,-1.731752,-1.438516,-1.181732,0.895539,-1.553889,1,0,0,0,...,0,0,0,0,1,0,1,0,0,0
2,32,-1.731552,-1.438516,-1.181732,0.895539,-1.553889,1,0,0,0,...,0,0,0,0,1,0,1,0,0,0
3,13,-1.731353,-1.334648,-1.093281,0.63637,-1.553889,1,0,0,0,...,0,0,0,0,1,0,1,0,0,0
4,1,-1.731154,-1.334648,-1.093281,0.63637,-1.553889,1,0,0,0,...,0,0,0,0,1,0,1,0,0,0


## Renaming columns

In [16]:
hour_data_ready = hour_data_copy_dummied.rename(
    columns={
        "holiday_0": "no_holiday",
        "holiday_1": "yes_holiday",
        "month_1": "jan",
        "month_2": "feb",
        "month_3": "mar",
        "month_4": "apr",
        "month_5": "may",
        "month_6": "jun",
        "month_7": "jul",
        "month_8": "aug",
        "month_9": "sep",
        "month_10": "oct",
        "month_11": "nov",
        "month_12": "dec",
        "weekday_0": "sun",
        "weekday_1": "mon",
        "weekday_2": "tue",
        "weekday_3": "wed",
        "weekday_4": "thu",
        "weekday_5": "fri",
        "weekday_6": "sat",
        "season_1": "winter",
        "season_2": "spring",
        "season_3": "summer",
        "season_4": "autumn",
    }
)

In [17]:
hour_data_ready.head()

Unnamed: 0,count,instant,temp,atemp,humidity,windspeed,winter,spring,summer,autumn,...,tue,wed,thu,fri,workingday_0,workingday_1,weather_1,weather_2,weather_3,weather_4
0,16,-1.731951,-1.334648,-1.093281,0.947372,-1.553889,1,0,0,0,...,0,0,0,0,1,0,1,0,0,0
1,40,-1.731752,-1.438516,-1.181732,0.895539,-1.553889,1,0,0,0,...,0,0,0,0,1,0,1,0,0,0
2,32,-1.731552,-1.438516,-1.181732,0.895539,-1.553889,1,0,0,0,...,0,0,0,0,1,0,1,0,0,0
3,13,-1.731353,-1.334648,-1.093281,0.63637,-1.553889,1,0,0,0,...,0,0,0,0,1,0,1,0,0,0
4,1,-1.731154,-1.334648,-1.093281,0.63637,-1.553889,1,0,0,0,...,0,0,0,0,1,0,1,0,0,0


In [18]:
list(hour_data_ready.columns.values)

['count',
 'instant',
 'temp',
 'atemp',
 'humidity',
 'windspeed',
 'winter',
 'spring',
 'summer',
 'autumn',
 'year_0',
 'year_1',
 'jan',
 'feb',
 'mar',
 'apr',
 'may',
 'jun',
 'jul',
 'aug',
 'sep',
 'oct',
 'nov',
 'dec',
 'hour_0',
 'hour_1',
 'hour_2',
 'hour_3',
 'hour_4',
 'hour_5',
 'hour_6',
 'hour_7',
 'hour_8',
 'hour_9',
 'hour_10',
 'hour_11',
 'hour_12',
 'hour_13',
 'hour_14',
 'hour_15',
 'hour_16',
 'hour_17',
 'hour_18',
 'hour_19',
 'hour_20',
 'hour_21',
 'hour_22',
 'hour_23',
 'no_holiday',
 'yes_holiday',
 'sat',
 'sun',
 'mon',
 'tue',
 'wed',
 'thu',
 'fri',
 'workingday_0',
 'workingday_1',
 'weather_1',
 'weather_2',
 'weather_3',
 'weather_4']

# 3. Modeling

In [19]:
def train_test_split(data, to_predict):
    X = data.loc[:, data.columns != to_predict]
    y = data.loc[:, to_predict]
    training_size = int(len(data) * 0.875)
    X_train, X_test, y_train, y_test = (
        X.loc[0 : training_size - 1],
        X.loc[training_size : len(data)],
        y.loc[0 : training_size - 1],
        y.loc[training_size : len(data)],
    )
    return X_train, X_test, y_train, y_test

In [20]:
x_train_predict_count, x_test_predict_count, y_train_predict_count, y_test_predict_count = train_test_split(
    hour_data_ready, "count"
)

In [22]:
x_train_count_arr = x_train_predict_count.values
x_test_count_arr = x_test_predict_count.values
y_train_count_arr = y_train_predict_count.values
y_test_arr = y_test_predict_count.values

In [32]:
def score_LR(X_train, X_test, y_train, y_test):
    lm = LinearRegression()
    lm.fit(X_train, y_train)
    prediction = lm.predict(X_test)
    print("Intercept:", lm.intercept_)
    print("Coefficients:", lm.coef_)
    print("Mean squared error (MSE): {:.4f}".format(mean_squared_error(y_test, prediction)))
    print(
        "Variance score (R2): {:.4f}".format(
            r2_score(y_test.compute(), prediction.compute())
        )
    )
    return prediction

In [33]:
final_score = score_LR(
    x_train_count_arr,
    x_test_count_arr,
    y_train_count_arr, 
    y_test_count_arr
)

Intercept: -46.421127400617294
Coefficients: [ 148.66141379   22.60485303   22.6770613    15.96465478  -15.94294589
   -3.68236923  -17.52717981   11.43352744   -2.06408918   23.97717062
  -22.56348446   26.43423051   -8.27196824   -5.96714878    7.66168076
    4.2477644    18.66064904    6.22753288   -9.64980103    6.6001612
   27.73752728    5.40501316  -12.17358007  -14.6738021  -116.70899821
 -134.09366865 -142.24206469 -153.28042062 -156.49856594 -140.35480876
  -83.26472839   46.37095241  178.73004416   38.85054415  -12.00259646
   12.11060736   49.08520415   45.07535983   29.90706797   37.58642202
   97.62746904  250.83693165  223.4889404   119.59354425   40.69506859
   -7.80853413  -44.85723373  -84.49724908   15.07434171   -3.16770353
    8.74035944   -3.51544206   -1.24139372    1.63980441    3.18344622
    3.50487695    6.81555993    0.85460768    3.27248246   15.60130956
    6.11664937  -45.33401064]
Mean squared error (MSE): 14814.5672
Variance score (R2): 0.6348
