# Regression Benchmark 

### Problem Statement

Bike sharing systems are new generation of traditional bike rentals where whole process from membership, rental and return back has become automatic. Through these systems, user is able to easily rent a bike from a particular position and return back at another position. 

Opposed to other transport services such as bus or subway, the duration of travel, departure and arrival position is explicitly recorded in these systems. This feature turns bike sharing system into a virtual sensor network that can be used for sensing mobility in the city.

To make the process seamless, and ensure that enough bikes are available for the people, we need to predict the count of bikes required in the coming month based on the past data.

- date:       Date in "yyyy-mm-dd" format
- season:     Four categories-> 1 = spring, 2 = summer, 3 = fall, 4 = winter
- month:      Extracted from the date variable
- hour:       Hour of the day
- holiday:    whether the day is a holiday or not (1/0)
- workingday: whether the day is neither a weekend nor holiday (1/0)
- weather:    Four Categories of weather
            * 1-> Clear, Few clouds, Partly cloudy, Partly cloudy
            * 2-> Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
            * 3-> Light Snow and Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
            * 4-> Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
- temp:       hourly temperature in Celsius
- atemp:      "feels like" temperature in Celsius
- humidity:   relative humidity
- windspeed:  wind speed


- registered: number of registered user
- casual:     number of non-registered user
- count:      number of total rentals (registered + casual)

### Importing Libraries

In [1]:
#importing libraries 

import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings("ignore")

### Importing Dataset

In [2]:
df=pd.read_csv('hour.csv')
df.shape

(17379, 16)

In [3]:
df.head().T

Unnamed: 0,0,1,2,3,4
instant,1,2,3,4,5
date,2011-01-01,2011-01-01,2011-01-01,2011-01-01,2011-01-01
season,1,1,1,1,1
month,1,1,1,1,1
hour,0,1,2,3,4
holiday,0,0,0,0,0
weekday,6,6,6,6,6
workingday,0,0,0,0,0
weather,1,1,1,1,1
temp,0.24,0.22,0.22,0.24,0.24


### Shuffling and Creating Train and Test Set

Task 1:

- Shuffle the dataset
- Create Train and Validation set

In [4]:
## Shuffling the Dataset
from sklearn.utils import shuffle
data = shuffle(df, random_state=33)

In [5]:
#creating 4 divisions
div = int(data.shape[0]/4)

# 3 parts to train set and 1 part to test set
train = data.iloc[:3*div+1]
test = data.iloc[3*div+1:]

In [6]:
train.shape, test.shape, data.shape

((13033, 16), (4346, 16), (17379, 16))

In [7]:
train.head()

Unnamed: 0,instant,date,season,month,hour,holiday,weekday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
11215,11216,2012-04-17,2,4,12,0,2,1,1,0.6,0.6212,0.33,0.3582,65,179,244
12940,12941,2012-06-28,3,6,9,0,4,1,1,0.7,0.6515,0.54,0.1642,33,318,351
11502,11503,2012-04-29,2,4,11,0,0,0,1,0.46,0.4545,0.51,0.0,128,283,411
13018,13019,2012-07-01,3,7,15,0,0,0,1,0.9,0.8182,0.37,0.2836,101,244,345
117,118,2011-01-06,1,1,2,0,4,1,1,0.16,0.2273,0.64,0.0,0,2,2


In [8]:
test.head().T

Unnamed: 0,4697,3004,12137,8179,10761
instant,4698,3005,12138,8180,10762
date,2011-07-19,2011-05-10,2012-05-25,2011-12-12,2012-03-29
season,3,2,2,4,2
month,7,5,5,12,3
hour,15,2,22,11,12
holiday,0,0,0,0,0
weekday,2,2,5,1,4
workingday,1,1,1,1,1
weather,1,1,1,2,1
temp,0.88,0.44,0.66,0.26,0.5


## Simple Mean (mean of count)

Task 2-

- Calculate the mean of target variable

In [9]:
# calculate mean of column
test['simple_mean'] = train['count'].mean()

In [10]:
test

Unnamed: 0,instant,date,season,month,hour,holiday,weekday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count,simple_mean
4697,4698,2011-07-19,3,7,15,0,2,1,1,0.88,0.8182,0.44,0.1940,48,110,158,189.342285
3004,3005,2011-05-10,2,5,2,0,2,1,1,0.44,0.4394,0.58,0.1343,1,4,5,189.342285
12137,12138,2012-05-25,2,5,22,0,5,1,1,0.66,0.6212,0.74,0.1045,65,147,212,189.342285
8179,8180,2011-12-12,4,12,11,0,1,1,2,0.26,0.2879,0.60,0.0896,15,100,115,189.342285
10761,10762,2012-03-29,2,3,12,0,4,1,1,0.50,0.4848,0.42,0.4925,64,228,292,189.342285
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10435,10436,2012-03-15,1,3,22,0,4,1,1,0.52,0.5000,0.68,0.1343,32,137,169,189.342285
57,58,2011-01-03,1,1,12,0,1,1,1,0.22,0.2121,0.35,0.2985,13,48,61,189.342285
578,579,2011-01-26,1,1,9,0,3,1,3,0.22,0.2121,0.87,0.2985,3,55,58,189.342285
5848,5849,2011-09-06,3,9,4,0,2,1,2,0.54,0.5152,0.94,0.2985,1,3,4,189.342285


Task 3-

- import mean absolute error from sklearn
- calculate mean absolute error

In [11]:
#calculating mean absolute error

from sklearn.metrics import mean_absolute_error as MAE
simple_mean_error = MAE(test['count'], test['simple_mean'])
simple_mean_error

142.0241113490485

## Mean count with respect to weekday

Task 4 -
- Check average count for different weekdays
- Make predictions using average wrt weekday

In [17]:
train.weekday.unique()

array([2, 4, 0, 5, 3, 6, 1], dtype=int64)

In [32]:
# calculating mean count based on day of week 
# Hint: use  pivot table

weekday_mean = pd.pivot_table(train, values='count', index = ['weekday'],
                              aggfunc=np.mean)
weekday_mean

Unnamed: 0_level_0,count
weekday,Unnamed: 1_level_1
0,174.494295
1,186.847598
2,189.835757
3,191.622293
4,194.846818
5,199.472747
6,188.762657


In [38]:
# initializing new column to zero
test['weekday_mean'] = 0

# For every unique entry in weekday
for day in train['weekday'].unique():
    
  # Assign the mean value corresponding to unique entry
    test['weekday_mean'][test['weekday'] == day] = train['count'][train['weekday'] == day].mean()

In [41]:
#calculating mean absolute error
weekday_mean_error = MAE(test['count'] , test['weekday_mean'] )
weekday_mean_error

142.12704251800122

## Mean Count with respect to Month

Task 5-

- Print month-wise average count using pivot table
- Use month-wise average count as predictions
- Calculate the Error

In [46]:
# calculating mean count based on month
# use pivot table
month_wise_count = pd.pivot_table(train,values="count",index="month",aggfunc=np.mean)
month_wise_count

Unnamed: 0_level_0,count
month,Unnamed: 1_level_1
1,96.75766
2,113.195939
3,150.336331
4,184.640216
5,220.746903
6,244.203892
7,227.442202
8,239.842767
9,241.808999
10,225.172669


In [42]:
# initializing new column to zero
test['month_wise_mean'] = 0

# For every unique entry in month variable
for i in train["month"].unique():
    
  # Assign the mean value corresponding to unique entry
    test['month_wise_mean'][test['month'] == i] = train["count"][train["month"]==i].mean()

In [43]:
#calculating mean absolute error
month_wise_mean_error = MAE(test["month_wise_mean"] , test["count"])
month_wise_mean_error

134.42417095178655

## Mean Count with respect to both Month workingday

In [None]:
combo = pd.pivot_table(train, values = 'count', index = ['month','workingday'], aggfunc = np.mean)
combo

Task 6-
- Predict average count based on month and workingday variables

In [63]:
# Initiating new empty column
test['combo_mean'] = 0


# For every Unique Value in month
for i in train["month"].unique():
  # For every Unique Value in workingdaya
    for j in train["workingday"].unique():
        
    # Calculate and Assign mean to new column, corresponding to both unique values simultaneously
        test['combo_mean'][(test['month'] == i) & (test['workingday']==j)] = train["count"][(train['month'] == i) & (train['workingday']==j)].mean()

In [64]:
#calculating mean absolute error
combo_mean_error = MAE(test['count'] , test['combo_mean'] )
combo_mean_error

134.09622679979694