<h1 align="center"> Supervised Methods: DC Bike Rental </h1>

<h3 align="center"> IST 5520: Data Science and Machine Learning with Python </h3>
<h3 align="center"> Estelle Lu
<h3 align="center"> Last Update: 11/13/2022 </h3>

## 1. Data

### 1.1. Import Data

The data file “DC_Bike_Rentals.csv"

In [1]:
import pandas as pd

dat = pd.read_csv("./DC_Bike_Rentals.csv")
dat.head()

Unnamed: 0,hour,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,count
0,0,1,0,0,1,9.84,14.395,81,0.0,16
1,1,1,0,0,1,9.02,13.635,80,0.0,40
2,2,1,0,0,1,9.02,13.635,80,0.0,32
3,3,1,0,0,1,9.84,14.395,75,0.0,13
4,4,1,0,0,1,9.84,14.395,75,0.0,1


In [2]:
dat.sample(10).transpose()

Unnamed: 0,5322,7470,6061,4207,10404,9556,5826,5122,6228,2256
hour,20.0,15.0,19.0,8.0,22.0,13.0,23.0,12.0,18.0,22.0
season,4.0,2.0,1.0,4.0,4.0,4.0,1.0,4.0,1.0,2.0
holiday,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
workingday,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0
weather,1.0,1.0,3.0,1.0,2.0,2.0,1.0,3.0,1.0,1.0
temp,21.32,22.14,9.84,18.86,14.76,24.6,18.86,18.04,16.4,30.34
atemp,25.0,25.76,12.88,22.725,16.665,27.275,22.725,21.97,20.455,34.85
humidity,59.0,37.0,87.0,82.0,66.0,88.0,72.0,100.0,40.0,66.0
windspeed,22.0028,26.0027,7.0015,0.0,16.9979,8.9981,26.0027,8.9981,15.0013,8.9981
count,221.0,271.0,161.0,417.0,74.0,154.0,61.0,33.0,468.0,129.0


In [3]:
dat.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   hour        10886 non-null  int64  
 1   season      10886 non-null  int64  
 2   holiday     10886 non-null  int64  
 3   workingday  10886 non-null  int64  
 4   weather     10886 non-null  int64  
 5   temp        10886 non-null  float64
 6   atemp       10886 non-null  float64
 7   humidity    10886 non-null  int64  
 8   windspeed   10886 non-null  float64
 9   count       10886 non-null  int64  
dtypes: float64(3), int64(7)
memory usage: 850.6 KB


In [4]:
dat.describe()

Unnamed: 0,hour,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,count
count,10886.0,10886.0,10886.0,10886.0,10886.0,10886.0,10886.0,10886.0,10886.0,10886.0
mean,11.541613,2.506614,0.028569,0.680875,1.418427,20.23086,23.655084,61.88646,12.799395,191.574132
std,6.915838,1.116174,0.166599,0.466159,0.633839,7.79159,8.474601,19.245033,8.164537,181.144454
min,0.0,1.0,0.0,0.0,1.0,0.82,0.76,0.0,0.0,1.0
25%,6.0,2.0,0.0,0.0,1.0,13.94,16.665,47.0,7.0015,42.0
50%,12.0,3.0,0.0,1.0,1.0,20.5,24.24,62.0,12.998,145.0
75%,18.0,4.0,0.0,1.0,2.0,26.24,31.06,77.0,16.9979,284.0
max,23.0,4.0,1.0,1.0,4.0,41.0,45.455,100.0,56.9969,977.0


In [5]:
# Count of classes

dat['season'].value_counts()

4    2734
2    2733
3    2733
1    2686
Name: season, dtype: int64

In [6]:
dat['holiday'].value_counts()

0    10575
1      311
Name: holiday, dtype: int64

In [7]:
dat['weather'].value_counts()  


1    7192
2    2834
3     859
4       1
Name: weather, dtype: int64

In [8]:
dat['workingday'].value_counts()

1    7412
0    3474
Name: workingday, dtype: int64

In [9]:
dat['count'].value_counts()

5      169
4      149
3      144
6      135
2      132
      ... 
801      1
629      1
825      1
589      1
636      1
Name: count, Length: 822, dtype: int64

### 1.2. Preprocess Data

The categorical/string columns cannot be directly used as input for most algorithms. Let's use one-hot encoding method for these categorical variables.

In [10]:
X = pd.get_dummies(data=dat.drop('count', axis=1), 
                  columns=['season','weather'],
                   prefix =['season','weather'])
  #                 columns=['weather','season','holiday','workingday'],
 #                  prefix =['weather','season','holiday','workingday'])
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 15 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   hour        10886 non-null  int64  
 1   holiday     10886 non-null  int64  
 2   workingday  10886 non-null  int64  
 3   temp        10886 non-null  float64
 4   atemp       10886 non-null  float64
 5   humidity    10886 non-null  int64  
 6   windspeed   10886 non-null  float64
 7   season_1    10886 non-null  uint8  
 8   season_2    10886 non-null  uint8  
 9   season_3    10886 non-null  uint8  
 10  season_4    10886 non-null  uint8  
 11  weather_1   10886 non-null  uint8  
 12  weather_2   10886 non-null  uint8  
 13  weather_3   10886 non-null  uint8  
 14  weather_4   10886 non-null  uint8  
dtypes: float64(3), int64(4), uint8(8)
memory usage: 680.5 KB


In [11]:
X.describe().transpose()


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
hour,10886.0,11.541613,6.915838,0.0,6.0,12.0,18.0,23.0
holiday,10886.0,0.028569,0.166599,0.0,0.0,0.0,0.0,1.0
workingday,10886.0,0.680875,0.466159,0.0,0.0,1.0,1.0,1.0
temp,10886.0,20.23086,7.79159,0.82,13.94,20.5,26.24,41.0
atemp,10886.0,23.655084,8.474601,0.76,16.665,24.24,31.06,45.455
humidity,10886.0,61.88646,19.245033,0.0,47.0,62.0,77.0,100.0
windspeed,10886.0,12.799395,8.164537,0.0,7.0015,12.998,16.9979,56.9969
season_1,10886.0,0.246739,0.431133,0.0,0.0,0.0,0.0,1.0
season_2,10886.0,0.251056,0.433641,0.0,0.0,0.0,1.0,1.0
season_3,10886.0,0.251056,0.433641,0.0,0.0,0.0,1.0,1.0


In [12]:
y = dat['count'].copy()
y

0         16
1         40
2         32
3         13
4          1
        ... 
10881    336
10882    241
10883    168
10884    129
10885     88
Name: count, Length: 10886, dtype: int64

### 1.3. Data Partition

We use the sklearn.model_selection.train_test_split() method to split the dataset into test and training sets. Major parameters are:

- **test_size** : float, int, or None (default is None). If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is automatically set to the complement of the train size. If train size is also None, test size is set to 0.25.

- **train_size** : float, int, or None (default is None). If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size.

- **random_state** : Pseudo-random number generator state used for random sampling.

For the detail of the method, refer to http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

In [13]:
from sklearn.model_selection import train_test_split

# 40-60% simple split
# To make the result reproducible, set the random_state
train_y,test_y,train_X,test_X = train_test_split(y, X,
                                                 test_size=0.4)
                                                 #random_state=123)

In [15]:
train_X.shape

(6531, 15)

In [16]:
train_y.shape

(6531,)

In [17]:
test_y.shape

(4355,)

In [18]:
test_X.shape

(4355, 15)

### 1.4. Normalize Data

Because the k-NN needs to calculate distance between observations, it's better to normalize data as we have variables measured in different scales. In this case, we normalize the data into the scale range [0, 1].


In [19]:
from sklearn import preprocessing

# Create a scaler to do the transformation
scaler = preprocessing.MinMaxScaler().fit(train_X)

In [20]:
# Transform training X
train_X_scale = scaler.transform(train_X)
train_X_scale = pd.DataFrame(train_X_scale)
train_X_scale.columns = train_X.columns

train_X_scale.describe().transpose() # transpose only display

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
hour,6531.0,0.503339,0.300158,0.0,0.26087,0.521739,0.782609,1.0
holiday,6531.0,0.029858,0.170207,0.0,0.0,0.0,0.0,1.0
workingday,6531.0,0.678303,0.467163,0.0,0.0,1.0,1.0,1.0
temp,6531.0,0.482303,0.193847,0.0,0.326531,0.489796,0.632653,1.0
atemp,6531.0,0.520095,0.192994,0.0,0.362012,0.534426,0.689655,1.0
humidity,6531.0,0.617434,0.1916,0.0,0.47,0.61,0.77,1.0
windspeed,6531.0,0.225702,0.143805,0.0,0.12284,0.228047,0.298225,1.0
season_1,6531.0,0.244373,0.429748,0.0,0.0,0.0,0.0,1.0
season_2,6531.0,0.250038,0.433068,0.0,0.0,0.0,0.5,1.0
season_3,6531.0,0.251416,0.433861,0.0,0.0,0.0,1.0,1.0


In [21]:
# Transform test X
test_X_scale = scaler.transform(test_X)
test_X_scale = pd.DataFrame(test_X_scale)
test_X_scale.columns = test_X.columns

test_X_scale.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
hour,4355.0,0.499516,0.301503,0.0,0.26087,0.478261,0.73913,1.0
holiday,4355.0,0.026636,0.161036,0.0,0.0,0.0,0.0,1.0
workingday,4355.0,0.68473,0.464677,0.0,0.0,1.0,1.0,1.0
temp,4355.0,0.48429,0.194038,0.0,0.326531,0.489796,0.632653,0.959184
atemp,4355.0,0.522639,0.192745,0.017184,0.362012,0.534426,0.689655,1.017298
humidity,4355.0,0.62101,0.193721,0.0,0.47,0.62,0.78,1.0
windspeed,4355.0,0.222855,0.142401,0.0,0.12284,0.193018,0.298225,0.912308
season_1,4355.0,0.250287,0.433228,0.0,0.0,0.0,1.0,1.0
season_2,4355.0,0.252583,0.434544,0.0,0.0,0.0,1.0,1.0
season_3,4355.0,0.250517,0.43336,0.0,0.0,0.0,1.0,1.0



## 2. k-Nearest Neighbors (k-NN)

### 2.1. Train a k-NN Classifier

In [22]:
from sklearn import neighbors

# KNN: K=5, default measure of distance (euclidean)
knn5 = neighbors.KNeighborsClassifier(n_neighbors=5, 
                                      weights='uniform', 
                                      algorithm='auto')

In [23]:
knn5.fit(train_X_scale, train_y)

#print(train_X)

KNeighborsClassifier()

Now, let's use the test dataset to assess the performance of the trained model.

In [24]:
pred_y_knn5 = knn5.predict(test_X_scale)


In [25]:
from sklearn import metrics

# Print confusion matrix
cm = metrics.confusion_matrix(test_y, pred_y_knn5)
print(cm)


[[10  9  6 ...  0  0  0]
 [12 11  8 ...  0  0  0]
 [16 12 13 ...  0  0  0]
 ...
 [ 0  0  0 ...  0  0  0]
 [ 0  0  0 ...  0  0  0]
 [ 0  0  0 ...  0  0  0]]


In [26]:
# Calculate classification accuracy
metrics.accuracy_score(test_y, pred_y_knn5)

0.019058553386911595

### 2.2. Tune the k-NN Classifier

The choice of the paramter value k has impact on the performance of the k-NN algorithm.

In the following, we tune the k parameter based on accuracy.

In [33]:
for k in range(20):
    k = k + 1
    knn = neighbors.KNeighborsClassifier(n_neighbors = k, 
                                         weights='uniform', 
                                         algorithm='auto')
    knn.fit(train_X_scale, train_y)
    pred_y = knn.predict(test_X_scale)


If we use overall accuracy as the measure, the optimal hyperparameter is k = 2. 

We can use different performance metric to tune the parameter. 

## Now, let's use the test dataset to assess the performance of the knn(k=2) model.

In [34]:
knn_final = neighbors.KNeighborsClassifier(n_neighbors = 2, 
                                      weights='uniform',                                    
                                      algorithm='auto')
knn_final.fit(train_X_scale, train_y)
pred_y_knn_final = knn_final.predict(test_X_scale)
print(pred_y_knn_final)

[126  79 263 ...  70 170  15]


In [89]:
# Performance measures with the different k values in terms of RMSE 
for k in range(20):
    k = k + 1
    knn = neighbors.KNeighborsClassifier(n_neighbors = k, 
                                         weights='uniform', 
                                         algorithm='auto')
    knn.fit(train_X_scale, train_y)
    pred_y = knn.predict(test_X_scale)
    # Calculate RMSE 
    print( "RMSE is : ", metrics.mean_squared_error(test_y, pred_y, squared=False)," for k =",k)


RMSE is :  114.37314301818253 % for k = 1
RMSE is :  113.87276017953988 % for k = 2
RMSE is :  123.6160735482843 % for k = 3
RMSE is :  134.02084934703814 % for k = 4
RMSE is :  141.70669000831035 % for k = 5
RMSE is :  147.68700807321045 % for k = 6
RMSE is :  153.93706498773415 % for k = 7
RMSE is :  157.49629704246874 % for k = 8
RMSE is :  160.16497020518136 % for k = 9
RMSE is :  162.74960305654557 % for k = 10
RMSE is :  164.68077738028913 % for k = 11
RMSE is :  166.7132760697553 % for k = 12
RMSE is :  168.25295933137636 % for k = 13
RMSE is :  169.51608846357567 % for k = 14
RMSE is :  170.44771664712363 % for k = 15
RMSE is :  171.27468957255334 % for k = 16
RMSE is :  172.06450943636662 % for k = 17
RMSE is :  172.75321287974805 % for k = 18
RMSE is :  173.34605328817133 % for k = 19
RMSE is :  173.08406338122433 % for k = 20


In [None]:
# Above, in terms of RMSE, it seems k=2 is optimal value 

In [35]:
# Performance measures with the different k values in terms of MAE 
for k in range(20):
    k = k + 1
    knn = neighbors.KNeighborsClassifier(n_neighbors = k, 
                                         weights='uniform', 
                                         algorithm='auto')
    knn.fit(train_X_scale, train_y)
    pred_y = knn.predict(test_X_scale)
    # Calculate MAE
    print( "MAE is : ", metrics.mean_absolute_error(test_y, pred_y)," for k =",k)




MAE is :  75.04982778415614  for k = 1
MAE is :  73.67623421354764  for k = 2
MAE is :  81.09207807118256  for k = 3
MAE is :  87.52537313432836  for k = 4
MAE is :  92.86199770378875  for k = 5
MAE is :  97.71067738231918  for k = 6
MAE is :  101.52743972445465  for k = 7
MAE is :  103.90723306544201  for k = 8
MAE is :  106.34006888633755  for k = 9
MAE is :  108.82594718714121  for k = 10
MAE is :  110.36096440872561  for k = 11
MAE is :  111.32514351320322  for k = 12
MAE is :  111.95177956371987  for k = 13
MAE is :  112.32812858783008  for k = 14
MAE is :  113.04270952927669  for k = 15
MAE is :  113.5804822043628  for k = 16
MAE is :  114.00987370838116  for k = 17
MAE is :  114.60505166475316  for k = 18
MAE is :  114.54787600459242  for k = 19
MAE is :  114.73892078071182  for k = 20


In [None]:
# Above, in terms of MSE, it seems k=2 is optimal value 

In [32]:
# Performance measures with the different k values in terms of R squared 
for k in range(20):
    k = k + 1
    knn = neighbors.KNeighborsClassifier(n_neighbors = k, 
                                         weights='uniform', 
                                         algorithm='auto')
    knn.fit(train_X_scale, train_y)
    pred_y = knn.predict(test_X_scale)
    # Calculate R squared
    print( "R squared is : ", metrics.r2_score(test_y, pred_y)," for k =",k)


R squared is :  0.5670851473960734  for k = 1
R squared is :  0.5736523073728383  for k = 2
R squared is :  0.4821811634649248  for k = 3
R squared is :  0.4109671437498301  for k = 4
R squared is :  0.34591452319498994  for k = 5
R squared is :  0.28508772036860586  for k = 6
R squared is :  0.23983265281518906  for k = 7
R squared is :  0.2105904682773695  for k = 8
R squared is :  0.18142141440278203  for k = 9
R squared is :  0.14743571322619453  for k = 10
R squared is :  0.12442822536573706  for k = 11
R squared is :  0.10763140027230877  for k = 12
R squared is :  0.09612537252965758  for k = 13
R squared is :  0.08664016769250849  for k = 14
R squared is :  0.07358070951117013  for k = 15
R squared is :  0.06320769605862109  for k = 16
R squared is :  0.05422652201847333  for k = 17
R squared is :  0.0441216983619922  for k = 18
R squared is :  0.04328031683748146  for k = 19
R squared is :  0.03978037603814333  for k = 20


In [None]:
# Above, in terms of MSE, it seems k=2 is optimal value 