# Seoul Bike Sharing

**Attribute Information:**

**-Rented Bike count:**  - Count of bikes rented at each hour, continuous numeric value<br>
**-Hour:** Hour of the day numeric value <br>
**-Temperature:** in Celsius numeric value <br>
**-Humidity:** in %, numeric value <br>
**-Windspeed:** in m/s, numeric value<br>
**-Visibility:** in 10m, numeric value<br>
**-Dew point temperature:** in Celsius, numeric value<br>
**-Solar radiation:** MJ/m2, numeric value<br>
**-Rainfall:** in mm, numeric value<br>
**-Snowfall:** in cm, numeric value<br>
**-Seasons:** Winter, Spring, Summer, Autumn categorical value<br>
**-Holiday:** Holiday/No holiday binary value<br>
**-Functional Day:** NoFunc(Non Functional Hours), Fun(Functional hours) binary value<br>


### Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import time
from scipy import stats
from sklearn import preprocessing, datasets, linear_model, metrics
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler, RobustScaler, MinMaxScaler, Normalizer
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor


import sys
sys.path.insert(1, '../RegressionAlgorithms/')
from knn import *
import linearRegressionNumpy

### Get the Data

In [None]:
data = pd.read_csv('SeoulBikeData.csv', delimiter = ',', engine='python')

In [None]:
data

### Basic Data Information 

In [None]:
data.info()

In [None]:
data.describe()

In [None]:
print(data.isnull().sum(axis=0))

### Exploratory Data Analysis

**Rented Bike Count**

*Histogram of Rented Bike Count Distribution*

In [None]:
fig = plt.figure(figsize = (20,5))
sns.set_style('darkgrid')
bins = np.arange(0, 3540, 100).tolist()
data['Rented Bike Count'].hist(bins=bins)
plt.xticks(bins)
plt.xlabel('Rented Bike Count')

**Rented Bike Count vs Season**

In [None]:
data['Seasons'].unique()

*Box plot of Rented Bike Count vs Seasons*

In [None]:
plt.figure(figsize=(8, 6))
sns.boxplot(x=data['Seasons'], y=data['Rented Bike Count'])
plt.show()

**Rented Bike Count vs Holiday**

*Box plot of Rented Bike Count vs Holiday*

In [None]:
plt.figure(figsize=(4, 6))
sns.boxplot(x=data['Holiday'], y=data['Rented Bike Count'])
plt.show()

**Traffic Volume vs Temperature**

*Plot of Rented Bike Count vs Temperature*

In [None]:
fig = sns.jointplot(x=data.iloc[:,2], y=data['Rented Bike Count'], kind='reg') #'Temperature(°C)'

**Traffic Volume vs Rainfall**

*Plot of Rented Bike Count vs Rainfall*

In [None]:
fig = sns.jointplot(x=data['Rainfall(mm)'], y=data['Rented Bike Count'], kind='reg')

*Removing outliers*

In [None]:
outliers = data[(data['Rainfall(mm)'] >= 20)]
data = data.drop(outliers.index)
data.index = np.arange(1, len(data) + 1)
outliers

In [None]:
fig = sns.jointplot(x=data['Rainfall(mm)'], y=data['Rented Bike Count'], kind='reg')

**Traffic Volume vs Rainfall**

*Plot of Traffic Volume vs Wind Speed*

In [None]:
fig = sns.jointplot(data=data, x="Wind speed (m/s)", y="Rented Bike Count", kind='reg')

*Removing outliers*

In [None]:
outliers = data[(data['Wind speed (m/s)'] >6)]
data = data.drop(outliers.index)
data.index = np.arange(1, len(data) + 1)
outliers

In [None]:
fig = sns.jointplot(data=data, x="Wind speed (m/s)", y="Rented Bike Count", kind='reg')

**Traffic Volume vs Snowfall**

*Plot of Traffic Volume vs Snowfall*

In [None]:
fig = plt.figure(figsize = (25,15))
ax1 = fig.add_subplot(2,3,1)
ax1.scatter(data=data, x="Snowfall (cm)", y="Rented Bike Count")

*Distribution only with snowy days*

In [None]:
data_snowy = data.loc[(data['Snowfall (cm)'] > 0)]
#data_snowy = data.loc[(data['weather_main'] == "Snow")]
data_snowy.index = np.arange(1, len(data_snowy) + 1)
data_snowy

In [None]:
fig = sns.jointplot(data=data_snowy, x="Snowfall (cm)", y="Rented Bike Count", kind='reg')

**Traffic Volume vs Hour**

*Plot of Traffic Volume vs Hour*

In [None]:
fig = sns.jointplot(data=data_snowy, x="Hour", y="Rented Bike Count", kind='reg')

**Feature Engineering on Date**

In [None]:
data[['Day','Month','Year']] = data['Date'].str.extract('(\d+)/(\d+)/(\d+)', expand=True)
data = data.drop(['Date'], axis=1)
data[['Day','Month','Year']] = data[['Day','Month','Year']].astype(float)

**Visualization**

*Bike Count vs Year*

In [None]:
plt.figure(figsize=(4, 6))
sns.boxplot(x=data['Year'], y=data['Rented Bike Count'])
plt.show()

*Bike Count vs Year*

In [None]:
plt.figure(figsize=(20, 6))
sns.boxplot(x=data['Month'], y=data['Rented Bike Count'])
plt.show()

*Rented Bike Count vs Hour*

In [None]:
plt.figure(figsize=(20, 6))
sns.boxplot(x=data['Hour'], y=data['Rented Bike Count'])
plt.show()

## Data Pre-processing

*Preprocess Binary Data*

In [None]:
data = data.replace(to_replace=['No Holiday', 'Holiday'], value=[0, 1])
data = data.replace(to_replace=['No', 'Yes'], value=[0, 1])

*Preprocess Non Ordinal Data*

In [None]:
one_hot = pd.get_dummies(data["Seasons"])
data = data.drop("Seasons",axis = 1)
data = data.join(one_hot.astype(float))

In [None]:
X = data.drop('Rented Bike Count', axis=1)
y = data['Rented Bike Count']

In [None]:
scaler = StandardScaler()
scaler = scaler.fit(X)
X_scaled = scaler.transform(X)

# Split the data in attributes and class as well as training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)

In [None]:
data.info()

## Regression Tasks

*Regression Algorithms from Sklearn*

### Linear Regression

In [None]:
start = time.time()
model = linear_model.LinearRegression().fit(X_train, y_train)
end = time.time()

# Make predictions using the testing set
y_pred1 = model.predict(X_test)

# The coefficients
print('Coefficients: \n', model.coef_, model.intercept_)

**Evaluation metrics**

In [None]:
print('cross validation score: ', cross_val_score(model, X_test, y_pred1, cv=10))
# The coefficient of determination: 1 is perfect prediction
print('Coefficient of determination: %.2f' % r2_score(y_test, y_pred1))
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred1))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred1))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred1)))
print("Time: %0.2f" % (end - start), "seconds")

### KNN Regression

In [None]:
start = time.time()
model = KNeighborsRegressor(n_neighbors=5).fit(X_train, y_train)
end = time.time()

# Make predictions using the testing set
y_pred= model.predict(X_test)

**Evaluation metrics**

In [None]:
print('cross validation score: ', cross_val_score(model, X_test, y_pred, cv=10))
# The coefficient of determination: 1 is perfect prediction
print('Coefficient of determination: %.2f' % r2_score(y_test, y_pred))
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
print("Time: %0.2f" % (end - start), "seconds")

### Decision Tree Regression

In [None]:
start = time.time()
model = DecisionTreeRegressor(random_state = 0).fit(X_train, y_train)
end = time.time()

# Make predictions using the testing set
y_pred = model.predict(X_test)

**Evaluation metrics**

In [None]:
print('cross validation score: ', cross_val_score(model, X_test, y_pred, cv=10))
# The coefficient of determination: 1 is perfect prediction
print('Coefficient of determination: %.2f' % r2_score(y_test, y_pred))
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
print("Time: %0.2f" % (end - start), "seconds")

### Random Forest Regressor

In [None]:
start = time.time()
model = RandomForestRegressor().fit(X_train, y_train)
end = time.time()

# Make predictions using the testing set
y_pred = model.predict(X_test)

**Evaluation metrics**

In [None]:
print('cross validation score: ', cross_val_score(model, X_test, y_pred, cv=10))
# The coefficient of determination: 1 is perfect prediction
print('Coefficient of determination: %.2f' % r2_score(y_test, y_pred))
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
print("Time: %0.2f" % (end - start), "seconds")

# Our Regression Algorithms

### Linear Regression Function (MSE)

In [None]:
try:
    del X_train['bias']
except:
    print('no bias to remove X_train')    
try:
    del X_test['bias']
except:
    print('no bias to remove X_test')
try:
    del X['bias']
except:
    print('no bias to remove X')




print('\n Seoul: Linear Regression Function (MSE):')    
alphaMethod = 'const'
mu = 1
convCritList = [1e5, 1e4, 1e3, 1e2, 1e1, 1e0, 1e-1, 1e-2, 1e-3, 1e-4, 1e-5]
print('epsilon       | sum total error:   | sum relative error:  | iterations | Rsquare |    time/s')
for convergenceCriterion in convCritList:
    start = time.time()
    weights, score, iterations = linearRegressionNumpy.linearRegression(X_train, y_train, mu = mu, 
                                                        convergenceCriterion = convergenceCriterion, lossFunction = 'MSE', 
                                                        alphaMethod = alphaMethod, printOutput = False)
    end = time.time()
    yPred2 = linearRegressionNumpy.predictLinearRegression(X_test, weights)
    


    print('{:13.0E} | {:19}| {:21}| {:11}| {:8.4f}| {:10.5f}'.format(convergenceCriterion, 
                                        str(np.sum(yPred2-y_pred1)), 
                                        str(np.sum((yPred2-y_pred1)/y_pred1)),
                                        str(iterations),
                                        r2_score(y_test, yPred2),
                                        end-start))


print('\nFinal weigths for smallest epsilon = {:2.0E}:'.format(convCritList[-1]))
print('weights = ', weights, '\n')

plt.title('SeoulBikeSharing: scikit prediction')
plt.plot(y_pred1)
plt.ylabel('Rented Bike Count')
plt.savefig('SeoulBikeSharing_scikit_prediction_MSE.jpeg', bbox_inches='tight')
plt.show()

plt.title('SeoulBikeSharing: our prediction (MSE)')
plt.plot(yPred2)
plt.ylabel('Rented Bike Count')
plt.savefig('SeoulBikeSharing_our_prediction_MSE.jpeg', bbox_inches='tight')
plt.show()

plt.title('SeoulBikeSharing: our prediction (MSE) vs. scikit prediction')
plt.plot(yPred2-y_pred1)
plt.ylabel('total error')
plt.savefig('SeoulBikeSharing_total_error_MSE.jpeg', bbox_inches='tight')
plt.show()

plt.title('SeoulBikeSharing: our prediction (MSE) vs. scikit prediction')
plt.plot((yPred2-y_pred1)/y_pred1)
plt.ylabel('relative error')
plt.savefig('SeoulBikeSharing_relative_error_MSE.jpeg', bbox_inches='tight')
plt.show()



**Evaluation metrics**

In [None]:
print('\n Seoul: Linear Regression Function (MSE):')
# The coefficient of determination: 1 is perfect prediction
print('Coefficient of determination: %.2f' % r2_score(y_test, yPred2))
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, yPred2))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, yPred2))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, yPred2)))

### Linear Regression Function (MAE)

In [None]:
try:
    del X_train['bias']
except:
    print('no bias to remove X_train')    
try:
    del X_test['bias']
except:
    print('no bias to remove X_test')
try:
    del X['bias']
except:
    print('no bias to remove X')




print('\n \n Seoul: Linear Regression Function (MAE):')    
alphaMethod = 'const'
mu = 1
convCritList = [1e5, 1e4, 1e3, 1e2, 1e1, 1e0, 1e-1, 1e-2, 1e-3, 1e-4, 1e-5, 1e-6, 1e-7]
print('epsilon       | sum total error:   | sum relative error:  | iterations | Rsquare |    time/s')
for convergenceCriterion in convCritList:
    start = time.time()
    weights, score, iterations = linearRegressionNumpy.linearRegression(X_train, y_train, mu = mu, 
                                                        convergenceCriterion = convergenceCriterion, lossFunction = 'MAE', 
                                                        alphaMethod = alphaMethod, printOutput = False)
    end = time.time()
    yPred2 = linearRegressionNumpy.predictLinearRegression(X_test, weights)



    print('{:13.0E} | {:19}| {:21}| {:11}| {:8.4f}| {:10.5f}'.format(convergenceCriterion, 
                                        str(np.sum(yPred2-y_pred1)), 
                                        str(np.sum((yPred2-y_pred1)/y_pred1)),
                                        str(iterations),
                                        r2_score(y_test, yPred2),
                                        end-start))
    
print('\nFinal weigths for smallest epsilon = {:2.0E}:'.format(convCritList[-1]))
print('weights = ', weights, '\n')

plt.title('SeoulBikeSharing: scikit prediction')
plt.plot(y_pred1)
plt.ylabel('Rented Bike Count')
plt.savefig('SeoulBikeSharing_scikit_prediction_MAE.jpeg', bbox_inches='tight')
plt.show()

plt.title('SeoulBikeSharing: our prediction (MAE)')
plt.plot(yPred2)
plt.ylabel('Rented Bike Count')
plt.savefig('SeoulBikeSharing_our_prediction_MAE.jpeg', bbox_inches='tight')
plt.show()

plt.title('SeoulBikeSharing: our prediction (MAE) vs. scikit prediction')
plt.plot(yPred2-y_pred1)
plt.ylabel('total error')
plt.savefig('SeoulBikeSharing_total_error_MAE.jpeg', bbox_inches='tight')
plt.show()

plt.title('SeoulBikeSharing: our prediction (MAE) vs. scikit prediction')
plt.plot((yPred2-y_pred1)/y_pred1)
plt.ylabel('relative error')
plt.savefig('SeoulBikeSharing_relative_error_MAE.jpeg', bbox_inches='tight')
plt.show()


**Evaluation metrics**

In [None]:
print('\n Seoul: Linear Regression Function (MAE):')
# The coefficient of determination: 1 is perfect prediction
print('Coefficient of determination: %.2f' % r2_score(y_test, yPred2))
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, yPred2))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, yPred2))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, yPred2)))

### Linear Regression Function (RMSE)

In [None]:
try:
    del X_train['bias']
except:
    print('no bias to remove X_train')    
try:
    del X_test['bias']
except:
    print('no bias to remove X_test')
try:
    del X['bias']
except:
    print('no bias to remove X')



print('\n Seoul: Linear Regression Function (RMSE):')

alphaMethod = 'const'
mu = 1
convCritList = [1e5, 1e4, 1e3, 1e2, 1e1, 1e0, 1e-1, 1e-2, 1e-3, 1e-4, 1e-5, 1e-6, 1e-7, 1e-8]
print('epsilon       | sum total error:   | sum relative error:  | iterations | Rsquare |    time/s')
for convergenceCriterion in convCritList:
    start = time.time()
    weights, score, iterations = linearRegressionNumpy.linearRegression(X_train, y_train, mu = mu, 
                                                        convergenceCriterion = convergenceCriterion, lossFunction = 'RMSE', 
                                                        alphaMethod = alphaMethod, printOutput = False)
    end = time.time()
    yPred2 = linearRegressionNumpy.predictLinearRegression(X_test, weights)



    print('{:13.0E} | {:19}| {:21}| {:11}| {:8.4f}| {:10.5f}'.format(convergenceCriterion, 
                                        str(np.sum(yPred2-y_pred1)), 
                                        str(np.sum((yPred2-y_pred1)/y_pred1)),
                                        str(iterations),
                                        r2_score(y_test, yPred2),
                                        end-start))

print('\nFinal weigths for smallest epsilon = {:2.0E}:'.format(convCritList[-1]))
print('weights = ', weights, '\n')

plt.title('SeoulBikeSharing: scikit prediction')
plt.plot(y_pred1)
plt.ylabel('Rented Bike Count')
plt.savefig('SeoulBikeSharing_scikit_prediction_RMSE.jpeg', bbox_inches='tight')
plt.show()

plt.title('SeoulBikeSharing: our prediction (RMSE)')
plt.plot(yPred2)
plt.ylabel('Rented Bike Count')
plt.savefig('SeoulBikeSharing_our_prediction_RMSE.jpeg', bbox_inches='tight')
plt.show()

plt.title('SeoulBikeSharing: our prediction (RMSE) vs. scikit prediction')
plt.plot(yPred2-y_pred1)
plt.ylabel('total error')
plt.savefig('SeoulBikeSharing_total_error_RMSE.jpeg', bbox_inches='tight')
plt.show()

plt.title('SeoulBikeSharing: our prediction (RMSE) vs. scikit prediction')
plt.plot((yPred2-y_pred1)/y_pred1)
plt.ylabel('relative error')
plt.savefig('SeoulBikeSharing_relative_error_RMSE.jpeg', bbox_inches='tight')
plt.show()


**Evaluation metrics**

In [None]:
print('\n Seoul: Linear Regression Function (RMSE):')
# The coefficient of determination: 1 is perfect prediction
print('Coefficient of determination: %.2f' % r2_score(y_test, yPred2))
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, yPred2))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, yPred2))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, yPred2)))

### Correlation of the regression

In [None]:
corr = data.corr()
plt.subplots(figsize=(20,9))
sns.heatmap(corr, annot=True)

In [None]:
top_feature = corr.index[abs(corr['Rented Bike Count'])>0.3]
plt.subplots(figsize=(12, 8))
top_corr = data[top_feature].corr()
sns.heatmap(top_corr, annot=True)
plt.show()

In [None]:
corr = data.corr()
corr.sort_values(['Rented Bike Count'], ascending=False, inplace=True)
corr['Rented Bike Count']

### KNN

**Dictionary creation to apply the mathematical functions of the algorithm**

Training Data Option:
- 0: All Data (except the target)
- 1: X_train/y_train (train_test_split)

In [None]:
training_data_option = 0

In [None]:
if training_data_option == 0:
    training_data = data
elif training_data_option == 1:
    training_data = data[data.index.isin(X_train.index)]
    test_data = data[data.index.isin(X_test.index)]
    
training_data

In [None]:
if training_data_option == 0:
    training_dictionary = training_data.to_dict('records')
elif training_data_option == 1:
    training_dictionary = training_data.to_dict('records')
    test_dictionary = test_data.to_dict('index')

In [None]:
#training_dictionary

In [None]:
len(training_dictionary)

**Forecasting instances**

In [None]:
y_test

**Algorithm parameters**

In [None]:
mode = 1 # 1 = KNeighbors; 2 = RadiusNeighbors
n_neighbours = 5
distance_function = 1 # 1 = Euclidean Distance; 2 = Manhattan Distance
radius = 0 # 0 indicates no radius
label = 'Rented Bike Count'
features = ['Temperature(�C)','Hour','Dew point temperature(�C)','Winter']

**Algorithm initialization**

In [None]:
knn = KNN(training_dictionary, label, features, mode, n_neighbours, distance_function, radius)

**Execution of the algorithm (forecasting)**

In [None]:
results = []

start = time.time()

if training_data_option == 0:
    for x in y_test.index:
        #print(x)
        target = training_dictionary[x-1]
        #print(target)
        result = knn.run(target)
        #print(result)
        results.append(result)
elif training_data_option == 1:
    for x in y_test.index:
        #print(x)
        target = test_dictionary[x]
        #print(target)
        result = knn.run(target)
        #print(result)
        results.append(result)
    
end = time.time()

**Predictions**

In [None]:
predictions = pd.Series(results,index=y_test.index)

In [None]:
predictions

**Evaluation metrics**

In [None]:
# The coefficient of determination: 1 is perfect prediction
print('Coefficient of determination: %.2f' % r2_score(y_test, predictions))
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, predictions))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, predictions))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))
print("Time: %0.2f" % (end - start), "seconds")