# **Visualize Dataset**

## Objectives

* The objective is to predict bike rental usage based on inputs such as temperature, humidity, wind speed... etc. 

## Inputs

* instant: record index
* dteday : date
* season : season (1:springer, 2:summer, 3:fall, 4:winter)
* yr : year (0: 2011, 1:2012)
* mnth : month ( 1 to 12)
* hr : hour (0 to 23)
* holiday : wether day is holiday or not (extracted from http://dchr.dc.gov/page/holiday-schedule)
* weekday : day of the week
* workingday : if day is neither weekend nor holiday is 1, otherwise is 0.
* weathersit :
    1. Clear, Few clouds, Partly cloudy
    2. Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
    3. Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
    4. Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
* temp : Normalized temperature in Celsius. The values are divided to 41 (max)
* hum: Normalized humidity. The values are divided to 100 (max)
* windspeed: Normalized wind speed. The values are divided to 67 (max)


## Outputs

* cnt: count of total rental bikes including both casual and registered  

## Additional Comments

* Data Reference:

This Hadi Fanaee-T
Laboratory of Artificial Intelligence and Decision Support (LIAAD), University of Porto INESC Porto, Campus da FEUP Rua Dr. Roberto Frias, 378 4200 - 465 Porto, Portugal 


# Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory.

* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Import Libraries

In [None]:
! pip install tensorflow==2.2.0
import tensorflow as tf
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Load and Inspect data

In [None]:
bike = pd.read_csv('inputs/datasets/raw/bike_sharing_daily.csv')
bike

In [None]:
sns.heatmap(bike.isnull())

In [None]:
bike = bike.drop(labels= ['instant'], axis=1)

In [None]:
bike

In [None]:
bike = bike.drop(labels= ['casual', 'registered'], axis = 1)
bike

In [None]:
bike.dteday = pd.to_datetime(bike.dteday, format= '%m/%d/%Y')
bike

In [None]:
bike.index = pd.DatetimeIndex(bike.dteday)
bike

In [None]:
bike = bike.drop(labels= ['dteday'], axis=1)
bike

* Print out the Weekly Usage

In [None]:
def CtnWeekly():
    bike['cnt'].asfreq('W').plot(linewidth = 3)
    plt.title('Bike Usage Per week')
    plt.xlabel('Week')
    plt.ylabel('Bike Rental')

In [None]:
CtnWeekly()

* Bike usage per Month

In [None]:
bike['cnt'].asfreq('M').plot(linewidth = 3)
plt.title('Bike Usage Per Month')
plt.xlabel('Month')
plt.ylabel('Bike Rental')

* Bike usage Per Quarter

In [None]:
bike['cnt'].asfreq('Q').plot(linewidth = 3)
plt.title('Bike Usage Per Quarter')
plt.xlabel('Quarter')
plt.ylabel('Bike Rental')

# Visualise the entire Data

In [None]:
sns.pairplot(bike)

* Divide Data into Numerical and Categorical

In [None]:
X_numerical = bike[['temp', 'hum', 'windspeed', 'cnt']]

sns.pairplot(X_numerical, diag_kind='kde')
plt.show()

In [None]:
X_numerical

In [None]:
def show_distribution(var_data):
    fig,ax = plt.subplots(1,2,figsize=(8, 8))
    
    ax[0].hist(var_data, bins=100)
    ax[0].set_xlabel('Frequency')

    mean_val = var_data.mean()
    median_val = var_data.median()
    min_val = var_data.min()
    max_val =  var_data.max()
    mode_val = var_data.mode()[0]

    ax[0].axvline(mean_val, color = 'magenta' , linestyle='dashed', linewidth = 2)
    ax[0].axvline(median_val, color = 'black' , linestyle='dashed', linewidth = 2)

    ax[1].boxplot(var_data, vert=False)
    ax[1].set_xlabel('value')

    fig.suptitle(var_data.name)

    fig.show()

In [None]:
for col in X_numerical:
    show_distribution(X_numerical[col])

# Visualising Catagorical Variables

In [None]:
X_cat = bike[['season', 'yr', 'mnth', 'holiday', 'weekday', 'workingday', 'weathersit']]

In [None]:
X_cat

* Build boxplot of all categorical variables againt the target variable 'cnt' 
* to see how each of the predictor variable stackup against the target variable.

In [None]:
plt.figure(figsize=(25, 10))
plt.subplot(2,3,1)
sns.boxplot(x = 'season', y = 'cnt', data = bike)
plt.subplot(2,3,2)
sns.boxplot(x = 'mnth', y = 'cnt', data = bike)
plt.subplot(2,3,3)
sns.boxplot(x = 'weathersit', y = 'cnt', data = bike)
plt.subplot(2,3,4)
sns.boxplot(x = 'holiday', y = 'cnt', data = bike)
plt.subplot(2,3,5)
sns.boxplot(x = 'weekday', y = 'cnt', data = bike)
plt.subplot(2,3,6)
sns.boxplot(x = 'workingday', y = 'cnt', data = bike)
plt.show()

* Check categorical variables frequency

In [None]:
for col in X_cat:
    fig = plt.figure(figsize=(9, 6))
    ax = fig.gca()
    cat_count = bike[col].value_counts().sort_index()
    cat_count.plot.bar(x=col,y='Rentals')
    ax.set_title(col + ' counts')
    ax.set_xlabel(col)
    ax.set_ylabel('Rentals')    
    plt.show()

# **Use Correlation**

* Find correlation between numerical variables with label using scatter charts

In [None]:
for col in X_numerical:
    correlation_value = bike[col].corr(bike['cnt'])
    fig = plt.figure(figsize=(9, 6))
    plt.scatter(x=bike[col],y=bike['cnt'], color='steelblue')
    plt.title("correlation_value: " + str(correlation_value))
    plt.xlabel(col) 
    plt.ylabel("Rentals")
    plt.show()

# Comparing categorical features with rentals cnt

In [None]:
for col in X_numerical:
    fig = plt.figure(figsize=(9, 6))
    ax = fig.gca()
    bike.boxplot(column='cnt', by=col, ax=ax)
    ax.set_title('Label by ' + col)
    ax.set_ylabel("Bike Rentals")
plt.show()

In [None]:
sns.heatmap(X_numerical.corr(), annot =True)

<hr>

# **CREATE TRAINING AND TESTING DATASET**

* Separate features and labels

In [None]:
X = bike[['season', 'yr', 'mnth', 'holiday', 'weekday', 'workingday', 'weathersit','temp', 'hum', 'windspeed']].values
y = bike['cnt'].values


print(f'Features: {X[:5]}, \nLabels: {y[:5]}')

* Divide data into 70% for Training and 30% for Testing

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state= 0)
print (f'X_train: {X_train.shape} \nX_test: {X_test.shape} \ny_train: {y_train.shape} \ny_test: {y_test.shape}')

* Linear Regression
* Fit the linear regression on the training set

In [None]:
from sklearn.linear_model import LinearRegression

model = LinearRegression().fit(X_train, y_train)
print(model)

In [None]:
predictions = model.predict(X_test)
np.set_printoptions(suppress=True)

print(f'Predicted labels: {np.round(predictions)[:10]}')
print(f'Actual labels: {y_test[:10]}')

* Visualizing a scatter plot that compares the predictions to the actual labels. 
* Overlay a trend line to get a general sense for how well the predicted labels align with the true labels.

In [None]:
plt.scatter(y_test, predictions)
plt.xlabel('Actual Labels')
plt.ylabel('Predicted Labels')
plt.title('Daily Bike Share Predictions')
# overlay the regression line
z = np.polyfit(y_test, predictions, 1) # Fit the polynomial of degree 1 to the points (y_test, predictions). Returns a vector of coefficients z that minimises the squared error in the order deg, deg-1, … 0.

p = np.poly1d(z) # Define the polynimial function
print(f'Polynomial function: {p}')
plt.plot(y_test,p(y_test), color='magenta') # Here  p(y_test) will evaluate the polynomial function for every point in y_test
plt.show()

* Evaluate the Model using MSE, RMSE, R<sup>2</sup>

In [None]:
from sklearn.metrics import mean_squared_error, r2_score
mse = mean_squared_error(y_test, predictions)
print(f'MSE: {mse}')

rmse = np.sqrt(mse)
print(f'RMSE: {rmse}')

r2 = r2_score(y_test, predictions)
print(f'R2: {r2}')

# **Linear Algorithm**

* Train the regression model by using a Lasso algorithm

In [None]:
from sklearn.linear_model import Lasso

# Fit Lasso model on training set
model = Lasso().fit(X_train, y_train)
print(model)

# Evaluate the model using text data
predictions = model.predict(X_test)

mse = mean_squared_error(y_test, predictions)
print(f'MSE: {mse}')

rmse = np.sqrt(mse)
print(f'RMSE: {rmse}')

r2 = r2_score(y_test, predictions)
print(f'R2: {r2}')

# Plot predicted vs actual
plt.scatter(y_test, predictions)
plt.xlabel('Actual Labels')
plt.ylabel('Predicted Labels')
plt.title('Daily Bike Share Predictions')
# overlay the regression line
z = np.polyfit(y_test, predictions, 1) # Fit the polynomial of degree 1 to the points (y_test, predictions). Returns a vector of coefficients z that minimises the squared error in the order deg, deg-1, … 0.
p = np.poly1d(z) # Define the polynimial function
print(f'Polynomial function: {p}')
plt.plot(y_test,p(y_test), color='magenta') # Here  p(y_test) will evaluate the polynomial function for every point in y_test
plt.show()

# **Decision Tree Algorithm**

In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.tree import export_text

# Fit decistion tree model on training set alos known as model training
model = DecisionTreeRegressor().fit(X_train, y_train)
print(model)

# Visualize the model tree
tree = export_text(model)
print(tree)

# **Evaluate the Trained Model**

In [None]:
# Evaluate the model using the test data
predictions = model.predict(X_test)
mse = mean_squared_error(y_test, predictions)
print(f'MSE: {mse}')
rmse = np.sqrt(mse)
print(f'RMSE: {rmse}')
r2 = r2_score(y_test, predictions)
print(f'R2: {r2}')

# Plot predicted vs actual
plt.scatter(y_test, predictions)
plt.xlabel('Actual Labels')
plt.ylabel('Predicted Labels')
plt.title('Daily Bike Share Predictions')
# overlay the regression line
z = np.polyfit(y_test, predictions, 1) # Fit the polynomial of degree 1 to the points (y_test, predictions). Returns a vector of coefficients z that minimises the squared error in the order deg, deg-1, … 0.
p = np.poly1d(z) # Define the polynimial function
print(f'Polynomial function: {p}')
plt.plot(y_test,p(y_test), color='magenta') # Here  p(y_test) will evaluate the polynomial function for every point in y_test
plt.show()

# **Ensemble Algorithm**

* Use Ensemble Algorithm to improve over the linear model

In [None]:
from sklearn.ensemble import RandomForestRegressor

# Train the model
model = RandomForestRegressor().fit(X_train, y_train)
print (model, "\n")

# Evaluate the model using the test data
predictions = model.predict(X_test)
mse = mean_squared_error(y_test, predictions)
print(f'MSE: {mse}')
rmse = np.sqrt(mse)
print(f'RMSE: {rmse}')
r2 = r2_score(y_test, predictions)
print(f'R2: {r2}')

# Plot predicted vs actual
plt.scatter(y_test, predictions)
plt.xlabel('Actual Labels')
plt.ylabel('Predicted Labels')
plt.title('Daily Bike Share Predictions')
# overlay the regression line
z = np.polyfit(y_test, predictions, 1) # Fit the polynomial of degree 1 to the points (y_test, predictions). Returns a vector of coefficients z that minimises the squared error in the order deg, deg-1, … 0.
p = np.poly1d(z) # Define the polynimial function
print(f'Polynomial function: {p}')
plt.plot(y_test,p(y_test), color='magenta') # Here  p(y_test) will evaluate the polynomial function for every point in y_test
plt.show()

* Trying a boosting ensemble algorithm for good measure

In [None]:
# Train the model
from sklearn.ensemble import GradientBoostingRegressor

# Fit a lasso model on the training set
model = GradientBoostingRegressor().fit(X_train, y_train)
print (model, "\n")

# Evaluate the model using the test data
predictions = model.predict(X_test)
mse = mean_squared_error(y_test, predictions)
print(f'MSE: {mse}')
rmse = np.sqrt(mse)
print(f'RMSE: {rmse}')
r2 = r2_score(y_test, predictions)
print(f'R2: {r2}')

# Plot predicted vs actual
plt.scatter(y_test, predictions)
plt.xlabel('Actual Labels')
plt.ylabel('Predicted Labels')
plt.title('Daily Bike Share Predictions')
# overlay the regression line
z = np.polyfit(y_test, predictions, 1) # Fit the polynomial of degree 1 to the points (y_test, predictions). Returns a vector of coefficients z that minimises the squared error in the order deg, deg-1, … 0.
p = np.poly1d(z) # Define the polynimial function
print(f'Polynomial function: {p}')
plt.plot(y_test,p(y_test), color='magenta') # Here  p(y_test) will evaluate the polynomial function for every point in y_test
plt.show()

# **Optimize Hyperparameters**

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer, r2_score

# Use a Gradient Boosting algorithm
alg = GradientBoostingRegressor()

# Try these hyperparameter values
params = {
 'learning_rate': [0.1, 0.5, 1.0],
 'n_estimators' : [50, 100, 150]
 }

# Find the best hyperparameter combination to optimize the R2 metric
score = make_scorer(r2_score)
gridsearch = GridSearchCV(alg, params, scoring=score, cv=3, return_train_score=True)
gridsearch.fit(X_train, y_train)
print("Best parameter combination:", gridsearch.best_params_, "\n")

# Get the best model
model=gridsearch.best_estimator_
print(model, "\n")

# Evaluate the model using the test data
predictions = model.predict(X_test)
mse = mean_squared_error(y_test, predictions)
print("MSE:", mse)
rmse = np.sqrt(mse)
print("RMSE:", rmse)
r2 = r2_score(y_test, predictions)
print("R2:", r2)

# Plot predicted vs actual
plt.scatter(y_test, predictions)
plt.xlabel('Actual Labels')
plt.ylabel('Predicted Labels')
plt.title('Daily Bike Share Predictions')
# overlay the regression line
z = np.polyfit(y_test, predictions, 1)
p = np.poly1d(z)
plt.plot(y_test,p(y_test), color='magenta')
plt.show()

* Encoding categorical variables

In [None]:
# Train the model
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LinearRegression
import numpy as np

# Define preprocessing for numeric columns (scale them)
numeric_features = [6,7,8,9]
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())])

# Define preprocessing for categorical features (encode them)
categorical_features = [0,1,2,3,4,5]
categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

# Combine preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

# Create preprocessing and training pipeline
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                           ('regressor', GradientBoostingRegressor())])


# fit the pipeline to train a linear regression model on the training set
model = pipeline.fit(X_train, (y_train))
print (model)

* Check how the model performs with the validation data

In [None]:
# Get predictions
predictions = model.predict(X_test)

# Display metrics
mse = mean_squared_error(y_test, predictions)
print("MSE:", mse)
rmse = np.sqrt(mse)
print("RMSE:", rmse)
r2 = r2_score(y_test, predictions)
print("R2:", r2)

# Plot predicted vs actual
plt.scatter(y_test, predictions)
plt.xlabel('Actual Labels')
plt.ylabel('Predicted Labels')
plt.title('Daily Bike Share Predictions')
z = np.polyfit(y_test, predictions, 1)
p = np.poly1d(z)
plt.plot(y_test,p(y_test), color='magenta')
plt.show()

* Try an alternative algorithm

In [None]:
# Use a different estimator in the pipeline
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                           ('regressor', RandomForestRegressor())])


# fit the pipeline to train a linear regression model on the training set
model = pipeline.fit(X_train, (y_train))
print (model, "\n")

# Get predictions
predictions = model.predict(X_test)

# Display metrics
mse = mean_squared_error(y_test, predictions)
print("MSE:", mse)
rmse = np.sqrt(mse)
print("RMSE:", rmse)
r2 = r2_score(y_test, predictions)
print("R2:", r2)

# Plot predicted vs actual
plt.scatter(y_test, predictions)
plt.xlabel('Actual Labels')
plt.ylabel('Predicted Labels')
plt.title('Daily Bike Share Predictions - Preprocessed')
z = np.polyfit(y_test, predictions, 1)
p = np.poly1d(z)
plt.plot(y_test,p(y_test), color='magenta')
plt.show()

# **Use the Trained Model**

* Save the model

In [None]:
import joblib

# Save the model as a pickle file
filename = 'bike-share.pkl'
joblib.dump(model, './inputs/datasets/raw/bike-share.pkl') # Save it at ./inputs/datasets/raw/bike-share.pkl

* Load and predict labels for new data

In [None]:
# Load the model from the file
loaded_model = joblib.load('./inputs/datasets/raw/bike-share.pkl')

# Create a numpy array containing a new observation (for example tomorrow's seasonal and weather forecast information)
X_new = np.array([[1,1,0,3,1,1,0.226957,0.22927,0.436957,0.1869]]).astype('float64')
print ('New sample: {}'.format(list(X_new[0])))

# Use the model to predict tomorrow's rentals
result = loaded_model.predict(X_new)
print('Prediction: {:.0f} rentals'.format(np.round(result[0])))

* Suppose you have a weather forecast for the next five days; you could use the model to predict bike rentals for each day based on the expected weather conditions.

In [None]:
# An array of features based on five-day weather forecast
X_new = np.array([[0,1,1,0,0,1,0.344167,0.363625,0.805833,0.160446],
                  [0,1,0,1,0,1,0.363478,0.353739,0.696087,0.248539],
                  [0,1,0,2,0,1,0.196364,0.189405,0.437273,0.248309],
                  [0,1,0,3,0,1,0.2,0.212122,0.590435,0.160296],
                  [0,1,0,4,0,1,0.226957,0.22927,0.436957,0.1869]])

# Use the model to predict rentals
results = loaded_model.predict(X_new)
print('5-day rental predictions:')
for prediction in results:
    print(np.round(prediction))

# Push files to Repo

In [None]:
import os
try:
  os.makedirs(name='outputs/datasets/collection') # create here your folder
  # os.makedirs(name='')
except Exception as e:
  print(e)

bike.to_csv(f"outputs/datasets/collection/bike_sharing_daily.csv", index=False)
