# Bike Sharing Demand Prediction

The following notebook is a typical example for a data science workflow that does not 
yet use experiment tracking. It will serve as a starting point for our exercises. Your 
task will be to add experiment tracking to the notebook. 

Let's first go over the steps of the workflow:

- Loading Data 
- Splitting the dataset 
- Exploratory Data Analysis (EDA)
- Define numerical and categorical features
- Train-test split
- Build a reusable training pipeline
- Train a model using linear regression
- Evaluate model performance

In [1]:
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import root_mean_squared_error, r2_score, mean_absolute_error
from sklearn.ensemble import RandomForestRegressor

In [2]:
RANDOM_STATE = 42

## Step 1: Loading the Data Set into a Pandas DataFrame

- Load the dataset into a Pandas DataFrame
- Inspect (part of) the dataset using the .head() method

In [None]:
bike_sharing_data = pd.read_csv("../data/hour.csv")
bike_sharing_data.head()

## Step 2: Split the Dataset by Year

- Split the dataset by year. The columns 'yr' is used to indicate the year using
- 0 = 2011 and 1 = 2012

In [4]:
data_2011 = bike_sharing_data[bike_sharing_data["yr"] == 0]
data_2012 = bike_sharing_data[bike_sharing_data["yr"] == 1]

## Step 3: Exploratory Data Analysis (EDA)

### Step 3.1: Plot the distribution of the target variable

- Plot the distribution of the target variable ('cnt') using sns.displot().
- Plot absolute counts or the probability density

In [None]:
sns.displot(data_2011["cnt"], kde=True, stat="density")
plt.xlabel("Count")
plt.ylabel("Probability density")

### Step 3.2: Investigate feature-target relationships

- Feel free to try out different features 

In [None]:
sns.boxplot(data=data_2011, x="season", y="cnt")

## Step 4: Define Numerical and Categorical Features

In [7]:
NUMERICAL_FEATURES = ["temp", "hum", "windspeed"]
CATEGORICAL_FEATURES = ["season", "mnth", "hr", "holiday", "weekday", "workingday", "weathersit"]

FEATURES = NUMERICAL_FEATURES + CATEGORICAL_FEATURES

TARGET = "cnt"

In [None]:
features_target = data_2011[FEATURES + [TARGET]]
features_target.head()

## Step 5: Train-test Split


- Split the dataset into a training and testing dataset
- Split both the train and test sets into "input" and "output" (i.e, "train_input", "train_output", etc.)
    - The "input" will contain all the features and will serve as the input to the model training and the predictions.
    - The "output" will be used as target in the model training and as ground truth in the evaluations.

In [9]:
train_data, test_data = train_test_split(features_target, random_state=RANDOM_STATE)

X_train = train_data[FEATURES]
y_train = train_data[TARGET]

X_test = test_data[FEATURES]
y_test = test_data[TARGET]

## Step 6: Building a Reusable Training Pipeline

Define a function that builds a reusable training pipeline. The functions
takes an estimator and returns a Scikit-Learn pipeline containing the preprocessing
and model training steps.

In [10]:
numerical_transformer = Pipeline([
    ("scaler", StandardScaler())
])

categorical_transformer = Pipeline([
    ("encoder", OneHotEncoder(sparse_output=False, handle_unknown='ignore'))
])

preprocessor = ColumnTransformer([
    ("numerical", numerical_transformer, NUMERICAL_FEATURES),
    ("categorical", categorical_transformer, CATEGORICAL_FEATURES),
])

def build_pipeline(estimator):
    return Pipeline([
        ("preprocessor", preprocessor),
        ("estimator", estimator),
    ])

## Step 7: Train a Model using Linear Regression

Create a linear regression pipeline and fit the pipeline to the training data

In [None]:
pipeline = build_pipeline(LinearRegression())
pipeline.fit(X_train, y_train)

## Step 8: Evaluate Model Performance

In [12]:
def root_mean_squared_log_error(y_true, y_pred):
    log_diff = np.log1p(y_pred) - np.log1p(y_true)
    squared_log_diff = log_diff ** 2
    mean_squared_log_diff = np.mean(squared_log_diff)
    return np.sqrt(mean_squared_log_diff)

In [13]:
def calculate_metrics(y_true, y_pred):
    rmse = root_mean_squared_error(y_true, y_pred)
    r2 = r2_score(y_true, y_pred)
    mae = mean_absolute_error(y_true, y_pred)
    rmsle = root_mean_squared_log_error(y_true, y_pred)

    print(f"RMSE: {rmse:.5f}")
    print(f"R² score: {r2:.5f}")
    print(f"MAE: {mae:.5f}")
    print(f"RMSLE: {rmsle:0.5f}")

## Step 9: Handle Skewed Training Data with Log Transform

- Recreate the linear regression pipeline and fit it to the log transformed data
- Create predictions using the new model
- Inverse the log transformation on the predictions
- Evaluate the model performance using the evaluate() function and compare the results

In [14]:
y_train_log_transformed = np.log1p(y_train)

In [None]:
pipeline = build_pipeline(LinearRegression())
pipeline.fit(X_train, y_train_log_transformed)

predictions = pipeline.predict(X_test)
predictions = np.expm1(predictions)

calculate_metrics(y_test, predictions)

## Step 10: Different Model - Random Forest Regressor

- Create a random forest regressor pipeline.
- Fit the pipeline, create predictions, and evaluate the model.
- Compare the model performance between the two different model types.

In [None]:
pipeline = build_pipeline(RandomForestRegressor(random_state=RANDOM_STATE))
pipeline.fit(X_train, y_train_log_transformed)

predictions = pipeline.predict(X_test)
predictions = np.expm1(predictions)

calculate_metrics(y_test, predictions)