# ML2 Online Learning: Bike Sharing


**IMPORTANT!!**: Notebook Presentation
- Include plots whenever possible to support explanations
- Markdown cells should be informative and well written

---

*Table of Contents*:
1.  [Introduction and Objectives](#1-introduction-and-objectives)
2.  [Data Preparation](#2-data-preparation)
3.  [Concept Drift](#3-concept-drift)
4.  [Batch Learning](#4-batch-learning)
5.  [Stream Learning](#5-stream-learning)
6.  [Results](#6-results)
7.  [Conclusions](#conclusions)

In [18]:
# Import packages and fix seeds
import pandas as pd
from river import stream
from rich import print

## 1. Introduction and Objectives

We must make sure to answer these questions:

- Problem description 
    - In non-ML terms (**Done**)
    - In ML terms:
        - What is the problem? (**Done**)
        - Type of problem (regression) (**Done**)
        - Is the dataset imbalanced? (**Done**)
        - Is the dataset influenced by drift? (**Done**)
        - Chosen metrics and justification (**Done. May need to be modified/expanded later**)
        - Assumptions for addressing the problem?
- Dataset Selection
    -  Justification + explanation of suitability (**Done**)
    -  Mention that this is an external dataset (**Done**)
    -  Does the dataset require preparation?
        -  "A prepared dataset is allowed but may not receive maximum points."
        -  We should ask the professor about this. Does grouping the data into 2-hour intervals count as preparation?

# Problem Description
Bike sharing systems have become increasingly popular in many cities around the world as an eco-friendly mode of transportations. Accurate predictions of bike demands are needed to ensure availability and to optimize their distribution. Therefore, predicting future bike rental activity can help improving the overall efficiency of these systems.

In this project, we use the *Bike Sharing Dataset* from Kaggle (https://www.kaggle.com/datasets/alfiansyahputrans/bike-sharing-dataset) to develop a model that predicts the number of bike rentals based on several features such as weather conditions, time of day or day of the week, among others. The dataset consists of two CSV files: one containing daily observations and another providing hourly records. Whe chose to work with the hourly dataset, as it contains a significantly large number of records and offers finer temporal granularity, a helpful feature when predicting demand patterns in a stream learning setting.

From a machine learning perspective, we are addressing a regression problem, where the target variable is a numerical value representing the number of bike rentals. The dataset provides real world data that varies over the course of two years (2011 and 2012) due to different temporal factors. This makes it well suited for a stream learning task, in which data arrives sequentially over time and the model must incrementally adapt to evolving patterns. 

Looking at the distribution of the target variable, it can be seen that the medium values are significantly more common than the extreme ones. That imbalance may difficult the learning process, as the model may be biased towards predicting the more common values.

The dataset may also be affected by concept drift, since bike rental demand can change over time due to factors such as changes in weather patterns, seasonal variations, or even changes in the popularity of the bike sharing system itself.

To evaluate the model's performance, we use the Mean Absolute Error (MAE) as the primary metric. MAE measures the average magnitude of the errors in a set of predictions. It is a common metric for regression problems and it provides robustness to outliers, which may be present in bike rental data due to unexpected events.

## 2. Data Preparation

- Brief description about how the data was studied (justification of data type conversions)
- Has the data been normalized or standardized? Why?
- In case the dataset contains nominal features, has one-hot encoding been performed? (This doesn't apply to out case)
- Is the definition of new features required? + Explanation
- Is the categorization of any features required? 
- Specific adaptations to the selected problem.

---
- Should we add plots such as registered vs month?

In [11]:
dataset = "./data/hour.csv"
data = pd.read_csv(dataset)
display(data.head())
display(data.describe())
print(data.dtypes)

Unnamed: 0,instant,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,2011-01-01,1,0,1,0,0,6,0,1,0.24,0.2879,0.81,0.0,3,13,16
1,2,2011-01-01,1,0,1,1,0,6,0,1,0.22,0.2727,0.8,0.0,8,32,40
2,3,2011-01-01,1,0,1,2,0,6,0,1,0.22,0.2727,0.8,0.0,5,27,32
3,4,2011-01-01,1,0,1,3,0,6,0,1,0.24,0.2879,0.75,0.0,3,10,13
4,5,2011-01-01,1,0,1,4,0,6,0,1,0.24,0.2879,0.75,0.0,0,1,1


Unnamed: 0,instant,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
count,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0
mean,8690.0,2.50164,0.502561,6.537775,11.546752,0.02877,3.003683,0.682721,1.425283,0.496987,0.475775,0.627229,0.190098,35.676218,153.786869,189.463088
std,5017.0295,1.106918,0.500008,3.438776,6.914405,0.167165,2.005771,0.465431,0.639357,0.192556,0.17185,0.19293,0.12234,49.30503,151.357286,181.387599
min,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.02,0.0,0.0,0.0,0.0,0.0,1.0
25%,4345.5,2.0,0.0,4.0,6.0,0.0,1.0,0.0,1.0,0.34,0.3333,0.48,0.1045,4.0,34.0,40.0
50%,8690.0,3.0,1.0,7.0,12.0,0.0,3.0,1.0,1.0,0.5,0.4848,0.63,0.194,17.0,115.0,142.0
75%,13034.5,3.0,1.0,10.0,18.0,0.0,5.0,1.0,2.0,0.66,0.6212,0.78,0.2537,48.0,220.0,281.0
max,17379.0,4.0,1.0,12.0,23.0,1.0,6.0,1.0,4.0,1.0,1.0,1.0,0.8507,367.0,886.0,977.0


In [12]:
hours = stream.iter_csv(dataset)
sample, target = next(hours)
print(sample)
print(target)

## 3. Concept Drift

- Use at least two concept dift detectors (verify if the proposed problem is suitable for online learning)
    - Which ones?
    - Add a brief description of why these detectors.
    - Exemplify drift in plots

## 4. Batch Learning

Steps:
1. ~~Load the dataset~~ (It should already be loaded)
2. Split into train, val, test
    - Is the split correctly made, i.e., if required, that data is stratified or grouped? Tip: Batch learning can be done by defining the pipelines in River and using the built-in wrapper to perform the remaining operations. 
    - Is a cross-validation mechanism used? 
4. Train with ML method(s)
    - Have any model hyperparameters been tuned?
    - Have different models been compared? Have the models been correctly adjusted/compared? No data from the test is used in the training/validation phase
6. Evaluate

**Important**: Comment how the technique deals with the concept drifts

## 5. Stream Learning

- Implement a River pipeline
- Choose a suitable metric from the API
- Use **at least** three complementary ML models
    - One should be a Hoeffding Tree
    - How should you model(s) be modified? (Default values might not be suitable)

**Evaluation**: 
- Does the notebook contain at least 3 stream learning pipelines with their corresponding models  
- Are pipelines used correctly in the solution? 
- Is one of the models a Hoeffding Tree? 
- Are the metrics selected suitable to evaluate the performance of the models? 

## 6. Results
**Evaluation**: 
- Is there a plot (or plots) that compare batch learning results with that from the stream learning approaches? 

## Conclusions

**Evaluation**:
- Are the conclusions supported by the results in the notebook? 
- Do the results and conclusions offer some open questions and future work? 