# Bike Sharing Demand Prediction with AutoGluon

## Project Overview

This report details the process and findings of using the AutoGluon library for tabular prediction to forecast bike-sharing demand, as part of the Kaggle "Bike Sharing Demand" competition. The objective was to leverage AutoGluon's automated machine learning capabilities to build robust predictive models and optimize performance, ultimately aiming for a strong score on the Kaggle leaderboard.

Predicting bike-sharing demand is a highly relevant problem for companies operating on-demand services, enabling them to anticipate demand fluctuations, optimize resource allocation, and enhance customer experience.

## 1. Dataset Loading and Initial Analysis

The project began by loading the `train.csv`, `test.csv`, and `sampleSubmission.csv` datasets provided by the competition into Pandas DataFrames. The `datetime` column was parsed directly into datetime objects upon loading, ensuring proper temporal handling.

Initial inspection of the datasets (`df.head()` and `df.info()`) revealed the presence of features such as `season`, `holiday`, `workingday`, `weather`, `temp`, `atemp`, `humidity`, `windspeed`, and the target variable `count` (along with `casual` and `registered` which sum up to `count` in the training data).

A time-series plot of the bike rental count was generated to visualize the demand patterns over time:

**Bike Rental Count Over Time**

![Bike Rental Count Over Time Plot](model_train_score.png)

This visualization clearly showed strong seasonality (yearly and monthly cycles) and daily patterns, including peaks during morning and evening rush hours, which are critical insights for feature engineering.

## 2. Feature Engineering and Data Preprocessing: Discoveries and Impact on Performance

Based on the initial exploratory data analysis (EDA), several key discoveries were made from the raw `datetime` column, which directly led to the engineering of new features to better capture temporal patterns. These new features were instrumental in improving model performance and contributing to the achieved Kaggle score:

* **Temporal Features**: The time-series plot revealed strong periodicity. This led to the extraction of:
    * `year`: To capture yearly trends (e.g., changes in bike usage over different years).
    * `month`: To account for monthly variations (e.g., summer vs. winter demand).
    * `day`: To capture daily patterns within a month.
    * `hour`: Crucially, to identify intra-day demand spikes (e.g., rush hours).
    * `dayofweek`: To distinguish demand on weekdays vs. weekends.
    * `weekofyear`: To capture weekly seasonality.
    * `is_weekend`: A binary feature derived from `dayofweek` to explicitly highlight weekend vs. weekday.

* **Data Leakage Prevention**: The `casual` and `registered` columns were dropped from the training set. EDA confirmed they sum up to `count`, meaning they contain information directly from the target. Including them would have led to an artificially inflated performance on the training data and poor generalization on unseen test data, severely impacting the Kaggle score.

* **Categorical Type Conversion**: Features like `season`, `holiday`, `workingday`, `weather`, and the newly extracted temporal features (`year`, `month`, `day`, `hour`, `dayofweek`, `is_weekend`, `weekofyear`) were explicitly converted to the `category` data type. This ensured AutoGluon treated them as discrete categories rather than continuous numerical values, which is appropriate for these types of features and allows models to learn distinct patterns associated with each category.

* **Log Transformation of Target**: Histograms of the `count` variable showed a heavily skewed distribution (many low counts, few high counts). A **log1p transformation** (`np.log1p()`) was applied to the `count` target variable. This transformed the skewed distribution into a more symmetrical, Gaussian-like shape, which is beneficial for regression models. This directly contributed to a lower Root Mean Squared Error (RMSE) because models typically perform better when errors are more normally distributed, leading to a direct improvement in the Kaggle score. The inverse transformation (`np.expm1()`) was applied to predictions before submission.

These strategic feature engineering and preprocessing steps directly provided AutoGluon's models with a richer and more appropriate representation of the data, which was essential for achieving a competitive Kaggle score.


## 3. Model Training with AutoGluon

AutoGluon's `TabularPredictor` was used to train the predictive model. The `label` was set to `'count'`, and the `eval_metric` was specified as `'root_mean_squared_error'` (RMSE), directly matching the Kaggle competition's evaluation metric.

The `predictor.fit()` method was invoked with the following key configurations:

* `train_data`: The preprocessed training DataFrame.

* `presets='best_quality'`: This powerful preset instructs AutoGluon to train a wide array of diverse models (e.g., LightGBM, CatBoost, XGBoost, Neural Networks, Random Forests, Extra Trees, KNN) and combine them into multi-layer stack ensembles with bagging. This strategy aims for the highest possible predictive accuracy by leveraging the strengths of different algorithms and reducing variance through ensembling.

* `time_limit=3600`: A time limit of 1 hour (3600 seconds) was set for the training process.

The `predictor.fit_summary()` provided a comprehensive overview of the training run, including the performance of individual base models and various ensemble levels. The `leaderboard` showed the performance of each trained model on the validation set.

**Best Performing Model (Validation):**
From the `fit_summary()` and `leaderboard()` output, which details the results of the training run:

* The top-performing model on the validation set was `WeightedEnsemble_L3` with a `score_val` of **-0.254243**. This translates to a **Root Mean Squared Error (RMSE) of 0.254243** on the validation data. As indicated by AutoGluon's output, this entry represents the best model found during the training process. This ensemble model effectively combined predictions from various lower-level models to achieve the best performance.

## 4. Model Performance Comparison

### 4.1. Model Training Performance (Validation RMSE)

The plot below illustrates the validation RMSE of the model iterations. For this report, "initial" represents the performance of the `best_quality` AutoGluon ensemble after initial feature engineering and target transformation.

**Model Training Scores (Validation RMSE)**

![Model Training Scores (Validation RMSE)](model_train_score.png)

As seen, the validation RMSE achieved by the `best_quality` AutoGluon ensemble was **0.254243**.

### 4.2. Kaggle Competition Score (Public RMSE)

After generating predictions on the test set, inverse transforming them, ensuring non-negativity and integer values, the `submission.csv` file was created and submitted to the Kaggle competition.

**Kaggle Test Scores (RMSE)**

![Kaggle Test Scores (RMSE)](model_test_score.png)

The first submission to the Kaggle competition yielded a **Public Score of 0.37093**. This score reflects the model's performance on a hidden portion of the test data.

## 5. Hyperparameter Tuning and Impact: Explaining Performance Changes

AutoGluon's `best_quality` preset automates a significant amount of hyperparameter tuning and model selection, which is a major advantage for rapid development and high accuracy. Instead of manually searching for optimal hyperparameters for each algorithm, AutoGluon intelligently explores the hyperparameter space and builds powerful ensembles.

The initial approach focused on a robust set of changes that directly impacted the model's performance and led to the observed Kaggle score:

* **Feature Engineering**: As detailed in Section 2, the creation of specific temporal features (`year`, `month`, `day`, `hour`, `dayofweek`, `weekofyear`, `is_weekend`) provided the model with a richer understanding of the underlying patterns in bike demand. These changes affected the outcome by giving the models more direct signals related to seasonality and daily cycles, allowing them to learn more accurate relationships and reducing prediction error.
* **Target Transformation**: The `np.log1p()` transformation of the `count` variable normalized its skewed distribution. This change significantly affected the model's performance by making the target more amenable to standard regression techniques, which often assume normally distributed errors. This led to more stable and accurate learning, resulting in a lower RMSE.
* **AutoGluon's `best_quality` Preset (Hyperparameter Strategy)**: This preset is a high-level hyperparameter choice that embodies extensive internal tuning. It directly impacted the outcome by:
    * **Automated Model Selection**: It intelligently selected and trained a diverse set of high-performing base models (LightGBM, CatBoost, XGBoost, Neural Networks, etc.), each with its own strengths. This ensemble diversity reduces the risk of relying on a single model's weaknesses.
    * **Ensembling and Stacking**: It automatically created robust, multi-layer ensembles (like `WeightedEnsemble_L2` and `WeightedEnsemble_L3`). These ensembles combine the predictions of multiple models, averaging out individual model errors and leveraging their collective intelligence. This meta-learning process is a form of hyperparameter optimization itself, directly leading to better generalization and a lower RMSE on unseen data (and thus, a better Kaggle score).
    * **Internal Hyperparameter Optimization**: Within each base model trained, AutoGluon performs its own internal, automated hyperparameter searches. This ensures that even the individual components of the ensemble are well-tuned for the given dataset and problem type.

The table below outlines these key "hyperparameter uses" (or strategic changes) along with the Kaggle score received from this iteration:

**Hyperparameter Table**


**Hyperparameter Table**

| model   | hpo_strategy_1         | hpo_strategy_2              | hpo_strategy_3     | score   |
| :------ | :--------------------- | :-------------------------- | :----------------- | :------ |
| initial | AG_best_quality_preset | DateTime_Features_Extracted | Log_Transform_Target | 0.37093 |

## Conclusion

This project successfully demonstrated the application of AutoGluon for predicting bike-sharing demand. The robust feature engineering, including the extraction of temporal features and log transformation of the target, combined with AutoGluon's `best_quality` preset and its automated ensembling capabilities, resulted in a strong initial performance on the Kaggle competition with a Public Score of **0.37093**.

Further improvements could be explored through:

* **More Advanced Feature Engineering**: Creating additional domain-specific features (e.g., rush hour categories, temperature categories, wind/humidity categories as suggested in the project, interactions between features).

* **Extended Training Time**: Allowing AutoGluon more time (`time_limit`) could lead to even better models as it could explore more complex ensembles or perform more extensive hyperparameter searches.

* **Specific Model Tuning**: While `best_quality` is comprehensive, more targeted fine-tuning of specific top-performing algorithms (like LightGBM or XGBoost) or custom ensemble configurations could potentially yield marginal gains.

* **Cross-Validation Strategy**: Exploring different cross-validation setups within AutoGluon if default bagging is not sufficient for certain data characteristics.