## **Problem Statement**

In this tutorial, we will work with the **Rossmann Store Sales** dataset to learn and apply gradient boosting methods for real-world forecasting.
Rossmann operates over 3,000 drug stores across several European countries, and accurate sales forecasts are essential for staff planning, inventory decisions, and store management.

The dataset we use **is not the official Kaggle competition dataset**.
It is a publicly shared copy uploaded by a Kaggle community user:

ðŸ“Œ **Dataset Source:**
**"Rossmann Store Sales" â€“ uploaded by Kaggle user Pratyusha Kar**
[https://www.kaggle.com/datasets/pratyushakar/rossmann-store-sales](https://www.kaggle.com/datasets/pratyushakar/rossmann-store-sales)

Although this is not the original competition page, the data closely follows the structure of the original challenge and is perfectly suitable for learning and experimentation.

Our goal remains the same as the original forecasting task:
to build a machine learning model that **predicts the `"Sales"` column** based on historical data, promotions, holidays, competition, seasonality, and store metadata.

This project focuses on:

* **Data preprocessing**
* **Feature engineering**
* **Handling time-related data**
* **Training gradient boosting models** (XGBoost, LightGBM, CatBoost)
* **Evaluating predictions**
* **Understanding real-world forecasting challenges**

In [None]:
import pandas as pd

In [None]:
ross_df = pd.read_csv('train.csv')
store_df = pd.read_csv('store.csv')
train_df = pd.read_csv('train.csv')


Let's merge the information from `store_df` into `train_df` and `test_df`.

In [6]:
merged_df = ross_df.merge(store_df, how='left', on='Store')
merged_test_df = store_df.merge(store_df, how='left', on='Store')

In [8]:
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1017209 entries, 0 to 1017208
Data columns (total 18 columns):
 #   Column                     Non-Null Count    Dtype  
---  ------                     --------------    -----  
 0   Store                      1017209 non-null  int64  
 1   DayOfWeek                  1017209 non-null  int64  
 2   Date                       1017209 non-null  object 
 3   Sales                      1017209 non-null  int64  
 4   Customers                  1017209 non-null  int64  
 5   Open                       1017209 non-null  int64  
 6   Promo                      1017209 non-null  int64  
 7   StateHoliday               1017209 non-null  object 
 8   SchoolHoliday              1017209 non-null  int64  
 9   StoreType                  1017209 non-null  object 
 10  Assortment                 1017209 non-null  object 
 11  CompetitionDistance        1014567 non-null  float64
 12  CompetitionOpenSinceMonth  693861 non-null   float64
 13  CompetitionO

> **EXERCISE**: Perform exploratory data analysis and visualization on the dataset. Study the distribution of values in each column, and their relationship with the target column `Sales`.