# TITLE HERE

Spring 2025 CMSC320 Final Project

Collaborators: Marvin Lin, Christopher Su, Tanish Bollam, Ayan Banerjee

### Contributions:

Marvin:

Chris:

Tanish:

Ayan:

## Introduction:

**Introduction**
For our final project in CMSC320, we are analyzing the Rossmann Store Sales dataset, originally published by Florian Knauer and Will Cukierski (2015) on Kaggle. This dataset contains historical sales data for over 1,000 Rossmann stores across several years, with features including daily sales figures, customer counts, promotional activity, store-specific attributes, and more.

**Research Question**
Our primary goal is to answer the question:
Can we accurately predict daily sales for a given store using historical data and store-level features?
To do this, we will develop a machine learning model that forecasts future store sales based on past trends and available metadata.

**Motivation**
Forecasting sales is a crucial task for retail businesses. Accurate predictions enable better inventory planning, workforce allocation, and promotional strategy. With the rise of data-driven decision-making, being able to reliably anticipate future demand gives stores a competitive edge. By tackling this problem, we hope to gain insight into how various factors—such as promotions, holidays, and store type—impact daily sales performance.



## Data Curation

This part of the project involves searching for and collecting relevant data from several sources. 

We will begin by importing relevant Python libraries necessary for this process.

**Imports**

In [2]:
import pandas as pd
import numpy as np

# Libraries for hypothesis testing
from sklearn.preprocessing import LabelEncoder
from scipy.stats import ttest_ind
from scipy.stats import f_oneway
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Libraries for data visualization
import matplotlib.pyplot as plt
import seaborn as sns

These libraries are necessary in the data science process. 

* **Pandas** is a library used for data manipulation and analysis. It provides data structures such as DataFrame and Series that allow us to efficiently handle structured data.
* **Numpy** is a a fundamental library for numerical computing. It offers support for arrays, functions, linear algebra, and statistical operations among functionality.
* **Scikit-learn** is a library primarily used for machine learning. It provides efficent tools for data analysis and modeling. These tasks include classification, regression, clustering, and dimensionality reduction.
* **Scipy-stats** is a module within scipy the SciPy library that provides various functions for statistical computations. These include probability distributions, statistical tests, descriptive statistics, etc.
* **Matplotlib** and **seaborn** are data visualization libraries for creating visualizations in Python such as line graphs, histograms, bar charts, etc.

Next, we must choose a relevant dataset for our topic. Based on our objective of predicting future store sales based on historical data, we chose the [Rossmann Store Sales dataset](https://www.kaggle.com/competitions/rossmann-store-sales) to work with. Rossmann is one of the largest drug store chains in Europe, with over 4000 stores. This dataset includes key features such as Date, Customers, Sales, Store information, Promotions, etc that will be important for predicting future sales. It includes historical sales data for 1,115 Rossmann stores. 

The data is split into the following files:
* **train.csv** - historical data including Sales.
* **test.csv** - historical data excluding Sales.
* **store.csv** - supplemental information about the stores.

After downloading these datasets, we can begin by loading them into pandas DataFrames.

In [10]:
# Load data into pandas DataFrames
train_df = pd.read_csv('DATASETS/train.csv', low_memory=False)
test_df = pd.read_csv('DATASETS/test.csv', low_memory=False)
store_df = pd.read_csv('DATASETS/store.csv', low_memory=False)

We use these two datasets and exclude **test.csv** because the information is already included in **train.csv**. The next step is to combine these into one DataFrame 

## Data Preprocessing

In [None]:
df = pd.merge(train_df, store_df, on='Store', how='left')
test_df = pd.merge(test_df, store_df, on='Store', how='left')

The **train_df** and **test_df** dataframes only include a Store identifier. However, each store has associated attributes (StoreType, Assortment, CompetitionDistance, etc.) in store_df that are crucial for modeling sales. The how='left' parameter in pd.merge ensures that all rows from train_df are preserved. 

By doing a left join on the dataframes, we ensure every record in your training and testing data includes relevant store features. This helps your machine learning model learn how differences between stores affect sales.

In [12]:
# label encode all the following columns
cols = ['StateHoliday', 'StoreType', 'Assortment', 'PromoInterval']
for col in cols:
    df[col] = df[col].astype(str)
    df[col] = LabelEncoder().fit_transform(df[col])

In [13]:
# converting string numbers into just numbers
cols = ['Open', 'Promo', 'Promo2', 'SchoolHoliday']
for col in cols:
    df[col] = df[col].astype(int)

In [None]:
df.dtypes

<<<<<<< local <modified: text/plain>


Store                          int64
DayOfWeek                      int64
Date                          object
Sales                          int64
Customers                      int64
Open                           int64
Promo                          int64
StateHoliday                  object
SchoolHoliday                  int64
StoreType                     object
Assortment                    object
CompetitionDistance          float64
CompetitionOpenSinceMonth    float64
CompetitionOpenSinceYear     float64
Promo2                         int64
Promo2SinceWeek              float64
Promo2SinceYear              float64
PromoInterval                 object
dtype: object



Store                          int64
DayOfWeek                      int64
Date                          object
Sales                          int64
Customers                      int64
Open                           int32
Promo                          int32
StateHoliday                   int32
SchoolHoliday                  int32
StoreType                      int32
Assortment                     int32
CompetitionDistance          float64
CompetitionOpenSinceMonth    float64
CompetitionOpenSinceYear     float64
Promo2                         int32
Promo2SinceWeek              float64
Promo2SinceYear              float64
PromoInterval                  int32
dtype: object

>>>>>>> remote <modified: text/plain>


## Data Exploration & Summary Statistics

## ML Algorithm Design & Development

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_squared_error, make_scorer
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor

# Filter dataset
df_model = df[(df["Open"] == 1) & (df["Sales"] > 0)].copy()

# Extract date
df_model["Date"] = pd.to_datetime(df_model["Date"])
df_model["Year"] = df_model["Date"].dt.year
df_model["Month"] = df_model["Date"].dt.month
df_model["Day"] = df_model["Date"].dt.day

# label encoding
label_cols = ['StateHoliday', 'StoreType', 'Assortment', 'PromoInterval']
for col in label_cols:
    df_model[col] = df_model[col].fillna("Missing").astype(str)
    df_model[col] = LabelEncoder().fit_transform(df_model[col])

# remove unused columns
df_model.drop(["Date", "Customers", "Open"], axis=1, inplace=True)

# split features and target
X = df_model.drop("Sales", axis=1)
y = df_model["Sales"]

# handle missing values
imputer = SimpleImputer(strategy="mean")
X = pd.DataFrame(imputer.fit_transform(X), columns=X.columns)

# models
models = {
    "Linear Regression": LinearRegression(),
    "Random Forest": RandomForestRegressor(n_estimators=50, n_jobs=-1, random_state=42),
    "XGBoost": XGBRegressor(n_estimators=100, learning_rate=0.1, n_jobs=-1, random_state=42)
}

# score
def rmse(y_true, y_pred):
    return np.sqrt(mean_squared_error(y_true, y_pred))

rmse_scorer = make_scorer(rmse, greater_is_better=False)

# cross validate with k-fold (5 folds)
for name, model in models.items():
    print(f"--- {name} ---")
    scores = cross_val_score(model, X, y, cv=5, scoring=rmse_scorer)
    print(f"CV RMSE (mean ± std): {-scores.mean():.2f} ± {scores.std():.2f}\n")


--- Linear Regression ---
CV RMSE (mean ± std): 2800.26 ± 76.48

--- Random Forest ---
CV RMSE (mean ± std): 1132.34 ± 97.21

--- XGBoost ---
CV RMSE (mean ± std): 1818.25 ± 32.16



In our 5-fold cross-validation on the Rossmann sales data, Random Forest achieved the best performance with a mean RMSE of 1132.34, significantly outperforming both XGBoost (1818.25) and Linear Regression (2800.26). This indicates that Random Forest is well-suited for capturing the dataset’s non-linear patterns. XGBoost, while more consistent across folds, underperformed likely due to untuned hyperparameters. Linear Regression had the highest error, suggesting it is too simplistic for the complexity of the sales dynamics. Overall, Random Forest currently offers the most accurate and balanced performance.

## ML Algorithm Training & Analysis

## Analysis & Conclusion