<a href="https://colab.research.google.com/github/samsung-ai-course/6-7-edition/blob/main/Supervised%20Learning/regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Store Sales - Time Series Forecasting
This notebook covers exploratory data analysis (EDA) and feature engineering for the Store Sales dataset.

## Dataset Overview
- **Goal:** Predict daily store sales.
- **Features:** Date, store information, promotions, and more.
- **Target:** Sales column.

### Exercises:
- Conduct EDA to understand trends and relationships.
- Engineer meaningful features to improve forecasting accuracy.

### Dataset Link:
Download the dataset from [Kaggle Store Sales Time Series Forecasting](https://www.kaggle.com/competitions/store-sales-time-series-forecasting/data), from the github repository or use the dataset directly from github.

## 1. Load and Explore the Data

In [337]:
#if you dont have it yet and want to download it and unzip it locally

# !wget https://github.com/samsung-ai-course/6-7-edition/raw/main/Supervised%20Learning/Datasets/store-sales-time-series-forecasting.zip

# 2. Unzip the downloaded file
# !unzip store-sales-time-series-forecasting.zip -d store_sales_data

In [338]:
# if you want to unzip programatically (assuming you have the file in this folder on the left)
# !unzip Supervised\ Learning/Datasets/store-sales-time-series-forecasting.zip -d Supervised\ Learning/Datasets/


In [339]:
# Import libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load dataset
train = pd.read_csv("Datasets/store_sales_data/train.csv", parse_dates=['date'])
stores = pd.read_csv("Datasets/store_sales_data/stores.csv")
oil = pd.read_csv("Datasets/store_sales_data/oil.csv", parse_dates=['date'])
holidays = pd.read_csv("Datasets/store_sales_data/holidays_events.csv", parse_dates=['date'])

# Preview dataset
# train.head()

In [340]:
# Summary of train dataset
# train.info()

### Question 1: What is the date range of the training data? Use `.min()` and `.max()` on the `date` column.

In [341]:
# Date range
range = train["date"].max() - train["date"].min()


## 2. Exploratory Data Analysis

In [342]:
# Plot sales over time
# plt.figure(figsize=(12, 6))
# sns.lineplot(x='date', y='sales', data=train, errorbar=None)
# plt.title('Daily Sales Over Time')
# plt.xlabel('Date')
# plt.ylabel('Sales')
# plt.show()

### Question 2: Are there noticeable trends or seasonality in sales data? What hypotheses can you form based on the plot?

In [343]:
# Aggregate sales by year and month
train['year'] = train['date'].dt.year
train['month'] = train['date'].dt.month
monthly_sales = train.groupby(['year', 'month'])['sales'].sum().reset_index()

# Plot monthly sales
# plt.figure(figsize=(12, 6))
# sns.lineplot(x='month', y='sales', hue='year', data=monthly_sales, marker='o')
# plt.title('Monthly Sales Trends by Year')
# plt.xlabel('Month')
# plt.ylabel('Sales')
# plt.legend(title='Year')
# plt.show()

### Question 3: Which months tend to have higher or lower sales? Can this be linked to holidays or promotions?

## 3. Feature Engineering

In [344]:
# Merge train dataset with holidays and oil prices
train = train.merge(oil, on='date', how='left')
train = train.merge(holidays, on='date', how='left')
train = train.merge(stores, on='store_nbr', how='left')

# Fill missing oil prices with forward fill
train['dcoilwtico'] = train['dcoilwtico'].fillna(method='ffill')
#What is this really doing ?

# Create new features
train['day_of_week'] = train['date'].dt.day_of_week
train['is_weekend'] = train['date'].dt.day_of_week.isin([5, 6])
train['year_month'] = train['date'].dt.to_period('M')
# Preview engineered features
train[['date', 'sales', 'dcoilwtico', 'day_of_week', 'is_weekend', 'year_month']].head()

  train['dcoilwtico'] = train['dcoilwtico'].fillna(method='ffill')


Unnamed: 0,date,sales,dcoilwtico,day_of_week,is_weekend,year_month
0,2013-01-01,0.0,,1,False,2013-01
1,2013-01-01,0.0,,1,False,2013-01
2,2013-01-01,0.0,,1,False,2013-01
3,2013-01-01,0.0,,1,False,2013-01
4,2013-01-01,0.0,,1,False,2013-01


### Question 4: How does oil price (`dcoilwtico`) correlate with sales? Plot and discuss.

In [345]:
# Correlation between oil price and sales
# plt.figure(figsize=(12, 6))
# sns.scatterplot(x='dcoilwtico', y='sales', data=train, alpha=0.5)
# plt.title('Oil Price vs Sales')
# plt.xlabel('Oil Price')
# plt.ylabel('Sales')
# plt.show()

### 4. Training Season

Based on all the EDA and feature engineer done prior train a simple linear regression

In [346]:
# dummy_list = ["family", "type_x", "locale", "locale_name", "transferred", "city", "state", "type_y"]
# dum = pd.get_dummies(train, columns=[dummy_list], drop_first=True)
train.dtypes

id                      int64
date           datetime64[ns]
store_nbr               int64
family                 object
sales                 float64
onpromotion             int64
year                    int32
month                   int32
dcoilwtico            float64
type_x                 object
locale                 object
locale_name            object
description            object
transferred            object
city                   object
state                  object
type_y                 object
cluster                 int64
day_of_week             int32
is_weekend               bool
year_month          period[M]
dtype: object

In [347]:
import pandas as pd
from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler

dummy_list = ["family", "type_x", "locale", "locale_name", "transferred", "city", "state", "type_y"]

# Select features and target variable
train = train.sort_values(by="date")
train = pd.get_dummies(train, columns=dummy_list, drop_first=True)
features = train.drop(columns=["sales", "date", "description", "year_month"]).columns
target = 'sales'

# Handle missing values (fill forward and backward)
train = train.ffill().bfill()

# Automatically split data into train and test sets (80% for training, 20% for testing)
split_index = int(0.8 * len(train))  # Split at 80% of the data
train_data = train.iloc[:split_index]
test_data = train.iloc[split_index:]

X_train, y_train = train_data[features], train_data[target]
X_test, y_test = test_data[features], test_data[target]

# Apply feature scaling (Standardization)
# std = StandardScaler()
# X_train_scaled = std.fit_transform(X_train)
# X_test_scaled = std.transform(X_test)

# Initialize and train a linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")


Mean Squared Error: 731582.9153527936


## 5. Extra Questions
1. Create a lag feature for sales (e.g., `sales_lag_1` for the previous day). How does this improve your understanding of the data?
2. Engineer a feature indicating the number of holidays in the past 7 days. Does it help explain sales trends?
3. Use one or both of these new features, do they impact the predictions?
4. Split the data into training and validation sets for future modeling. How would you ensure no data leakage in a time-series setup? (We will talk about this next, but think about it)

In [348]:
#Have fun ;)