<a href="https://colab.research.google.com/github/dianalves00/6-7-edition/blob/main/Supervised%20Learning/regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Store Sales - Time Series Forecasting
This notebook covers exploratory data analysis (EDA) and feature engineering for the Store Sales dataset.

## Dataset Overview
- **Goal:** Predict daily store sales.
- **Features:** Date, store information, promotions, and more.
- **Target:** Sales column.

### Exercises:
- Conduct EDA to understand trends and relationships.
- Engineer meaningful features to improve forecasting accuracy.

### Dataset Link:
Download the dataset from [Kaggle Store Sales Time Series Forecasting](https://www.kaggle.com/competitions/store-sales-time-series-forecasting/data), from the github repository or use the dataset directly from github.

## 1. Load and Explore the Data

In [None]:
#if you dont have it yet and want to download it and unzip it locally
!wget https://github.com/samsung-ai-course/6-7-edition/raw/main/Supervised%20Learning/Datasets/store-sales-time-series-forecasting.zip

# 2. Unzip the downloaded file
!unzip store-sales-time-series-forecasting.zip -d store_sales_data

--2024-11-24 19:25:07--  https://github.com/samsung-ai-course/6-7-edition/raw/main/Supervised%20Learning/Datasets/store-sales-time-series-forecasting.zip
Resolving github.com (github.com)... 140.82.114.4
Connecting to github.com (github.com)|140.82.114.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/samsung-ai-course/6-7-edition/main/Supervised%20Learning/Datasets/store-sales-time-series-forecasting.zip [following]
--2024-11-24 19:25:08--  https://raw.githubusercontent.com/samsung-ai-course/6-7-edition/main/Supervised%20Learning/Datasets/store-sales-time-series-forecasting.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 22416355 (21M) [application/zip]
Saving to: ‘store-sales-time-series-

In [None]:
#if you want to unzip programatically (assuming you have the file in this folder on the left)
!unzip Supervised\ Learning/Datasets/store-sales-time-series-forecasting.zip -d Supervised\ Learning/Datasets/


In [None]:
# Import libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load dataset
train = pd.read_csv("/train.csv", parse_dates=['date'])
stores = pd.read_csv("/stores.csv")
oil = pd.read_csv("/oil.csv", parse_dates=['date'])
holidays = pd.read_csv("/holidays_events.csv", parse_dates=['date'])

# Preview dataset
train.head()

In [None]:
# Summary of train dataset
train.info()

### Question 1: What is the date range of the training data? Use `.min()` and `.max()` on the `date` column.

In [None]:
# Date range

## 2. Exploratory Data Analysis

In [None]:
# Plot sales over time
plt.figure(figsize=(12, 6))
sns.lineplot(x='date', y='sales', data=train, ci=None)
plt.title('Daily Sales Over Time')
plt.xlabel('Date')
plt.ylabel('Sales')
plt.show()

### Question 2: Are there noticeable trends or seasonality in sales data? What hypotheses can you form based on the plot?

In [None]:
# Aggregate sales by year and month
train['year'] = train['date'].dt.year
train['month'] = train['date'].dt.month
monthly_sales = train.groupby(['year', 'month'])['sales'].sum().reset_index()

# Plot monthly sales
plt.figure(figsize=(12, 6))
sns.lineplot(x='month', y='sales', hue='year', data=monthly_sales, marker='o')
plt.title('Monthly Sales Trends by Year')
plt.xlabel('Month')
plt.ylabel('Sales')
plt.legend(title='Year')
plt.show()

### Question 3: Which months tend to have higher or lower sales? Can this be linked to holidays or promotions?

## 3. Feature Engineering

In [None]:
# Merge train dataset with holidays and oil prices
train = train.merge(oil, on='date', how='left')
train = train.merge(holidays, on='date', how='left')
train = train.merge(stores, on='store_nbr', how='left')

# Fill missing oil prices with forward fill
train['dcoilwtico'] = train['dcoilwtico'].fillna(method='ffill')
#What is this really doing ?

# Create new features
train['day_of_week'] = #TODO
train['is_weekend'] = #TODO
train['year_month'] = #TODO OR https://pandas.pydata.org/docs/reference/api/pandas.Series.dt.to_period.html

# Preview engineered features
train[['date', 'sales', 'dcoilwtico', 'day_of_week', 'is_weekend']].head()

### Question 4: How does oil price (`dcoilwtico`) correlate with sales? Plot and discuss.

In [None]:
# Correlation between oil price and sales
plt.figure(figsize=(12, 6))
sns.scatterplot(x='dcoilwtico', y='sales', data=train, alpha=0.5)
plt.title('Oil Price vs Sales')
plt.xlabel('Oil Price')
plt.ylabel('Sales')
plt.show()

### 4. Training Season

Based on all the EDA and feature engineer done prior train a simple linear regression

In [None]:
# prompt: ### 4. Training Season
# Based on all the EDA and feature engineer done prior train a simple linear regression
# import only the necessary and do incompete code

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Select features and target variable
features = #TODO
target = 'sales'

# Handle missing values (if any) -  replace with more robust imputation if necessary
train = #TODO

# Split data into training and testing sets
X = train[features]
y = train[target]

#Question: In this dataset train and test are already separated. Why would we split it again ? Is there a reason ? Is this correct?
#P.s this is a time-series
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train a linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = #TODO

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")

## 5. Extra Questions
1. Create a lag feature for sales (e.g., `sales_lag_1` for the previous day). How does this improve your understanding of the data?
2. Engineer a feature indicating the number of holidays in the past 7 days. Does it help explain sales trends?
3. Use one or both of these new features, do they impact the predictions?
4. Split the data into training and validation sets for future modeling. How would you ensure no data leakage in a time-series setup? (We will talk about this next, but think about it)

In [None]:
#Have fun ;)