# **Demand Forecasting Optimization for Corporation Favorita A Time Series Regression ML Approach**

### **Hypothesis:**
`Null Hypothesis (H0)`: There is no significant impact of promotions on the sales of products at Corporation Favorita stores.  

`Alternative Hypothesis (H1)`: Promotions significantly impact the sales of products at Corporation Favorita stores.



### **Analytical Questions**
1. Is the train dataset complete (has all the required dates)?
2. Which dates have the lowest and highest sales for each year (excluding days the store was closed)?
3. Compare the sales for each month across the years and determine which month of which year had the highest sales.
4. Did the earthquake impact sales?
5. Are certain stores or groups of stores selling more products? (Cluster, city, state, type)
6. Are sales affected by promotions, oil prices and holidays?
7. What analysis can we get from the date and its extractable features?
8. Which product family and stores did the promotions affect.
9. What is the difference between RMSLE, RMSE, MSE (or why is the MAE greater than all of them?)
10. Does the payment of wages in the public sector on the 15th and last days of the month influence the store sales.

## **Importing Necessary packages**

In [1]:
import pyodbc
from dotenv import dotenv_values
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from itertools import product

import warnings
warnings.filterwarnings('ignore') 

## **Data Collection**

Database Connection

In [2]:
# Loading environment variables from .env file
environment_variables = dotenv_values('.env')

# Getting the values for the credentials set in the .env file
server = environment_variables.get("SERVER")
database = environment_variables.get("DATABASE")
username = environment_variables.get("USERNAME")
password = environment_variables.get("PASSWORD")

# Creating a connection string
connection_string = f"DRIVER={{SQL Server}}; \
                    SERVER={server}; \
                    DATABASE={database}; \
                    UID={username}; \
                    PWD={password};"

# Connecting to the server
connection = pyodbc.connect(connection_string)

Loading data from the database

In [3]:
# Loading Oil dataset 
oil = pd.read_sql_query("SELECT * FROM dbo.oil", connection)

# Saving the DataFrame to a CSV file
oil.to_csv('data/oil.csv', index=False)

oil.head()

Unnamed: 0,date,dcoilwtico
0,2013-01-01,
1,2013-01-02,93.139999
2,2013-01-03,92.970001
3,2013-01-04,93.120003
4,2013-01-07,93.199997


In [4]:
# Loading holidays_events dataset
holidays_events = pd.read_sql_query("SELECT * FROM dbo.holidays_events", connection)

# Saving the DataFrame to a CSV file
holidays_events.to_csv('data/holidays_events.csv', index=False)

holidays_events.head()

Unnamed: 0,date,type,locale,locale_name,description,transferred
0,2012-03-02,Holiday,Local,Manta,Fundacion de Manta,False
1,2012-04-01,Holiday,Regional,Cotopaxi,Provincializacion de Cotopaxi,False
2,2012-04-12,Holiday,Local,Cuenca,Fundacion de Cuenca,False
3,2012-04-14,Holiday,Local,Libertad,Cantonizacion de Libertad,False
4,2012-04-21,Holiday,Local,Riobamba,Cantonizacion de Riobamba,False


In [5]:
# Loading stores dataset
stores = pd.read_sql_query("SELECT * FROM dbo.stores", connection)

# Saving the DataFrame to a CSV file
stores.to_csv('data/stores.csv', index=False)

stores.head()

Unnamed: 0,store_nbr,city,state,type,cluster
0,1,Quito,Pichincha,D,13
1,2,Quito,Pichincha,D,13
2,3,Quito,Pichincha,D,8
3,4,Quito,Pichincha,D,9
4,5,Santo Domingo,Santo Domingo de los Tsachilas,D,4


In [6]:
train = pd.read_csv(r"C:\Users\Elisha Stanley\Downloads\TECH\Data Science\Azubi\Project 03\train.csv")

train.head()

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
0,0,2013-01-01,1,AUTOMOTIVE,0.0,0
1,1,2013-01-01,1,BABY CARE,0.0,0
2,2,2013-01-01,1,BEAUTY,0.0,0
3,3,2013-01-01,1,BEVERAGES,0.0,0
4,4,2013-01-01,1,BOOKS,0.0,0


In [7]:
test = pd.read_csv('Data/test.csv')

test.head()

Unnamed: 0,id,date,store_nbr,family,onpromotion
0,3000888,2017-08-16,1,AUTOMOTIVE,0
1,3000889,2017-08-16,1,BABY CARE,0
2,3000890,2017-08-16,1,BEAUTY,2
3,3000891,2017-08-16,1,BEVERAGES,20
4,3000892,2017-08-16,1,BOOKS,0


In [8]:
transactions = pd.read_csv('Data/transactions.csv')

transactions.head()

Unnamed: 0,date,store_nbr,transactions
0,2013-01-01,25,770
1,2013-01-02,1,2111
2,2013-01-02,2,2358
3,2013-01-02,3,3487
4,2013-01-02,4,1922


In [9]:
sample_submission = pd.read_csv('Data/sample_submission.csv')

sample_submission.head()

Unnamed: 0,id,sales
0,3000888,0.0
1,3000889,0.0
2,3000890,0.0
3,3000891,0.0
4,3000892,0.0


## **Data Cleaning & Preparation**

In [10]:
dataset_list = [stores, train, test,transactions, oil, holidays_events, sample_submission]

Exploring data information

In [11]:
def data_information(datasets):
    for data in datasets:
        print(data.info())
        print('_' * 50)

data_information(dataset_list)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54 entries, 0 to 53
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   store_nbr  54 non-null     int64 
 1   city       54 non-null     object
 2   state      54 non-null     object
 3   type       54 non-null     object
 4   cluster    54 non-null     int64 
dtypes: int64(2), object(3)
memory usage: 2.2+ KB
None
__________________________________________________
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000888 entries, 0 to 3000887
Data columns (total 6 columns):
 #   Column       Dtype  
---  ------       -----  
 0   id           int64  
 1   date         object 
 2   store_nbr    int64  
 3   family       object 
 4   sales        float64
 5   onpromotion  int64  
dtypes: float64(1), int64(3), object(2)
memory usage: 137.4+ MB
None
__________________________________________________
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28512 entries, 0 to 28511
Dat

Checking for null values

In [12]:
def show_missing_val(datasets):
    for data in datasets:
        print(data.isnull().sum())
        print('_' * 50)

show_missing_val(dataset_list)

store_nbr    0
city         0
state        0
type         0
cluster      0
dtype: int64
__________________________________________________
id             0
date           0
store_nbr      0
family         0
sales          0
onpromotion    0
dtype: int64
__________________________________________________
id             0
date           0
store_nbr      0
family         0
onpromotion    0
dtype: int64
__________________________________________________
date            0
store_nbr       0
transactions    0
dtype: int64
__________________________________________________
date           0
dcoilwtico    43
dtype: int64
__________________________________________________
date           0
type           0
locale         0
locale_name    0
description    0
transferred    0
dtype: int64
__________________________________________________
id       0
sales    0
dtype: int64
__________________________________________________


In [13]:
# filling 'dcoilwtico' missing values with the median
oil['dcoilwtico'].fillna(oil['dcoilwtico'].median(), inplace=True)

Dealing with duplicate records

In [14]:
# Function to check duplicates on all datasets and drop them
def drop_duplicates(df):
    return df.drop_duplicates(inplace=True)

train.drop_duplicates(inplace=True)
test.drop_duplicates(inplace=True)
stores.drop_duplicates(inplace=True)
holidays_events.drop_duplicates(inplace=True)
transactions.drop_duplicates(inplace=True)
oil.drop_duplicates(inplace=True)
sample_submission.drop_duplicates(inplace=True)

Formatting Date columns

In [15]:
# Function to convert dates to datetime format
def convert_dates(df):
    df['date'] = pd.to_datetime(df['date'])
    return df

train = convert_dates(train)
test = convert_dates(test)
holidays_events = convert_dates(holidays_events)
transactions = convert_dates(transactions)
oil = convert_dates(oil)

## **Hypothesis Testing**

`Null Hypothesis (H0)`: There is no significant impact of promotions on the sales of products at Corporation Favorita stores.  

`Alternative Hypothesis (H1)`: Promotions significantly impact the sales of products at Corporation Favorita stores.

## **Answering Analytical Questions**

### 1. Is the train dataset complete (has all the required dates)?

In [16]:
train['date'].min(), train['date'].max()

(Timestamp('2013-01-01 00:00:00'), Timestamp('2017-08-15 00:00:00'))

In [17]:
# Getting dates which are not in the train dataset
missing_dates = pd.date_range(start= '2013-01-01', end='2017-08-15').difference(train.date)

missing_dates

DatetimeIndex(['2013-12-25', '2014-12-25', '2015-12-25', '2016-12-25'], dtype='datetime64[ns]', freq=None)

In [18]:
# Getting all unique stores and family 
uniques_stores = train.store_nbr.unique()
unique_family = train.family.unique()

In [19]:
# Replacing the missing dates by pairing it with all the unique stores and families
replace_dates = list(product(missing_dates, uniques_stores,unique_family ))

# Creating a dataframe for the replaced dates
replace_dates = pd.DataFrame(replace_dates, columns=['date', 'store_nbr', 'family'])
replace_dates.head()

Unnamed: 0,date,store_nbr,family
0,2013-12-25,1,AUTOMOTIVE
1,2013-12-25,1,BABY CARE
2,2013-12-25,1,BEAUTY
3,2013-12-25,1,BEVERAGES
4,2013-12-25,1,BOOKS


In [20]:
# Adding replaced dates to our train data
train_data = pd.concat([train, replace_dates], ignore_index=True)
train_data.head()

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
0,0.0,2013-01-01,1,AUTOMOTIVE,0.0,0.0
1,1.0,2013-01-01,1,BABY CARE,0.0,0.0
2,2.0,2013-01-01,1,BEAUTY,0.0,0.0
3,3.0,2013-01-01,1,BEVERAGES,0.0,0.0
4,4.0,2013-01-01,1,BOOKS,0.0,0.0


In [21]:
# Checking for missing date again
missing_dates = pd.date_range(start='2013-01-01', end='2017-08-15').difference(train_data.date)

missing_dates

DatetimeIndex([], dtype='datetime64[ns]', freq='D')

### 2. Which dates have the lowest and highest sales for each year (excluding days the store was closed)?

### 3. Comparing the sales for each month across the years and determining which month of which year had the highest sales.

### 4. Did the earthquake impact sales?

### 5. Are certain stores or groups of stores selling more products? (Cluster, city, state, type)

### 6. Are sales affected by promotions, oil prices and holidays?

### 7. What analysis can we get from the date and its extractable features?

### 8. Which product family and stores did the promotions affect.

### 9. What is the difference between RMSLE, RMSE, MSE (or why is the MAE greater than all of them?)

### 10. Does the payment of wages in the public sector on the 15th and last days of the month influence the store sales.

## **Exploritory Data Analysis (EDA)**