### Project Title: 
##### Time Series Analysis: "Demand Forecasting for Inventory Optimization at Corporation Favorita"


### 1. Business Understanding
#### Business Scenario
As data scientists in Corporation Favorita, a large Ecuadorian-based grocery retailer, we are tasked to ensure that there is always the right quantity of products in stock.
To do this we have decided to build a series of machine learning models to forecast the demand of products in various locations.We have been provided with some datasets to help in this project.



#### Project Description
This project aims to ensure optimal inventory levels at Corporation Favorita, by leveraging machine learning models to forecast product demand across various locations. Accurate demand forecasting will help maintain the right quantity of products in stock, reducing instances of overstocking and stockouts, thereby enhancing customer satisfaction and minimizing operational costs. The project follows the CRISP-DM framework and utilizes data provided by the marketing and sales teams to develop and validate predictive models


#### Business Objective
The primary objective of this project is to develop and implement a series of machine learning models to accurately forecast the demand for various products across different locations of Corporation Favorita. By achieving this objective, Corporation Favorita aims to optimize its inventory management, ensuring that the right quantity of products is consistently in stock. 



#### Hypothesis Testing
Null Hypothesis (H0): Promotional activities and Sales do not have a significant impact on product demands in various stores. 

Alternate Hypothesis (H1): Sales data has a significant impact on product demands in various stores.

Alternate Hypothesis (H2): Promotional activities have a significant impact on product demands in various stores. 

#### Analytical Questions
1. Is the train dataset complete (has all the required dates)?
2. Which dates have the lowest and highest sales for each year (excluding days the store was closed)?
3. Compare the sales for each month across the years and determine which month of which year had the highest sales.
4. Did the earthquake impact sales?
5. Are certain stores or groups of stores selling more products? (Cluster, city, state, type)
6. Are sales affected by promotions, oil prices and holidays?
7. What analysis can we get from the date and its extractable features?
8. Which product family and stores did the promotions affect.
9. What is the difference between RMSLE, RMSE, MSE (or why is the MAE greater than all of them?)
10. Does the payment of wages in the public sector on the 15th and last days of the month influence the store sales.

### 2. Data Understanding
#### Sourcing the Dataset
The datasets were sourced from a github repository, a onedrive account, and a SQL server database.

The data at a github repository contains two dattasets; train and transactions

The data at a onedrive  was downloaded manually due to permission issues and contains two datasets also. This is to be used for testing purposes.

The datasets hosted by a SQL server database was queried, and the respective dataframes saved as single files in csv format.

In [3]:
### Install required packages
# !pip install statsmodels


#Libraries for sql
# database connections
import pyodbc    
from dotenv import dotenv_values
import warnings 
warnings.filterwarnings('ignore')
from statsmodels.tools.sm_exceptions import ValueWarning

#libraries for handling data
import pandas as pd
import numpy as np

##data visualizations
from scipy import stats
import matplotlib.pyplot as plt 
import seaborn as sns 
import plotly.express as px
import calplot

# Feature Processing
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

#stat models
# from pmdarima import auto_arima
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.graphics.tsaplots import plot_acf 
from statsmodels.graphics.tsaplots import plot_pacf 
from statsmodels.tsa.ar_model import AutoReg
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.statespace.sarimax import SARIMAX
from scipy.stats import ttest_ind

# Error evaluations
from sklearn.metrics import mean_squared_error, mean_squared_log_error,mean_squared_log_error, mean_absolute_error

# Modelling
import xgboost as xgb
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import RandomizedSearchCV

import joblib

Data from SQL Server database

In [None]:
# #Loading first dataset from database
# # Load environment variables from .env file
# environment_variables = dotenv_values('.env')

# # Access database credentials from environment variables dictionary
# server = environment_variables.get("SERVER")
# database = environment_variables.get("DATABASE")
# password = environment_variables.get("PASSWORD")
# username = environment_variables.get("USER")

# # Construct the connection string
# # connection_string = f"DRIVER=ODBC Driver 17 for SQL Server;SERVER={server};DATABASE={database};User Id={username};PASSWORD={password};"

# # Construct the connection string
# connection_string = f"DRIVER=ODBC Driver 17 for SQL Server;SERVER={server};DATABASE={database};UID={username};PWD={password};MARS_Connection=yes;MinProtocolVersion=TLSv1.2;"

# # connection_string = f"DRIVER={{SQL Server}};SERVER={server};DATABASE={database};UID={username};PWD={password};MARS_Connection=yes;MinProtocolVersion=TLSv1.2;"


# # Connect to the database
# try:
#     connection = pyodbc.connect(connection_string)
#     print("Connection successful!")
# except Exception as e:
#     print("Error:", e)

# # Specify the SQL queries to extract data from the tables
# oil_data = "SELECT * FROM dbo.oil"
# holiday_data = "SELECT * FROM dbo.holidays_events"
# store_data = "SELECT * FROM dbo.store"

# # Suppress warnings
# warnings.filterwarnings('ignore')

# # Create a cursor from the connection
# # with connection.cursor() as cursor:
#     # Execute the queries and fetch data into Pandas DataFrames
# oil_data = pd.read_sql_query(oil_data, connection)
# holiday_data = pd.read_sql_query(holiday_data, connection)
# store_data = pd.read_sql_query(store_data, connection)

In [16]:
holiday_data = pd.read_csv("holiday_data.csv")
holiday_data

Unnamed: 0,date,type,locale,locale_name,description,transferred
0,02/03/2012,Holiday,Local,Manta,Fundacion de Manta,False
1,01/04/2012,Holiday,Regional,Cotopaxi,Provincializacion de Cotopaxi,False
2,12/04/2012,Holiday,Local,Cuenca,Fundacion de Cuenca,False
3,14/04/2012,Holiday,Local,Libertad,Cantonizacion de Libertad,False
4,21/04/2012,Holiday,Local,Riobamba,Cantonizacion de Riobamba,False
...,...,...,...,...,...,...
345,22/12/2017,Additional,National,Ecuador,Navidad-3,False
346,23/12/2017,Additional,National,Ecuador,Navidad-2,False
347,24/12/2017,Additional,National,Ecuador,Navidad-1,False
348,25/12/2017,Holiday,National,Ecuador,Navidad,False


In [18]:
stores_data = pd.read_csv("stores_data.csv")
stores_data

Unnamed: 0.1,Unnamed: 0,store_nbr,city,state,type,cluster
0,0,1,Quito,Pichincha,D,13
1,1,2,Quito,Pichincha,D,13
2,2,3,Quito,Pichincha,D,8
3,3,4,Quito,Pichincha,D,9
4,4,5,Santo Domingo,Santo Domingo de los Tsachilas,D,4
5,5,6,Quito,Pichincha,D,13
6,6,7,Quito,Pichincha,D,8
7,7,8,Quito,Pichincha,D,8
8,8,9,Quito,Pichincha,B,6
9,9,10,Quito,Pichincha,C,15


In [19]:
oil_data = pd.read_csv("oil_data.csv")
oil_data

Unnamed: 0.1,Unnamed: 0,date,dcoilwtico
0,0,01/01/2013,
1,1,02/01/2013,93.139999
2,2,03/01/2013,92.970001
3,3,04/01/2013,93.120003
4,4,07/01/2013,93.199997
...,...,...,...
1213,1213,25/08/2017,47.650002
1214,1214,28/08/2017,46.400002
1215,1215,29/08/2017,46.459999
1216,1216,30/08/2017,45.959999


Load train data from Github

In [22]:
transactions_data = pd.read_csv("transactions.csv")
transactions_data

Unnamed: 0,date,store_nbr,transactions
0,2013-01-01,25,770
1,2013-01-02,1,2111
2,2013-01-02,2,2358
3,2013-01-02,3,3487
4,2013-01-02,4,1922
...,...,...,...
83483,2017-08-15,50,2804
83484,2017-08-15,51,1573
83485,2017-08-15,52,2255
83486,2017-08-15,53,932


In [23]:
train_data = pd.read_csv("train.csv")
train_data

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
0,0,2013-01-01,1,AUTOMOTIVE,0.000,0
1,1,2013-01-01,1,BABY CARE,0.000,0
2,2,2013-01-01,1,BEAUTY,0.000,0
3,3,2013-01-01,1,BEVERAGES,0.000,0
4,4,2013-01-01,1,BOOKS,0.000,0
...,...,...,...,...,...,...
3000883,3000883,2017-08-15,9,POULTRY,438.133,0
3000884,3000884,2017-08-15,9,PREPARED FOODS,154.553,1
3000885,3000885,2017-08-15,9,PRODUCE,2419.729,148
3000886,3000886,2017-08-15,9,SCHOOL AND OFFICE SUPPLIES,121.000,8


Load test data from onedrive 

In [24]:
test_data = pd.read_csv("test.csv")
test_data

Unnamed: 0,id,date,store_nbr,family,onpromotion
0,3000888,2017-08-16,1,AUTOMOTIVE,0
1,3000889,2017-08-16,1,BABY CARE,0
2,3000890,2017-08-16,1,BEAUTY,2
3,3000891,2017-08-16,1,BEVERAGES,20
4,3000892,2017-08-16,1,BOOKS,0
...,...,...,...,...,...
28507,3029395,2017-08-31,9,POULTRY,1
28508,3029396,2017-08-31,9,PREPARED FOODS,0
28509,3029397,2017-08-31,9,PRODUCE,1
28510,3029398,2017-08-31,9,SCHOOL AND OFFICE SUPPLIES,9


#### Exploratory Data Analysis (EDA)

#### The Train Data

In [26]:
# Check for nulls in the train dataset
train_data.isnull().sum()

id             0
date           0
store_nbr      0
family         0
sales          0
onpromotion    0
dtype: int64

In [None]:
#Check for duplicates in the train dataset
train_data.duplicated().sum()