## Analyzing historical demand and predicting sales using Machine Learning

Capstone project for Intermediate Data Science

Sumithra Candasamy

### The Project

Predict the sales based on historical demand and holiday markdown events for 45 stores located in different regions. Each store contains many departments and there is a need to project the sales for each department in each store. Markdowns affect sales, and the challenge is to predict which departments are affected and the extent of the impact.

This project investigates the application data sets and creates models that predict the sales for future timeframe. Based on this information, the client will be able to make better inventory handling decisions.

### The Data

The data that we have for this project is taken from the Kaggle website. The stores information, training and test data and features data is available to use.

-  _stores.csv_

> This file contains anonymized information about the 45 stores, indicating the type and size of store.

-  _train.csv_

> This is the historical training data, which covers to 2010-02-05 to 2012-11-01. Within this file you will find the following fields
- the store number
- the department number
- the weekly date
-  sales for the given department in the given store
- whether the week is a special holiday week


- test.csv

> This file is identical to train.csv, except we have withheld the weekly sales. You must predict the sales for each triplet of store, department, and date in this file.

- features.csv

> This file contains additional data related to the store, department, and regional activity for the given dates. It contains the following fields:
- the store number
- the week
- average temperature in the region
- cost of fuel in the region
- MarkDown1-5 - anonymized data related to promotional markdowns that Walmart is running. MarkDown data is only available after Nov 2011, and is not available for all stores all the time. Any missing value is marked with an NA.
- CPI - the consumer price index
- Unemployment - the unemployment rate
- IsHoliday - whether the week is a special holiday week
> For convenience, the four holidays fall within the following weeks in the dataset (not all holidays are in the data):
> Super Bowl: 12-Feb-10, 11-Feb-11, 10-Feb-12, 8-Feb-13
> Labor Day: 10-Sep-10, 9-Sep-11, 7-Sep-12, 6-Sep-13
> Thanksgiving: 26-Nov-10, 25-Nov-11, 23-Nov-12, 29-Nov-13
> Christmas: 31-Dec-10, 30-Dec-11, 28-Dec-12, 27-Dec-13

### Methodology

I have used the forecasting procedures (FBProphet and ARIMA) that are used to forecast time series data.

**Prophet** is a procedure for forecasting time series data. It is based on an additive model where non-linear trends are fit with yearly and weekly seasonality, plus holidays. It works best with daily periodicity data with at least one year of historical data. Prophet is robust to missing data, shifts in the trend, and large outliers.

In time series analysis, an autoregressive integrated moving average (**ARIMA**) model is a generalization of an autoregressive moving average (ARMA) model. Both ARMA and ARIMA models are fitted to time series data either to better understand the data or to predict future points in the series (forecasting). ARIMA models are applied in some cases where data show evidence of non-stationarity, where an initial differencing step (corresponding to the "integrated" part of the model) can be applied one or more times to eliminate the non-stationarity.

### Libraries

pandas for :
-  data loading, wrangling, cleaning and manipulation
-  feature selection and engineering
-  description statistics

numpy for:
- array data structure, matrix manipulation    

matplotlib for:
- data visualization

scikit-learn for:
- model evaluation

### Data Wrangling and Cleaning

The data is in the form of CSV files, which are easily loaded into pandas dataframes. All the data is read as their respective datatype (int, float, bool), and the dates are parsed as dates automatically. Some of weeks are missing in the train csv, and these are filled with 0 for sales.

### Feature Selection and Engineering

Since the dataset contains the historical sales for 45 stores in 81 departments, I have used this information, along with the holidays. This showed the seasonality of the sales in different store/department combinations.

### Model Fitting

#### Prophet

Prophet model is used with holidays information and yearly seasonality. The data was split into train and test sets, and the model was applied to the train set. The model was then used to define future timeframes to include the length of test set. The predict method is used to forecast for the future timeframe.    

#### ARIMA

There are three distinct integers (p, d, q) that are used to parametrize ARIMA models. These three parameters account for seasonality, trend, and noise in datasets. 

p is the auto-regressive part of the model. It allows us to incorporate the effect of past values into our model.
d is the integrated part of the model. This includes terms in the model that incorporate the amount of differencing.
q is the moving average part of the model. This allows us to set the error of our model as a linear combination of the error values observed at previous time points in the past.

The Augmented Dickey Fuller test is run to check stationarity of time series data. 

The Autocorrelation and Partial Autocorrelation plots are plotted on the data to get a range for the p and q parameters. The range is then used to find the optimal parameters for p and q using grid search.

Since we have time series with seasonal effects, we can use the seasonal ARIMA, which is denoted as ARIMA(p,d,q)(P,D,Q)s. Here, (p, d, q) are the non-seasonal parameters, while (P, D, Q) follow the same definition but are applied to the seasonal component of the time series. 
The term s is the periodicity of the time series (4 for quarterly periods, 12 for yearly periods, etc.).

### Model Evaluation

The RMSE and MAE is captured for both the models for all stores and departments. It is observed that the RMSE and MAE of the Prophet forecast model are smaller when compared to the ARIMA model. So the Prophet forecasting model appears to be more accurate than the ARIMA.

### Results

The results of Prophet model is overall better compared to the ARIMA model. However some departments in Prophet with model show a relatively high MAE, namely departments 38, 72. It would be better to consult the product manager to get more information on these departments.
Also the Prophet model is better in evaluating with store/depts having 0 sales for certain weeks, while the ARIMA model does not produce good pdq's to create a model.