# Fast Moving Consumer Goods Sales Forecast - Part I

# 1. Introduction
We'll start with an overview of how machine learning models work and how they are used. This may feel basic if you've done statistical modeling or machine learning before. Don't worry, we will progress to building powerful models soon.

This week will have you build models as you go through following scenario:

As a graduate of the MIT SCM program you are hired by a large fast moving consumer goods (FMCG) company. They have put you in charge of creating a supply chain analytics department. The board wants to create this department as the company for decades relied on the intuition of the experienced staff. 
When talking with this staff you find out that they identified patterns in the revenues based on patterns such as holidays and sports events.

Machine learning works the same way.  We'll start with a model called the Decision Tree. There are fancier models that give more accurate predictions. But decision trees are easy to understand, and they are the basic building block for some of the best models in data science.

For simplicity, we'll start with the simplest possible decision tree. 

![First Decision Trees](https://www.dropbox.com/s/zwlagalivi46y6f/decisiontree1.png?dl=1)

It divides the data into only two categories. The predicted price for any week is the historical average of weekly sales in the same category. 

We use data to decide how to break the sales data into two groups, and then again to determine the predicted revenue in each group. In this example we predict the sales of week with or without a holiday. This step of capturing patterns from data is called **fitting** or **training** the model. The data used to **fit** the model is called the **training data**.  

The details of how the model is fit (e.g. how to split up the data) is complex enough that we will save it for later. After the model has been fit, you can apply it to new data to **predict** weekly sales in the future.

---
## Improving the Decision Tree
Which of the following two decision trees is more likely to result from fitting the FMCG sales training data?

![First Decision Trees](https://www.dropbox.com/s/116u9op450wwaze/Drawing1%20%285%29.png?dl=1)

The decision tree on the right (Decision Tree 2) probably makes more sense, because it captures the reality that in the week before a holiday revenue will be higher than usual. The biggest shortcoming of this model is that it doesn't capture other factors such as which month of the year it is. A FMCG store will probably have higher sales in the week before Christmas and Thanksgiving than in the week of Presidents day.

You can capture more factors using a tree that has more "splits." These are called "deeper" trees. A decision tree that also considers the month of the year might look like this:
![Depth 2 Tree](https://www.dropbox.com/s/jzakh55vj1q4xr3/Drawing1%20%284%29.png?dl=1)

You predict the sales of any week by tracing through the decision tree, always picking the path corresponding to that characteristics of that week. The predicted price for that week is at the bottom of the tree.  The point at the bottom where we make a prediction is called a **leaf.**   

The splits and values at the leaves will be determined by the data, so it's time for you to check out the data you will be working with.

---
# 2. Using Pandas to Get Familiar With Your Data

The first step in any machine learning project is familiarize yourself with the data. You'll use the Pandas library for this. Pandas is the primary tool data scientists use for exploring and manipulating data. Most people abbreviate pandas in their code as `pd`.  We do this with the command:

In [1]:
import pandas as pd

The most important part of the Pandas library is the DataFrame.  A DataFrame holds the type of data you might think of as a table. This is similar to a sheet in Excel, or a table in a SQL database. 

Pandas has powerful methods for most things you'll want to do with this type of data.  

As an example, we'll look at [data about weekly retail sales at Walmart stores](https://www.kaggle.com/datasets/rutuspatel/walmart-dataset-retail). The example (Walmart Retail Dataset) data is at the file path **`Walmart_Store_sales.csv`**.

We load and explore the data with the following commands:

In [2]:
from datetime import datetime
# save filepath to variable for easier access
walmart_file_path = 'https://www.dropbox.com/s/ns7envvzoqyypui/Walmart_Store_sales.csv?dl=1'

# read the data and store data in DataFrame titled walmart_data
# Parse date column from day-month-year into Pandas 
walmart_data = pd.read_csv(walmart_file_path,parse_dates=['Date'], date_parser=lambda x: datetime.strptime(x, '%d-%M-%Y').date()) 

In [3]:
walmart_data.head(5)

Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment
0,1,2010-01-05,1643690.9,0,42.31,2.572,211.096358,8.106
1,1,2010-01-12,1641957.44,1,38.51,2.548,211.24217,8.106
2,1,2010-01-19,1611968.17,0,39.93,2.514,211.289143,8.106
3,1,2010-01-26,1409727.59,0,46.63,2.561,211.319643,8.106
4,1,2010-01-05,1554806.68,0,46.5,2.625,211.350143,8.106


In [4]:
#List data types of each column
walmart_data.dtypes

Store                    int64
Date            datetime64[ns]
Weekly_Sales           float64
Holiday_Flag             int64
Temperature            float64
Fuel_Price             float64
CPI                    float64
Unemployment           float64
dtype: object

In [5]:
# print a summary of the data in Walmart data
walmart_data.describe(datetime_is_numeric=True)

Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment
count,6435.0,6435,6435.0,6435.0,6435.0,6435.0,6435.0,6435.0
mean,23.0,2011-01-02 21:59:09.650349824,1046965.0,0.06993,60.663782,3.358607,171.578394,7.999151
min,1.0,2010-01-01 00:00:00,209986.2,0.0,-2.06,2.472,126.064,3.879
25%,12.0,2010-01-23 00:00:00,553350.1,0.0,47.46,2.933,131.735,6.891
50%,23.0,2011-01-14 00:00:00,960746.0,0.0,62.67,3.445,182.616521,7.874
75%,34.0,2012-01-06 00:00:00,1420159.0,0.0,74.94,3.735,212.743293,8.622
max,45.0,2012-01-31 00:00:00,3818686.0,1.0,100.14,4.468,227.232807,14.313
std,12.988182,,564366.6,0.255049,18.444933,0.45902,39.356712,1.875885


## Interpreting Data Description
The results show 8 numbers for each column in your original dataset. The first number, the **count**,  shows how many rows have non-missing values. In this case there are no missing values.

Missing values could arise for many reasons. For example, a store might be closed temporarily during an emergency such as a hurricane and subsequently have no sales data. We'll come back to the topic of missing data.

The second value is the **mean**, which is the average.  Under that, **std** is the standard deviation, which measures how numerically spread out the values are.

To interpret the **min**, **25%**, **50%**, **75%** and **max** values, imagine sorting each column from lowest to highest value.  The first (smallest) value is the min.  If you go a quarter way through the list, you'll find a number that is bigger than 25% of the values and smaller than 75% of the values.  That is the **25%** value (pronounced "25th percentile").  The 50th and 75th percentiles are defined analogously, and the **max** is the largest number.

---
# 3. Selecting Data for Modeling

Your dataset had too many variables to wrap your head around, or even to print out nicely.  How can you pare down this overwhelming amount of data to something you can understand?

We'll start by picking a few variables using our intuition. Later we will show you statistical techniques to automatically prioritize variables.

To choose variables/columns, we'll need to see a list of all columns in the dataset. That is done with the **columns** property of the DataFrame (the bottom line of code below).


In [6]:
#import pandas as pd
#from datetime import datetime

# save filepath to variable for easier access
walmart_file_path = 'https://www.dropbox.com/s/ns7envvzoqyypui/Walmart_Store_sales.csv?dl=1'

# read the data and store data in DataFrame titled walmart_data
# Parse date column from day-month-year into Pandas 
walmart_data = pd.read_csv(walmart_file_path,parse_dates=['Date'], date_parser=lambda x: datetime.strptime(x, '%d-%M-%Y')) 
walmart_data.columns

Index(['Store', 'Date', 'Weekly_Sales', 'Holiday_Flag', 'Temperature',
       'Fuel_Price', 'CPI', 'Unemployment'],
      dtype='object')

In [7]:
# dropna drops missing values (think of na as "not available")
walmart_data = walmart_data.dropna(axis=0)

There are many ways to select a subset of your data. We will focus on two approaches for now.

1. Dot notation, which we use to select the "prediction target"
2. Selecting with a column list, which we use to select the "features" 

## Selecting The Prediction Target 
You can pull out a variable with **dot-notation**.  This single column is stored in a **Series**, which is broadly like a DataFrame with only a single column of data.  

We'll use the dot notation to select the column we want to predict, which is called the **prediction target**. By convention, the prediction target is called **y**. So the code we need to save the weekly sales in the Walmart data is

In [8]:
y = walmart_data.Weekly_Sales

## Choosing "Features"
The columns that are inputted into our model (and later used to make predictions) are called "features." In our case, those would be the columns used to determine the weekly sales. Sometimes, you will use all columns except the target as features. Other times you'll be better off with fewer features. 

For now, we'll build a model with only a few features. Later on you'll see how to iterate and compare models built with different features.

We select multiple features by providing a list of column names inside brackets. Each item in that list should be a string (with quotes).

Here is an example:

In [9]:
walmart_features = ['Store', 'Fuel_Price', 'Unemployment', 'CPI', 'Temperature', 'Holiday_Flag']

By convention, this data is called **X**.

In [10]:
X = walmart_data[walmart_features]

Let's quickly review the data we'll be using to predict weekly sales using the `describe` method and the `head` method, which shows the top few rows.

In [11]:
X.describe()

Unnamed: 0,Store,Fuel_Price,Unemployment,CPI,Temperature,Holiday_Flag
count,6435.0,6435.0,6435.0,6435.0,6435.0,6435.0
mean,23.0,3.358607,7.999151,171.578394,60.663782,0.06993
std,12.988182,0.45902,1.875885,39.356712,18.444933,0.255049
min,1.0,2.472,3.879,126.064,-2.06,0.0
25%,12.0,2.933,6.891,131.735,47.46,0.0
50%,23.0,3.445,7.874,182.616521,62.67,0.0
75%,34.0,3.735,8.622,212.743293,74.94,0.0
max,45.0,4.468,14.313,227.232807,100.14,1.0


In [12]:
X.head()

Unnamed: 0,Store,Fuel_Price,Unemployment,CPI,Temperature,Holiday_Flag
0,1,2.572,8.106,211.096358,42.31,0
1,1,2.548,8.106,211.24217,38.51,1
2,1,2.514,8.106,211.289143,39.93,0
3,1,2.561,8.106,211.319643,46.63,0
4,1,2.625,8.106,211.350143,46.5,0


Visually checking your data with these commands is an important part of a data scientist's job.  You'll frequently find surprises in the dataset that deserve further inspection.

---
# 4. Building Your Model

You will use the **scikit-learn** library to create your models.  When coding, this library is written as **sklearn**, as you will see in the sample code. Scikit-learn is easily the most popular library for modeling the types of data typically stored in DataFrames. 

The steps to building and using a model are:
* **Define:** What type of model will it be?  A decision tree?  Some other type of model? Some other parameters of the model type are specified too.
* **Fit:** Capture patterns from provided data. This is the heart of modeling.
* **Predict:** Just what it sounds like
* **Evaluate**: Determine how accurate the model's predictions are.

Here is an example of defining a decision tree model with scikit-learn and fitting it with the features and target variable.

In [13]:
#Import DecisionTreeRegressor from sklearn.tree
from sklearn.tree import DecisionTreeRegressor

# Define model (specify a number for random_state to ensure same results each run)
walmart_model = DecisionTreeRegressor(random_state=1)

# Fit model
walmart_model.fit(X, y)

DecisionTreeRegressor(random_state=1)

Many machine learning models allow some randomness in model training. Specifying a number for `random_state` ensures you get the same results in each run. This is considered a good practice. You use any number, and model quality won't depend meaningfully on exactly what value you choose.

We now have a fitted model that we can use to make predictions.

In practice, you'll want to make predictions for the future and not the past for which we already know the weekly sales. But we'll make predictions for the first few rows of the training data to see how the predict function works.


In [14]:
print("Making predictions for the following 5 weeks:")
print(X.head())
print("The predictions are")
print(walmart_model.predict(X.head()))

Making predictions for the following 5 weeks:
   Store  Fuel_Price  Unemployment         CPI  Temperature  Holiday_Flag
0      1       2.572         8.106  211.096358        42.31             0
1      1       2.548         8.106  211.242170        38.51             1
2      1       2.514         8.106  211.289143        39.93             0
3      1       2.561         8.106  211.319643        46.63             0
4      1       2.625         8.106  211.350143        46.50             0
The predictions are
[1643690.9  1641957.44 1611968.17 1409727.59 1554806.68]


You've built a model. But how good is it?

You will learn to use model validation to measure the quality of your model. Measuring model quality is the key to iteratively improving your models.

---
# 5. What is Model Validation

You'll want to evaluate almost every model you ever build. In many applications, the relevant measure of model quality is predictive accuracy. In other words, will the model's predictions be close to what actually happens.

Many people make a huge mistake when measuring predictive accuracy. They make predictions with their *training data* and compare those predictions to the target values in the *training data*. You'll see the problem with this approach and how to solve it in a moment, but let's think about how we'd do this first.

You'd first need to summarize the model quality into an understandable way. If you compare predicted and actual weekly sales of a long period, you'll likely find mix of good and bad predictions. But looking through a list of 10,000 predicted and actual values would be tedious and pointless. We need to summarize this into a single metric.

There are many metrics for summarizing model quality, but we'll start with one called **Mean Absolute Error** (also called **MAE**). Let's break down this metric starting with the last word, error.

The prediction error for each week is: <br>
```
error=actual−predicted
```
 
So, if the weekly sales is \$150,000 and you predicted it would be \$100,000 then the error is \$50,000.

With the MAE metric, we take the absolute value of each error. This converts each error to a positive number. We then take the average of those absolute errors. This is our measure of model quality. In plain English, it can be said as

> On average, our predictions are off by about X.

To calculate MAE, we first need a model. 

In [15]:
#import pandas as pd

# Load data
walmart_file_path = 'https://www.dropbox.com/s/ns7envvzoqyypui/Walmart_Store_sales.csv?dl=1'
walmart_data = pd.read_csv(walmart_file_path) 
# Filter rows with missing price values
filtered_walmart_data = walmart_data.dropna(axis=0)
# Choose target and features
y = filtered_walmart_data.Weekly_Sales
walmart_features = ['Fuel_Price', 'Unemployment', 'CPI', 'Temperature', 'Holiday_Flag']
X = filtered_walmart_data[walmart_features]

from sklearn.tree import DecisionTreeRegressor
# Define model
walmart_model = DecisionTreeRegressor()
# Fit model
walmart_model.fit(X, y)

DecisionTreeRegressor()

Once we have a model, here is how we calculate the Mean Absolute Error (MAE):

In [16]:
from sklearn.metrics import mean_absolute_error

predicted_weekly_sales = walmart_model.predict(X)
mean_absolute_error(y, predicted_weekly_sales)

104402.50066822066

And to calculate the Mean Absolute Percentage Error (MAPE)

In [17]:
from sklearn.metrics import mean_absolute_percentage_error

predicted_weekly_sales = walmart_model.predict(X)
mean_absolute_percentage_error(y, predicted_weekly_sales)

0.13253033531516809

And the Mean Squared Error (MSE)

In [18]:
from sklearn.metrics import mean_squared_error

predicted_weekly_sales = walmart_model.predict(X)
mean_squared_error(y, predicted_weekly_sales)

59974765422.80942

And finally to calculate the Root Mean Squared Error (RMSE), we can add the parameter squared=False

In [19]:
mean_squared_error(y, predicted_weekly_sales, squared=False)

244897.4589962285

## The Problem with "In-Sample" Scores

The measure we just computed can be called an "in-sample" score. We used a single "sample" of weekly sales for both building the model and evaluating it. Here's why this is bad.

Imagine that the temparture is unrelated to weekly sales. 

However, in the sample of data you used to build the model, all weeks with low temperature had a higher weekly sales value. The model's job is to find patterns that predict weekly sales, so it will see this pattern, and it will always predict high prices for weeks with low temperatures.

Since this pattern was derived from the training data, the model will appear accurate in the training data.

But if this pattern doesn't hold when the model sees new data, the model would be very inaccurate when used in practice.

Since models' practical value come from making predictions on new data, we measure performance on data that wasn't used to build the model. The most straightforward way to do this is to exclude some data from the model-building process, and then use those to test the model's accuracy on data it hasn't seen before. This data is called **validation data**.

## Coding the Train-Test Split

The scikit-learn library has a function `train_test_split` to break up the data into two pieces. We'll use some of that data as training data to fit the model, and we'll use the other data as validation data to calculate `mean_absolute_error`.

Here is the code:

In [20]:
#Import train_test_split from sklearn.model_selection
from sklearn.model_selection import train_test_split

# split data into training and validation data, for both features and target
# The split is based on a random number generator. Supplying a numeric value to
# the random_state argument guarantees we get the same split every time we
# run this script.
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)

# Define model
walmart_model = DecisionTreeRegressor()

# Fit model
walmart_model.fit(train_X, train_y)

# Get predicted prices on validation data
val_predictions = walmart_model.predict(val_X)
print(mean_absolute_error(val_y, val_predictions))

428447.76891443966


## Wow!

Your mean absolute error for the in-sample data was about 104,000 dollars. Out-of-sample it is more than 425,000 dollars.

This is the difference between a model that is almost exactly right, and one that is unusable for most practical purposes. As a point of reference, the average weekly sales in the validation data is about 1 million dollars. So the error in new data is almost half of the average weekly sales.

There are many ways to improve this model, such as experimenting to find better features or different model types. 

In the next class, you will first learn the concepts of underfitting and overfitting, and you will be able to apply these ideas to make your models more accurate.