### 16.0 Problem Statement
#### ZERO-IN ON IMPACT OF MARKDOWNS

In this section, we will be implementing Gaussian Naive Bayes Classifier in python to the impact of the **MarkDowns** on **Clearance_Clothings [Clothings on sales]**.

Bayes’ theorem is based on conditional probability. The conditional probability helps us in calculating the probability that something will happen, given that something else has already happened. The Naive Bayes classifier works using the Bayes theorem. It assumes all the features are independent to each other. Even if the features depend on each other or upon the existence of the other features, Naive Bayes classifier considers all of these properties to independently contribute to the probability that the target event occurs.

### 16.1 Gaussian Naive Bayes
A Gaussian Naive Bayes algorithm is a special type of Naive Bayes (NB) algorithm. It’s specifically used when the features have continuous values. It’s also assumed that all the features are following a gaussian distribution i.e, normal distribution.



### 16.2 The Dataset
The dataset for this analysis will be extracted from the master retail sales dataset. The five markdowns will be used to predict whether the clearance clothing sales is **Weak** <=10k (0), **Average** (>10K -greater than 10K) or <=20k (1), **Strong** (>20K -greater than 20K) or <=30k (2), or **very Strong** (>30K - greater than 30K) (3).

### 16.3 Import required Python machine learning packages
We need to import pandas, numpy and sklearn libraries. From sklearn, we need to import preprocessing modules like Imputer. The Imputer package helps to impute the missing values (NB: Missing values have already been taken care of in our master dataset).

In [1]:
# Required Python Machine learning Packages
import pandas as pd
import numpy as np
# For preprocessing the data
from sklearn.preprocessing import Imputer
from sklearn import preprocessing
# To split the dataset into train and test datasets
from sklearn.cross_validation import train_test_split
# To model the Gaussian Navie Bayes classifier
from sklearn.naive_bayes import GaussianNB
# To calculate the accuracy score of the model
from sklearn.metrics import accuracy_score
import matplotlib as plt
%matplotlib inline



The above python machine learning packages we are going to use to build the random forest classifier. Let’s talk about the need for these packages in random forest classifier implementation.

The train_test_split module is for splitting the dataset into training and testing set. The accuracy_score module will be used for calculating the accuracy of our Gaussian Naive Bayes algorithm.

### 16.4 Data Importing
For importing the data and manipulating it, we are going to use pandas dataframes. First of all, we will download the dataset. 

In [26]:
# Read the file 'master_dataset.xlsx' into a DataFrame df using the read_xls() function.
df = pd.read_excel('master_dataset.xlsx', sheetname='Sheet1')

We are saving our data into “df” dataframe. For checking the length & dimensions of our dataframe, we can use len() method & “.shape” and for checking the features names and information about the data set, we use .keys and .info

In [27]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8190 entries, 0 to 8189
Data columns (total 95 columns):
Store                     8190 non-null int64
Date                      8190 non-null datetime64[ns]
Temperature               8190 non-null float64
Fuel_Price                8190 non-null float64
MarkDown1                 8190 non-null float64
MarkDown2                 8190 non-null float64
MarkDown3                 8190 non-null float64
MarkDown4                 8190 non-null float64
MarkDown5                 8190 non-null float64
CPI                       8190 non-null float64
Unemployment              8190 non-null float64
IsHoliday                 8190 non-null bool
Type                      8190 non-null object
Size                      8190 non-null int64
Jewelry                   8190 non-null float64
Pets                      8190 non-null float64
TV_Video                  8190 non-null float64
Cell_Phones               8190 non-null float64
Pharmaceutical            8190

#### 16.4.1 The MarkDowns
One of the key interests in our dataset are the markdowns. There are five markdowns: Markdown1 (Promotions carried out from Easter (Spring), MarkDown2 (Promotions carried out from thanksgiving), MarkDown3 (Promotions carried out from Christmas), MarkDown4 (Promotions carried out from Labor Day) and Markdown5 (Promotions carried out from Summer). 

We will extract the features and assign a name markdown_df to it.

In [28]:
#create markdown dataset
markdown_df = df[['MarkDown1', 'MarkDown2', 'MarkDown3', 'MarkDown4', 'MarkDown5'] ]

In [29]:
#display the first five rows of the markdown dataset\
markdown_df.head()

Unnamed: 0,MarkDown1,MarkDown2,MarkDown3,MarkDown4,MarkDown5
0,10382.9,6115.67,215.07,2406.62,6551.42
1,10382.9,6115.67,215.07,2406.62,6551.42
2,10382.9,6115.67,215.07,2406.62,6551.42
3,10382.9,6115.67,215.07,2406.62,6551.42
4,10382.9,6115.67,215.07,2406.62,6551.42


#### 16.4.2 Clearance Clothing Sales
Clearance clothing sales will be used as the target variable. The sales ranges from 0 weekly sales to 40,000 weekly sales. Therefore, we have divided the weekly sales to into four categories:

0 - Weak - Sales below or equal to 10k
1 - Average - Sales between 10k and 20k (including 20k)
2 - Strong - Sales between 20k and 30k (including 30k)
4 - Very Strong - Sales above 30k

Let us create a target column from the existing Clearance Clothing Sales column and assign the name df['range'] to it.

In [31]:
#Create new column for the different categories of clearance clothing

conditions = [
    (df['Clearance_Clothings'] <= 10000),
    (df['Clearance_Clothings']> 10000) & (df['Clearance_Clothings'] <= 20000),
    (df['Clearance_Clothings']> 20000) & (df['Clearance_Clothings'] <= 30000),
    (df['Clearance_Clothings']> 30000)]
choices = ['Weak', 'Average', 'Strong', 'Very Strong']
df['range'] = np.select(conditions, choices, default='Normal')
print(df.head())

   Store       Date  Temperature  Fuel_Price  MarkDown1  MarkDown2  MarkDown3  \
0      1 2010-05-02        42.31       2.572    10382.9    6115.67     215.07   
1      1 2010-12-02        38.51       2.548    10382.9    6115.67     215.07   
2      1 2010-02-19        39.93       2.514    10382.9    6115.67     215.07   
3      1 2010-02-26        46.63       2.561    10382.9    6115.67     215.07   
4      1 2010-05-03        46.50       2.625    10382.9    6115.67     215.07   

   MarkDown4  MarkDown5         CPI   ...     Musical_Instruments  Star_Wars  \
0    2406.62    6551.42  211.096358   ...                57022.45  118966.90   
1    2406.62    6551.42  211.242170   ...                57845.36  126907.41   
2    2406.62    6551.42  211.289143   ...                59462.22  122267.65   
3    2406.62    6551.42  211.319643   ...                63011.44  135066.75   
4    2406.62    6551.42  211.350143   ...                57335.17  125048.08   

  Movies_TV  Video_Games  Portab

In [32]:
# create master dataset for analysis - assign the name clearance_df to it
clearance_df = df[['MarkDown1', 'MarkDown2', 'MarkDown3', 'MarkDown4', 'MarkDown5', 'range']]

In [33]:
#view the first five rows of clearance_df
clearance_df.head()

Unnamed: 0,MarkDown1,MarkDown2,MarkDown3,MarkDown4,MarkDown5,range
0,10382.9,6115.67,215.07,2406.62,6551.42,Weak
1,10382.9,6115.67,215.07,2406.62,6551.42,Weak
2,10382.9,6115.67,215.07,2406.62,6551.42,Weak
3,10382.9,6115.67,215.07,2406.62,6551.42,Average
4,10382.9,6115.67,215.07,2406.62,6551.42,Weak


### 16.5 Data preprocessing

For preprocessing, we are going to make a duplicate copy of our original dataframe. We are duplicating clearance_df to clearance_df_rev dataframe.


In [34]:
#Make a duplicate copy of clearance_df
clearance_df_rev= clearance_df

Before we proceed, we need some summary statistics of our preprocessed dataframe. For this, we can use describe() method. It can be used to generate various summary statistics, excluding NaN values.

We are passing an “include” parameter with value as “all”, this is used to specify that we want summary statistics of all the attributes.

In [35]:
# Dataset basic statistics

clearance_df_rev.describe(include= 'all')

Unnamed: 0,MarkDown1,MarkDown2,MarkDown3,MarkDown4,MarkDown5,range
count,8190.0,8190.0,8190.0,8190.0,8190.0,8190
unique,,,,,,4
top,,,,,,Weak
freq,,,,,,4756
mean,8887.617797,6107.224317,928.78522,3130.176556,4544.031686,
std,9180.062712,8960.310896,7528.138611,5183.784963,9679.725089,
min,-2781.45,-265.76,-179.26,0.22,-185.17,
25%,2475.8475,197.84,14.54,263.0875,1882.345,
50%,7174.09,2006.62,117.92,1772.815,3508.88,
75%,11435.1425,8716.86,327.84,3834.44,5588.33,


For naive Bayes, we need to convert all the data values in one format. We are going to encode all the labels with the value between **0** and n_classes **-1**.

### 16.6 One-Hot Encoder
For implementing this, we are going to use LabelEncoder of scikit learn library. For encoding, we can also use the One-Hot encoder. It encodes the data into binary format.

In [42]:
#one hot encoding
le = preprocessing.LabelEncoder()
MarkDown1_cat = le.fit_transform(clearance_df.MarkDown1)
MarkDown2_cat = le.fit_transform(clearance_df.MarkDown2)
MarkDown3_cat   = le.fit_transform(clearance_df.MarkDown3)
MarkDown4_cat   = le.fit_transform(clearance_df.MarkDown4)
MarkDown5_cat   = le.fit_transform(clearance_df.MarkDown5)

In [43]:
#initialize the encoded categorical columns
clearance_df_rev['MarkDown1'] = MarkDown1_cat
clearance_df_rev['MarkDown2'] = MarkDown2_cat
clearance_df_rev['MarkDown3'] = MarkDown3_cat
clearance_df_rev['MarkDown4'] = MarkDown4_cat
clearance_df_rev['MarkDown5'] = MarkDown5_cat


In [44]:
clearance_df_rev.head()

Unnamed: 0,range,MarkDown1,MarkDown2,MarkDown3,MarkDown4,MarkDown5
0,Weak,3226,2326,2149,2269,3465
1,Weak,3226,2326,2149,2269,3465
2,Weak,3226,2326,2149,2269,3465
3,Average,3226,2326,2149,2269,3465
4,Weak,3226,2326,2149,2269,3465


Create new pandas dataframe from the encoded categorical columns

In [47]:
clearance_df_new = pd.DataFrame({'MarkDown1_cat':MarkDown1_cat, 'MarkDown2_cat':MarkDown2_cat, 'MarkDown3_cat':MarkDown3_cat,
                                'MarkDown4_cat':MarkDown4_cat, 'MarkDown5_cat':MarkDown5_cat})

In [48]:
# See the first 6 rows of clearance_df_new
clearance_df_new.head()

Unnamed: 0,MarkDown1_cat,MarkDown2_cat,MarkDown3_cat,MarkDown4_cat,MarkDown5_cat
0,3226,2326,2149,2269,3465
1,3226,2326,2149,2269,3465
2,3226,2326,2149,2269,3465
3,3226,2326,2149,2269,3465
4,3226,2326,2149,2269,3465


### 16.6 Standardization of Data
All the data values of our dataframe are numeric. Now, we need to convert them on a single scale. We can standardize the values.  

In [49]:
#Standardization of Data
num_features = ['MarkDown1_cat', 'MarkDown2_cat', 'MarkDown3_cat', 'MarkDown4_cat', 'MarkDown5_cat']

scaled_features = {}
for each in num_features:
    mean, std = clearance_df_new[each].mean(), clearance_df_new[each].std()
    scaled_features[each] = [mean, std]
    clearance_df_new.loc[:, each] = (clearance_df_new[each] - mean)/std

We have converted our data values into standardized values. Let us print and check the output of dataframe.

In [52]:
print(clearance_df_new.tail())

      MarkDown1_cat  MarkDown2_cat  MarkDown3_cat  MarkDown4_cat  \
8185      -0.200620       0.014843      -1.538562       0.482946   
8186       0.582277       0.378397       1.129279       1.102460   
8187      -0.439337       0.295132       0.086583      -0.419120   
8188      -0.676483       0.041816       0.638196      -0.843903   
8189      -1.559892      -0.062559      -1.609369      -1.613284   

      MarkDown5_cat  
8185      -0.019395  
8186      -1.092529  
8187      -0.602795  
8188      -1.502871  
8189      -0.820454  


### 16.7 Data Slicing

Let’s split the data into training and test set. We can easily perform this step using sklearn’s train_test_split() method.

In [53]:
# split the dataset to training and test set
features = clearance_df_new.values
target = clearance_df['range'].values
features_train, features_test, target_train, target_test = train_test_split(features,
                                                                            target, test_size = 0.33, random_state = 10)

Using above code snippet, we have divided the data into features and target set. The feature set consists of 5 columns i.e, predictor variables and target set consists of 1 column with class values.

The features_train & target_train consists of training data and the features_test & target_test consists of testing data.

### 16.8 Gaussian Naive Bayes Implementation
After completing the data preprocessing. it’s time to implement machine learning algorithm on it. We are going to use sklearn’s GaussianNB module.

In [54]:
#Implement GNB
clf = GaussianNB()
clf.fit(features_train, target_train)
target_pred = clf.predict(features_test)

We have built a GaussianNB classifier. The classifier is trained using training data. We can use fit() method for training it. After building a classifier, our model is ready to make predictions. We can use predict() method with test set features as its parameters.

### 16.9 Accuracy of our Gaussian Naive Bayes model

It’s time to test the quality of our model. We have made some predictions. Let’s compare the model’s prediction with actual target values for the test set. By following this method, we are going to calculate the accuracy of our model.

In [59]:
# accuracy_score
accuracy_score(target_test, target_pred, normalize = True)

0.60858305586385497

Our model is giving an accuracy of 60%. This is not bad with a simple implementation. 

In [60]:
print ("Accuracy is ", accuracy_score(target_test, target_pred, normalize = True)*100)

Accuracy is  60.8583055864
