### 8.0.1 Support Vector Classification
In the case of our retail dataset, the task is to predict, Weekly Sales, which the MarkDowns represent. We have samples of each of the 5 markdowns on which we fit an estimator to be able to predict the classes to which unseen samples belong. 

We are going to derive an estimator for classification which is a Python object that implements the methods fit(X, y) and predict(T). The estimator we are using is the class sklearn.svm.SVC that implements support vector classification.

In [1]:
from sklearn import svm

In [2]:
clf = svm.SVC(gamma=0.001, C=100.)

#### 8.0.1.1 Choosing the parameters of the model
The constructor of the estimator in 8.0.1 takes as arguments the parameters of the model.

##### 8.0.1.1.1 Manual Setting of Gamma
First we set the value of gamma manually. The estimator instance clf (which is a classifier) is fitted to the model, that is, we let it learn from the model.

In [3]:
# libraries
#%matplotlib notebook

import pandas as pd
import numpy as np

import matplotlib
import seaborn
import matplotlib.dates as md
from matplotlib import pyplot as plt

from sklearn import preprocessing
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.covariance import EllipticEnvelope
from pyemma import msm
from sklearn.ensemble import IsolationForest
from sklearn.svm import OneClassSVM

In [4]:
# load the master_dataset.xls
df = pd.read_excel('master_dataset.xlsx', sheetname='Sheet1')

In [5]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8190 entries, 0 to 8189
Data columns (total 95 columns):
Store                     8190 non-null int64
Date                      8190 non-null datetime64[ns]
Temperature               8190 non-null float64
Fuel_Price                8190 non-null float64
MarkDown1                 8190 non-null float64
MarkDown2                 8190 non-null float64
MarkDown3                 8190 non-null float64
MarkDown4                 8190 non-null float64
MarkDown5                 8190 non-null float64
CPI                       8190 non-null float64
Unemployment              8190 non-null float64
IsHoliday                 8190 non-null bool
Type                      8190 non-null object
Size                      8190 non-null int64
Jewelry                   8190 non-null float64
Pets                      8190 non-null float64
TV_Video                  8190 non-null float64
Cell_Phones               8190 non-null float64
Pharmaceutical            8190

#### MarkDown 1 Prediction
Here we predict the behavior of MarkDown 1 using the sales of other materials and other parameters.

In [6]:
# Use MarkDown1 sales as a target for prediction. 
df_MarkDown1 = df['MarkDown1']

Split Markdown1 into Low-Low Sales represented by the number 0, Low Sales represented by the number 1,  Average Sales represented by the number 2, Above Average Sales represented by the number 3 and High Sales represented by the number 4 / LLS, LS, AS, AAS, HS

In [7]:
def MarkDown1_Split(x):
    if x < 4444:
        return 0
    elif 4444< x < 8888:
        return 1
    elif 8888< x < 13332:
        return 2
    elif 13332 < x < 17776:
        return 3
    else: return 4

In [8]:
# Descritize MarkDown_Score
df_MarkDown1["MarkDown_Score"] = df_MarkDown1.apply(MarkDown1_Split)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


In [9]:
print (df_MarkDown1["MarkDown_Score"].head())

0    2
1    2
2    2
3    2
4    2
Name: MarkDown1, dtype: int64


In [10]:
#Drop none-float columns and the MarkDowns that would be used for prediction
df.drop(['Store', 'Date', 'IsHoliday', 'Type', 'MarkDown1', 'MarkDown2', 'MarkDown3', 'MarkDown4', 'MarkDown5'], axis=1, inplace=True)

As a training set, we would use all the values of our dataset apart from the last one. We select this training set with the [:-1] Python syntax, which produces a new array that contains all but the last entry of digits.data

In [11]:
clf.fit(df[:-1], df_MarkDown1["MarkDown_Score"][:-1])

SVC(C=100.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma=0.001, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

Now we can predict new values, in particular, we will ask to the classifier what is the value of our last data point in the retail dataset, which we have not used to train the classifier:

In [12]:
clf.predict(df[-1:])

array([1], dtype=int64)

This has equivalent of Low MarkDown1 Sale.

##### 8.0.1.1.2 Automatic Setting of Gamma¶
It is possible to automatically find good values for the parameters. In order to do this, we will be using tools such as grid search and cross validation.