## WhiteBox Error and Sensitivity Analysis on Wine Quality Data

Goals of tutorial include:
* [Importing wine quality dataset](#wine_quality)
* [Handling categorical features](#categorical)
* [Build model](#model)
* [Deploying WhiteBoxError graphics](#wbox_error)
* [Deploying WhiteBoxSensitivity graphics](#wbox_sensitivity)

In [13]:
import pandas as pd
from sklearn.ensemble import RandomForestRegressor

from whitebox.utils import utils as wb_utils
from whitebox.eval import WhiteBoxError, WhiteBoxSensitivity

### Import wine quality dataset <a id=wine_quality><a>
Perform basic exploratory data analysis to better understand what types of columns are available

In [5]:
df = pd.read_csv('./datasets/winequality.csv')

In [6]:
df.head()

Unnamed: 0,fixed.acidity,volatile.acidity,citric.acid,residual.sugar,chlorides,free.sulfur.dioxide,total.sulfur.dioxide,density,pH,sulphates,quality,Type,AlcoholContent
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,5,Red,Low
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,5,Red,Low
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,5,Red,Low
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,6,Red,Low
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,5,Red,Low


In [7]:
df.dtypes # it looks like most of our columns are numeric, with the exception of Type and AlcoholContent

fixed.acidity           float64
volatile.acidity        float64
citric.acid             float64
residual.sugar          float64
chlorides               float64
free.sulfur.dioxide     float64
total.sulfur.dioxide    float64
density                 float64
pH                      float64
sulphates               float64
quality                   int64
Type                     object
AlcoholContent           object
dtype: object

In [8]:
df.groupby('AlcoholContent')['fixed.acidity'].count() # most of our data resides in low/medium alcohol content

AlcoholContent
High       852
Low       2832
Medium    2813
Name: fixed.acidity, dtype: int64

In [7]:
df.groupby('Type')['fixed.acidity'].count() # and most of our data is white wine

Type
Red      1599
White    4898
Name: fixed.acidity, dtype: int64

### Handling categorical data <a id=categorical><a>
    
We can rely on pandas to convert our string/category columns into dummy variables to be used in our models

In [9]:
# dependent variables
ydepend = 'quality'

# create model data frame which will have categories converted to dummies
model_df = pd.get_dummies(df.loc[:, df.columns!=ydepend])

### Build model <a id=model><a>

In [14]:
# build the model
modelobj = RandomForestRegressor()

modelobj.fit(model_df, df.loc[:, ydepend])

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

### Create WhiteBoxError <a id=wbox_error><a>

In [27]:
# specify keepfeaturelist as a subset of columns we want to focus on

keepfeaturelist = ['fixed.acidity', 
                  'quality', 
                  'AlcoholContent', 
                  'sulphates', 
                  'volatile.acidity', 
                  'residual.sugar', 
                  'free.sulfur.dioxide', 
                  'ALL DATA']


# hack it for one groupby variable so we can also see how the model is doing across the entire space
df['ALL DATA'] = 'ALL DATA'

# specify the groupby variables
groupbyvars = ['AlcoholContent', 'ALL DATA']

# instantiate wbox error
WB = WhiteBoxError(modelobj=modelobj,
                   model_df=model_df,
                   ydepend=ydepend,
                   cat_df=df, 
                   groupbyvars=groupbyvars,
                   keepfeaturelist=keepfeaturelist,
                   autoformat_types=True,
                   verbose=None,
                   round_num=4)

In [28]:
# run wbox error
WB.run(output_type='html', output_path='winequality_example.html')

Percent Complete:  0%

  return np.nanmean(a, axis, out=out, keepdims=keepdims)


Percent Complete: 86%

In [21]:
WB._save('winequality_example.html')

In [22]:
WB.outputs

[{'Data': [{'ALL DATA': 'ALL DATA',
    'errNeg': 0.0,
    'errPos': 0.0,
    'groupByValue': 'Low',
    'groupByVarName': 'AlcoholContent',
    'predictedYSmooth': 5.2},
   {'ALL DATA': 'ALL DATA',
    'errNeg': -0.1,
    'errPos': 0.1,
    'groupByValue': 'Medium',
    'groupByVarName': 'AlcoholContent',
    'predictedYSmooth': 6.0},
   {'ALL DATA': 'ALL DATA',
    'errNeg': -0.1,
    'errPos': 0.1,
    'groupByValue': 'High',
    'groupByVarName': 'AlcoholContent',
    'predictedYSmooth': 6.6}],
  'Type': 'Categorical'},
 {'Data': [{'errNeg': -0.1,
    'errPos': 0.1,
    'groupByValue': 'ALL DATA',
    'groupByVarName': 'ALL DATA',
    'predictedYSmooth': 5.5,
    'residual.sugar': 0.9},
   {'errNeg': -0.1,
    'errPos': 0.1,
    'groupByValue': 'ALL DATA',
    'groupByVarName': 'ALL DATA',
    'predictedYSmooth': 5.8,
    'residual.sugar': 1.0},
   {'errNeg': -0.1,
    'errPos': 0.1,
    'groupByValue': 'ALL DATA',
    'groupByVarName': 'ALL DATA',
    'predictedYSmooth': 5.9,
    

### WhiteBox Sensitivity Analysis <a id=wbox_sensitivity><a>

In [None]:
# whitebox sensitivity behaves very similarly to WhiteBox Error

# specify your dependent variable
ydepend = 'quality'


# specify groupby variables
groupbyVars = ['Type']

# we need to create dummy variables to enhance our model further
dummydf = df.copy(deep=True)

# create dummies example using all categorical columns
dummies = pd.concat([pd.get_dummies(dummydf.loc[:, col], prefix = col) for col in dummydf.select_dtypes(include = ['category']).columns], axis = 1)
finaldf = pd.concat([dummydf.select_dtypes(include = [np.number]), dummies], axis = 1)



# create train dataset for fitting model
xtrain = finaldf.loc[:, finaldf.columns != ydepend].copy(deep = True)
# create dependent variable dataset
ytrain = finaldf.loc[:, ydepend]

# fit the model
modelobj.fit(xtrain, ytrain)

In [None]:

# specify featuredict as a subset of columns we want to focus on
# specify featuredict as a subset of columns we want to focus on
featuredict = {'fixed.acidity': 'FIXED ACIDITY',
               'Type': 'TYPE',
               'quality': 'SUPERQUALITY',
               'AlcoholContent': 'AC',
               'sulphates': 'SULPHATES',
              'volatile.acidity': 'VOLATILE ACIDITY',
              'residual.sugar': 'RESIDUAL SUGAR',
              'free.sulfur.dioxide': 'FREE SULFUR DIOXIDE'}

# instantiate whitebox sensitivity
WB = WhiteBoxSensitivity(modelobj = modelobj,
                   model_df = finaldf,
                   ydepend= ydepend,
                   cat_df = df,
                   groupbyvars = groupbyvars,
                   featuredict = featuredict,
                    verbose=None)
# run
WB.run()

In [None]:
# save the final outputs to disk
WB.save(fpath = '../output/example_winequality_sensitivity.html')