## 3.2 - Model Generation

This secton covers the model training, testing, and tuning of both the Support Vector Machine and a Decision Tree.

### 3.3.1 - Support Vector Machine Training

For SVM training experimentation covered the following:
- a) use of multi-class for quality range of 0-1
- b) use of binary class for quality range of 0,1
- c) use of all features - standardized using StandardScaler
- d) use of select features with high correlation to quality outcome using StandardScaler

Experiments take advantage of the pipeline and grid search capabiliteis of SciKit that allows providing a range of parameters (hyperparameters) along with Preprocessing, the Model, and Cross Fold validation settings and SciKit learn will run through all the combinations and generate the "best" in terms of Accuracy.

#### 3.3.1.1 Loading and setting up test and training datasets.



In [5]:
#this is for development only and not relevant
%load_ext autoreload
%autoreload

from utils.helpers import *
df_red, df_white, df_all = pull_and_load_data(force = False)

# this is the full 
#X_train = get_df_no_color(df_all, binary = True)
X_train, y_train = get_features_and_labels(df_all, binary = True)



The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
path exist and not forced
path exist and not forced


In [15]:
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV

model_pipeline = Pipeline([
    ('scale', StandardScaler()), ('svm', SVC())])

#https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
#https://scikit-learn.org/stable/auto_examples/plot_kernel_ridge_regression.html#sphx-glr-auto-examples-plot-kernel-ridge-regression-py
# *** https://scikit-learn.org/stable/auto_examples/model_selection/plot_grid_search_digits.html#sphx-glr-auto-examples-model-selection-plot-grid-search-digits-py
#https://stackoverflow.com/a/45394598/140618

#### *** 
# https://scikit-learn.org/stable/auto_examples/model_selection/plot_grid_search_digits.html#sphx-glr-auto-examples-model-selection-plot-grid-search-digits-py
####

# https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter

# https://scikit-learn.org/stable/auto_examples/svm/plot_rbf_parameters.html

parameteres = {
  #  'scale__with_mean' : [True, False], # .77 wit these, .84 without?
  #  'scale__with_std' : [True, False],
    'svm__C':[0.001,0.1,10,100,10e5],
    'svm__gamma':[0.1,0.01],
    'svm__max_iter':[100000],
    'svm__random_state' : [42]}

# TODO: kill 'refit' https://scikit-learn.org/stable/modules/grid_search.html#multimetric-grid-search
grid = GridSearchCV(
    model_pipeline, parameteres, cv = 10,
  #  scoring = {
  #      'accuracy': 'accuracy',
  #      'precision' : 'precision',
  #      'recall' : 'recall',
  #      'f1' : 'f1'
  #  }, 
    #refit = 'accuracy', 
    verbose = 10
).fit(X_train, y_train)

Fitting 10 folds for each of 10 candidates, totalling 100 fits
[CV] svm__C=0.001, svm__gamma=0.1, svm__max_iter=100000 ..............


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  svm__C=0.001, svm__gamma=0.1, svm__max_iter=100000, score=0.632, total=   0.8s
[CV] svm__C=0.001, svm__gamma=0.1, svm__max_iter=100000 ..............


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.8s remaining:    0.0s


[CV]  svm__C=0.001, svm__gamma=0.1, svm__max_iter=100000, score=0.632, total=   0.8s
[CV] svm__C=0.001, svm__gamma=0.1, svm__max_iter=100000 ..............


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    1.6s remaining:    0.0s


[CV]  svm__C=0.001, svm__gamma=0.1, svm__max_iter=100000, score=0.632, total=   0.8s
[CV] svm__C=0.001, svm__gamma=0.1, svm__max_iter=100000 ..............


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    2.4s remaining:    0.0s


[CV]  svm__C=0.001, svm__gamma=0.1, svm__max_iter=100000, score=0.632, total=   0.9s
[CV] svm__C=0.001, svm__gamma=0.1, svm__max_iter=100000 ..............


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    3.3s remaining:    0.0s


[CV]  svm__C=0.001, svm__gamma=0.1, svm__max_iter=100000, score=0.634, total=   0.8s
[CV] svm__C=0.001, svm__gamma=0.1, svm__max_iter=100000 ..............


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    4.0s remaining:    0.0s


[CV]  svm__C=0.001, svm__gamma=0.1, svm__max_iter=100000, score=0.634, total=   0.8s
[CV] svm__C=0.001, svm__gamma=0.1, svm__max_iter=100000 ..............


[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:    4.8s remaining:    0.0s


[CV]  svm__C=0.001, svm__gamma=0.1, svm__max_iter=100000, score=0.634, total=   0.8s
[CV] svm__C=0.001, svm__gamma=0.1, svm__max_iter=100000 ..............


[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed:    5.6s remaining:    0.0s


[CV]  svm__C=0.001, svm__gamma=0.1, svm__max_iter=100000, score=0.633, total=   0.8s
[CV] svm__C=0.001, svm__gamma=0.1, svm__max_iter=100000 ..............


[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:    6.4s remaining:    0.0s


[CV]  svm__C=0.001, svm__gamma=0.1, svm__max_iter=100000, score=0.633, total=   0.8s
[CV] svm__C=0.001, svm__gamma=0.1, svm__max_iter=100000 ..............


[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:    7.2s remaining:    0.0s


[CV]  svm__C=0.001, svm__gamma=0.1, svm__max_iter=100000, score=0.633, total=   0.8s
[CV] svm__C=0.001, svm__gamma=0.01, svm__max_iter=100000 .............
[CV]  svm__C=0.001, svm__gamma=0.01, svm__max_iter=100000, score=0.632, total=   0.8s
[CV] svm__C=0.001, svm__gamma=0.01, svm__max_iter=100000 .............
[CV]  svm__C=0.001, svm__gamma=0.01, svm__max_iter=100000, score=0.632, total=   0.8s
[CV] svm__C=0.001, svm__gamma=0.01, svm__max_iter=100000 .............
[CV]  svm__C=0.001, svm__gamma=0.01, svm__max_iter=100000, score=0.632, total=   0.8s
[CV] svm__C=0.001, svm__gamma=0.01, svm__max_iter=100000 .............
[CV]  svm__C=0.001, svm__gamma=0.01, svm__max_iter=100000, score=0.632, total=   0.8s
[CV] svm__C=0.001, svm__gamma=0.01, svm__max_iter=100000 .............
[CV]  svm__C=0.001, svm__gamma=0.01, svm__max_iter=100000, score=0.634, total=   0.8s
[CV] svm__C=0.001, svm__gamma=0.01, svm__max_iter=100000 .............
[CV]  svm__C=0.001, svm__gamma=0.01, svm__max_iter=100000, 



[CV]  svm__C=100, svm__gamma=0.1, svm__max_iter=100000, score=0.597, total=   2.2s
[CV] svm__C=100, svm__gamma=0.1, svm__max_iter=100000 ................




[CV]  svm__C=100, svm__gamma=0.1, svm__max_iter=100000, score=0.566, total=   2.3s
[CV] svm__C=100, svm__gamma=0.1, svm__max_iter=100000 ................




[CV]  svm__C=100, svm__gamma=0.1, svm__max_iter=100000, score=0.574, total=   2.2s
[CV] svm__C=100, svm__gamma=0.1, svm__max_iter=100000 ................




[CV]  svm__C=100, svm__gamma=0.1, svm__max_iter=100000, score=0.655, total=   2.2s
[CV] svm__C=100, svm__gamma=0.1, svm__max_iter=100000 ................




[CV]  svm__C=100, svm__gamma=0.1, svm__max_iter=100000, score=0.665, total=   2.3s
[CV] svm__C=100, svm__gamma=0.1, svm__max_iter=100000 ................




[CV]  svm__C=100, svm__gamma=0.1, svm__max_iter=100000, score=0.685, total=   2.2s
[CV] svm__C=100, svm__gamma=0.1, svm__max_iter=100000 ................




[CV]  svm__C=100, svm__gamma=0.1, svm__max_iter=100000, score=0.769, total=   2.3s
[CV] svm__C=100, svm__gamma=0.1, svm__max_iter=100000 ................




[CV]  svm__C=100, svm__gamma=0.1, svm__max_iter=100000, score=0.746, total=   2.3s
[CV] svm__C=100, svm__gamma=0.1, svm__max_iter=100000 ................




[CV]  svm__C=100, svm__gamma=0.1, svm__max_iter=100000, score=0.740, total=   2.4s
[CV] svm__C=100, svm__gamma=0.1, svm__max_iter=100000 ................




[CV]  svm__C=100, svm__gamma=0.1, svm__max_iter=100000, score=0.716, total=   2.3s
[CV] svm__C=100, svm__gamma=0.01, svm__max_iter=100000 ...............
[CV]  svm__C=100, svm__gamma=0.01, svm__max_iter=100000, score=0.580, total=   1.2s
[CV] svm__C=100, svm__gamma=0.01, svm__max_iter=100000 ...............
[CV]  svm__C=100, svm__gamma=0.01, svm__max_iter=100000, score=0.682, total=   1.2s
[CV] svm__C=100, svm__gamma=0.01, svm__max_iter=100000 ...............
[CV]  svm__C=100, svm__gamma=0.01, svm__max_iter=100000, score=0.635, total=   1.1s
[CV] svm__C=100, svm__gamma=0.01, svm__max_iter=100000 ...............
[CV]  svm__C=100, svm__gamma=0.01, svm__max_iter=100000, score=0.702, total=   1.2s
[CV] svm__C=100, svm__gamma=0.01, svm__max_iter=100000 ...............
[CV]  svm__C=100, svm__gamma=0.01, svm__max_iter=100000, score=0.717, total=   1.2s
[CV] svm__C=100, svm__gamma=0.01, svm__max_iter=100000 ...............
[CV]  svm__C=100, svm__gamma=0.01, svm__max_iter=100000, score=0.734, t



[CV]  svm__C=1000000.0, svm__gamma=0.1, svm__max_iter=100000, score=0.522, total=   3.6s
[CV] svm__C=1000000.0, svm__gamma=0.1, svm__max_iter=100000 ..........




[CV]  svm__C=1000000.0, svm__gamma=0.1, svm__max_iter=100000, score=0.572, total=   3.7s
[CV] svm__C=1000000.0, svm__gamma=0.1, svm__max_iter=100000 ..........




[CV]  svm__C=1000000.0, svm__gamma=0.1, svm__max_iter=100000, score=0.474, total=   3.8s
[CV] svm__C=1000000.0, svm__gamma=0.1, svm__max_iter=100000 ..........




[CV]  svm__C=1000000.0, svm__gamma=0.1, svm__max_iter=100000, score=0.585, total=   3.8s
[CV] svm__C=1000000.0, svm__gamma=0.1, svm__max_iter=100000 ..........




[CV]  svm__C=1000000.0, svm__gamma=0.1, svm__max_iter=100000, score=0.560, total=   3.7s
[CV] svm__C=1000000.0, svm__gamma=0.1, svm__max_iter=100000 ..........




[CV]  svm__C=1000000.0, svm__gamma=0.1, svm__max_iter=100000, score=0.625, total=   3.7s
[CV] svm__C=1000000.0, svm__gamma=0.1, svm__max_iter=100000 ..........




[CV]  svm__C=1000000.0, svm__gamma=0.1, svm__max_iter=100000, score=0.682, total=   3.8s
[CV] svm__C=1000000.0, svm__gamma=0.1, svm__max_iter=100000 ..........




[CV]  svm__C=1000000.0, svm__gamma=0.1, svm__max_iter=100000, score=0.663, total=   3.7s
[CV] svm__C=1000000.0, svm__gamma=0.1, svm__max_iter=100000 ..........




[CV]  svm__C=1000000.0, svm__gamma=0.1, svm__max_iter=100000, score=0.700, total=   3.7s
[CV] svm__C=1000000.0, svm__gamma=0.1, svm__max_iter=100000 ..........




[CV]  svm__C=1000000.0, svm__gamma=0.1, svm__max_iter=100000, score=0.638, total=   3.8s
[CV] svm__C=1000000.0, svm__gamma=0.01, svm__max_iter=100000 .........




[CV]  svm__C=1000000.0, svm__gamma=0.01, svm__max_iter=100000, score=0.382, total=   4.6s
[CV] svm__C=1000000.0, svm__gamma=0.01, svm__max_iter=100000 .........




[CV]  svm__C=1000000.0, svm__gamma=0.01, svm__max_iter=100000, score=0.543, total=   4.4s
[CV] svm__C=1000000.0, svm__gamma=0.01, svm__max_iter=100000 .........




[CV]  svm__C=1000000.0, svm__gamma=0.01, svm__max_iter=100000, score=0.642, total=   4.3s
[CV] svm__C=1000000.0, svm__gamma=0.01, svm__max_iter=100000 .........




[CV]  svm__C=1000000.0, svm__gamma=0.01, svm__max_iter=100000, score=0.595, total=   4.4s
[CV] svm__C=1000000.0, svm__gamma=0.01, svm__max_iter=100000 .........




[CV]  svm__C=1000000.0, svm__gamma=0.01, svm__max_iter=100000, score=0.529, total=   4.2s
[CV] svm__C=1000000.0, svm__gamma=0.01, svm__max_iter=100000 .........




[CV]  svm__C=1000000.0, svm__gamma=0.01, svm__max_iter=100000, score=0.509, total=   4.3s
[CV] svm__C=1000000.0, svm__gamma=0.01, svm__max_iter=100000 .........




[CV]  svm__C=1000000.0, svm__gamma=0.01, svm__max_iter=100000, score=0.706, total=   4.4s
[CV] svm__C=1000000.0, svm__gamma=0.01, svm__max_iter=100000 .........




[CV]  svm__C=1000000.0, svm__gamma=0.01, svm__max_iter=100000, score=0.643, total=   4.3s
[CV] svm__C=1000000.0, svm__gamma=0.01, svm__max_iter=100000 .........




[CV]  svm__C=1000000.0, svm__gamma=0.01, svm__max_iter=100000, score=0.533, total=   4.5s
[CV] svm__C=1000000.0, svm__gamma=0.01, svm__max_iter=100000 .........


[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:  2.7min finished


[CV]  svm__C=1000000.0, svm__gamma=0.01, svm__max_iter=100000, score=0.610, total=   4.3s


In [12]:
print("score = %3.2f" %(grid.score(X_train,y_train)))
print(grid.best_params_)

score = 0.77
{'scale__with_mean': True, 'scale__with_std': True, 'svm__C': 10, 'svm__gamma': 0.01, 'svm__max_iter': 100000}


In [13]:
grid

GridSearchCV(cv=10, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('scale',
                                        StandardScaler(copy=True,
                                                       with_mean=True,
                                                       with_std=True)),
                                       ('svm',
                                        SVC(C=1.0, break_ties=False,
                                            cache_size=200, class_weight=None,
                                            coef0=0.0,
                                            decision_function_shape='ovr',
                                            degree=3, gamma='scale',
                                            kernel='rbf', max_iter=-1,
                                            probability=False,
                                            random_state=None, shrinking=True,
                                            tol=0....

In [None]:
grid.cv_results_

In [16]:
grid.best_params_

{'svm__C': 10, 'svm__gamma': 0.01, 'svm__max_iter': 100000}

In [18]:
grid.best_estimator_

Pipeline(memory=None,
         steps=[('scale',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('svm',
                 SVC(C=10, break_ties=False, cache_size=200, class_weight=None,
                     coef0=0.0, decision_function_shape='ovr', degree=3,
                     gamma=0.01, kernel='rbf', max_iter=100000,
                     probability=False, random_state=None, shrinking=True,
                     tol=0.001, verbose=False))],
         verbose=False)

In [19]:
grid.best_score_


0.7075711745881238

In [24]:
grid.cv_results_

{'mean_fit_time': array([0.75247395, 0.72971303, 0.60026951, 0.69851217, 0.91480985,
        0.65935476, 2.23463337, 1.13741827, 3.70260365, 4.3303313 ]),
 'std_fit_time': array([0.02802653, 0.00808969, 0.00973134, 0.01524635, 0.02227772,
        0.01221629, 0.0603345 , 0.01973854, 0.05604559, 0.11668648]),
 'mean_score_time': array([0.0459583 , 0.04760151, 0.03926413, 0.04352386, 0.03264506,
        0.03571587, 0.03076854, 0.03417389, 0.0297637 , 0.04072535]),
 'std_score_time': array([0.00024252, 0.00262029, 0.00027601, 0.00270929, 0.00037764,
        0.00094388, 0.00064822, 0.00040129, 0.00060382, 0.00107654]),
 'param_svm__C': masked_array(data=[0.001, 0.001, 0.1, 0.1, 10, 10, 100, 100, 1000000.0,
                    1000000.0],
              mask=[False, False, False, False, False, False, False, False,
                    False, False],
        fill_value='?',
             dtype=object),
 'param_svm__gamma': masked_array(data=[0.1, 0.01, 0.1, 0.01, 0.1, 0.01, 0.1, 0.01, 0.1, 0.01]