# Fitting SVM Models

##### import statements

In [2]:
import os
import pandas as pd
import sys
sys.path.insert(0, os.path.join(os.path.dirname(os.getcwd()), 'src'))
import models.svm_modeling as msm

##### read in the data

In [3]:
df = pd.read_csv(os.path.join(os.path.dirname(os.getcwd()),'data','interim', 'full_feature_data.csv'))

### The `svm_model` function provides a flexible way of fitting data to a range of models. It takes three required arguments: *df* (your dataframe), *model* (a short string of the model name), and *features* (either a list of features, a list of feature prefixes, or *'all'* to specify all features). Additionally, it takes two optional arguments, each of which has a default value: *features_provided* (default=False; change only if providing a pre-selected list of features) and *y* (default='Flaw_Depth'; can specify a different value as your y value).

### This function uses `GridSearchCV` from sci-kit learn to optimize hyperparameters, one of which is *kernel*. The '*rbf*' and '*linear*' kernels consistently performed  the best. If you would like to consider both during model fitting, simply use *'svm'* as the model argument.

### Run the cell below for an example. This will return a GridSearchCV model object, which contains considerable information regarding the model fit and can be used to make predictions. The function will also return a scaler object (more on that later). Please note that the grid search process may take a minute or two.

In [4]:
model, scaler = msm.svm_model(df=df, model='svm', features='all')

Fitting 5 folds for each of 24 candidates, totalling 120 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 120 out of 120 | elapsed:  1.5min finished


Fitting complete!


In [5]:
print("The best hyperparamers were {}".format(model.best_params_))

The best hyperparamers were {'C': 0.1, 'gamma': 0.0001, 'kernel': 'linear'}


In [6]:
print("The best RMSE value was {}".format(abs(model.best_score_)))

The best RMSE value was 0.053667248496671614


### You could also run the same model, but only conisder one kernel. To do this, set model to either '*svm_lin*' or '*svm_rbf*'.

In [7]:
model_svm_lin, scaler = msm.svm_model(df=df, model='svm_lin', features='all')

Fitting 5 folds for each of 3 candidates, totalling 15 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed:   15.4s finished


Fitting complete!


In [8]:
print("The best hyperparamers were {}".format(model_svm_lin.best_params_))

The best hyperparamers were {'C': 0.1}


In [9]:
print("The best RMSE value was {}".format(abs(model_svm_lin.best_score_)))

The best RMSE value was 0.053667248496671614


### If you only wanted to use a subset of parameters, e.g., all phase values, you could pass a list of feature prefixes to the features argument. For example, `features=['Amp', 'Phase']`, `features=['AB']`, `features=['A_', 'B_', 'AB']`, etc.

In [10]:
model_AB, scaler = msm.svm_model(df=df, model='svm_lin', features=['AB'])

Fitting 5 folds for each of 3 candidates, totalling 15 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed:   29.1s finished


Fitting complete!


In [11]:
print("The best hyperparamers were {}".format(model_svm_lin.best_params_))

The best hyperparamers were {'C': 0.1}


In [12]:
print("The best RMSE value was {}".format(abs(model_svm_lin.best_score_)))

The best RMSE value was 0.053667248496671614


### If a subset of features has already been selected, that can be passed as a list to the features argument. For example, below is the list of features identified by lasso regression.

In [13]:
feature_list = ['Amp_5', 'Amp_7', 'Amp_18', 'Amp_20', 'Phase_1', 'Phase_2', 'Phase_3', 'Phase_4', 'Phase_5', 'Phase_6',
            'Phase_8', 'Phase_10', 'Phase_11', 'Phase_12', 'Phase_13', 'Phase_14', 'Phase_15', 'Phase_16', 'Phase_17',
            'Phase_18', 'Phase_19', 'Phase_20', 'A_Value_1', 'A_Value_6', 'A_Value_10', 'A_Value_19', 'B_Value_4', 
            'B_Value_6', 'B_Value_12', 'B_Value_15', 'B_Value_16', 'B_Value_20', 'AB_Ratio_1', 'AB_Ratio_2',
            'AB_Ratio_3', 'AB_Ratio_4', 'AB_Ratio_5', 'AB_Ratio_6', 'AB_Ratio_7', 'AB_Ratio_8', 'AB_Ratio_9',
            'AB_Ratio_10', 'AB_Ratio_11', 'AB_Ratio_12', 'AB_Ratio_13', 'AB_Ratio_14', 'AB_Ratio_15', 'AB_Ratio_16',
            'AB_Ratio_17', 'AB_Ratio_18', 'AB_Ratio_19', 'AB_Ratio_20']

#### Remember to set `features_provided=True`!

In [14]:
model_w_features, scaler = msm.svm_model(df=df, model='svm_lin', features=feature_list, features_provided=True)

Fitting 5 folds for each of 3 candidates, totalling 15 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed:   14.1s finished


Fitting complete!


In [15]:
print("The best hyperparamers were {}".format(model_w_features.best_params_))

The best hyperparamers were {'C': 10.0}


In [16]:
print("The best RMSE value was {}".format(abs(model_w_features.best_score_)))

The best RMSE value was 0.049820446059445966


### If another variable is of interest as the dependent variable for the model, that can also be selected. For example, percent depth could be used instead of depth.

In [17]:
model_pct_depth, scaler_pct_depth = msm.svm_model(df=df, model='svm_lin', features=feature_list, features_provided=True, y='Pct_Depth')

Fitting 5 folds for each of 3 candidates, totalling 15 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed:  1.1min finished


Fitting complete!


In [18]:
print("The best hyperparamers were {}".format(model_pct_depth.best_params_))

The best hyperparamers were {'C': 1.0}


In [19]:
print("The best RMSE value was {}".format(abs(model_pct_depth.best_score_)))

The best RMSE value was 6.26938609336819


### To use any of these models for predictions, you can use the `make_predictions_on_test` function, which takes a test data dataframe, scaler, feature list, and model as arguments.

### This function requires the features as a list; if you built a model using features simply as a list of prefixes (e.g., `features=['AB']`), you can get a full list by using the `get_feature_list` function, which requires a dataframe and your features as prefixes or simply by specifying *'all'*.

##### all AB features

In [20]:
AB_features = msm.get_feature_list(feats=['AB'], df=df)

In [21]:
AB_features

['AB_Ratio_1',
 'AB_Ratio_2',
 'AB_Ratio_3',
 'AB_Ratio_4',
 'AB_Ratio_5',
 'AB_Ratio_6',
 'AB_Ratio_7',
 'AB_Ratio_8',
 'AB_Ratio_9',
 'AB_Ratio_10',
 'AB_Ratio_11',
 'AB_Ratio_12',
 'AB_Ratio_13',
 'AB_Ratio_14',
 'AB_Ratio_15',
 'AB_Ratio_16',
 'AB_Ratio_17',
 'AB_Ratio_18',
 'AB_Ratio_19',
 'AB_Ratio_20']

##### All features

In [22]:
all_features = msm.get_feature_list(feats='all', df=df)

In [23]:
all_features

['Amp_1',
 'Amp_2',
 'Amp_3',
 'Amp_4',
 'Amp_5',
 'Amp_6',
 'Amp_7',
 'Amp_8',
 'Amp_9',
 'Amp_10',
 'Amp_11',
 'Amp_12',
 'Amp_13',
 'Amp_14',
 'Amp_15',
 'Amp_16',
 'Amp_17',
 'Amp_18',
 'Amp_19',
 'Amp_20',
 'Phase_1',
 'Phase_2',
 'Phase_3',
 'Phase_4',
 'Phase_5',
 'Phase_6',
 'Phase_7',
 'Phase_8',
 'Phase_9',
 'Phase_10',
 'Phase_11',
 'Phase_12',
 'Phase_13',
 'Phase_14',
 'Phase_15',
 'Phase_16',
 'Phase_17',
 'Phase_18',
 'Phase_19',
 'Phase_20',
 'A_Value_1',
 'A_Value_2',
 'A_Value_3',
 'A_Value_4',
 'A_Value_5',
 'A_Value_6',
 'A_Value_7',
 'A_Value_8',
 'A_Value_9',
 'A_Value_10',
 'A_Value_11',
 'A_Value_12',
 'A_Value_13',
 'A_Value_14',
 'A_Value_15',
 'A_Value_16',
 'A_Value_17',
 'A_Value_18',
 'A_Value_19',
 'A_Value_20',
 'B_Value_1',
 'B_Value_2',
 'B_Value_3',
 'B_Value_4',
 'B_Value_5',
 'B_Value_6',
 'B_Value_7',
 'B_Value_8',
 'B_Value_9',
 'B_Value_10',
 'B_Value_11',
 'B_Value_12',
 'B_Value_13',
 'B_Value_14',
 'B_Value_15',
 'B_Value_16',
 'B_Value_17',
 

### Also note that the data need to be scaled before any predictions can be made. The scaler is the second output of the `svm_model` function, and is used in `make_predictions_on_test`.

### Let's make predictions for the hold-out set data using the Pct_Depth linear SVM model.

In [24]:
test_data = pd.read_csv(os.path.join(os.path.dirname(os.getcwd()),'data','interim', 'full_feature_test_data.csv'))

In [25]:
y_pred = msm.make_predictions_on_test(test_data, scaler_pct_depth, feature_list, model_pct_depth)

In [26]:
y_pred

array([ 97.51116081,  51.85992351,  18.77678606,  81.22103548,
        24.54162278,  72.75761092,  50.80852322,  32.94653143,
        57.73985622, 103.70184532,  55.73729727,  19.49718895,
        82.37510505,  29.03672756,  76.17466569,  50.81760467,
        34.48040757,  59.81694934, 100.22991088,  53.04275728,
        15.37465095,  84.75258588,  21.70587555,  69.8753622 ,
        48.82267917,  35.8181494 ,  57.1907746 ,  95.08656482,
        52.31684282,  19.47428177,  83.71796603,  21.79211948,
        71.43679718,  50.33158531,  39.30363302,  60.54409507,
        97.49790541,  78.93967895,  24.50707271,  58.82178414,
        30.08781843,  73.39432502,  33.84538672,  57.26371057,
        55.63807382, 101.86926768,  79.6579654 ,  26.92031197,
        58.47138335,  31.34369996,  72.67674553,  33.48640171,
        51.63729656,  56.73795606, 102.4840243 ,  79.62513402,
        28.62998928,  55.25402093,  29.08647599,  74.10156977,
        30.92957231,  57.63585664,  57.40013605,  97.40

### Finally, although this function is called 'svm_model', it can take model arguments for OLS linear regression (*'lin_reg'*), ridge regression (*'ridge'*), elastic nets (*'elastic'*), and lasso regression (*'lasso'*).

### Here's a quick example for linear regression.

In [27]:
model, scaler = msm.svm_model(df=df, model='lin_reg', features='all')

Fitting 5 folds for each of 40 candidates, totalling 200 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting complete!


[Parallel(n_jobs=1)]: Done 200 out of 200 | elapsed:    5.9s finished


In [28]:
print("The best hyperparamers were {}".format(model.best_params_))

The best hyperparamers were {'copy_X': True, 'fit_intercept': True, 'n_jobs': 1, 'normalize': False}


In [29]:
print("The best RMSE value was {}".format(abs(model.best_score_)))

The best RMSE value was 0.050194079813211966
