# Hyperparameter Tuning

We will explore tuning the hyperparameters of the Support Vector Regression algorithms using SciKit Learn's Grid Search function [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html).

In [3]:
%conda install scikit-learn pandas numpy

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /Users/tim/opt/miniconda3/envs/training

  added / updated specs:
    - numpy
    - pandas
    - scikit-learn


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    joblib-0.16.0              |             py_0         210 KB
    mkl_fft-1.1.0              |   py37hc64f4ea_0         135 KB
    numpy-1.18.5               |   py37h1da2735_0           5 KB
    numpy-base-1.18.5          |   py37h3304bdc_0         3.9 MB
    pandas-1.0.5               |   py37h959d312_0         7.4 MB
    scikit-learn-0.23.1        |   py37h603561c_0         4.6 MB
    scipy-1.5.0                |   py37h912ce22_0        13.2 MB
    threadpoolctl-2.1.0        |     pyh5ca1d4c_0          17 KB
    ------------------------------------------------------------
                              

In [4]:
# Import modules
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import MinMaxScaler
import pandas as pd 
import numpy as np
import sklearn.svm
import sklearn.ensemble

In [5]:
# Use same evaluate_model function as before
def evaluate_model(model_fn, print_result=False):
    '''
    Consumes a function model_fn
    and evaluates its predictive accuracy against 
    the housing prices test set.
    We have included a switch for the output to be a more human readable
    printed version or the uncurtailed floating point value of the average.
    '''
    test_data = pd.read_csv("https://raw.githubusercontent.com/eliiza/ml-training-data/master/housing_price_data/test_data.csv")
    actual_values = test_data['SalePrice']
    # Pass in all columns except SalePrice
    test_input = test_data.filter(regex='^(?!SalePrice$).*')
    predicted_saleprice = model_fn(test_input)
    mae = np.mean(np.abs(predicted_saleprice-actual_values))
    if print_result:
        return print("The model is inaccurate by $%.2f on average." % mae)
    else:
        return mae

In [6]:
# Functions for data encoding

# Step 1: Feature Engineering
def encode_data(data,scaler = None):
    """
    Encode a dataframe of house price data using the desired feature engineering process. 
    The scaler argument allows you to either scale the data anew (scaler = None), 
    or use previously derived scaling parameters
    e.g. when you want to encoding test data using the scaling parameters from the training dataset.
    Returns a dataframe of engineered features and the scaler object.
    """
    
    features = data.copy()
    
    # Numerical features
    features = features[['OverallQual','GrLivArea','BedroomAbvGr','FullBath','YearBuilt']]
    features['QualAreaInteract'] = features['OverallQual'] * features['GrLivArea']
    
    # Ordinal feature - map to numerical as before
    cond_map = {'Ex':5,'Gd':4,'TA':3,'Fa':2,'Po':1}
    features['KitchenQual'] = data['KitchenQual'].map(cond_map).fillna(0)
    
    # Categorical features - one-hot encode using pre-written helper functions (below)
    features['CentralAir'] = data['CentralAir'] == 'Y'
    electrical = encode_electrical(data['Electrical'])
    heating = encode_heating(data['Heating'])
    features = pd.concat([features,electrical,heating],axis=1)
    
    # Convert to float data type for scaling process
    features = features.astype(float)
    
    # Scale all the features
    # If no `scaler` object in the function arguments - carry out scaling anew
    # If `scaler` object in the function arguments - use those scaling parameters
    if(not scaler):
       scaler = MinMaxScaler()
       scaler.fit(features)
    features = pd.DataFrame(scaler.transform(features), 
                            columns = ['OverallQual','GrLivArea','BedroomAbvGr','FullBath','YearBuilt',
                                       'QualAreaInteract','KitchenQual',
                                       'CentralAir','FuseA','FuseF','FuseP','Mix','SBrkr',
                                       'GasA','GasW','Grav','Wall'])
    
    # Return the desired data frame and the scaling parameters used
    return(features,scaler)


# Helper functions for one hot encoding

def encode_electrical(electrical):
    """
    Create data frame with one column per category in 'electrical' column, rows are Boolean with respect to
    category string in electrical column.
    """
    one_hot_encoding = pd.DataFrame()
    one_hot_encoding['FuseA'] = electrical == 'FuseA'
    one_hot_encoding['FuseF'] = electrical == 'FuseF'
    one_hot_encoding['FuseP'] = electrical == 'FuseP'
    one_hot_encoding['Mix']   = electrical == 'Mix'
    one_hot_encoding['SBrkr'] = electrical == 'SBrkr'
    return(one_hot_encoding)

def encode_heating(heating):
    """
    Create data frame with one column per category in 'heating' column, rows are Boolean with respect to
    category string in heating column.
    """
    one_hot_encoding = pd.DataFrame()
    one_hot_encoding['GasA'] = heating == 'GasA'
    one_hot_encoding['GasW'] = heating == 'GasW'
    one_hot_encoding['Grav'] = heating == 'Grav'
    one_hot_encoding['Wall'] = heating == 'Wall'
    return(one_hot_encoding)

In [7]:
# Import training data
training_set = pd.read_csv("https://raw.githubusercontent.com/eliiza/ml-training-data/master/housing_price_data/training_data.csv")

# Feature engineering as in previous notebook (using function in utils.py)
training_features, training_scaler = encode_data(training_set)

## Default Support Vector Regression hyperparameters

How well do the default SVR hyperparameters perform?

In [8]:
# SVR

# Step 2: Train the model
predictor = sklearn.svm.SVR(gamma='auto') 
predictor.fit(training_features, training_set['SalePrice'])

# Step 3: Create a function that can make predictions using the model
def svr_model(input_data, scaler=training_scaler):
    input_features,_ = encode_data(input_data,scaler)
    predictions = predictor.predict(input_features)
    return(predictions)

# Step 4: Evaluate
evaluate_model(svr_model, print_result=True)

The model is inaccurate by $51635.91 on average.


## Hyperparameter Tuning

We will try to tune these hyperparameters:

- `C` = margin size to apply no penalty
- `epsilon` = penalty parameter to apply to errors
- `kernel` = type of transformation applied

In [9]:
# Define the chosen hyperparameter values for the Grid Search
hyperparameters = {'C':[1E5, 1E6, 1E7], 'epsilon':[100, 250, 500], 
                   'kernel': ['linear', 'rbf']}

# Define the estimator to train
svr = sklearn.svm.SVR(gamma='auto')

# Define the Grid Search CV to undertake
svr_grid_search = GridSearchCV(svr, hyperparameters, cv = 10)

In [11]:
# Run the Grid Search Cross Validation process
svr_grid_search.fit(training_features, training_set['SalePrice'])

GridSearchCV(cv=10, estimator=SVR(gamma='auto'),
             param_grid={'C': [100000.0, 1000000.0, 10000000.0],
                         'epsilon': [100, 250, 500],
                         'kernel': ['linear', 'rbf']})

In [12]:
# See what hyperparameter combination was best
svr_grid_search.best_params_

{'C': 1000000.0, 'epsilon': 100, 'kernel': 'rbf'}

In [13]:
def tuned_svr_model(input_data):
    """
    Create a function that applies the tuned SVR model to test data.
    Returns predictions for the 'SalesPrice' column.
    """
    # Create scaler for use on test data
    training_features, scaler = encode_data(training_set)
    # Apply scaler for feature engineering on test data
    input_features,_ = encode_data(input_data,scaler)
    # Predict using the test data
    predictions = svr_grid_search.best_estimator_.predict(input_features)
    return(predictions)

In [14]:
# Evaluate the tuned model
evaluate_model(tuned_svr_model, print_result=True)

The model is inaccurate by $20804.21 on average.
