![eo_logo.png](eo_logo.png)

# Tuning SVM and XGBoost

This notebook will show how to tune a SVM's and XGBoost hyperparameters using a **genetic algorithm**. The NN will be evaluated on the sonar dataset. 

The optimisation process includes three main steps:<br>
1) Coding the **evaluation function** - taking a dictionar of parameters as argument and returning a scalar <br>
2) Defining the **search space**, a list of integer, real or categorical parameters<br>
3) Running the **optimisation function**<br>

The datasety can be found in the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Connectionist+Bench+(Sonar,+Mines+vs.+Rocks))

General evolution_opt documentation can be found [here](https://eliottkalfon.github.io/evolution_opt/)



## Importing the main packages

In [1]:
import pandas as pd
from evolution_opt.genetic import *

from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC

import xgboost as xgb

## Reading and preparing the sonar dataset

In [2]:
# load dataset
dataframe = pd.read_csv('sonar_dataset.txt', header=None)
dataset = dataframe.values
# split into input (X) and output (Y) variables
X = dataset[:,0:60].astype(float)
Y = dataset[:,60]
# encode class values as integers
encoder = LabelEncoder()
encoder.fit(Y)
encoded_Y = encoder.transform(Y)

## Coding the SVM evaluation function

In [3]:
def evaluate_svm(param_dict):
    '''
    This function will evaluate an SVM classifier using the parameters of a given individual
    '''
    # evaluate baseline model with standardized dataset
    estimators = []
    estimators.append(('standardize', StandardScaler()))
    estimators.append(('svm', SVC(C=param_dict['C'], kernel=param_dict['kernel'], 
                                  gamma=param_dict['gamma'], degree=param_dict['degree'])))
    pipeline = Pipeline(estimators)
    kfold = StratifiedKFold(n_splits=3, shuffle=True)
    results = cross_val_score(pipeline, X, encoded_Y, cv=kfold)
    #Returns the average accuracy across the cross validation splits
    return np.mean(results)

## Defining the search space

In [4]:
search_space = [
    Integer(1, 1000, 'C', step = 10),
    Categorical(['rbf', 'linear', 'poly', 'sigmoid'], 'kernel'),
    Real(0.001, 0.1, 'gamma', precision = 3),
    Integer(1,5, 'degree')
]

## Running the optimisation function

In [5]:
best_params = optimise(evaluate_svm, search_space,
             minimize=False, population_size=10,
             n_rounds=500, n_children=10, verbose=False)

Number of Iterations: 500
Best score: 0.9376811594202898
Best parameters: {'C': 721, 'kernel': 'rbf', 'gamma': 0.015, 'degree': 1}


In [6]:
best_params

{'C': 721, 'kernel': 'rbf', 'gamma': 0.015, 'degree': 1}

## Coding the XGBoost Evaluation Function

In [7]:
def evaluate_xgboost(param_dict):
    '''
     This function will evaluate an XGBoost classifier using the parameters of a given individual
    '''
    # evaluate baseline model with standardized dataset
    estimators = []
    estimators.append(('standardize', StandardScaler()))
    estimators.append(('xgb', xgb.XGBClassifier(objective='binary:logistic', 
                                                learning_rate=param_dict['learning_rate'],
                                                gamma=param_dict['gamma'], 
                                                max_depth=param_dict['max_depth'],
                                                min_child_weight=param_dict['min_child_weight'], 
                                                subsample=param_dict['subsample'], 
                                                colsample_bytree=param_dict['colsample'])))
    pipeline = Pipeline(estimators)
    kfold = StratifiedKFold(n_splits=3, shuffle=True)
    results = cross_val_score(pipeline, X, encoded_Y, cv=kfold)
    #Returns the average accuracy across the cross validation splits
    return np.mean(results)

In [8]:
search_space = [
    Real(0, 0.9, 'learning_rate'),
    Integer(2, 15, 'max_depth'),
    Real(0, 5, 'gamma'),
    Integer(2, 15, 'min_child_weight'),
    Real(0.1, 1, 'subsample'),
    Real(0.1, 1, 'colsample')
]

In [9]:
best_params = optimise(evaluate_xgboost, search_space,
             minimize=False, population_size=10,
             n_rounds=500, n_children=10, verbose=False)

Number of Iterations: 500
Best score: 0.8898550724637682
Best parameters: {'learning_rate': 0.182, 'max_depth': 3, 'gamma': 0.392, 'min_child_weight': 3, 'subsample': 0.71, 'colsample': 0.68}


In [10]:
best_params

{'learning_rate': 0.182,
 'max_depth': 3,
 'gamma': 0.392,
 'min_child_weight': 3,
 'subsample': 0.71,
 'colsample': 0.68}