# Machine Learning Engineer Nanodegree
## Using Supervised Classification Algorithms to Predict Bank Term Deposit Subscription
Fabiano Shoji Yoschitaki  
July 8th, 2018

## Project Design

As it is described the capstone proposal document, the project is composed of the following activites:

- **Data and Library Loading: ** the first step is to load the Bank Marketing data set in the CSV format from the UCI's Machine Learning Repository and all the libraries needed for the project.

- **Data Exploration: ** in this step, we'll do some tasks like: visualize the data, print some samples, check its dimensions, check the most relevant features, show its statistical summary.  

- **Data Preparation: ** after exploring the data, pre-processing tasks will be done: data cleaning, remove null values, convert categorical features into dummy/indicator variables and split the data into training and testing datasets. 

- **Model Selection: ** with the prepared data, various supervised classification algorithms will be experimented in order to find compare their results and choose the best one (taking into account the accuracy score) for model tuning.  

- **Model Tuning: ** after we choose the best model, grid search cross validation will be applied with the objective to tune the hyper-parameters of the model.

- **Final Evaluation: ** in this step, the accuracy score of the tuned model will be evaluated by applying it to the testing dataset. 

-----------
### 1. Data and Library Loading
In this section, we will load the dataset and the libraries used in the project.  

#### 1.1. Library Loading
Loading all libraries needed for the project.

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from time import time
from pandas.tools.plotting import scatter_matrix
from IPython.display import display

from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import BaggingClassifier
from xgboost.sklearn import XGBClassifier
from xgboost import plot_importance
from sklearn.svm import SVC
from sklearn.dummy import DummyClassifier

from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import make_scorer

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split

import warnings
warnings.filterwarnings('ignore')
sns.set(style="whitegrid")
%matplotlib inline

#### 1.2. Data Loading
Loading the dataset from the CSV file.

In [2]:
bank_full_data = pd.read_csv('bank-full.csv', delimiter=';')
print("Bank dataset was loaded successfully!")

Bank dataset was loaded successfully!


-----------
### 2. Data Exploration
Here we will apply some methods/techniques for Exploratory Data Analysis to better understand the data.

#### 2.1. Data Dimensions
Printing the first 10 rows from the data.

In [None]:
print("The dataset has {} rows and {} columns".format(bank_full_data.shape[0], bank_full_data.shape[1]))

#### 2.2. Data Information
Printing information about column dtypes, non null values and memory usage.

In [None]:
bank_full_data.info()

#### 2.3. Data Samples
Printing the first 10 rows of the data.

In [None]:
bank_full_data.head(10)

#### 2.4. Data Descriptive Statistics
Visualizing statistical summary of the data.

In [None]:
bank_full_data.describe()

#### 2.5 Data General Information
Exploring features information.

In [None]:
# Calculate number of clients
n_clients = len(bank_full_data)

# Calculate clients who have subscribed
n_clients_subscribed = len(bank_full_data[bank_full_data['y'] == 'yes'])

# Calculate clients who haven't subscribed
n_clients_not_subscribed = len(bank_full_data[bank_full_data['y'] == 'no'])

# Calculate graduation rate
subscription_rate = float(n_clients_subscribed)/float(n_clients) * 100

# Print the results
print("Total number of clients: {}".format(n_clients))
print("Number of clients who have subscribed: {}".format(n_clients_subscribed))
print("Number of clients who haven't subscribed: {}".format(n_clients_not_subscribed))
print("Subscription rate of the dataset: {:.2f}%".format(subscription_rate))
print("No-Subscription rate of the dataset: {:.2f}%".format(100-subscription_rate))

#### 2.6. Visualization
Generating some graphs for visualization.

In [None]:
plt.figure(figsize=(8, 5))
plt.title("Distribution of Clients Subscribed vs Not Subscribed")
bank_full_data.groupby("y")['y'].count().plot.bar()

In [None]:
age_histogram = sns.distplot(bank_full_data['age'], bins=10)
plt.title('Distribution by Age')
age_histogram.figure.set_size_inches(12, 6)
plt.show()

In [None]:
figure = plt.figure(figsize=(12, 6))
mask = np.zeros_like(bank_full_data.corr(), dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
sns.heatmap(bank_full_data.corr(), mask=mask, annot=True, cmap="Blues")
figure.suptitle('Correlation Matrix', fontsize=15)

-----------
### 3. Data Preparation
In this section we will apply some methods/techniques for Data Preprocessing.

#### 3.1. Checking for null values
If the dataset has null values, we must 

In [None]:
bank_full_data.isnull().sum()

#### 3.1. Preprocessing Features
Applying pandas_get_dummies to convert categorical features into binary variables. Also, we'll replace 'yes' -> 1, 'no' -> 0.

In [3]:
def preprocess_features(X):
    ''' Preprocesses the student data and converts non-numeric binary variables into
        binary (0/1) variables. Converts categorical variables into dummy variables. '''
    
    # Initialize new output DataFrame
    output = pd.DataFrame(index=X.index)

    # Investigate each feature column for the data
    for col, col_data in X.iteritems():
        
        # If data type is non-numeric, replace all yes/no values with 1/0
        if col_data.dtype == object:
            col_data = col_data.replace(['yes', 'no'], [1, 0])

        # If data type is categorical, convert to dummy variables
        if col_data.dtype == object:
            # Example: 'school' => 'school_GP' and 'school_MS'
            col_data = pd.get_dummies(col_data, prefix=col)  
                    
        # Collect the revised columns
        output = output.join(col_data)
    
    return output

In [4]:
bank_full_data = preprocess_features(bank_full_data)
print("Processed feature columns ({} total features): \n{}".format(len(bank_full_data.columns), list(bank_full_data.columns)))

Processed feature columns (49 total features): 
['age', 'job_admin.', 'job_blue-collar', 'job_entrepreneur', 'job_housemaid', 'job_management', 'job_retired', 'job_self-employed', 'job_services', 'job_student', 'job_technician', 'job_unemployed', 'job_unknown', 'marital_divorced', 'marital_married', 'marital_single', 'education_primary', 'education_secondary', 'education_tertiary', 'education_unknown', 'default', 'balance', 'housing', 'loan', 'contact_cellular', 'contact_telephone', 'contact_unknown', 'day', 'month_apr', 'month_aug', 'month_dec', 'month_feb', 'month_jan', 'month_jul', 'month_jun', 'month_mar', 'month_may', 'month_nov', 'month_oct', 'month_sep', 'duration', 'campaign', 'pdays', 'previous', 'poutcome_failure', 'poutcome_other', 'poutcome_success', 'poutcome_unknown', 'y']


#### 3.2. Identifying Feature and Target Columns
Separating the feature columns from the target column.

In [5]:
# Extract feature columns
feature_cols = list(bank_full_data.columns[:-1])

# Extract target column 'y' (subscribed/not subscribed)
target_col = bank_full_data.columns[-1] 

# Show the list of columns
print("Feature columns:\n{}".format(feature_cols))
print("\nTarget column: {}".format(target_col))

# Separate the data into feature data and target data (X_all and y_all, respectively)
X_all = bank_full_data[feature_cols]
y_all = bank_full_data[target_col]

# Show the feature information by printing the first five rows
print("\nFeature values:")
print(X_all.head())

Feature columns:
['age', 'job_admin.', 'job_blue-collar', 'job_entrepreneur', 'job_housemaid', 'job_management', 'job_retired', 'job_self-employed', 'job_services', 'job_student', 'job_technician', 'job_unemployed', 'job_unknown', 'marital_divorced', 'marital_married', 'marital_single', 'education_primary', 'education_secondary', 'education_tertiary', 'education_unknown', 'default', 'balance', 'housing', 'loan', 'contact_cellular', 'contact_telephone', 'contact_unknown', 'day', 'month_apr', 'month_aug', 'month_dec', 'month_feb', 'month_jan', 'month_jul', 'month_jun', 'month_mar', 'month_may', 'month_nov', 'month_oct', 'month_sep', 'duration', 'campaign', 'pdays', 'previous', 'poutcome_failure', 'poutcome_other', 'poutcome_success', 'poutcome_unknown']

Target column: y

Feature values:
   age  job_admin.  job_blue-collar  job_entrepreneur  job_housemaid  \
0   58           0                0                 0              0   
1   44           0                0                 0      

In [None]:
X_all.head(10)

#### 3.3. Training and Testing Datasets
Splitting data into training and testing datasets.

In [6]:
# Shuffle and split the dataset into the number of training and testing points above.
X_train, X_test, y_train, y_test = train_test_split(X_all, y_all, test_size=0.3, random_state=10)

print("Training set has {} samples with {:.2f}% of 'yes' (subscribed) and {:.2f}% of 'no' (not subscribed)."
      .format(X_train.shape[0], 
        100 * len(y_train[y_train == 1])/len(y_train), 
        100 * len(y_train[y_train == 0])/len(y_train)))

print("Testing set has {} samples with {:.2f}% of 'yes' (subscribed) and {:.2f}% of 'no' (not subscribed)."
      .format(X_test.shape[0], 
        100 * len(y_test[y_test == 1])/len(y_test), 
        100 * len(y_test[y_test == 0])/len(y_test)))

Training set has 31647 samples with 11.79% of 'yes' (subscribed) and 88.21% of 'no' (not subscribed).
Testing set has 13564 samples with 11.49% of 'yes' (subscribed) and 88.51% of 'no' (not subscribed).


#### 3.4. Feature Scaling
Rescaling the features for them to have standard normal distribution with mean 0 and a standard deviation 1.

In [7]:
sc = StandardScaler()
# Keep column header names for final importance plot.
X_train = pd.DataFrame(sc.fit_transform(X_train), columns=X_all.columns)
X_test = pd.DataFrame(sc.transform(X_test), columns=X_all.columns)

-----------
### 4. Model Selection
In this section we will select some supervised classification algorithms in order to choose the best one to tune in the last section.

#### 4.1. Selected Supervised Classification Algorithms
List of chosen algorithms:
- Gaussian Naive Bayes (GaussianNB)
- Decision Trees
- Bagging (Ensemble Methods) 
- AdaBoost (Ensemble Methods) 
- Random Forest (Ensemble Methods)
- Linear Discriminant Analysis (LDA)
- K-Nearest Neighbors (KNeighbors)
- Stochastic Gradient Descent (SGDC)
- Support Vector Machines (SVM)
- Logistic Regression (LR)
- eXtreme Gradient Boosting (XGBoost)

In [None]:
classifiers = []
classifiers.append(GaussianNB())
classifiers.append(DecisionTreeClassifier(random_state=1))
classifiers.append(BaggingClassifier(random_state=1))
classifiers.append(AdaBoostClassifier(random_state=1))
classifiers.append(RandomForestClassifier(random_state=1))
classifiers.append(LinearDiscriminantAnalysis())
classifiers.append(KNeighborsClassifier())
classifiers.append(SGDClassifier())
classifiers.append(SVC(random_state=1))
classifiers.append(LogisticRegression(random_state=1))
classifiers.append(XGBClassifier(random_state=1))

#### 4.2. Training, Predicting and Scoring Functions
Defining functions to train models, predict labels and show metric results: accuracy and f1-score.

In [8]:
def train_classifier(clf, X_train, y_train):
    ''' Fits a classifier to the training data. '''
    # Start the clock, train the classifier, then stop the clock
    start = time()
    clf.fit(X_train, y_train)
    end = time()    
    # Print the results
    print("Trained model in {:.4f} seconds".format(end-start))
    
def predict_labels(clf, features, target):
    ''' Makes predictions using a fit classifier based on accuracy score. '''    
    # Start the clock, make predictions, then stop the clock
    start = time()
    y_pred = clf.predict(features)
    end = time()    
    return y_pred

def train_predict(clf, X_train, y_train, X_test, y_test):
    ''' Train and predict using a classifier based on accuracy and F1-score. '''
    print("\nTraining classifier {} using a training set size of {}".format(clf.__class__.__name__, len(X_train)))
    
    # Train the classifier
    train_classifier(clf, X_train, y_train)
    
    # Predict labels for training and testing sets
    y_pred_train = predict_labels(clf, X_train, y_train)
    y_pred_test = predict_labels(clf, X_test, y_test)
    
    # Print the results of prediction for both training and testing
    print("Accuracy score for training set: {:.4f}.".format(accuracy_score(y_train, y_pred_train)))
    print("F1 score for training set: {:.4f}.".format(f1_score(y_train, y_pred_train)))
    print("Accuracy score for test set: {:.4f}.".format(accuracy_score(y_test, y_pred_test)))
    print("F1 score for test set: {:.4f}.".format(f1_score(y_test, y_pred_test)))    

#### 4.3. Evaluation
Evaluating all classifiers.

In [None]:
for clf in classifiers:    
    train_predict(clf, X_train, y_train, X_test, y_test)

-----------
### 5. Model Tuning

#### 5.1. Cross Validation
Creating random data splits for training and testing sets.

In [9]:
cv_sets = ShuffleSplit(test_size=0.20, random_state=1)
cv_sets.get_n_splits(X_train)
print(cv_sets)

ShuffleSplit(n_splits=10, random_state=1, test_size=0.2, train_size=None)


In [10]:
tuned_params = {}

#### 5.2. Scoring for Grid Search Cross Validation
Creating multiple metric evaluation: Accuracy and F1-Score.

In [11]:
scoring = {
    'Accuracy': make_scorer(accuracy_score), 
    'F1-Score': make_scorer(f1_score)
}

#### 5.3. Default XGBoost
Creating XGBoost instance for tuning.

In [12]:
model_v0 = XGBClassifier(
    random_state=1)
train_predict(model_v0, X_train, y_train, X_test, y_test)


Training classifier XGBClassifier using a training set size of 31647
Trained model in 5.6580 seconds
Accuracy score for training set: 0.9085.
F1 score for training set: 0.5046.
Accuracy score for test set: 0.9082.
F1 score for test set: 0.4916.


#### 5.4. First tune: scale_post_weight
Tuning scale_pos_weight parameter.

In [13]:
params_0 = {
    'scale_pos_weight': [x for x in range(1,11)]
}
gs_0 = GridSearchCV(
    estimator=model_v0, 
    param_grid=params_0,
    scoring=scoring, 
    cv=5, 
    refit='Accuracy')
gs_0.fit(X_train, y_train)

GridSearchCV(cv=5, error_score='raise',
       estimator=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=1,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'scale_pos_weight': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]},
       pre_dispatch='2*n_jobs', refit='Accuracy',
       return_train_score='warn',
       scoring={'Accuracy': make_scorer(accuracy_score), 'F1-Score': make_scorer(f1_score)},
       verbose=0)

In [14]:
tuned_params['scale_pos_weight'] = gs_0.best_params_['scale_pos_weight']
print("The best parameters and values found were: {}, best score: {}".format(tuned_params, gs_0.best_score_))

The best parameters and values found were: {'scale_pos_weight': 1}, best score: 0.9037191518943344


In [15]:
model_v1 = XGBClassifier(
    random_state=1,
    scale_pos_weight=tuned_params['scale_pos_weight'])
train_predict(model_v1, X_train, y_train, X_test, y_test)


Training classifier XGBClassifier using a training set size of 31647
Trained model in 7.2180 seconds
Accuracy score for training set: 0.9085.
F1 score for training set: 0.5046.
Accuracy score for test set: 0.9082.
F1 score for test set: 0.4916.


#### 5.5. Second tune: objective
Tuning objective parameter.

In [16]:
params_1 = {
    'objective': [
        'reg:linear', 
        'reg:logistic',
        'binary:logistic',
        'binary:logitraw',
        'binary:hinge',
        'count:poisson',]
}
 
gs_1 = GridSearchCV(
    estimator=model_v1, 
    param_grid=params_1, 
    scoring=scoring, 
    cv=5, 
    refit='Accuracy')
gs_1.fit(X_train, y_train)

GridSearchCV(cv=5, error_score='raise',
       estimator=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=1,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'objective': ['reg:linear', 'reg:logistic', 'binary:logistic', 'binary:logitraw', 'binary:hinge', 'count:poisson']},
       pre_dispatch='2*n_jobs', refit='Accuracy',
       return_train_score='warn',
       scoring={'Accuracy': make_scorer(accuracy_score), 'F1-Score': make_scorer(f1_score)},
       verbose=0)

In [17]:
tuned_params['objective'] = gs_1.best_params_['objective']
print("The best parameters and values found were: {}, best score: {}".format(gs_1.best_params_, gs_1.best_score_))

The best parameters and values found were: {'objective': 'reg:linear'}, best score: 0.9039719404682909


In [18]:
model_v2 = XGBClassifier(
    random_state=1,
    scale_pos_weight=tuned_params['scale_pos_weight'],
    objective=tuned_params['objective'])
train_predict(model_v2, X_train, y_train, X_test, y_test)


Training classifier XGBClassifier using a training set size of 31647
Trained model in 7.0007 seconds
Accuracy score for training set: 0.9086.
F1 score for training set: 0.5088.
Accuracy score for test set: 0.9082.
F1 score for test set: 0.4978.


#### 5.6. Third tune: max_depth and min_child_weight
Tuning max_depth and min_child_weight parameters.

In [19]:
params_2 = {
    'max_depth': [x for x in range(3, 9)],
    'min_child_weight': [x for x in range(1, 7)]
}
gs_2 = GridSearchCV(
    estimator=model_v2, 
    param_grid=params_2,
    scoring=scoring, 
    cv=5,
    refit='Accuracy')
gs_2.fit(X_train, y_train)

GridSearchCV(cv=5, error_score='raise',
       estimator=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='reg:linear', random_state=1,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'max_depth': [3, 4, 5, 6, 7, 8], 'min_child_weight': [1, 2, 3, 4, 5, 6]},
       pre_dispatch='2*n_jobs', refit='Accuracy',
       return_train_score='warn',
       scoring={'Accuracy': make_scorer(accuracy_score), 'F1-Score': make_scorer(f1_score)},
       verbose=0)

In [20]:
tuned_params['max_depth'] = gs_2.best_params_['max_depth']
tuned_params['min_child_weight'] = gs_2.best_params_['min_child_weight']
print("The best parameters and values found were: {}, best score: {}".format(gs_2.best_params_, gs_2.best_score_))

The best parameters and values found were: {'max_depth': 6, 'min_child_weight': 6}, best score: 0.9060890447751762


In [21]:
model_v3 = XGBClassifier(
    random_state=1,
    scale_pos_weight=tuned_params['scale_pos_weight'],
    objective=tuned_params['objective'], 
    max_depth=tuned_params['max_depth'],
    min_child_weight=tuned_params['min_child_weight'])
train_predict(model_v3, X_train, y_train, X_test, y_test)


Training classifier XGBClassifier using a training set size of 31647
Trained model in 14.1944 seconds
Accuracy score for training set: 0.9380.
F1 score for training set: 0.6873.
Accuracy score for test set: 0.9110.
F1 score for test set: 0.5444.


#### 5.7. Fourth tune: subsample and colsample_bytree
Tuning subsample and colsample_bytree parameters.

In [22]:
params_3 = {
    'subsample':[0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
    'colsample_bytree':[0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
}
gs_3 = GridSearchCV(
    estimator=model_v3, 
    param_grid=params_3,
    scoring=scoring, 
    cv=5,
    refit='Accuracy')
gs_3.fit(X_train, y_train)

GridSearchCV(cv=5, error_score='raise',
       estimator=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=6, min_child_weight=6, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='reg:linear', random_state=1,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'subsample': [0.5, 0.6, 0.7, 0.8, 0.9, 1.0], 'colsample_bytree': [0.5, 0.6, 0.7, 0.8, 0.9, 1.0]},
       pre_dispatch='2*n_jobs', refit='Accuracy',
       return_train_score='warn',
       scoring={'Accuracy': make_scorer(accuracy_score), 'F1-Score': make_scorer(f1_score)},
       verbose=0)

In [23]:
tuned_params['colsample_bytree'] = gs_3.best_params_['colsample_bytree']
tuned_params['subsample'] = gs_3.best_params_['subsample']
print("The best parameters and values found were: {}, best score: {}".format(gs_3.best_params_, gs_3.best_score_))

The best parameters and values found were: {'colsample_bytree': 1.0, 'subsample': 0.8}, best score: 0.9064366290643663


In [24]:
model_v4 = XGBClassifier(
    random_state=1,
    scale_pos_weight=tuned_params['scale_pos_weight'],
    objective=tuned_params['objective'], 
    max_depth=tuned_params['max_depth'],
    min_child_weight=tuned_params['min_child_weight'], 
    subsample=tuned_params['subsample'], 
    colsample_bytree=tuned_params['colsample_bytree'])
train_predict(model_v4, X_train, y_train, X_test, y_test)


Training classifier XGBClassifier using a training set size of 31647
Trained model in 15.1605 seconds
Accuracy score for training set: 0.9373.
F1 score for training set: 0.6830.
Accuracy score for test set: 0.9115.
F1 score for test set: 0.5476.


#### 5.8. Fifth tune: reg_alpha and reg_lambda
Tuning reg_alpha and reg_lambda parameters.

In [25]:
params_4 = {
    'reg_alpha':[x for x in range(0, 6)],
    'reg_lambda':[x for x in range(1, 7)]
}
gs_4 = GridSearchCV(
    estimator=model_v4, 
    param_grid=params_4, 
    scoring=scoring, 
    cv=5,
    refit='Accuracy')
gs_4.fit(X_train, y_train)

SyntaxError: keyword argument repeated (<ipython-input-25-687bd7c50edc>, line 9)

In [None]:
tuned_params['reg_alpha'] = gs_4.best_params_['reg_alpha']
tuned_params['reg_lambda'] = gs_4.best_params_['reg_lambda']
print("The best parameters and values found were: {}, best score: {}".format(gs_4.best_params_, gs_4.best_score_))

In [None]:
model_v5 = XGBClassifier(
    random_state=1,
    scale_pos_weight=tuned_params['scale_pos_weight'],
    objective=tuned_params['objective'], 
    max_depth=tuned_params['max_depth'],
    min_child_weight=tuned_params['min_child_weight'], 
    subsample=tuned_params['subsample'], 
    colsample_bytree=tuned_params['colsample_bytree'],
    reg_alpha=tuned_params['reg_alpha'],
    reg_lambda=tuned_params['reg_lambda'])
train_predict(model_v5, X_train, y_train, X_test, y_test)

#### 5.9. Sixth tune: gamma
Tuning gamma parameter.

In [None]:
params_5 = {
    'gamma':[x * 0.1 for x in range(0, 11)]
}
gs_5 = GridSearchCV(
    estimator=model_v5, 
    param_grid=params_5,
    scoring=scoring, 
    cv=5,
    refit='Accuracy')
gs_5.fit(X_train, y_train)

In [None]:
tuned_params['gamma'] = gs_5.best_params_['gamma']
print("The best parameters and values found were: {}, best score: {}".format(gs_5.best_params_, gs_5.best_score_))

In [None]:
model_v6 = XGBClassifier(
    random_state=1,
    scale_pos_weight=tuned_params['scale_pos_weight'],
    objective=tuned_params['objective'], 
    max_depth=tuned_params['max_depth'],
    min_child_weight=tuned_params['min_child_weight'], 
    subsample=tuned_params['subsample'], 
    colsample_bytree=tuned_params['colsample_bytree'],
    reg_alpha=tuned_params['reg_alpha'],
    reg_lambda=tuned_params['reg_lambda'],
    gamma=tuned_params['gamma'])
train_predict(model_v6, X_train, y_train, X_test, y_test)

#### 5.10. Tuned Parameters
Printing the best values found for model tuning.

In [None]:
print("{:<25} {:<10}".format('Parameter','Value'))
for k, v in tuned_params.items():
    print("{:<25} {:<10}".format(k, v))

-----------
### 6. Final Evaluation

#### 6.1. Model Benchmark
Creating a Dummy Classifier which Predicts the Majority Class and XGBoost untuned model

In [None]:
dummy_clf = DummyClassifier(strategy='most_frequent', random_state=1)
train_predict(dummy_clf, X_train, y_train, X_test, y_test)

In [None]:
xgb_untuned_model = XGBClassifier(random_state=1)
train_predict(xgb_untuned_model, X_train, y_train, X_test, y_test)

In [None]:
final_xgb_tuned_model = XGBClassifier(
    random_state=1,
    scale_pos_weight=1,
    objective='reg:logistic', 
    max_depth=6, 
    min_child_weight=3, 
    subsample=1, 
    colsample_bytree=1,
    reg_alpha=1,
    reg_lambda=1,
    gamma=0.7)
train_predict(final_xgb_tuned_model, X_train, y_train, X_test, y_test)

#### 6.2. Feature Importance
Printing feature importance of the final tuned model.

In [None]:
chart = plot_importance(final_xgb_tuned_model, max_num_features=6)
chart.figure.set_size_inches(14, 5)
plt.title('Top 5 Importance Features',fontsize=15)
plt.yticks(fontsize=12)
plt.ylabel('Feature',fontsize=15)
plt.show()