# XGBoost (Extreme Gradient Boosting)

## Installation Guide





Following [part 1](https://www.ibm.com/developerworks/community/blogs/jfp/entry/Installing_XGBoost_For_Anaconda_on_Windows?lang=en) works for me.

However, I do need to download [windows x64 binaries](http://www.picnet.com.au/blogs/guido/post/2016/09/22/xgboost-windows-x64-binaries-for-download/)

That's all that I needed

More info:

http://www.picnet.com.au/blogs/guido/post/2016/09/22/how-to-build-xgboost-on-windows/

http://xgboost.readthedocs.io/en/latest/build.html

https://github.com/dmlc/xgboost/blob/master/doc/build.md#python-package-installation






In [1]:
import xgboost as xgb
import numpy as np

In [2]:
data = np.random.rand(5,10) # 5 entities, each contains 10 features
label = np.random.randint(2, size=5) # binary target
dtrain = xgb.DMatrix( data, label=label)

In [3]:
dtest = dtrain

In [4]:
param = {'bst:max_depth':2, 'bst:eta':1, 'silent':1, 'objective':'binary:logistic' }
param['nthread'] = 4
param['eval_metric'] = 'auc'

In [5]:
evallist  = [(dtest,'eval'), (dtrain,'train')]

In [6]:
num_round = 10
bst = xgb.train( param, dtrain, num_round, evallist )

[0]	eval-auc:0.5	train-auc:0.5
[1]	eval-auc:0.5	train-auc:0.5
[2]	eval-auc:0.5	train-auc:0.5
[3]	eval-auc:0.5	train-auc:0.5
[4]	eval-auc:0.5	train-auc:0.5
[5]	eval-auc:0.5	train-auc:0.5
[6]	eval-auc:0.5	train-auc:0.5
[7]	eval-auc:0.5	train-auc:0.5
[8]	eval-auc:0.5	train-auc:0.5
[9]	eval-auc:0.5	train-auc:0.5


## Documentation



[Documentation](http://xgboost.readthedocs.io/en/latest/)

[tqchen github](https://github.com/tqchen/xgboost/tree/master/demo/guide-python)

[dmlc github](https://github.com/dmlc/xgboost)

* “Gradient Boosting” is proposed in the paper Greedy Function Approximation: A Gradient Boosting Machine, by Friedman. 
* XGBoost is based on this original model. 

* Supervised Learning

## Objective Function : Training Loss + Regularization

$$Obj(Θ)=L(θ)+Ω(Θ)$$

* $L$ is the training loss function, and 
* $Ω$ is the regularization term. 

### Training Loss

The training loss measures how predictive our model is on training data.

Example 1, Mean Squared Error for Linear Regression:

$$L(θ)= \sum_i(y_i-\hat{y}_i)^2$$

Example 2, Logistic Loss for Logistic Regression:

$$ L(θ) = \sum_i \large[ y_i ln(1 + e^{-\hat{y}_i}) + (1-y_i) ln(1 + e^{\hat{y}_i}) \large] $$

### Regularization Term

The regularization term controls the complexity of the model, which helps us to avoid overfitting. 

[XGBoost vs GBM](https://www.quora.com/What-is-the-difference-between-the-R-gbm-gradient-boosting-machine-and-xgboost-extreme-gradient-boosting/answer/Tianqi-Chen-1)

* Specifically,  xgboost used a more regularized model formalization to control over-fitting, which gives it better performance.

* For model, it might be more suitable to be called as regularized gradient boosting.



In [7]:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd

In [8]:
df = sns.load_dataset('titanic')

In [9]:
df.dropna(inplace=True)

## Data Pre-processing

In [10]:
X = df[['pclass', 'sex', 'age']]

In [11]:
from sklearn import preprocessing
lb = preprocessing.LabelBinarizer()

In [12]:
X['sex'] = lb.fit_transform(X['sex'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [13]:
y = df['survived']

***

In [14]:
from sklearn.model_selection import train_test_split

In [15]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [16]:
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [17]:
def print_score(clf, X_train, y_train, X_test, y_test, train=True):
    '''
    print the accuracy score, classification report and confusion matrix of classifier
    '''
    if train:
        '''
        training performance
        '''
        print("Train Result:\n")
        y_pred = clf.predict(X_train)
        print("accuracy score: {0:.4f}\n".format(accuracy_score(y_train, y_pred)))
        print("Classification Report: \n {}\n".format(classification_report(y_train, y_pred)))
        print("Confusion Matrix: \n {}\n".format(confusion_matrix(y_train, y_pred)))

        res = cross_val_score(clf, X_train, y_train, cv=10, scoring='accuracy')
        print("Average Accuracy: \t {0:.4f}".format(np.mean(res)))
        print("Accuracy SD: \t\t {0:.4f}".format(np.std(res)))
        
    elif train==False:
        '''
        test performance
        '''
        print("Test Result:\n")
        y_pred = clf.predict(X_test)
        print("accuracy score: {0:.4f}\n".format(accuracy_score(y_test, y_pred)))
        print("Classification Report: \n {}\n".format(classification_report(y_test, y_pred)))
        print("Confusion Matrix: \n {}\n".format(confusion_matrix(y_test, y_pred)))    
        

## Model Fitting and Prediction

In [18]:
import xgboost as xgb

In [19]:
xgb_clf = xgb.XGBClassifier(max_depth=5, n_estimators=10000, learning_rate=0.3,
                            n_jobs=-1)

In [20]:
xgb_clf.fit(X_train, y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.3, max_delta_step=0, max_depth=5,
              min_child_weight=1, missing=None, n_estimators=10000, n_jobs=-1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)

In [21]:
print_score(xgb_clf, X_train, y_train, X_test, y_test, train=True)

Train Result:

accuracy score: 0.9685

Classification Report: 
               precision    recall  f1-score   support

           0       0.93      0.98      0.95        41
           1       0.99      0.97      0.98        86

    accuracy                           0.97       127
   macro avg       0.96      0.97      0.96       127
weighted avg       0.97      0.97      0.97       127


Confusion Matrix: 
 [[40  1]
 [ 3 83]]

Average Accuracy: 	 0.8410
Accuracy SD: 		 0.0892


In [22]:
print_score(xgb_clf, X_train, y_train, X_test, y_test, train=False)

Test Result:

accuracy score: 0.6545

Classification Report: 
               precision    recall  f1-score   support

           0       0.47      0.50      0.49        18
           1       0.75      0.73      0.74        37

    accuracy                           0.65        55
   macro avg       0.61      0.61      0.61        55
weighted avg       0.66      0.65      0.66        55


Confusion Matrix: 
 [[ 9  9]
 [10 27]]



***

# Ensemble of ensembles - model stacking

## Introduction

* **Ensemble with different types of classifiers**: 
  * Different types of classifiers (E.g., logistic regression, decision trees, random forest, etc.) are fitted on the same training data
  * Results are combined based on either 
    * majority voting (classification) or 
    * average (regression)
  

* **Ensemble with a single type of classifier**: 
  * Bootstrap samples are drawn from training data 
  * With each bootstrap sample, model (E.g., Individual model may be decision trees, random forest, etc.) will be fitted 
  * All the results are combined to create an ensemble. 
  * Suitabe for highly flexible models that is prone to overfitting / high variance. 

***

## Combining Method

* **Majority voting or average**: 
  * Classification: Largest number of votes (mode) 
  * Regression problems: Average (mean).
  
  
* **Method of application of meta-classifiers on outcomes**: 
  * Binary outcomes: 0 / 1 from individual classifiers
  * Meta-classifier is applied on top of the individual classifiers. 
  
  
* **Method of application of meta-classifiers on probabilities**: 
  * Probabilities are obtained from individual classifiers. 
  * Applying meta-classifier
  

## Data Loading and Prep

In [23]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [24]:
try:
  df = pd.read_csv("data\WA_Fn-UseC_-HR-Employee-Attrition.csv")
except:
  url = "https://raw.githubusercontent.com/anthonyng2/intro_ml_finance/master/data/WA_Fn-UseC_-HR-Employee-Attrition.csv"
  df = pd.read_csv(url)

df.pop('EmployeeNumber')
df.pop('Over18')
df.pop('StandardHours')
df.pop('EmployeeCount')
y = df['Attrition']
X = df
X.pop('Attrition')
from sklearn import preprocessing
le = preprocessing.LabelBinarizer()
y = le.fit_transform(y)
ind_BusinessTravel = pd.get_dummies(df['BusinessTravel'], prefix='BusinessTravel')
ind_Department = pd.get_dummies(df['Department'], prefix='Department')
ind_EducationField = pd.get_dummies(df['EducationField'], prefix='EducationField')
ind_Gender = pd.get_dummies(df['Gender'], prefix='Gender')
ind_JobRole = pd.get_dummies(df['JobRole'], prefix='JobRole')
ind_MaritalStatus = pd.get_dummies(df['MaritalStatus'], prefix='MaritalStatus')
ind_OverTime = pd.get_dummies(df['OverTime'], prefix='OverTime')
df1 = pd.concat([ind_BusinessTravel, ind_Department, ind_EducationField, ind_Gender, 
                 ind_JobRole, ind_MaritalStatus, ind_OverTime])
df1 = pd.concat([ind_BusinessTravel, ind_Department, ind_EducationField, ind_Gender, 
                 ind_JobRole, ind_MaritalStatus, ind_OverTime, df.select_dtypes(['int64'])], axis=1)
df1.dropna(inplace=True)
df1.shape

(1470, 51)

In [25]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df1, y)

In [26]:
from sklearn import preprocessing
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.metrics import accuracy_score, classification_report
from sklearn.metrics import confusion_matrix, roc_auc_score
def print_score(clf, X_train, X_test, y_train, y_test, train=True):
    '''
    v0.1 Follow the scikit learn library format in terms of input
    print the accuracy score, classification report and confusion matrix of classifier
    '''
    lb = preprocessing.LabelBinarizer()
    lb.fit(y_train)
    if train:
        '''
        training performance
        '''
        print("Train Result:\n")
        y_pred = clf.predict(X_train)
        print("accuracy score: {0:.4f}\n".format(accuracy_score(y_train, y_pred)))
        print("Classification Report: \n {}\n".format(classification_report(y_train, y_pred)))
        print("Confusion Matrix: \n {}\n".format(confusion_matrix(y_train, y_pred)))
        print("ROC AUC: {0:.4f}\n".format(roc_auc_score(lb.transform(y_train), 
                                                        lb.transform(y_pred))))

        #cv_res = cross_val_score(clf, X_train, y_train, cv=10, scoring='accuracy')
        #print("Average Accuracy: \t {0:.4f}".format(np.mean(cv_res)))
        #print("Accuracy SD: \t\t {0:.4f}".format(np.std(cv_res)))
        
    elif train==False:
        '''
        test performance
        '''
        print("Test Result:\n")        
        y_pred = clf.predict(X_test)
        print("accuracy score: {0:.4f}\n".format(accuracy_score(y_test, y_pred)))
        print("Classification Report: \n {}\n".format(classification_report(y_test, y_pred)))
        print("Confusion Matrix: \n {}\n".format(confusion_matrix(y_test, y_pred)))    
        print("ROC AUC: {0:.4f}\n".format(roc_auc_score(lb.transform(y_test), lb.transform(y_pred))))
        

## Approach 1

### Model 1: Decision Tree

In [27]:
from sklearn.tree import DecisionTreeClassifier

In [28]:
tree_clf = DecisionTreeClassifier()
tree_clf.fit(X_train, y_train)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

In [29]:
print_score(tree_clf, X_train, X_test, y_train, y_test, train=True)
print_score(tree_clf, X_train, X_test, y_train, y_test, train=False)

Train Result:

accuracy score: 1.0000

Classification Report: 
               precision    recall  f1-score   support

           0       1.00      1.00      1.00       919
           1       1.00      1.00      1.00       183

    accuracy                           1.00      1102
   macro avg       1.00      1.00      1.00      1102
weighted avg       1.00      1.00      1.00      1102


Confusion Matrix: 
 [[919   0]
 [  0 183]]

ROC AUC: 1.0000

Test Result:

accuracy score: 0.8234

Classification Report: 
               precision    recall  f1-score   support

           0       0.91      0.89      0.90       314
           1       0.41      0.46      0.43        54

    accuracy                           0.82       368
   macro avg       0.66      0.67      0.67       368
weighted avg       0.83      0.82      0.83       368


Confusion Matrix: 
 [[278  36]
 [ 29  25]]

ROC AUC: 0.6742



### Model 2: Random Forest

In [30]:
from sklearn.ensemble import RandomForestClassifier

In [31]:
rf_clf = RandomForestClassifier()
rf_clf.fit(X_train, y_train.ravel())

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [32]:
print_score(rf_clf, X_train, X_test, y_train, y_test, train=True)
print_score(rf_clf, X_train, X_test, y_train, y_test, train=False)

Train Result:

accuracy score: 1.0000

Classification Report: 
               precision    recall  f1-score   support

           0       1.00      1.00      1.00       919
           1       1.00      1.00      1.00       183

    accuracy                           1.00      1102
   macro avg       1.00      1.00      1.00      1102
weighted avg       1.00      1.00      1.00      1102


Confusion Matrix: 
 [[919   0]
 [  0 183]]

ROC AUC: 1.0000

Test Result:

accuracy score: 0.8696

Classification Report: 
               precision    recall  f1-score   support

           0       0.87      0.99      0.93       314
           1       0.80      0.15      0.25        54

    accuracy                           0.87       368
   macro avg       0.84      0.57      0.59       368
weighted avg       0.86      0.87      0.83       368


Confusion Matrix: 
 [[312   2]
 [ 46   8]]

ROC AUC: 0.5709



In [33]:
en_en = pd.DataFrame()

In [34]:
tree_clf.predict_proba(X_train)

array([[1., 0.],
       [1., 0.],
       [1., 0.],
       ...,
       [0., 1.],
       [1., 0.],
       [1., 0.]])

In [35]:
tree_clf.predict_proba?

In [36]:
en_en['tree_clf'] = pd.DataFrame(tree_clf.predict_proba(X_train))[1]
en_en['rf_clf'] =  pd.DataFrame(rf_clf.predict_proba(X_train))[1]
col_name = en_en.columns
en_en = pd.concat([en_en, pd.DataFrame(y_train).reset_index(drop=True)], axis=1)

In [37]:
en_en.head()

Unnamed: 0,tree_clf,rf_clf,0
0,0.0,0.02,0
1,0.0,0.24,0
2,0.0,0.15,0
3,0.0,0.07,0
4,0.0,0.06,0


In [38]:
tmp = list(col_name)
tmp.append('ind')
en_en.columns = tmp

In [39]:
en_en.head()

Unnamed: 0,tree_clf,rf_clf,ind
0,0.0,0.02,0
1,0.0,0.24,0
2,0.0,0.15,0
3,0.0,0.07,0
4,0.0,0.06,0


### Meta Classifier

In [40]:
from sklearn.linear_model import LogisticRegression

In [41]:
m_clf = LogisticRegression(fit_intercept=False)

In [42]:
m_clf.fit(en_en[['tree_clf', 'rf_clf']], en_en['ind'])

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=False,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [43]:
en_test = pd.DataFrame()

In [44]:
en_test['tree_clf'] = pd.DataFrame(tree_clf.predict_proba(X_test))[1]
en_test['rf_clf'] =  pd.DataFrame(rf_clf.predict_proba(X_test))[1]
col_name = en_en.columns
en_test['combined'] = m_clf.predict(en_test[['tree_clf', 'rf_clf']])

In [45]:
col_name = en_test.columns
tmp = list(col_name)
tmp.append('ind')

In [46]:
tmp

['tree_clf', 'rf_clf', 'combined', 'ind']

In [47]:
en_test = pd.concat([en_test, pd.DataFrame(y_test).reset_index(drop=True)], axis=1)

In [48]:
en_test.columns = tmp

In [49]:
print(pd.crosstab(en_test['ind'], en_test['combined']))

combined    0   1
ind              
0         278  36
1          29  25


In [50]:
print(round(accuracy_score(en_test['ind'], en_test['combined']), 4))

0.8234


In [51]:
print(classification_report(en_test['ind'], en_test['combined']))

              precision    recall  f1-score   support

           0       0.91      0.89      0.90       314
           1       0.41      0.46      0.43        54

    accuracy                           0.82       368
   macro avg       0.66      0.67      0.67       368
weighted avg       0.83      0.82      0.83       368



***

## Data Loading and Prep

In [52]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [53]:
try:
  df = pd.read_csv("data\WA_Fn-UseC_-HR-Employee-Attrition.csv")
except:
  url = "https://raw.githubusercontent.com/anthonyng2/intro_ml_finance/master/data/WA_Fn-UseC_-HR-Employee-Attrition.csv"
  df = pd.read_csv(url)

df.pop('EmployeeNumber')
df.pop('Over18')
df.pop('StandardHours')
df.pop('EmployeeCount')
y = df['Attrition']
X = df
X.pop('Attrition')
from sklearn import preprocessing
le = preprocessing.LabelBinarizer()
y = le.fit_transform(y)
ind_BusinessTravel = pd.get_dummies(df['BusinessTravel'], prefix='BusinessTravel')
ind_Department = pd.get_dummies(df['Department'], prefix='Department')
ind_EducationField = pd.get_dummies(df['EducationField'], prefix='EducationField')
ind_Gender = pd.get_dummies(df['Gender'], prefix='Gender')
ind_JobRole = pd.get_dummies(df['JobRole'], prefix='JobRole')
ind_MaritalStatus = pd.get_dummies(df['MaritalStatus'], prefix='MaritalStatus')
ind_OverTime = pd.get_dummies(df['OverTime'], prefix='OverTime')
df1 = pd.concat([ind_BusinessTravel, ind_Department, ind_EducationField, ind_Gender, 
                 ind_JobRole, ind_MaritalStatus, ind_OverTime])
df1 = pd.concat([ind_BusinessTravel, ind_Department, ind_EducationField, ind_Gender, 
                 ind_JobRole, ind_MaritalStatus, ind_OverTime, df.select_dtypes(['int64'])], axis=1)
df1.dropna(inplace=True)
df1.shape

(1470, 51)

In [54]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df1, y,
                                                    test_size=0.2)

In [55]:
def print_score(clf, X_train, y_train, X_test, y_test, train=True):
    '''
    print the accuracy score, classification report and confusion matrix of classifier
    '''
    if train:
        '''
        training performance
        '''
        print("Train Result:\n")
        y_pred = clf.predict(X_train)
        print("accuracy score: {0:.4f}\n".format(accuracy_score(y_train, y_pred)))
        print("Classification Report: \n {}\n".format(classification_report(y_train, y_pred)))
        print("Confusion Matrix: \n {}\n".format(confusion_matrix(y_train, y_pred)))

        res = cross_val_score(clf, X_train, y_train, cv=10, scoring='accuracy')
        print("Average Accuracy: \t {0:.4f}".format(np.mean(res)))
        print("Accuracy SD: \t\t {0:.4f}".format(np.std(res)))
        
    elif train==False:
        '''
        test performance
        '''
        print("Test Result:\n")        
        y_pred = clf.predict(X_test)
        print("accuracy score: {0:.4f}\n".format(accuracy_score(y_test, y_pred)))
        print("Classification Report: \n {}\n".format(classification_report(y_test, y_pred)))
        print("Confusion Matrix: \n {}\n".format(confusion_matrix(y_test, y_pred)))  

In [56]:
from sklearn import preprocessing
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.metrics import accuracy_score, classification_report
from sklearn.metrics import confusion_matrix, roc_auc_score
def print_score(clf, X_train, X_test, y_train, y_test, train=True):
    '''
    v0.1 Follow the scikit learn library format in terms of input
    print the accuracy score, classification report and confusion matrix of classifier
    '''
    lb = preprocessing.LabelBinarizer()
    lb.fit(y_train)
    if train:
        '''
        training performance
        '''
        print("Train Result:\n")
        y_pred = clf.predict(X_train)
        print("accuracy score: {0:.4f}\n".format(accuracy_score(y_train, y_pred)))
        print("Classification Report: \n {}\n".format(classification_report(y_train, y_pred)))
        print("Confusion Matrix: \n {}\n".format(confusion_matrix(y_train, y_pred)))
        print("ROC AUC: {0:.4f}\n".format(roc_auc_score(lb.transform(y_train), 
                                                        lb.transform(y_pred))))

        #cv_res = cross_val_score(clf, X_train, y_train, cv=10, scoring='accuracy')
        #print("Average Accuracy: \t {0:.4f}".format(np.mean(cv_res)))
        #print("Accuracy SD: \t\t {0:.4f}".format(np.std(cv_res)))
        
    elif train==False:
        '''
        test performance
        '''
        print("Test Result:\n")        
        y_pred = clf.predict(X_test)
        print("accuracy score: {0:.4f}\n".format(accuracy_score(y_test, y_pred)))
        print("Classification Report: \n {}\n".format(classification_report(y_test, y_pred)))
        print("Confusion Matrix: \n {}\n".format(confusion_matrix(y_test, y_pred)))    
        print("ROC AUC: {0:.4f}\n".format(roc_auc_score(lb.transform(y_test), lb.transform(y_pred))))
        

## Approach 2

### Model 1: Decision Tree

In [57]:
from sklearn.tree import DecisionTreeClassifier

In [58]:
tree_clf = DecisionTreeClassifier()
tree_clf.fit(X_train, y_train)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

In [59]:
print_score(tree_clf, X_train, X_test, y_train, y_test, train=True)
print_score(tree_clf, X_train, X_test, y_train, y_test, train=False)

Train Result:

accuracy score: 1.0000

Classification Report: 
               precision    recall  f1-score   support

           0       1.00      1.00      1.00       989
           1       1.00      1.00      1.00       187

    accuracy                           1.00      1176
   macro avg       1.00      1.00      1.00      1176
weighted avg       1.00      1.00      1.00      1176


Confusion Matrix: 
 [[989   0]
 [  0 187]]

ROC AUC: 1.0000

Test Result:

accuracy score: 0.7517

Classification Report: 
               precision    recall  f1-score   support

           0       0.85      0.84      0.85       244
           1       0.28      0.30      0.29        50

    accuracy                           0.75       294
   macro avg       0.57      0.57      0.57       294
weighted avg       0.76      0.75      0.75       294


Confusion Matrix: 
 [[206  38]
 [ 35  15]]

ROC AUC: 0.5721



### Model 2: Random Forest

In [60]:
from sklearn.ensemble import RandomForestClassifier

In [61]:
rf_clf = RandomForestClassifier()
rf_clf.fit(X_train, y_train.ravel())

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [62]:
print_score(rf_clf, X_train, X_test, y_train, y_test, train=True)
print_score(rf_clf, X_train, X_test, y_train, y_test, train=False)

Train Result:

accuracy score: 1.0000

Classification Report: 
               precision    recall  f1-score   support

           0       1.00      1.00      1.00       989
           1       1.00      1.00      1.00       187

    accuracy                           1.00      1176
   macro avg       1.00      1.00      1.00      1176
weighted avg       1.00      1.00      1.00      1176


Confusion Matrix: 
 [[989   0]
 [  0 187]]

ROC AUC: 1.0000

Test Result:

accuracy score: 0.8435

Classification Report: 
               precision    recall  f1-score   support

           0       0.85      0.99      0.91       244
           1       0.75      0.12      0.21        50

    accuracy                           0.84       294
   macro avg       0.80      0.56      0.56       294
weighted avg       0.83      0.84      0.79       294


Confusion Matrix: 
 [[242   2]
 [ 44   6]]

ROC AUC: 0.5559



In [63]:
en_en = pd.DataFrame()

In [64]:
tree_clf.predict_proba(X_train)

array([[1., 0.],
       [1., 0.],
       [1., 0.],
       ...,
       [1., 0.],
       [1., 0.],
       [0., 1.]])

In [65]:
tree_clf.predict_proba?

In [66]:
en_en['tree_clf'] = pd.DataFrame(tree_clf.predict_proba(X_train))[1]
en_en['rf_clf'] =  pd.DataFrame(rf_clf.predict_proba(X_train))[1]
col_name = en_en.columns
en_en = pd.concat([en_en, pd.DataFrame(y_train).reset_index(drop=True)], axis=1)

In [67]:
en_en.head()

Unnamed: 0,tree_clf,rf_clf,0
0,0.0,0.09,0
1,0.0,0.05,0
2,0.0,0.06,0
3,0.0,0.06,0
4,0.0,0.08,0


In [68]:
tmp = list(col_name)
tmp.append('ind')
en_en.columns = tmp

In [69]:
en_en.head()

Unnamed: 0,tree_clf,rf_clf,ind
0,0.0,0.09,0
1,0.0,0.05,0
2,0.0,0.06,0
3,0.0,0.06,0
4,0.0,0.08,0


### Meta Classifier

In [70]:
from sklearn.linear_model import LogisticRegression

In [71]:
m_clf = LogisticRegression(fit_intercept=False)

In [72]:
m_clf.fit(en_en[['tree_clf', 'rf_clf']], en_en['ind'])

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=False,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [73]:
en_test = pd.DataFrame()

In [74]:
en_test['tree_clf'] = pd.DataFrame(tree_clf.predict_proba(X_test))[1]
en_test['rf_clf'] =  pd.DataFrame(rf_clf.predict_proba(X_test))[1]
col_name = en_en.columns
en_test['combined'] = m_clf.predict(en_test[['tree_clf', 'rf_clf']])

In [75]:
col_name = en_test.columns
tmp = list(col_name)
tmp.append('ind')

In [76]:
tmp

['tree_clf', 'rf_clf', 'combined', 'ind']

In [77]:
en_test = pd.concat([en_test, pd.DataFrame(y_test).reset_index(drop=True)], axis=1)

In [78]:
en_test.columns = tmp

In [79]:
print(pd.crosstab(en_test['ind'], en_test['combined']))

combined    0   1
ind              
0         206  38
1          35  15


In [80]:
print(round(accuracy_score(en_test['ind'], en_test['combined']), 4))

0.7517


In [81]:
print(classification_report(en_test['ind'], en_test['combined']))

              precision    recall  f1-score   support

           0       0.85      0.84      0.85       244
           1       0.28      0.30      0.29        50

    accuracy                           0.75       294
   macro avg       0.57      0.57      0.57       294
weighted avg       0.76      0.75      0.75       294



***

## Single Classifier

In [82]:

try:
  df = pd.read_csv("data\WA_Fn-UseC_-HR-Employee-Attrition.csv")
except:
  url = "https://raw.githubusercontent.com/anthonyng2/intro_ml_finance/master/data/WA_Fn-UseC_-HR-Employee-Attrition.csv"
  df = pd.read_csv(url)


In [83]:
df.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,EnvironmentSatisfaction,Gender,HourlyRate,JobInvolvement,JobLevel,JobRole,JobSatisfaction,MaritalStatus,MonthlyIncome,MonthlyRate,NumCompaniesWorked,Over18,OverTime,PercentSalaryHike,PerformanceRating,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,2,Female,94,3,2,Sales Executive,4,Single,5993,19479,8,Y,Yes,11,3,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,3,Male,61,2,2,Research Scientist,2,Married,5130,24907,1,Y,No,23,4,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,4,Male,92,2,1,Laboratory Technician,3,Single,2090,2396,6,Y,Yes,15,3,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,4,Female,56,3,1,Research Scientist,3,Married,2909,23159,1,Y,Yes,11,3,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,1,Male,40,3,1,Laboratory Technician,2,Married,3468,16632,9,Y,No,12,3,4,80,1,6,3,3,2,2,2,2


In [84]:
df.Attrition.value_counts() / df.Attrition.count()

No     0.838776
Yes    0.161224
Name: Attrition, dtype: float64

In [85]:
from sklearn.ensemble import RandomForestClassifier

In [86]:
from sklearn.ensemble import BaggingClassifier

In [87]:
from sklearn.ensemble import AdaBoostClassifier

In [88]:
class_weight = {0:0.834, 1:0.166}

In [89]:
from itertools import chain
pd.Series(list(chain(*y_train))).value_counts() / pd.Series(list(y_train)).count()

0    0.840986
1    0.159014
dtype: float64

In [90]:
forest = RandomForestClassifier(class_weight=class_weight)

In [91]:
ada = AdaBoostClassifier(base_estimator=forest, n_estimators=100,
                         learning_rate=0.5, random_state=42)

In [92]:
ada.fit(X_train, y_train.ravel())

AdaBoostClassifier(algorithm='SAMME.R',
                   base_estimator=RandomForestClassifier(bootstrap=True,
                                                         ccp_alpha=0.0,
                                                         class_weight={0: 0.834,
                                                                       1: 0.166},
                                                         criterion='gini',
                                                         max_depth=None,
                                                         max_features='auto',
                                                         max_leaf_nodes=None,
                                                         max_samples=None,
                                                         min_impurity_decrease=0.0,
                                                         min_impurity_split=None,
                                                         min_samples_leaf=1,
                              

In [93]:
print_score(ada, X_train, X_test, y_train, y_test, train=True)
print_score(ada, X_train, X_test, y_train, y_test, train=False)

Train Result:

accuracy score: 1.0000

Classification Report: 
               precision    recall  f1-score   support

           0       1.00      1.00      1.00       989
           1       1.00      1.00      1.00       187

    accuracy                           1.00      1176
   macro avg       1.00      1.00      1.00      1176
weighted avg       1.00      1.00      1.00      1176


Confusion Matrix: 
 [[989   0]
 [  0 187]]

ROC AUC: 1.0000

Test Result:

accuracy score: 0.8571

Classification Report: 
               precision    recall  f1-score   support

           0       0.86      0.99      0.92       244
           1       0.83      0.20      0.32        50

    accuracy                           0.86       294
   macro avg       0.85      0.60      0.62       294
weighted avg       0.85      0.86      0.82       294


Confusion Matrix: 
 [[242   2]
 [ 40  10]]

ROC AUC: 0.5959



In [94]:
bag_clf = BaggingClassifier(base_estimator=ada, n_estimators=50,
                            max_samples=1.0, max_features=1.0, bootstrap=True,
                            bootstrap_features=False, n_jobs=-1,
                            random_state=42)

In [95]:
bag_clf.fit(X_train, y_train.ravel())

BaggingClassifier(base_estimator=AdaBoostClassifier(algorithm='SAMME.R',
                                                    base_estimator=RandomForestClassifier(bootstrap=True,
                                                                                          ccp_alpha=0.0,
                                                                                          class_weight={0: 0.834,
                                                                                                        1: 0.166},
                                                                                          criterion='gini',
                                                                                          max_depth=None,
                                                                                          max_features='auto',
                                                                                          max_leaf_nodes=None,
                                                   

In [96]:
print_score(bag_clf, X_train, X_test, y_train, y_test, train=True)
print_score(bag_clf, X_train, X_test, y_train, y_test, train=False)

Train Result:

accuracy score: 0.9702

Classification Report: 
               precision    recall  f1-score   support

           0       0.97      1.00      0.98       989
           1       1.00      0.81      0.90       187

    accuracy                           0.97      1176
   macro avg       0.98      0.91      0.94      1176
weighted avg       0.97      0.97      0.97      1176


Confusion Matrix: 
 [[989   0]
 [ 35 152]]

ROC AUC: 0.9064

Test Result:

accuracy score: 0.8469

Classification Report: 
               precision    recall  f1-score   support

           0       0.85      0.99      0.91       244
           1       0.73      0.16      0.26        50

    accuracy                           0.85       294
   macro avg       0.79      0.57      0.59       294
weighted avg       0.83      0.85      0.80       294


Confusion Matrix: 
 [[241   3]
 [ 42   8]]

ROC AUC: 0.5739



***