# HR attrition data example

In this section, we will be using IBM Watson's HR Attrition data (the data has been utilized in the book after taking prior permission from the data administrator) shared in Kaggle datasets under open source license agreement https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset to predict whether employees would attrite or not based on independent explanatory variables:

In [1]:
import pandas as pd 
hrattr_data = pd.read_csv("../datasets/WA_Fn-UseC_-HR-Employee-Attrition.csv") 
 
hrattr_data.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2


There are about 1470 observations and 35 variables in this data, the top five rows are shown here for a quick glance of the variables:

The following code is used to convert Yes or No categories into 1 and 0 for modeling purposes, as scikit-learn does not fit the model on character/categorical variables directly, hence dummy coding is required to be performed for utilizing the variables in models:

In [2]:
hrattr_data['Attrition_ind'] = 0 
hrattr_data.loc[hrattr_data['Attrition'] =='Yes', 'Attrition_ind'] = 1

Dummy variables are created for all seven categorical variables (shown here in alphabetical order), which are `Business Travel`, `Department`, `Education Field`, `Gender`, `Job Role`, `Marital Status`, and `Overtime`. We have ignored four variables from the analysis, as they do not change across the observations, which are Employee count, Employee number, Over18, and Standard Hours:

In [3]:
dummy_busnstrvl = pd.get_dummies(hrattr_data['BusinessTravel'], prefix='busns_trvl') 
dummy_dept = pd.get_dummies(hrattr_data['Department'], prefix='dept') 
dummy_edufield = pd.get_dummies(hrattr_data['EducationField'], prefix='edufield') 
dummy_gender = pd.get_dummies(hrattr_data['Gender'], prefix='gend') 
dummy_jobrole = pd.get_dummies(hrattr_data['JobRole'], prefix='jobrole') 
dummy_maritstat = pd.get_dummies(hrattr_data['MaritalStatus'], prefix='maritalstat')  
dummy_overtime = pd.get_dummies(hrattr_data['OverTime'], prefix='overtime') 

Continuous variables are separated and will be combined with the created dummy variables later:

In [4]:
continuous_columns = ['Age','DailyRate','DistanceFromHome', 'Education', 'EnvironmentSatisfaction',
                      'HourlyRate','JobInvolvement','JobLevel','JobSatisfaction', 'MonthlyIncome', 
                      'MonthlyRate', 'NumCompaniesWorked','PercentSalaryHike',  'PerformanceRating', 
                      'RelationshipSatisfaction','StockOptionLevel', 'TotalWorkingYears', 
                      'TrainingTimesLastYear','WorkLifeBalance', 'YearsAtCompany', 'YearsInCurrentRole', 
                      'YearsSinceLastPromotion','YearsWithCurrManager'] 
 
hrattr_continuous = hrattr_data[continuous_columns]

In the following step, both derived dummy variables from categorical variables and straight continuous variables are combined:

In [5]:
hrattr_data_new = pd.concat([dummy_busnstrvl, dummy_dept, dummy_edufield, dummy_gender, dummy_jobrole, 
                             dummy_maritstat, dummy_overtime, hrattr_continuous, hrattr_data['Attrition_ind']],
                            axis=1)

In [6]:
hrattr_data_new.head()

Unnamed: 0,busns_trvl_Non-Travel,busns_trvl_Travel_Frequently,busns_trvl_Travel_Rarely,dept_Human Resources,dept_Research & Development,dept_Sales,edufield_Human Resources,edufield_Life Sciences,edufield_Marketing,edufield_Medical,...,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager,Attrition_ind
0,0,0,1,0,0,1,0,1,0,0,...,1,0,8,0,1,6,4,0,5,1
1,0,1,0,0,1,0,0,1,0,0,...,4,1,10,3,3,10,7,1,7,0
2,0,0,1,0,1,0,0,0,0,0,...,2,0,7,3,3,0,0,0,0,1
3,0,1,0,0,1,0,0,1,0,0,...,3,0,8,3,3,8,7,3,0,0
4,0,0,1,0,1,0,0,0,0,1,...,4,1,6,3,3,2,2,2,2,0


> Note: Here, we have not removed one extra derived dummy variable for each categorical variable due to the reason that multi-collinearity does not create a problem in decision trees as it does in either logistic or linear regression, hence we can simply utilize all the derived variables in the rest of the chapter, as all the models utilize decision trees as an underlying model, even after performing ensembles of it.

Once basic data has been prepared, it needs to be split by 70-30 for training and testing purposes:

In [7]:
# Train and Test split 
from sklearn.model_selection import train_test_split 
x_train,x_test,y_train,y_test = \
    train_test_split(hrattr_data_new.drop (['Attrition_ind'], axis=1),
                     hrattr_data_new['Attrition_ind'],   
                     test_size = 0.3, 
                     random_state=42)

The DecisionTtreeClassifier from scikit-learn has been utilized for modeling purposes, which is available in the tree submodule:

In [8]:
# Decision Tree Classifier 
from sklearn.tree import DecisionTreeClassifier

The parameters selected for the DT classifier are in the following code with splitting criterion as Gini, Maximum depth as 5, minimum number of observations required for qualifying split is 2, and the minimum samples that should be present in the terminal node is 1:

In [9]:
dt_fit = DecisionTreeClassifier(criterion="gini", max_depth=5,min_samples_split=2,  min_samples_leaf=1,random_state=42) 
dt_fit.fit(x_train,y_train) 
 
print ("\nDecision Tree - Train Confusion  Matrix\n\n", pd.crosstab(y_train, dt_fit.predict(x_train),rownames = ["Actuall"],colnames = ["Predicted"]))    

from sklearn.metrics import accuracy_score, classification_report    
print ("\nDecision Tree - Train accuracy\n\n",round(accuracy_score (y_train, dt_fit.predict(x_train)),3)) 
print ("\nDecision Tree - Train Classification Report\n", classification_report(y_train, dt_fit.predict(x_train))) 
 
print ("\n\nDecision Tree - Test Confusion Matrix\n\n",pd.crosstab(y_test, dt_fit.predict(x_test),rownames = ["Actuall"],colnames = ["Predicted"])) 
print ("\nDecision Tree - Test accuracy",round(accuracy_score(y_test, dt_fit.predict(x_test)),3)) 
print ("\nDecision Tree - Test Classification Report\n", classification_report( y_test, dt_fit.predict(x_test)))


Decision Tree - Train Confusion  Matrix

 Predicted    0   1
Actuall           
0          844   9
1           98  78

Decision Tree - Train accuracy

 0.896

Decision Tree - Train Classification Report
               precision    recall  f1-score   support

           0       0.90      0.99      0.94       853
           1       0.90      0.44      0.59       176

   micro avg       0.90      0.90      0.90      1029
   macro avg       0.90      0.72      0.77      1029
weighted avg       0.90      0.90      0.88      1029



Decision Tree - Test Confusion Matrix

 Predicted    0   1
Actuall           
0          361  19
1           49  12

Decision Tree - Test accuracy 0.846

Decision Tree - Test Classification Report
               precision    recall  f1-score   support

           0       0.88      0.95      0.91       380
           1       0.39      0.20      0.26        61

   micro avg       0.85      0.85      0.85       441
   macro avg       0.63      0.57      0.59       

By carefully observing the results, we can infer that, even though the test accuracy is high (84.6%), the precision and recall of one category (Attrition = Yes) is low (precision = 0.39 and recall = 0.20). This could be a serious issue when management tries to use this model to provide some extra benefits proactively to the employees with a high chance of attrition prior to actual attrition, as this model is unable to identify the real employees who will be leaving. Hence, we need to look for other modifications; one way is to control the model by using class weights. By utilizing class weights, we can increase the importance of a particular class at the cost of an increase in other errors.

For example, by increasing class weight to category 1, we can identify more employees with the characteristics of actual attrition, but by doing so, we will mark some of the non-potential churner employees as potential attriters (which should be acceptable).

Another classical example of the important use of class weights is, in banking scenarios. When giving loans, it is better to reject some good applications than accepting bad loans. Hence, even in this case, it is a better idea to use higher weightage to defaulters over non-defaulters

### Tuning class weights in decision tree classifier

In the following code, class weights are tuned to see the performance change in decision trees with the same parameters. A dummy DataFrame is created to save all the results of various precision-recall details of combinations:

In [10]:
import numpy as np

dummyarray = np.empty((6,10))
dt_wttune = pd.DataFrame(dummyarray)

Metrics to be considered for capture are weight for zero and one category (for example, if the weight for zero category given is 0.2, then automatically, weight for the one should be 0.8, as total weight should be equal to 1), training and testing accuracy, precision for zero category, one category, and overall. Similarly, recall for zero category, one category, and overall are also calculated:

In [11]:
dt_wttune.columns = ["zero_wght","one_wght","tr_accuracy", "tst_accuracy", "prec_zero","prec_one", "prec_ovll", 
                     "recl_zero","recl_one","recl_ovll"]

Weights for the zero category are verified from 0.01 to 0.5, as we know we do not want to explore cases where the zero category will be given higher weightage than one category:

In [12]:
zero_clwghts = [0.01,0.1,0.2,0.3,0.4,0.5] 

for i in range(len(zero_clwghts)):
    
    clwght = {0:zero_clwghts[i],1:1.0-zero_clwghts[i]}
    dt_fit = DecisionTreeClassifier(criterion="gini",max_depth=5,min_samples_split=2,
                                    min_samples_leaf=1,random_state=42,class_weight = clwght)
    
    dt_fit.fit(x_train,y_train)
    dt_wttune.loc[i, 'zero_wght'] = clwght[0]       
    dt_wttune.loc[i, 'one_wght'] = clwght[1]     
    dt_wttune.loc[i, 'tr_accuracy'] = round(accuracy_score(y_train,dt_fit.predict(x_train)),3)    
    dt_wttune.loc[i, 'tst_accuracy'] = round(accuracy_score(y_test,dt_fit.predict(x_test)),3)    
        
    clf_sp = classification_report(y_test,dt_fit.predict(x_test)).split()
    dt_wttune.loc[i, 'prec_zero'] = float(clf_sp[5])   
    dt_wttune.loc[i, 'prec_one'] = float(clf_sp[10])   
    dt_wttune.loc[i, 'prec_ovll'] = float(clf_sp[17])   
    
    dt_wttune.loc[i, 'recl_zero'] = float(clf_sp[6])   
    dt_wttune.loc[i, 'recl_one'] = float(clf_sp[11])   
    dt_wttune.loc[i, 'recl_ovll'] = float(clf_sp[18])
    
    print ("\nClass Weights",clwght,"Train accuracy:",round(accuracy_score(y_train,dt_fit.predict(x_train)),3),"Test accuracy:",round(accuracy_score(y_test,dt_fit.predict(x_test)),3))
    print ("Test Confusion Matrix\n\n",pd.crosstab(y_test,dt_fit.predict(x_test),rownames = ["Actuall"],colnames = ["Predicted"])) 


Class Weights {0: 0.01, 1: 0.99} Train accuracy: 0.342 Test accuracy: 0.272
Test Confusion Matrix

 Predicted   0    1
Actuall           
0          65  315
1           6   55

Class Weights {0: 0.1, 1: 0.9} Train accuracy: 0.806 Test accuracy: 0.732
Test Confusion Matrix

 Predicted    0   1
Actuall           
0          282  98
1           20  41

Class Weights {0: 0.2, 1: 0.8} Train accuracy: 0.871 Test accuracy: 0.83
Test Confusion Matrix

 Predicted    0   1
Actuall           
0          341  39
1           36  25

Class Weights {0: 0.3, 1: 0.7} Train accuracy: 0.881 Test accuracy: 0.839
Test Confusion Matrix

 Predicted    0   1
Actuall           
0          345  35
1           36  25

Class Weights {0: 0.4, 1: 0.6} Train accuracy: 0.894 Test accuracy: 0.832
Test Confusion Matrix

 Predicted    0   1
Actuall           
0          346  34
1           40  21

Class Weights {0: 0.5, 1: 0.5} Train accuracy: 0.896 Test accuracy: 0.846
Test Confusion Matrix

 Predicted    0   1
Actual

From the preceding screenshot, we can seen that at class weight values of 0.3 (for zero) and 0.7 (for one) it is identifying a higher number of attriters (25 out of 61) without compromising test accuracy 83.9% using decision trees methodology

### Random forest classifier

In [13]:
# Random Forest Classifier 
from sklearn.ensemble import RandomForestClassifier

The parameters used in random forest are: n_estimators representing the number of individual decision trees used is 5000, maximum features selected are auto, which means it will select sqrt(p) for classification and p/3 for regression automatically. Here is the straightforward classification problem though. Minimum samples per leaf provides the minimum number of observations required in the terminal node:

In [14]:
rf_fit = RandomForestClassifier(n_estimators=5000, criterion="gini", max_depth=5, min_samples_split=2,
                                bootstrap=True, max_features='auto', random_state=42, min_samples_leaf=1, 
                                class_weight = {0:0.3,1:0.7})
rf_fit.fit(x_train,y_train)       

print ("\nRandom Forest - Train Confusion Matrix\n\n",pd.crosstab(y_train,rf_fit.predict(x_train),rownames = ["Actuall"],colnames = ["Predicted"]))      
print ("\nRandom Forest - Train accuracy",round(accuracy_score(y_train,rf_fit.predict(x_train)),3))
print ("\nRandom Forest  - Train Classification Report\n",classification_report(y_train,rf_fit.predict(x_train)))

print ("\n\nRandom Forest - Test Confusion Matrix\n\n",pd.crosstab(y_test,rf_fit.predict(x_test),rownames = ["Actuall"],colnames = ["Predicted"]))      
print ("\nRandom Forest - Test accuracy",round(accuracy_score(y_test,rf_fit.predict(x_test)),3))
print ("\nRandom Forest - Test Classification Report\n",classification_report(y_test,rf_fit.predict(x_test)))


Random Forest - Train Confusion Matrix

 Predicted    0    1
Actuall            
0          841   12
1           76  100

Random Forest - Train accuracy 0.914

Random Forest  - Train Classification Report
               precision    recall  f1-score   support

           0       0.92      0.99      0.95       853
           1       0.89      0.57      0.69       176

   micro avg       0.91      0.91      0.91      1029
   macro avg       0.90      0.78      0.82      1029
weighted avg       0.91      0.91      0.91      1029



Random Forest - Test Confusion Matrix

 Predicted    0   1
Actuall           
0          373   7
1           47  14

Random Forest - Test accuracy 0.878

Random Forest - Test Classification Report
               precision    recall  f1-score   support

           0       0.89      0.98      0.93       380
           1       0.67      0.23      0.34        61

   micro avg       0.88      0.88      0.88       441
   macro avg       0.78      0.61      0.64     

Random forest classifier produced 87.8% test accuracy compared with bagging 87.3%, and also identifies 14 actually attrited employees in contrast with bagging, for which 13 attrited employees have been identified:

In [15]:
import matplotlib.pyplot as plt

# Plot of Variable importance by mean decrease in gini
model_ranks = pd.Series(rf_fit.feature_importances_,index=x_train.columns, name='Importance').sort_values(ascending=False, inplace=False)
model_ranks.index.name = 'Variables'
top_features = model_ranks.iloc[:31].sort_values(ascending=True,inplace=False)
plt.figure(figsize=(20,10))
ax = top_features.plot(kind='barh')
_ = ax.set_title("Variable Importance Plot")
_ = ax.set_xlabel('Mean decrease in Variance')
_ = ax.set_yticklabels(top_features.index, fontsize=13)

plt.show()

<Figure size 2000x1000 with 1 Axes>

From the variable importance plot, it seems that the monthly income variable seems to be most significant, followed by overtime, total working years, stock option levels, years at company, and so on. This provides us with some insight into what are major contributing factors that determine whether the employee will remain with the company or leave the organization

#### Grid search

Tuning parameters in a machine learning model plays a critical role. Here, we are showing a grid search example on how to tune a random forest model

In [16]:
# Random Forest Classifier - Grid Search 
from sklearn.pipeline import Pipeline 
from sklearn.model_selection import train_test_split,GridSearchCV 
 
pipeline = Pipeline([ ('clf',RandomForestClassifier(criterion='gini',class_weight = {0:0.3,1:0.7}))])

Tuning parameters are similar to random forest parameters apart from verifying all the combinations using the pipeline function. The number of combinations to be evaluated will be (3 x 3 x 2 x 2) *5 =36*5 = 180 combinations. Here 5 is used in the end, due to the cross validation of five-fold:

In [17]:
parameters = {
        'clf__n_estimators':(2000,3000,5000),
        'clf__max_depth':(5,15,30),
        'clf__min_samples_split':(2,3),
        'clf__min_samples_leaf':(1,2)  }

grid_search = GridSearchCV(pipeline,parameters,n_jobs=-1,cv=5,verbose=1,scoring='accuracy')
grid_search.fit(x_train,y_train)

print ('Best Training score: %0.3f' % grid_search.best_score_)
print ('Best parameters set:')
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print ('\t%s: %r' % (param_name, best_parameters[param_name]))

predictions = grid_search.predict(x_test)

print ("Testing accuracy:",round(accuracy_score(y_test, predictions),4))
print ("\nComplete report of Testing data\n",classification_report(y_test, predictions))
print ("\n\nRandom Forest Grid Search- Test Confusion Matrix\n\n",pd.crosstab(y_test, predictions,rownames = ["Actuall"],colnames = ["Predicted"]))      


Fitting 5 folds for each of 36 candidates, totalling 180 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   50.1s
[Parallel(n_jobs=-1)]: Done 180 out of 180 | elapsed:  4.0min finished


Best Training score: 0.867
Best parameters set:
	clf__max_depth: 5
	clf__min_samples_leaf: 1
	clf__min_samples_split: 2
	clf__n_estimators: 3000
Testing accuracy: 0.873

Complete report of Testing data
               precision    recall  f1-score   support

           0       0.89      0.98      0.93       380
           1       0.62      0.21      0.32        61

   micro avg       0.87      0.87      0.87       441
   macro avg       0.75      0.60      0.62       441
weighted avg       0.85      0.87      0.85       441



Random Forest Grid Search- Test Confusion Matrix

 Predicted    0   1
Actuall           
0          372   8
1           48  13


In the preceding results, grid search seems to not provide much advantage compared with the already explored random forest result. But, practically, most of the times, it will provide better and more robust results compared with a simple exploration of models. However, by carefully evaluating many different combinations, it will eventually discover the best parameters combination

### AdaBoost classifier

In [18]:
# Adaboost Classifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
dtree = DecisionTreeClassifier(criterion='gini',max_depth=1)

Decision stump is used as a base classifier for AdaBoost. If we observe the following code, the depth of the tree remains as 1, which has decision taking ability only once (also considered a weak classifier):

In [19]:
dtree = DecisionTreeClassifier(criterion='gini',max_depth=1)

In AdaBoost, decision stump has been used as a base estimator to fit on whole datasets and then fits additional copies of the classifier on the same dataset up to 5000 times. The learning rate shrinks the contribution of each classifer by 0.05. There is a trade-off between learning rate and the number of estimators. By carefully choosing a low learning rate and a long number of estimators, one can converge optimum very much, however at the expense of computing power:

In [20]:
adabst_fit = AdaBoostClassifier(base_estimator= dtree,n_estimators=5000,learning_rate=0.05,random_state=42)

adabst_fit.fit(x_train, y_train)
print ("\nAdaBoost - Train Confusion Matrix\n\n", pd.crosstab(y_train, adabst_fit.predict(x_train), rownames = ["Actuall"],colnames = ["Predicted"]))
print ("\nAdaBoost - Train accuracy",round(accuracy_score(y_train,adabst_fit.predict(x_train)), 3))
print ("\nAdaBoost  - Train Classification Report\n",classification_report(y_train,adabst_fit.predict(x_train)))


AdaBoost - Train Confusion Matrix

 Predicted    0    1
Actuall            
0          844    9
1           55  121

AdaBoost - Train accuracy 0.938

AdaBoost  - Train Classification Report
               precision    recall  f1-score   support

           0       0.94      0.99      0.96       853
           1       0.93      0.69      0.79       176

   micro avg       0.94      0.94      0.94      1029
   macro avg       0.93      0.84      0.88      1029
weighted avg       0.94      0.94      0.93      1029



The result of the AdaBoost seems to be much better than the known best random forest classifiers in terms of the recall of 1 value. Though there is a slight decrease in accuracy to 86.8% compared with the best accuracy of 87.8%, the number of 1's predicted is 23 from the RF, which is 14 with some expense of increase in 0's, but it really made good progress in terms of identifying actual attriters

### Gradient boosting classifier

Gradient boosting is one of the competition-winning algorithms that work on the principle of boosting weak learners iteratively by shifting focus towards problematic observations that were difficult to predict in previous iterations and performing an ensemble of weak learners, typically decision trees. It builds the model in a stage-wise fashion like other boosting methods do, but it generalizes them by allowing optimization of an arbitrary differentiable loss function.

In [21]:
# Gradientboost Classifier
from sklearn.ensemble import GradientBoostingClassifier

Parameters used in the gradient boosting algorithms are as follows. Deviance has been used for loss, as the problem we are trying to solve is 0/1 binary classification. The learning rate has been chosen as 0.05, number of trees to build is 5000 trees, minimum sample per leaf/terminal node is 1, and minimum samples needed in a bucket for qualification for splitting is 2:

In [22]:
gbc_fit = GradientBoostingClassifier (loss='deviance', learning_rate=0.05, n_estimators=5000, min_samples_split=2, min_samples_leaf=1, max_depth=1, random_state=42 ) 

gbc_fit.fit(x_train,y_train) 

print ("\nGradient Boost - Train Confusion Matrix\n\n",pd.crosstab(y_train, gbc_fit.predict(x_train),rownames = ["Actuall"],colnames = ["Predicted"]))
print ("\nGradient Boost - Train accuracy",round(accuracy_score(y_train, gbc_fit.predict(x_train)),3))
print ("\nGradient Boost - Train Classification Report\n",classification_report( y_train, gbc_fit.predict(x_train)))

print ("\n\nGradient Boost - Test Confusion Matrix\n\n",pd.crosstab(y_test, gbc_fit.predict(x_test),rownames = ["Actuall"],colnames = ["Predicted"]))
print ("\nGradient Boost - Test accuracy",round(accuracy_score(y_test, gbc_fit.predict(x_test)),3)) >>> print ("\nGradient Boost - Test Classification Report\n",classification_report( y_test, gbc_fit.predict(x_test)))

SyntaxError: invalid syntax (<ipython-input-22-94d43e0971a0>, line 10)

If we analyze the results, Gradient boosting has given better results than AdaBoost with the highest possible test accuracy of 87.5% with most 1's captured as 24, compared with AdaBoost with which the test accuracy obtained was 86.8%. Hence, it has been proven that it is no wonder why every data scientist tries to use this algorithm to win competitions!

### Extreme gradient boosting - XGBoost classifier
XGBoost is the new algorithm developed in 2014 by Tianqi Chen based on the Gradient boosting principles. It has created a storm in the data science community since its inception. XGBoost has been developed with both deep consideration in terms of system optimization and principles in machine learning. The goal of the library is to push the extremes of the computation limits of machines to provide scalable, portable, and accurate results:

In [None]:
# Xgboost Classifier
import xgboost as xgb
xgb_fit = xgb.XGBClassifier(max_depth=2, n_estimators=5000, learning_rate=0.05)
xgb_fit.fit(x_train, y_train)

print ("\nXGBoost - Train Confusion Matrix\n\n",pd.crosstab(y_train, xgb_fit.predict(x_train),rownames = ["Actuall"],colnames = ["Predicted"]))     
print ("\nXGBoost - Train accuracy",round(accuracy_score(y_train, xgb_fit.predict(x_train)),3))
print ("\nXGBoost  - Train Classification Report\n",classification_report(y_train, xgb_fit.predict(x_train)))
print ("\n\nXGBoost - Test Confusion Matrix\n\n",pd.crosstab(y_test, xgb_fit.predict(x_test),rownames = ["Actuall"],colnames = ["Predicted"]))     
print ("\nXGBoost - Test accuracy",round(accuracy_score(y_test, xgb_fit.predict(x_test)),3))
print ("\nXGBoost - Test Classification Report\n",classification_report(y_test, xgb_fit.predict(x_test)))

The results obtained from XGBoost are almost similar to gradient boosting. The test accuracy obtained was 87.1%, whereas boosting got 87.5%, and also the number of 1's identified is 23 compared with 24 in gradient boosting. The greatest advantage of XGBoost over Gradient boost is in terms of performance and the options available to control model tune. By changing a few of them, makes XGBoost even beat gradient boost as well!