# Python-MLearning: Classification using Extreme Gradient Boosting (XGBoost) and Sklearn Library

## Project and Results Presentation


By: Hector Alvaro Rojas &nbsp;&nbsp;|&nbsp;&nbsp; Data Science, Visualizations and Applied Statistics &nbsp;&nbsp;|&nbsp;&nbsp; Mach 19 2018<br>
    Url: [http://www.arqmain.net]   &nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;   GitHub: [https://github.com/arqmain]
    <hr>

## I INTRODUCTION

This project presents an application of Extreme Gradient Boosting (XGBoost) algorithm to the classification problem, evaluating the models by using a precision measurement (accuracy_score) and the ROC Curve.

This is a Two-Class approach to clasiffy Normal and Abnormal Orthopedic Patients.  The dataset can be gotten from [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/datasets/vertebral+column#], but in this project I will use a copy (already pre-procesing) of this dataset wich you can download from [here](http://www.arqmain.net/MLearning/Datasets/column_2C_weka.csv).

The dataset contains 310 observations or patiens. There are six columns of measurements of the patients. This columns are the variables (features): pelvic_incidence, pelvic_tilt numeric, lumbar_lordosis_angle, sacral_slope, pelvic_radius, degree_spondylolisthesis.

The seventh column is the class of the patien observed. All observed patent belong to one of two classes: Normal; Abnormal.



The following visualization presents the distribution of the patients on the classification variable "class". The graph shows a projection of the variables over a bidimensional space, considering two orthogonal axes that represent 85.8%. of the total variation of the original data.

<img src="img/PC2.png" />

This graph allows us to observe that the data present a level of real separation. In addition, they could be considered as linearly separable. 

## II GENERAL DATA CHECKING

In [1]:
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import pandas_profiling

# read csv (comma separated value) into data
data = pd.read_csv('data/column_2C_weka.csv')
data.columns

Index(['pelvic_incidence', 'pelvic_tilt numeric', 'lumbar_lordosis_angle',
       'sacral_slope', 'pelvic_radius', 'degree_spondylolisthesis', 'class'],
      dtype='object')

In [3]:
pandas_profiling.ProfileReport(data)

0,1
Number of variables,7
Number of observations,310
Total Missing (%),0.0%
Total size in memory,17.0 KiB
Average record size in memory,56.3 B

0,1
Numeric,6
Categorical,1
Boolean,0
Date,0
Text (Unique),0
Rejected,0
Unsupported,0

0,1
Distinct count,2
Unique (%),0.6%
Missing (%),0.0%
Missing (n),0

0,1
Abnormal,210
Normal,100

Value,Count,Frequency (%),Unnamed: 3
Abnormal,210,67.7%,
Normal,100,32.3%,

0,1
Distinct count,310
Unique (%),100.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,26.297
Minimum,-11.058
Maximum,418.54
Zeros (%),0.0%

0,1
Minimum,-11.058
5-th percentile,-4.0831
Q1,1.6037
Median,11.768
Q3,41.287
95-th percentile,81.691
Maximum,418.54
Range,429.6
Interquartile range,39.684

0,1
Standard deviation,37.559
Coef of variation,1.4283
Kurtosis,38.069
Mean,26.297
MAD,25.771
Skewness,4.318
Sum,8152
Variance,1410.7
Memory size,2.5 KiB

Value,Count,Frequency (%),Unnamed: 3
49.67209559,1,0.3%,
67.72731595,1,0.3%,
5.415825143,1,0.3%,
5.988550702,1,0.3%,
1.517203356,1,0.3%,
-1.537383074,1,0.3%,
30.34120327,1,0.3%,
-0.622526643,1,0.3%,
58.05754155,1,0.3%,
51.80589921,1,0.3%,

Value,Count,Frequency (%),Unnamed: 3
-11.05817866,1,0.3%,
-10.67587083,1,0.3%,
-10.09310817,1,0.3%,
-9.569249858,1,0.3%,
-8.941709421,1,0.3%,

Value,Count,Frequency (%),Unnamed: 3
118.3533701,1,0.3%,
124.9844057,1,0.3%,
145.3781432,1,0.3%,
148.7537109,1,0.3%,
418.5430821,1,0.3%,

0,1
Distinct count,280
Unique (%),90.3%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,51.931
Minimum,14
Maximum,125.74
Zeros (%),0.0%

0,1
Minimum,14.0
5-th percentile,26.852
Q1,37.0
Median,49.562
Q3,63.0
95-th percentile,85.595
Maximum,125.74
Range,111.74
Interquartile range,26.0

0,1
Standard deviation,18.554
Coef of variation,0.35728
Kurtosis,0.16181
Mean,51.931
MAD,14.997
Skewness,0.59945
Sum,16099
Variance,344.25
Memory size,2.5 KiB

Value,Count,Frequency (%),Unnamed: 3
41.99999999,4,1.3%,
35.0,4,1.3%,
51.99999999,4,1.3%,
46.99999999,4,1.3%,
37.0,3,1.0%,
57.99999999,3,1.0%,
34.0,3,1.0%,
62.99999999,2,0.6%,
47.99999999,2,0.6%,
50.99999999,2,0.6%,

Value,Count,Frequency (%),Unnamed: 3
14.0,1,0.3%,
15.5,1,0.3%,
15.59036345,1,0.3%,
19.0710746,1,0.3%,
20.0308863,1,0.3%,

Value,Count,Frequency (%),Unnamed: 3
93.89277881,1,0.3%,
95.15763273,1,0.3%,
96.28306169,1,0.3%,
100.7442198,1,0.3%,
125.7423855,1,0.3%,

0,1
Distinct count,310
Unique (%),100.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,60.497
Minimum,26.148
Maximum,129.83
Zeros (%),0.0%

0,1
Minimum,26.148
5-th percentile,35.989
Q1,46.43
Median,58.691
Q3,72.878
95-th percentile,87.869
Maximum,129.83
Range,103.69
Interquartile range,26.447

0,1
Standard deviation,17.237
Coef of variation,0.28492
Kurtosis,0.22378
Mean,60.497
MAD,14.308
Skewness,0.52044
Sum,18754
Variance,297.1
Memory size,2.5 KiB

Value,Count,Frequency (%),Unnamed: 3
58.52162283,1,0.3%,
50.81926781,1,0.3%,
42.51727249,1,0.3%,
72.64385013,1,0.3%,
59.72614016,1,0.3%,
46.39026008,1,0.3%,
86.04127982,1,0.3%,
48.0306238,1,0.3%,
65.66534698,1,0.3%,
59.59554032,1,0.3%,

Value,Count,Frequency (%),Unnamed: 3
26.14792141,1,0.3%,
30.14993632,1,0.3%,
30.74193812,1,0.3%,
31.23238734,1,0.3%,
31.27601184,1,0.3%,

Value,Count,Frequency (%),Unnamed: 3
95.48022873,1,0.3%,
96.65731511,1,0.3%,
115.9232606,1,0.3%,
118.1446548,1,0.3%,
129.8340406,1,0.3%,

0,1
Distinct count,310
Unique (%),100.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,117.92
Minimum,70.083
Maximum,163.07
Zeros (%),0.0%

0,1
Minimum,70.083
5-th percentile,95.339
Q1,110.71
Median,118.27
Q3,125.47
95-th percentile,139.14
Maximum,163.07
Range,92.988
Interquartile range,14.758

0,1
Standard deviation,13.317
Coef of variation,0.11294
Kurtosis,0.93461
Mean,117.92
MAD,10.002
Skewness,-0.17683
Sum,36555
Variance,177.35
Memory size,2.5 KiB

Value,Count,Frequency (%),Unnamed: 3
141.0881494,1,0.3%,
131.8024914,1,0.3%,
124.6460723,1,0.3%,
128.0636203,1,0.3%,
128.9056892,1,0.3%,
78.99945411,1,0.3%,
124.1158358,1,0.3%,
122.0929536,1,0.3%,
98.62251165,1,0.3%,
119.3356546,1,0.3%,

Value,Count,Frequency (%),Unnamed: 3
70.08257486,1,0.3%,
78.99945411,1,0.3%,
81.0245406,1,0.3%,
82.45603817,1,0.3%,
84.24141517,1,0.3%,

Value,Count,Frequency (%),Unnamed: 3
147.8946372,1,0.3%,
148.5255624,1,0.3%,
151.8398566,1,0.3%,
157.848799,1,0.3%,
163.0710405,1,0.3%,

0,1
Distinct count,310
Unique (%),100.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,17.543
Minimum,-6.5549
Maximum,49.432
Zeros (%),0.0%

0,1
Minimum,-6.5549
5-th percentile,3.3834
Q1,10.667
Median,16.358
Q3,22.12
95-th percentile,37.55
Maximum,49.432
Range,55.987
Interquartile range,11.453

0,1
Standard deviation,10.008
Coef of variation,0.57051
Kurtosis,0.67618
Mean,17.543
MAD,7.6173
Skewness,0.67655
Sum,5438.3
Variance,100.17
Memory size,2.5 KiB

Value,Count,Frequency (%),Unnamed: 3
24.1888846,1,0.3%,
9.652074879,1,0.3%,
15.40221253,1,0.3%,
39.84466878,1,0.3%,
22.21848205,1,0.3%,
17.44383762,1,0.3%,
17.89940172,1,0.3%,
13.29178975,1,0.3%,
33.42595126,1,0.3%,
29.39654543,1,0.3%,

Value,Count,Frequency (%),Unnamed: 3
-6.554948347,1,0.3%,
-5.845994341,1,0.3%,
-3.759929872,1,0.3%,
-2.970024337,1,0.3%,
-1.329412398,1,0.3%,

Value,Count,Frequency (%),Unnamed: 3
42.68919513,1,0.3%,
46.55005318,1,0.3%,
48.06953097,1,0.3%,
48.90365265,1,0.3%,
49.4318636,1,0.3%,

0,1
Distinct count,281
Unique (%),90.6%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,42.954
Minimum,13.367
Maximum,121.43
Zeros (%),0.0%

0,1
Minimum,13.367
5-th percentile,23.489
Q1,33.347
Median,42.405
Q3,52.696
95-th percentile,63.435
Maximum,121.43
Range,108.06
Interquartile range,19.349

0,1
Standard deviation,13.423
Coef of variation,0.3125
Kurtosis,3.0074
Mean,42.954
MAD,10.613
Skewness,0.79258
Sum,13316
Variance,180.18
Memory size,2.5 KiB

Value,Count,Frequency (%),Unnamed: 3
45.0,3,1.0%,
56.30993248,3,1.0%,
35.41705528,3,1.0%,
33.11134196,3,1.0%,
34.38034472,2,0.6%,
55.92280472,2,0.6%,
33.21525149,2,0.6%,
29.7448813,2,0.6%,
48.17983012,2,0.6%,
52.88313932,2,0.6%,

Value,Count,Frequency (%),Unnamed: 3
13.3669307,1,0.3%,
13.51656811,1,0.3%,
15.38846783,1,0.3%,
16.26020471,1,0.3%,
17.38697218,1,0.3%,

Value,Count,Frequency (%),Unnamed: 3
77.19573393,1,0.3%,
78.40782459,1,0.3%,
78.79405249,1,0.3%,
79.69515353,1,0.3%,
121.4295656,1,0.3%,

Unnamed: 0,pelvic_incidence,pelvic_tilt numeric,lumbar_lordosis_angle,sacral_slope,pelvic_radius,degree_spondylolisthesis,class
0,63.027818,22.552586,39.609117,40.475232,98.672917,-0.2544,Abnormal
1,39.056951,10.060991,25.015378,28.99596,114.405425,4.564259,Abnormal
2,68.832021,22.218482,50.092194,46.613539,105.985135,-3.530317,Abnormal
3,69.297008,24.652878,44.311238,44.64413,101.868495,11.211523,Abnormal
4,49.712859,9.652075,28.317406,40.060784,108.168725,7.918501,Abnormal


## III RESULTS FROM TWO CLASS APPROACH USING TRAIN-TEST-SPLIT METHOD

All the parameters were left as they are by the defect. The values of the parameters listed here are the result we got from them as a result of the living for defect action.  We listed them in communion with the ones we will use by the time we will search for them applying the GGrid Search cross-validation with 10 folds.

### 31 XGBoost Based on Original Features

>* gamma = 0
>* learning_rate = 0.1
>* max_depth = 3
>* min_child_weight = 1
>* n_estimators = 100
>* reg_alpha = 0
>* Accuracy = 0.8172
>* AUC = 0.88
>* Confusion matrix:
> <img src="img/CMatrix1.png" />


### 32 XGBoost Based on Rescaled Features Using Standarization

>* gamma = 0
>* learning_rate = 0.1
>* max_depth = 3
>* min_child_weight = 1
>* n_estimators = 100
>* reg_alpha = 0
>* Accuracy = 0.8172
>* AUC = 0.92
>* Confusion matrix:
> <img src="img/CMatrix2.png" />

### 33 XGBoost Based on Rescaled Features Using Normalization

>* gamma = 0
>* learning_rate = 0.1
>* max_depth = 3
>* min_child_weight = 1
>* n_estimators = 100
>* reg_alpha = 0
>* Accuracy = 0.8172
>* AUC = 0.88
>* Confusion matrix:
> <img src="img/CMatrix3.png" />


### 34 Roc Curves Three Models in the same Plot

> <img src="img/ROC1.png" />

## IV RESULTS FROM K FOLD CROSS-VALIDATION METHOD

We made a Grid Search cross validation with 10 folds. We made the search for the following parameter's specifications:

<b>gamma</b>: [0, 0.1, 0.5]<br>
<b>learning_rate</b>: [0.05, 0.1]<br>
<b>max_depth</b>: [5, 7]<br>
<b>min_child_weight</b>: range(1,6)<br>
<b>n_estimators</b>: [100, 200]<br>
<b>reg_alpha</b>: [1e-5, 1e-2, 0.1, 1, 100]<br>

The rest of the parameters were left as they are by defect.

### 41 XGBoost Based on Original Features

>* gamma = 0
>* learning_rate = 0.05
>* max_depth = 5
>* min_child_weight = 5
>* n_estimators = 100
>* reg_alpha = 0.1
>* Accuracy = 0.8525
>* AUC = 0.99
>* Confusion matrix:
> <img src="img/CMatrix21.png" />


### 42 NN Based on Rescaled Features Using Standarization

>* gamma = 0
>* learning_rate = 0.05
>* max_depth = 5
>* min_child_weight = 5
>* n_estimators = 100
>* reg_alpha = 0.1
>* Accuracy = 0.8525
>* AUC = 0.99
>* Confusion matrix:
> <img src="img/CMatrix22.png" />


### 43 XGBoost Based on Rescaled Features Using Normalization

>* gamma = 0
>* learning_rate = 0.05
>* max_depth = 5
>* min_child_weight = 5
>* n_estimators = 100
>* reg_alpha = 0.1
>* Accuracy = 0.8571
>* AUC = 0.97
>* Confusion matrix:
> <img src="img/CMatrix23.png" />

### 44 Roc Curves Three Models in the same Plot

> <img src="img/ROC2.png" />

<hr>
By: Hector Alvaro Rojas &nbsp;&nbsp;|&nbsp;&nbsp; Data Science, Visualizations and Applied Statistics &nbsp;&nbsp;|&nbsp;&nbsp; March 19, 2018<br>
    Url: [http://www.arqmain.net]   &nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;   GitHub: [https://github.com/arqmain]
    <hr>