<h2>1. Importing the Dataset</h2>

In [1]:
import pandas as pd
import numpy as np

# The dataset is about breast cancer, target is benign(2) or malignant(4)
df = pd.read_csv('../../../data/clean/Data_XGBoost.csv')
display(df.head())
x = df.iloc[:, :-1].values
y = df.iloc[:, -1].values

Unnamed: 0,Sample code number,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bare Nuclei,Bland Chromatin,Normal Nucleoli,Mitoses,Class
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2


---
<h2>2. Splitting the Dataset</h2>

In [2]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)
print("train dataset size : {} observations\ntest dataset size : {} observations".format(x_train.shape[0], x_test.shape[0]))

train dataset size : 546 observations
test dataset size : 137 observations


---
<h2>3. Training the Model with Train Dataset</h2>

In [3]:
from xgboost import XGBClassifier

# XGBoost can be used for regression (XGBRegressor) or classification (XGBClassifier) model
xgbc = XGBClassifier()
xgbc.fit(x_train, y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=0, num_parallel_tree=1, random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

---
<h2>4. Predicting the Test Dataset and Display Results</h2>

In [4]:
y_pred = xgbc.predict(x_test)

pd.DataFrame(data=np.stack((y_test, y_pred), axis=1),
             index=None, columns=['y actual', 'y prediction'],
             copy=False).head(10)

Unnamed: 0,y actual,y prediction
0,2,2
1,2,2
2,4,4
3,4,4
4,2,2
5,2,2
6,2,2
7,4,4
8,2,2
9,2,2


---
<h2>5. Making the Confusion Matrix</h2>

In [5]:
from sklearn.metrics import confusion_matrix

print(confusion_matrix(y_test, y_pred))
print("\nConfusion matrix result shows that:\n\t- 85 correct predictions of the class 2 (benign)\
        \n\t- 49 correct predictions of the class 4 (malignant)")

[[85  2]
 [ 1 49]]

Confusion matrix result shows that:
	- 85 correct predictions of the class 2 (benign)        
	- 49 correct predictions of the class 4 (malignant)


---
<h2>6. Applying k-Fold Cross Validation</h2>

In [6]:
from sklearn.model_selection import cross_val_score

acc = cross_val_score(estimator=xgbc, X=x_train, y=y_train, cv=10, n_jobs=-1)
print("Accuracy: {:.2f}%".format(acc.mean()*100))
print("Standard Deviation: {:.2f}%".format(acc.std()*100))
print("\nWhile the 10 accuracy is resulting from the test, they fall around between {:.2f}% and {:.2f}%. So we have actually a low Standard Deviation.".format((acc.mean()-acc.std())*100, (acc.mean()+acc.std())*100))

Accuracy: 96.53%
Standard Deviation: 2.63%

While the 10 accuracy is resulting from the test, they fall around between 93.89% and 99.16%. So we have actually a low Standard Deviation.
