<h2>1. Importing the Dataset</h2>

In [1]:
import pandas as pd
import numpy as np

# The dataset is about breast cancer, target is benign(2) or malignant(4)
df = pd.read_csv('../../../data/clean/Data_XGBoost.csv')
display(df.head())
x = df.iloc[:, :-1].values
y = df.iloc[:, -1].values

Unnamed: 0,Sample code number,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bare Nuclei,Bland Chromatin,Normal Nucleoli,Mitoses,Class
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2


---
<h2>2. Splitting the Dataset</h2>

In [2]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)
print("train dataset size : {} observations\ntest dataset size : {} observations".format(x_train.shape[0], x_test.shape[0]))

train dataset size : 546 observations
test dataset size : 137 observations


---
<h2>3. Training the Model with Train Dataset [CatBoost <a href="https://catboost.ai/">source</a>]</h2>
<ul style="font-size:15px">
    <li>CatBoost is an algorithm for gradient boosting on decision trees
    <li>CatBoost has special feature which is <strong>self-tuning</strong>, means it does not require to be tuned
    <li>CatBoost can handle categorical features easily, so we do not have to pre-process or spend time and effort turning it to numbers
    <li>CatBoost can be used for regression (CatBoostRegressor) or classification (CatBoostClassifier) model
</ul>

In [3]:
from catboost import CatBoostClassifier

cbc = CatBoostClassifier()
cbc.fit(x_train, y_train)

emaining: 534ms
616:	learn: 0.0205628	total: 858ms	remaining: 532ms
617:	learn: 0.0205132	total: 859ms	remaining: 531ms
618:	learn: 0.0204583	total: 861ms	remaining: 530ms
619:	learn: 0.0204181	total: 862ms	remaining: 528ms
620:	learn: 0.0203960	total: 864ms	remaining: 527ms
621:	learn: 0.0203647	total: 865ms	remaining: 526ms
622:	learn: 0.0203238	total: 866ms	remaining: 524ms
623:	learn: 0.0202722	total: 867ms	remaining: 523ms
624:	learn: 0.0202170	total: 868ms	remaining: 521ms
625:	learn: 0.0202039	total: 869ms	remaining: 519ms
626:	learn: 0.0201620	total: 870ms	remaining: 518ms
627:	learn: 0.0201252	total: 871ms	remaining: 516ms
628:	learn: 0.0200745	total: 872ms	remaining: 515ms
629:	learn: 0.0200285	total: 874ms	remaining: 513ms
630:	learn: 0.0199697	total: 875ms	remaining: 512ms
631:	learn: 0.0199337	total: 876ms	remaining: 510ms
632:	learn: 0.0198978	total: 878ms	remaining: 509ms
633:	learn: 0.0198646	total: 879ms	remaining: 507ms
634:	learn: 0.0197812	total: 880ms	remaining: 50

<catboost.core.CatBoostClassifier at 0x1e8203f1790>

---
<h2>4. Predicting the Test Dataset and Display Results</h2>

In [4]:
y_pred = cbc.predict(x_test)

pd.DataFrame(data=np.stack((y_test, y_pred), axis=1),
             index=None, columns=['y actual', 'y prediction'],
             copy=False).head(10)

Unnamed: 0,y actual,y prediction
0,2,2
1,2,2
2,4,4
3,4,4
4,2,2
5,2,2
6,2,2
7,4,4
8,2,2
9,2,2


---
<h2>5. Making the Confusion Matrix</h2>

In [5]:
from sklearn.metrics import confusion_matrix

print(confusion_matrix(y_test, y_pred))
print("\nConfusion matrix result shows that:\n\t- 84 correct predictions of the class 2 (benign)\
        \n\t- 50 perfect correct predictions of the class 4 (malignant)")

[[84  3]
 [ 0 50]]

Confusion matrix result shows that:
	- 84 correct predictions of the class 2 (benign)        
	- 50 perfect correct predictions of the class 4 (malignant)


---
<h2>6. Applying k-Fold Cross Validation</h2>

In [6]:
from sklearn.model_selection import cross_val_score

acc = cross_val_score(estimator=cbc, X=x_train, y=y_train, cv=10, n_jobs=-1)
print("Accuracy: {:.2f}%".format(acc.mean()*100))
print("Standard Deviation: {:.2f}%".format(acc.std()*100))
print("\nWhile the 10 accuracy is resulting from the test, they fall around between {:.2f}% and {:.2f}%. So we have actually a low Standard Deviation.".format((acc.mean()-acc.std())*100, (acc.mean()+acc.std())*100))

Accuracy: 97.26%
Standard Deviation: 2.03%

While the 10 accuracy is resulting from the test, they fall around between 95.23% and 99.29%. So we have actually a low Standard Deviation.
