# XGBoost - Breast Cancer Classification

In [2]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

### EDA

In [4]:
data = pd.read_csv('../data/Breast_cancer.csv')
data.head()

Unnamed: 0,Sample code number,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bare Nuclei,Bland Chromatin,Normal Nucleoli,Mitoses,Class
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2


In [5]:
data['Class'] = data['Class'].map({2:0,4:1})

In [6]:
data.shape

(683, 11)

In [7]:
data.isna().sum()

Sample code number             0
Clump Thickness                0
Uniformity of Cell Size        0
Uniformity of Cell Shape       0
Marginal Adhesion              0
Single Epithelial Cell Size    0
Bare Nuclei                    0
Bland Chromatin                0
Normal Nucleoli                0
Mitoses                        0
Class                          0
dtype: int64

###  splitting x and y

In [9]:
X = data.drop('Class',axis=1)
y = data['Class']

### Splitting test and traindata

In [11]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

### XGboost

In [24]:
import xgboost as xgb
classifier = xgb.XGBClassifier()
classifier.fit(X_train, y_train)
from sklearn.metrics import confusion_matrix, accuracy_score
y_pred = classifier.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

[[88  5]
 [16 91]]


0.895

In [26]:
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))

Accuracy: 90.38 %


In [32]:
data_dmatrix = xgb.DMatrix(data=X,label=y)
params = {"objective":"binary:logistic",'colsample_bytree': 0.3,'learning_rate': 0.1,
                'max_depth': 5, 'alpha': 10}
cv_results = xgb.cv(dtrain=data_dmatrix, params=params, nfold=3,
                    num_boost_round=5000,early_stopping_rounds=10,metrics="error", as_pandas=True, seed=123)
cv_results

Unnamed: 0,train-error-mean,train-error-std,test-error-mean,test-error-std
0,0.433992,0.031829,0.521021,0.017808
1,0.4205,0.047129,0.517011,0.018851
2,0.375987,0.058684,0.510999,0.024529
3,0.145994,0.035966,0.243972,0.056433
4,0.137992,0.026529,0.229975,0.05317
5,0.129495,0.014457,0.185979,0.019854
6,0.137996,0.008762,0.181981,0.017177
7,0.138496,0.00741,0.16999,0.011948
8,0.138996,0.008962,0.16999,0.010034
9,0.141498,0.005983,0.169993,0.011995


xgb.DMatrix is a data structure optimized for XGBoost. It is used to hold the data and labels in a format suitable for efficient computation during training.

data=X: The feature matrix X.
label=y: The target labels y.

This structure is lightweight and includes optimizations like:

Handling missing values.
Storing additional information like weights and base margin.

The params dictionary contains the hyperparameters for the XGBoost model. 

xgb.cv performs cross-validation to evaluate the model's performance and tune the number of boosting rounds.

dtrain=data_dmatrix: The data to be used for cross-validation.

params=params: The model hyperparameters.

nfold=3: Number of folds for cross-validation.

Data will be split into 3 parts: 2 for training and 1 for validation, repeated across folds.

num_boost_round=5000: Maximum number of boosting rounds (iterations).

early_stopping_rounds=10: Stops training if the performance doesn't improve for 10 consecutive rounds.
Prevents overfitting and saves computation.

metrics="error": Evaluation metric.
"error" calculates the classification error rate.

as_pandas=True: Returns results as a pandas DataFrame.

seed=123: Sets a random seed for reproducibility.

In [14]:
#END