<a href="https://colab.research.google.com/github/audrey-siqueira/Data-Science-Projects/blob/master/XGBoost.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Breast Cancer Prediction using XGBoost**
---
<p align="justify">
Worldwide, breast cancer is the most common type of cancer in women and the second highest in terms of mortality rates.Diagnosis of breast cancer is performed when an abnormal lump is found (from self-examination or x-ray) or a tiny speck of calcium is seen (on an x-ray). After a suspicious lump is found, the doctor will conduct a diagnosis to determine whether it is cancerous and, if so, whether it has spread to other parts of the body.

https://drive.google.com/uc?id=1ttUyHZU7S9_eMwmXUEovAubOG_7raTW7

https://drive.google.com/uc?export=view&id=1ttUyHZU7S9_eMwmXUEovAubOG_7raTW7

![](https://drive.google.com/uc?id=1ttUyHZU7S9_eMwmXUEovAubOG_7raTW7)

<p align="justify">
The goal in this case is to use machine learning to predict whether a patient diagnosed with cancer has a benign or malignant tumor. For this activity a database with diganosticated patients was provided, in that database it is possible to find all parameters of the given tumor, such as tumor size and shape. 
<p align="justify">

The proposed model was the **XGBoost**, using the database with the values of all parameters as Independent variables and the respective **Malignant (4) or Benign (2)** as Dependent Variable.

<p align=center>
<img src="https://drive.google.com/uc?export=view&id=1ttUyHZU7S9_eMwmXUEovAubOG_7raTW7" width="60%"></p>

<p align="justify">
The database was divided into 2 parts, the Training part used the parameters values to generate the prediction curve, the Testing part was used to compare the real values with the prediction curve generated in the training and check the accuracy percentage of generated model.




**Code description is explained below:**

## **Importing the libraries**


The 3 libraries needed for the project are imported.
- Pandas for data manipulation and analysis
- Numpy for mathematical operations
- Matplotlib for graphical visualizations

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## **Importing the data set**


The .csv file containing  database is imported.

The values of the database header and how they are distributed can be viewed in the image below:

In [None]:
dataset= pd.read_csv('/content/drive/My Drive/Colab Notebooks/10-XGBoost/XGBoost/Data.csv')
X = dataset.iloc[:, :-1].values
Y = dataset.iloc[:, -1].values
dataset.head()

Unnamed: 0,Sample code number,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bare Nuclei,Bland Chromatin,Normal Nucleoli,Mitoses,Class
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2


## **Splitting the dataset into the Training set and Test set**

The ***sklearn.model_selection*** library was used to divide the database between Training set and Test set. 

X and Y axes values were divided between Training set and Test set, resulting in 4 final vectors:
**X_train**, **X_test** and **Y_train**, **Y_test**

The database was divided in 80% of the total values for Training and 20% of the total values for Testing.

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.2,random_state=0)

## **Fitting XGBoost to the Training Set**

The **xgboost** library was used to apply the **XGBoost** method.

Using the **Training set** of X and Y values, a prediction curve is created.

In [None]:
from xgboost import XGBClassifier
classifier = XGBClassifier()
classifier.fit(X_train, Y_train)



XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)

## **Making the Confusion Matrix**

The confusion matrix is applied to determine the accuracy of the classification model.

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score
y_pred = classifier.predict(X_test)
cm = confusion_matrix(Y_test, y_pred)
print(cm)
accuracy_score(Y_test, y_pred)


[[84  3]
 [ 0 50]]


0.9781021897810219

## **Applying k-Fold Cross Validation**

In [None]:
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X_train, y = Y_train, cv = 10)
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))

Accuracy: 96.53 %
Standard Deviation: 2.07 %


## **Predicting a new result**

Predicting a new result for new values of Age and Salary.

In [None]:
print(classifier.predict(sc.transform([[30,87000]])))

[0]


## **Conclusion**

The XGBoost model proved to be a great model classifying results , getting an accuracy around 97%.