# Breast Cancer Classification

## Project Overview

This project involves building a classification model to predict whether a breast cancer tumor is benign or malignant based on features extracted from a digitized image of a fine needle aspirate (FNA) of a breast mass. The dataset contains real-valued features that describe characteristics of the cell nuclei present in the image. The classification task involves predicting the diagnosis (benign or malignant) using these features.

The dataset used in this project is the **Breast Cancer Wisconsin (Diagnostic) Dataset**, which is publicly available on the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29). The project involves data preprocessing, feature analysis, and building a machine learning model to accurately classify the tumors.

## Data Source

This dataset is freely available in Kaggele in the following link:
> [https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data/data]

## Dataset Description

The dataset contains the following attributes:

1. **ID number**: Unique identifier for each sample.
2. **Diagnosis**: Target variable (M = malignant, B = benign).
3. **Features**: There are 30 real-valued features computed for each cell nucleus in the image. These features include:
   - **Radius** (mean of distances from center to points on the perimeter)
   - **Texture** (standard deviation of gray-scale values)
   - **Perimeter**
   - **Area**
   - **Smoothness** (local variation in radius lengths)
   - **Compactness** (perimeter² / area - 1.0)
   - **Concavity** (severity of concave portions of the contour)
   - **Concave points** (number of concave portions of the contour)
   - **Symmetry**
   - **Fractal dimension** ("coastline approximation" - 1)
   
   The mean, standard error, and "worst" or largest (mean of the three largest values) of these features were computed for each image, resulting in a total of 30 features. For instance:
   - Field 3 is Mean Radius.
   - Field 13 is Radius Standard Error (SE).
   - Field 23 is Worst Radius.

- **Number of Instances**: 569
- **Number of Attributes**: 32 (ID number, diagnosis, and 30 real-valued features)
- **Missing Values**: None
- **Class Distribution**: 357 benign, 212 malignant

## Attribute Information

1. **ID number**
2. **Diagnosis**: (M = malignant, B = benign)
3. **Ten real-valued features** are computed for each cell nucleus:
   - **Radius** (mean of distances from center to points on the perimeter)
   - **Texture** (standard deviation of gray-scale values)
   - **Perimeter**: Perimeter of the tumor.
   - **Area**: Area of the tumor.
   - **Smoothness** (local variation in radius lengths)
   - **Compactness** (perimeter² / area - 1.0)
   - **Concavity** (severity of concave portions of the contour)
   - **Concave points** (number of concave portions of the contour)
   - **Symmetry**
   - **Fractal dimension** ("coastline approximation" - 1)


## Problem Statement

- **Model Training**: Train the model with the cleaned data to classify a tumor wheather it is malignant or benign.
- **Model Evaluation**: Evaluate the performance of the trained model using different evaluation metrics such as accuracy, precision, recall and F1 score.


### Load Libraries

In [10]:
# General

import pandas as pd
import numpy as np
import os
import warnings

# Preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Model and Evaluation Metrics
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Model Optimization
from sklearn.model_selection import GridSearchCV

### Settings

In [2]:
# Warnings
warnings.filterwarnings("ignore")

# Path
data_path = "../data"
csv_path = os.path.join(data_path, "data_cleaned.csv")

### Load Data

In [3]:
df = pd.read_csv(csv_path)

In [4]:
# Check Data
df.head()

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,1,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,1,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,1,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,1,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


### Preprocessing

In [5]:
# Seperate Input and Output Features
X = df.drop("diagnosis", axis= 1)
y = df["diagnosis"]

In [6]:
# Split training and testing data
X_train,X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state= 42)

In [11]:
# Standardize the data to make them in same scale
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

### Model Training and Evaluation

In [15]:
# Define a function to train and evaluate a model
def train_evaluate(model):
    # Train the model
    model.fit(X_train, y_train)

    # Make Prediction with trained model
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    # Print Evaluation metrics for training and testing data
    print("=" * 60)
    print("EVALUATION METRICS FOR TRAINING")
    print("=" * 60)
    print(f"Accuracy: {accuracy_score(y_train, y_train_pred)}")
    print(f"Precision: {precision_score(y_train, y_train_pred)}")
    print(f"Recall: {recall_score(y_train, y_train_pred)}")
    print(f"F1: {f1_score(y_train, y_train_pred)}")
    print("=" * 60)
    print("EVALUATION METRICS FOR TESTING")
    print("=" * 60)
    print(f"Accuracy: {accuracy_score(y_test, y_test_pred)}")
    print(f"Precision: {precision_score(y_test, y_test_pred)}")
    print(f"Recall: {recall_score(y_test, y_test_pred)}")
    print(f"F1: {f1_score(y_test, y_test_pred)}")

In [16]:
# Try XGBoost Classifier

# Define model
xgbc = XGBClassifier()
train_evaluate(xgbc)

EVALUATION METRICS FOR TRAINING
Accuracy: 1.0
Precision: 1.0
Recall: 1.0
F1: 1.0
EVALUATION METRICS FOR TESTING
Accuracy: 0.956140350877193
Precision: 0.9523809523809523
Recall: 0.9302325581395349
F1: 0.9411764705882352


### Insights

The performance of the XGBoost classifier on this dataset shows that the model is performing very well but might be slightly **overfitting**. Let's discuss the metrics:

#### Training Metrics (Perfect 1.0 scores):

- **Accuracy (1.0)**: The model correctly classifies all the training data points.
- **Precision (1.0)**: All predicted positive instances (breast cancer cases) in the training data are actually positive.
- **Recall (1.0)**: The model finds all the positive instances in the training data, meaning it detects all breast cancer cases.
- **F1 Score (1.0)**: Since both precision and recall are 1.0, the F1 score (harmonic mean of precision and recall) is also 1.0.
This indicates perfect performance on the training data, which suggests that the model might have overfitted to the training set, as it has learned to classify it perfectly.

#### Testing Metrics (Near-perfect but slightly lower):

- **Accuracy (0.9561)**: The model correctly classifies **95.6%** of the test data points. This is still very good but lower than the perfect training accuracy.
- Precision (0.9524)**: About **95.2%** of the breast cancer cases predicted by the model are actual cases, showing the model is still making highly reliable predictions.
- **Recall (0.9302)**: The model identifies **93%** of actual breast cancer cases in the test data. Some cases are missed, but the recall is still very strong.
- **F1 Score (0.9412)**: This indicates a good balance between precision and recall, although not perfect like in the training set.



### Model Optimization

In [17]:
# Define a function to tune hyperparameter
def tune_hyperparameter(model, param_grid):
    # Define Grid Search 
    gscv = GridSearchCV(
        model,
        param_grid= param_grid,
        cv = 5,
        verbose= 2
    )
    # Train with grid search
    gscv.fit(X, y)

    # Print Best Score
    print(f"Best Score:{gscv.best_score_}")
    # Get Best hyperparameter set
    best_params = gscv.best_params_
    return best_params

In [24]:
# Define hyperparameter dictionary for XGBoostRegressor
param_dict = {
    "n_estimators": [ 100],
    "max_depth": [ 3, 4],
    "min_child_weight": [3, 4],
    "colsample_bytree": [0.5, 1.0],
    "alpha": [0, 1, ],
    "labmda": [0, 1],
    "gamma": [0, 0.1, 1.0]
}

# Define XGBoost Regressor
xgbr_ht = XGBClassifier()

# Hyperpermeter tuning to get best hyperparameters
best_params = tune_hyperparameter(xgbr_ht, param_dict)
print(best_params)

Fitting 5 folds for each of 96 candidates, totalling 480 fits
[CV] END alpha=0, colsample_bytree=0.5, gamma=0, labmda=0, max_depth=3, min_child_weight=3, n_estimators=100; total time=   0.3s
[CV] END alpha=0, colsample_bytree=0.5, gamma=0, labmda=0, max_depth=3, min_child_weight=3, n_estimators=100; total time=   0.1s
[CV] END alpha=0, colsample_bytree=0.5, gamma=0, labmda=0, max_depth=3, min_child_weight=3, n_estimators=100; total time=   0.2s
[CV] END alpha=0, colsample_bytree=0.5, gamma=0, labmda=0, max_depth=3, min_child_weight=3, n_estimators=100; total time=   0.1s
[CV] END alpha=0, colsample_bytree=0.5, gamma=0, labmda=0, max_depth=3, min_child_weight=3, n_estimators=100; total time=   0.1s
[CV] END alpha=0, colsample_bytree=0.5, gamma=0, labmda=0, max_depth=3, min_child_weight=4, n_estimators=100; total time=   0.1s
[CV] END alpha=0, colsample_bytree=0.5, gamma=0, labmda=0, max_depth=3, min_child_weight=4, n_estimators=100; total time=   0.1s
[CV] END alpha=0, colsample_bytree=

In [25]:
# Train with best parameter set
model = XGBClassifier(**best_params)
train_evaluate(model)

EVALUATION METRICS FOR TRAINING
Accuracy: 0.9978021978021978
Precision: 1.0
Recall: 0.9940828402366864
F1: 0.9970326409495549
EVALUATION METRICS FOR TESTING
Accuracy: 0.9736842105263158
Precision: 0.9761904761904762
Recall: 0.9534883720930233
F1: 0.9647058823529412


### Conclusion

After hyperparameter tuning, the XGBoost classifier's performance on this dataset has improved, particularly in terms of its generalization to the test set. Let's discuss the metrics,

#### Training Metrics:

- **Accuracy (0.9978)**: The model correctly classifies about **99.78%** of the training data points. This is very close to perfect, but there’s a slight reduction compared to the previous perfect accuracy of **1.0**. This small drop is a good sign, as it indicates the model is no longer memorizing the training data and is learning more general patterns.
- **Precision (1.0)**: The model continues to perfectly identify all predicted positive cases (breast cancer cases) in the training data.
- **Recall (0.9941)**: The model identifies **99.41%** of actual breast cancer cases in the training data, slightly lower than before but still excellent.
- **F1 Score (0.9970)**: The F1 score is close to **1.0**, indicating a very strong balance between precision and recall, with the model performing exceptionally well on the training set.

#### Testing Metrics (Improved performance):

- **Accuracy (0.9737)**: The model correctly classifies **97.37%** of the test data, an improvement from the previous test accuracy of **95.61%**. This shows better generalization after hyperparameter tuning.
- **Precision (0.9762)**: About **97.62%** of the breast cancer cases predicted by the model are actual cases, a slight improvement from the previous precision of 95.24%. This indicates fewer false positives.
- Recall (0.9535)**: The model identifies **95.35%** of actual breast cancer cases in the test data, a notable improvement from the previous recall of **93.02%**. This shows that the model is detecting more true positives (fewer false negatives) in the test set.
**F1 Score (0.9647)**: The F1 score, which balances precision and recall, is **96.47%**, indicating that the model performs well on the test data and has a good trade-off between precision and recall.

#### Summary

- **Generalization Improved**: After hyperparameter tuning, the model's testing performance has improved across all key metrics (accuracy, precision, recall, and F1 score), especially in recall and accuracy. The model now better balances its performance between the training and testing sets.

- **Reduced Overfitting**: The slight reduction in training accuracy **(from 1.0 to 0.9978)** and the near-perfect recall on the training set **(from 1.0 to 0.9941)** suggest that the model is no longer overfitting as much. It’s learning more general patterns rather than memorizing the training data.