# Breast Cancer Classification

## Project Overview

This project involves building a classification model to predict whether a breast cancer tumor is benign or malignant based on features extracted from a digitized image of a fine needle aspirate (FNA) of a breast mass. The dataset contains real-valued features that describe characteristics of the cell nuclei present in the image. The classification task involves predicting the diagnosis (benign or malignant) using these features.

The dataset used in this project is the **Breast Cancer Wisconsin (Diagnostic) Dataset**, which is publicly available on the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29). The project involves data preprocessing, feature analysis, and building a machine learning model to accurately classify the tumors.

## Data Source

This dataset is freely available in Kaggele in the following link:
> [https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data/data]

## Dataset Description

The dataset contains the following attributes:

1. **ID number**: Unique identifier for each sample.
2. **Diagnosis**: Target variable (M = malignant, B = benign).
3. **Features**: There are 30 real-valued features computed for each cell nucleus in the image. These features include:
   - **Radius** (mean of distances from center to points on the perimeter)
   - **Texture** (standard deviation of gray-scale values)
   - **Perimeter**
   - **Area**
   - **Smoothness** (local variation in radius lengths)
   - **Compactness** (perimeter² / area - 1.0)
   - **Concavity** (severity of concave portions of the contour)
   - **Concave points** (number of concave portions of the contour)
   - **Symmetry**
   - **Fractal dimension** ("coastline approximation" - 1)
   
   The mean, standard error, and "worst" or largest (mean of the three largest values) of these features were computed for each image, resulting in a total of 30 features. For instance:
   - Field 3 is Mean Radius.
   - Field 13 is Radius Standard Error (SE).
   - Field 23 is Worst Radius.

- **Number of Instances**: 569
- **Number of Attributes**: 32 (ID number, diagnosis, and 30 real-valued features)
- **Missing Values**: None
- **Class Distribution**: 357 benign, 212 malignant

## Attribute Information

1. **ID number**
2. **Diagnosis**: (M = malignant, B = benign)
3. **Ten real-valued features** are computed for each cell nucleus:
   - **Radius** (mean of distances from center to points on the perimeter)
   - **Texture** (standard deviation of gray-scale values)
   - **Perimeter**: Perimeter of the tumor.
   - **Area**: Area of the tumor.
   - **Smoothness** (local variation in radius lengths)
   - **Compactness** (perimeter² / area - 1.0)
   - **Concavity** (severity of concave portions of the contour)
   - **Concave points** (number of concave portions of the contour)
   - **Symmetry**
   - **Fractal dimension** ("coastline approximation" - 1)


## Problem Statement

- **Model Training**: Train the model with the cleaned data to classify a tumor wheather it is malignant or benign.
- **Model Evaluation**: Evaluate the performance of the trained model using different evaluation metrics such as accuracy, precision, recall and F1 score.


### Load Libraries

In [2]:
# General

import pandas as pd
import numpy as np
import os
import warnings

# Preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Model and Evaluation Metrics
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Model Optimization
from sklearn.model_selection import GridSearchCV

### Settings

In [3]:
# Warnings
warnings.filterwarnings("ignore")

# Path
data_path = "../data"
csv_path = os.path.join(data_path, "data_cleaned.csv")

### Load Data

In [4]:
df = pd.read_csv(csv_path)

In [5]:
# Check Data
df.head()

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,1,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,1,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,1,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,1,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


### Preprocessing

In [6]:
# Seperate Input and Output Features
X = df.drop("diagnosis", axis= 1)
y = df["diagnosis"]

In [7]:
# Split training and testing data
X_train,X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state= 42)

In [8]:
# Standardize the data to make them in same scale
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

### Model Training and Evaluation

In [9]:
# Define a function to train and evaluate a model
def train_evaluate(model):
    # Train the model
    model.fit(X_train, y_train)

    # Make Prediction with trained model
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    # Print Evaluation metrics for training and testing data
    print("=" * 60)
    print("EVALUATION METRICS FOR TRAINING")
    print("=" * 60)
    print(f"Accuracy: {accuracy_score(y_train, y_train_pred)}")
    print(f"Precision: {precision_score(y_train, y_train_pred)}")
    print(f"Recall: {recall_score(y_train, y_train_pred)}")
    print(f"F1: {f1_score(y_train, y_train_pred)}")
    print("=" * 60)
    print("EVALUATION METRICS FOR TESTING")
    print("=" * 60)
    print(f"Accuracy: {accuracy_score(y_test, y_test_pred)}")
    print(f"Precision: {precision_score(y_test, y_test_pred)}")
    print(f"Recall: {recall_score(y_test, y_test_pred)}")
    print(f"F1: {f1_score(y_test, y_test_pred)}")

In [10]:
# Try SVM Classifier

# Define model
svc = SVC()
train_evaluate(svc)

EVALUATION METRICS FOR TRAINING
Accuracy: 0.9142857142857143
Precision: 0.9642857142857143
Recall: 0.7988165680473372
F1: 0.8737864077669902
EVALUATION METRICS FOR TESTING
Accuracy: 0.9473684210526315
Precision: 1.0
Recall: 0.8604651162790697
F1: 0.9249999999999999


### Insights

The evaluation metrics for the SVM classifier on the Breast Cancer classification task show good overall performance, especially in terms of generalization to the testing data. Here’s a detailed analysis:

#### Training Metrics (Perfect 1.0 scores):

- **Accuracy (0.91)**: The model correctly classifies **91.4%** of the training data. This indicates that the model has learned the patterns in the training data well, but there is room for improvement, especially in recall.
- **Precision (0.96)**: The model is highly precise on the training data, correctly identifying **96.4%** of the patients predicted to have breast cancer as actual positives. This means the model is making very few false positive predictions.
- **Recall (0.80)**: The recall of **79.9%** indicates that the model is correctly identifying **80%** of the actual positive cases (patients with breast cancer). However, it is missing about **20%** of the true positive cases (false negatives), which is a concern for a health-related classification task.
- **F1 Score (0.87)**: The F1 score, which balances precision and recall, is **87.4%**, indicating solid performance. However, it also highlights that recall is pulling down the overall score compared to the very high precision.

#### Testing Metrics (Near-perfect but slightly lower):

- **Accuracy (0.95)**: The model correctly classifies **94.7%** of the testing data, a strong performance and an improvement over the training accuracy. This suggests the model generalizes well to unseen data.
- **Precision (1.0)**: Precision is perfect (**100%**) on the test data, meaning that all patients the model predicts as having breast cancer actually have the disease. There are zero false positives, which is excellent in medical diagnostics, where false positives can lead to unnecessary anxiety and additional tests.
- **Recall (0.86)**: The recall on the test set is **86%**, meaning the model correctly identifies **86%** of the actual positive cases. While this is an improvement over the training recall (**79.9%**), there is still room to increase recall so that fewer cases of actual breast cancer are missed.
- **F1 Score (0.92)**: The F1 score on the test data is **92.5%**, reflecting a well-balanced model that handles both precision and recall well.

#### Analysis of Performance:

- **High Precision, Lower Recall:** Both in the training and testing results, precision is significantly higher than recall. This indicates that while the model is very good at making correct positive predictions (few false positives), it is still missing a proportion of actual positive cases (some false negatives). In the context of breast cancer detection, high precision is critical because false positives can lead to unnecessary medical interventions. However, recall is equally important because failing to detect actual cases of breast cancer (false negatives) can have serious consequences.
- **Generalization:** The model’s testing performance is strong and even slightly better than the training performance. This suggests that the model generalizes well and is not overfitting. The improvement in testing accuracy (**0.95 vs. 0.91**) and recall (**0.86 vs. 0.80**) is somewhat unusual but can happen when the test set is simpler or more representative of the general population than the training set.
- **F1 Score Balance:** The F1 score, which balances precision and recall, is quite high in both training (**0.87**) and testing (**0.92**). The improvement in the test F1 score suggests that the model's trade-off between precision and recall is more favorable when generalized to unseen data.
- **Room for Recall Improvement:** Although the model performs well overall, the relatively lower recall, especially on the training data (**0.80**), indicates that there are still some missed cases of breast cancer (false negatives). Given the context of breast cancer classification, improving recall would likely be a priority.


### Model Optimization

In [11]:
# Define a function to tune hyperparameter
def tune_hyperparameter(model, param_grid):
    # Define Grid Search 
    gscv = GridSearchCV(
        model,
        param_grid= param_grid,
        cv = 5,
        verbose= 2
    )
    # Train with grid search
    gscv.fit(X, y)

    # Print Best Score
    print(f"Best Score:{gscv.best_score_}")
    # Get Best hyperparameter set
    best_params = gscv.best_params_
    return best_params

In [12]:
# Define hyperparameter dictionary for XGBoostRegressor
param_dict = {
    "C": [ 0.1, 1, 10],
    "kernel": [ 'linear', 'rbf'],
    "gamma": [0, 0.1, 1.0]
}

# Define SVM Classifier
svc_ht = SVC()

# Hyperpermeter tuning to get best hyperparameters
best_params = tune_hyperparameter(svc_ht, param_dict)
print(best_params)

Fitting 5 folds for each of 18 candidates, totalling 90 fits
[CV] END ......................C=0.1, gamma=0, kernel=linear; total time=   0.0s
[CV] END ......................C=0.1, gamma=0, kernel=linear; total time=   0.0s
[CV] END ......................C=0.1, gamma=0, kernel=linear; total time=   0.0s
[CV] END ......................C=0.1, gamma=0, kernel=linear; total time=   0.0s
[CV] END ......................C=0.1, gamma=0, kernel=linear; total time=   0.0s
[CV] END .........................C=0.1, gamma=0, kernel=rbf; total time=   0.0s
[CV] END .........................C=0.1, gamma=0, kernel=rbf; total time=   0.0s
[CV] END .........................C=0.1, gamma=0, kernel=rbf; total time=   0.0s
[CV] END .........................C=0.1, gamma=0, kernel=rbf; total time=   0.0s
[CV] END .........................C=0.1, gamma=0, kernel=rbf; total time=   0.0s
[CV] END ....................C=0.1, gamma=0.1, kernel=linear; total time=   0.1s
[CV] END ....................C=0.1, gamma=0.1, k

In [13]:
# Train with best parameter set
model = SVC(**best_params)
train_evaluate(model)

EVALUATION METRICS FOR TRAINING
Accuracy: 0.9714285714285714
Precision: 0.9814814814814815
Recall: 0.9408284023668639
F1: 0.9607250755287009
EVALUATION METRICS FOR TESTING
Accuracy: 0.956140350877193
Precision: 0.975
Recall: 0.9069767441860465
F1: 0.9397590361445783


### Conclusion

After hyperparameter tuning, the SVM classifier for Breast Cancer classification shows significant improvements across both training and testing metrics. Let's discuss the metrics,

#### Training Metrics:

- **Accuracy (0.97)**: The model correctly classifies **97.1%** of the training data, indicating a strong fit to the training set.
- **Precision (0.98)**: Precision is very high, meaning almost all predictions labeled as positive (patients with breast cancer) are correct. The model rarely makes false positive predictions.
- **Recall (0.94)**: Recall has improved significantly compared to the default configuration, indicating that the model is now correctly identifying **94%** of actual positive cases (patients with breast cancer), which is crucial for minimizing false negatives.
- **F1 Score (0.96)**: The F1 score, a balance between precision and recall, is very high, reflecting the model's strong overall performance in identifying positive cases with few false positives and false negatives.

#### Testing Metrics (Improved performance):

- **Accuracy (0.96)**: The model achieves **95.6%** accuracy on the test data, which is a very strong result and shows good generalization to unseen data.
- **Precision (0.98)**: Similar to the training set, the precision on the test set is very high, meaning the model is making highly reliable positive predictions with almost no false positives.
- **Recall (0.91)**: The recall on the test set is **90.7%**, showing that the model identifies most of the actual positive cases in the test set, though a small percentage of cases may still go undetected (false negatives).
**F1 Score (0.94)**: The F1 score on the test set is **0.94**, indicating that the model performs well in balancing precision and recall, even on unseen data.

#### Summary

- **Improved Recall:** Compared to the previous results, recall has improved on both the training and testing datasets **(from 0.80 to 0.94 in training and from 0.86 to 0.91 in testing)**. This means the model is now identifying a higher proportion of true positives, which is essential in breast cancer detection. Fewer cases are being missed.

- **Generalization to Unseen Data:** The testing performance closely mirrors the training performance, with only a slight drop in recall **(0.94 to 0.91)** and F1 score **(0.96 to 0.94)**. This suggests that the model generalizes well to new, unseen data, avoiding overfitting and maintaining high predictive power.