#  Breast Cancer Detection using Machine Learning
**Author:** Anushka Kandwal
**Goal:** Classify tumors as Benign or Malignant using ML models (Logistic Regression, Random Forest, SVM).  

**This project uses the Breast Cancer Wisconsin Diagnostic dataset from Kaggle. We train multiple models, evaluate them, and create an interactive dashboard for predictions.**

**Tech Stack:** Python | pandas | scikit-learn | matplotlib | seaborn  

**Result:** Achieved ~ 96-97 % accuracy with strong recall for Malignant class.

---
###  Getting Started in Google Colab
1. Upload this notebook to your Google Drive.
2. Open it with **Google Colab**.
3. (Optional) Mount your Drive if your dataset is in Drive:
   ```python
   from google.colab import drive
   drive.mount('/content/drive')
   ```
4. Update the CSV path if needed, then run all cells.

Mount google drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')


## Step 1: Create Project Directory
We first create a dedicated folder in Google Drive to store all project files, including dataset, models, and notebooks. This ensures everything is organized and easy to access.

In [None]:
import os

project_dir = "/content/drive/MyDrive/breast_cancer_project"
os.makedirs(project_dir, exist_ok=True)

print("Project directory created at:", project_dir)


## Step 2: Upload Kaggle API Key
Used `kaggle.json` file to access Kaggle datasets directly from Colab.


In [None]:
from google.colab import files
files.upload()  # Upload kaggle.json here when prompted


## Step 3: Configure Kaggle API
We rename and move the uploaded `kaggle.json` to the proper directory (`~/.kaggle`) so that the Kaggle API can access it.


In [None]:
# Rename the file and move to ~/.kaggle directory
!mkdir -p ~/.kaggle
!mv "kaggle (1).json" ~/.kaggle/kaggle.json
!chmod 600 ~/.kaggle/kaggle.json

# Verify setup
!ls -l ~/.kaggle


## Step 4: Download and Extract Dataset from Kaggle
We navigate to our project directory in Google Drive, download the Breast Cancer Wisconsin dataset using the Kaggle API,
and unzip it directly into the project folder. This ensures the dataset is stored permanently in Drive for easy access.

In [None]:

%cd "$project_dir"

# Download the dataset
!kaggle datasets download -d uciml/breast-cancer-wisconsin-data

# Unzip it into your Drive folder
!unzip -o breast-cancer-wisconsin-data.zip -d "$project_dir"


## Step 5: Load Dataset into Pandas
We load the Breast Cancer dataset from our Google Drive project folder into a Pandas DataFrame.
This allows us to perform exploratory data analysis (EDA) and preprocessing on the dataset.

In [None]:
import pandas as pd

data_path = os.path.join(project_dir, "data.csv")
df = pd.read_csv(data_path)

print("Dataset loaded successfully from Drive!")
print("Shape:", df.shape)
df.head()


## Step 6: Explore Dataset Structure
We check the dataset shape, column names, and general information to understand the data types,


In [None]:

print("Shape:", df.shape)
print("\nColumn names:\n", df.columns.tolist())


print("\nData info:")
df.info()

# Quick look at first few rows
df.head()


## Step 7: Check for Missing Values and Duplicates
 verifing the  data quality by checking for missing values in each column and identifying duplicate rows.
This helps ensure that our dataset is clean before preprocessing and model training.

In [None]:
# Check for missing values
print("Missing values per column:\n", df.isnull().sum())

# Check duplicates
print("\nDuplicates:", df.duplicated().sum())


## Step 8: Dataset Distribution
Dataset is checked to see the number of Benign (B) and Malignant (M) cases.
This helps to ensure if the dataset is balanced or imbalanced.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

print(df['diagnosis'].value_counts())

plt.figure(figsize=(5,4))
sns.countplot(x='diagnosis', data=df, palette='Set2')
plt.title("Class Distribution (Benign vs Malignant)")
plt.show()


### Step 9 : Data Cleaning: Removing Irrelevant Columns

In [None]:
df.drop(columns=['id'], inplace=True)
if 'Unnamed: 32' in df.columns:
    df.drop(columns=['Unnamed: 32'], inplace=True)

print("Remaining columns:", len(df.columns))


In [None]:
df['diagnosis'] = df['diagnosis'].map({'M': 1, 'B': 0})
df['diagnosis'].value_counts()


### Feature Correlation Heatmap with Target Variable (`diagnosis`)

In [None]:
plt.figure(figsize=(10,8))
corr = df.corr(numeric_only=True)
sns.heatmap(corr[['diagnosis']].sort_values(by='diagnosis', ascending=False),
            annot=True, cmap='coolwarm')
plt.title("Feature Correlation with Diagnosis")
plt.show()


### Data Preparation: Train-Test Split and Feature Scaling

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Separate features and target
X = df.drop('diagnosis', axis=1)
y = df['diagnosis']

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print("Train size:", X_train.shape)
print("Test size:", X_test.shape)

# Feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("Data scaled successfully.")


### Importing libraries

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_auc_score, roc_curve
import matplotlib.pyplot as plt
import seaborn as sns


### Model 1: Logistic Regression — Training and Evaluation

In [None]:
# Initialize and train
lr = LogisticRegression(random_state=42, max_iter=1000)
lr.fit(X_train_scaled, y_train)

# Predict
y_pred_lr = lr.predict(X_test_scaled)

# Evaluate
print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred_lr))
print("\nClassification Report:\n", classification_report(y_test, y_pred_lr))

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred_lr)
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues")
plt.title("Logistic Regression Confusion Matrix")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()


### Ensemble Learning with Random Forest for Breast Cancer Classification

In [None]:
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train_scaled, y_train)

y_pred_rf = rf.predict(X_test_scaled)

print("Random Forest Accuracy:", accuracy_score(y_test, y_pred_rf))
print("\nClassification Report:\n", classification_report(y_test, y_pred_rf))

cm = confusion_matrix(y_test, y_pred_rf)
sns.heatmap(cm, annot=True, fmt="d", cmap="Greens")
plt.title("Random Forest Confusion Matrix")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()


### Support Vector Machine (SVM) Classification with Performance Evaluation and Visualization

In [None]:
svm = SVC(kernel='rbf', probability=True, random_state=42)
svm.fit(X_train_scaled, y_train)

y_pred_svm = svm.predict(X_test_scaled)

print("SVM Accuracy:", accuracy_score(y_test, y_pred_svm))
print("\nClassification Report:\n", classification_report(y_test, y_pred_svm))

cm = confusion_matrix(y_test, y_pred_svm)
sns.heatmap(cm, annot=True, fmt="d", cmap="Oranges")
plt.title("SVM Confusion Matrix")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()


### ROC Curve Comparison for Logistic Regression, Random Forest, and SVM Models

In [None]:
plt.figure(figsize=(8,6))

models = {'Logistic Regression': lr, 'Random Forest': rf, 'SVM': svm}

for name, model in models.items():
    y_prob = model.predict_proba(X_test_scaled)[:,1]
    fpr, tpr, _ = roc_curve(y_test, y_prob)
    auc = roc_auc_score(y_test, y_prob)
    plt.plot(fpr, tpr, label=f"{name} (AUC = {auc:.2f})")

plt.plot([0,1], [0,1], 'k--')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve Comparison")
plt.legend()
plt.show()


### For Model Saving

In [None]:
!pip install -q joblib

import joblib


### Selecting and Saving the Best Model (Random Forest)

In [None]:
best_model = rf  # Random Forest


In [None]:
model_path = os.path.join(project_dir, "best_model.pkl")
scaler_path = os.path.join(project_dir, "scaler.pkl")

# Save the trained model
joblib.dump(best_model, model_path)

# Save the scaler used for feature scaling
joblib.dump(scaler, scaler_path)

print("Model saved at:", model_path)
print("Scaler saved at:", scaler_path)


###Saving Trained Model and Scaler for Deployment Using Joblib

In [None]:
# Load model & scaler from Drive
loaded_model = joblib.load(model_path)
loaded_scaler = joblib.load(scaler_path)

# Example: predict on first 5 test samples
X_sample = X_test.iloc[:5]
X_sample_scaled = loaded_scaler.transform(X_sample)
preds = loaded_model.predict(X_sample_scaled)
print("Predictions (0=Benign, 1=Malignant):", preds)


### Breast Cancer Prediction Function Using Trained Random Forest Model

In [None]:
def predict_breast_cancer(model, scaler, input_data):
    """
    Predict breast cancer based on input features.

    Parameters:
    - model: trained ML model (Random Forest)
    - scaler: fitted StandardScaler
    - input_data: pandas DataFrame with the same features as training data

    Returns:
    - predictions: list of 'Benign' or 'Malignant'
    """
    # Scale features
    input_scaled = scaler.transform(input_data)

    # Predict
    pred_numeric = model.predict(input_scaled)

    # Convert to human-readable
    pred_labels = ['Benign' if x==0 else 'Malignant' for x in pred_numeric]

    return pred_labels


###Sample Predictions and Comparison with Actual Labels

In [None]:
# Select first 5 test samples
X_sample = X_test.iloc[:5]

# Make predictions
predictions = predict_breast_cancer(loaded_model, loaded_scaler, X_sample)
print("Predictions:", predictions)

# Compare with actual labels
print("Actual:", ['Malignant' if x==1 else 'Benign' for x in y_test.iloc[:5]])


### Interactive Widget Setup for User Input in Breast Cancer Prediction

In [None]:
import ipywidgets as widgets
from IPython.display import display
import pandas as pd

# Feature names in correct order
feature_names = X_train.columns.tolist()

# Pre-fill with mean values from training data
feature_means = X_train.mean()


### Creating Scrollable Input Form for Interactive Feature Entry

In [None]:
# Dictionary to hold widgets
feature_widgets = {}

# VBox children list
widget_list = []

for f in feature_names:
    w = widgets.FloatText(
        value=float(feature_means[f]),
        description=f,
        step=0.01,
        layout=widgets.Layout(width='350px')
    )
    feature_widgets[f] = w
    widget_list.append(w)

# Make scrollable container
input_form = widgets.VBox(widget_list, layout=widgets.Layout(
    height='500px', overflow_y='scroll', border='1px solid gray', padding='10px'
))
display(input_form)


### Interactive Breast Cancer Prediction with Real-Time Confidence Output

In [None]:
predict_button = widgets.Button(description="Predict Breast Cancer", button_style='success')
output = widgets.Output()

def on_predict_clicked(b):
    with output:
        output.clear_output()
        # Read input values in correct order
        input_data = {f: [feature_widgets[f].value] for f in feature_names}
        input_df = pd.DataFrame(input_data)

        # Predict
        pred_label = predict_breast_cancer(loaded_model, loaded_scaler, input_df)[0]
        prob = loaded_model.predict_proba(loaded_scaler.transform(input_df))[0][1]

        print(f"Prediction: {pred_label}")
        print(f"Confidence (Malignant probability): {prob*100:.2f}%")

predict_button.on_click(on_predict_clicked)
display(predict_button, output)

