# Breast Cancer Classification

This notebook contains the exploratory data analysis, model training, and evaluation steps for the Breast Cancer Wisconsin dataset. We will implement six classification models and evaluate their performance.

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier
import pickle


In [2]:
# Load the dataset
data = pd.read_csv('../data/breast_cancer.csv')
data.head()

In [3]:
# Data preprocessing
X = data.drop('diagnosis', axis=1)
y = data['diagnosis'].map({'M': 1, 'B': 0})  # M: malignant, B: benign

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)


In [4]:
# Train and evaluate models
models = {
    'Logistic Regression': LogisticRegression(),
    'K-Nearest Neighbors': KNeighborsClassifier(),
    'Decision Tree': DecisionTreeClassifier(),
    'Random Forest': RandomForestClassifier(),
    'Support Vector Machine': SVC(),
    'XGBoost': XGBClassifier()
}

results = {}
for model_name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    results[model_name] = classification_report(y_test, y_pred, output_dict=True)


In [5]:
# Save the best model (for example, Random Forest)
best_model = models['Random Forest']
with open('../models/best_model.pkl', 'wb') as f:
    pickle.dump(best_model, f)


In [6]:
# Display evaluation metrics
for model_name, metrics in results.items():
    print(f"{model_name}:")
    print(metrics['accuracy'])
    print("---")


## Conclusion

In this notebook, we implemented six classification models on the Breast Cancer Wisconsin dataset. We evaluated their performance and saved the best model for deployment.