
# Breast Cancer ML Project

This notebook performs the following steps:
1. **Exploratory Data Analysis (EDA)**
2. **Feature Engineering**
3. **Model Training**
4. **Model Selection**

The dataset used is the Breast Cancer dataset from `sklearn`.


In [None]:

# Load required libraries
import pandas as pd
from sklearn.datasets import load_breast_cancer

# Load the dataset
data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df["target"] = data.target

# Display the first few rows of the dataset
df.head()



## 1. Exploratory Data Analysis (EDA)

In this section, we will:
- Understand the data structure.
- Check for missing values.
- Explore feature distributions and correlations.


In [None]:

import matplotlib.pyplot as plt
import seaborn as sns

# Check for missing values
print("Missing values in each column:\n", df.isnull().sum())

# Summary statistics
print("\nSummary statistics:\n", df.describe())

# Pairplot for selected features
sns.pairplot(df, vars=df.columns[:5], hue="target", diag_kind="kde")
plt.show()

# Correlation heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(df.corr(), annot=True, fmt=".2f", cmap="coolwarm")
plt.title("Correlation Heatmap")
plt.show()



## 2. Feature Engineering

Here, we will:
- Split the dataset into training and test sets.
- Standardize the features for better model performance.


In [None]:

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Define features and target
X = df.drop(columns=["target"])
y = df["target"]

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("Feature engineering complete.")



## 3. Model Training

We will train multiple models and evaluate their performance.


In [None]:

from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Train and evaluate multiple models
models = {
    "Random Forest": RandomForestClassifier(random_state=42),
    "Logistic Regression": LogisticRegression(max_iter=1000, random_state=42),
    "Support Vector Machine": SVC(random_state=42)
}

accuracies = {}
for name, model in models.items():
    model.fit(X_train_scaled, y_train)
    predictions = model.predict(X_test_scaled)
    accuracy = accuracy_score(y_test, predictions)
    accuracies[name] = accuracy
    print(f"{name} Accuracy: {accuracy:.4f}")



## 4. Model Selection

Based on the accuracy scores, the best-performing model will be selected.


In [None]:

# Select the best model
best_model_name = max(accuracies, key=accuracies.get)
best_model_accuracy = accuracies[best_model_name]

print(f"The best model is {best_model_name} with an accuracy of {best_model_accuracy:.4f}.")
