# Logistic Regression — Breast Cancer Dataset
This notebook demonstrates **Logistic Regression** using the real-world **Breast Cancer dataset** from `sklearn.datasets`.

Steps covered:
1. Setup & imports
2. Load dataset
3. Exploratory Data Analysis (EDA)
4. Train/test split
5. Fit Logistic Regression model
6. Evaluate performance
7. ROC curve & AUC
8. Feature importance (coefficients)


## 1. Setup & Imports

In [None]:
import sys, numpy as np, pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_curve, auc

# Reproducibility
np.random.seed(42)

print("Python:", sys.version.split()[0])
print("NumPy:", np.__version__)
print("Pandas:", pd.__version__)

## 2. Load Breast Cancer Dataset

In [None]:
cancer = load_breast_cancer()
X = pd.DataFrame(cancer.data, columns=cancer.feature_names)
y = pd.Series(cancer.target)

print("Dataset shape:", X.shape)
print("Classes:", cancer.target_names)
X.head()

## 3. Exploratory Data Analysis (EDA)
Check class distribution and correlations.

In [None]:
# Class distribution
sns.countplot(x=y, palette='Set2')
plt.title("Class Distribution (0 = Malignant, 1 = Benign)")
plt.show()

# Correlation heatmap of top features
plt.figure(figsize=(12,8))
sns.heatmap(X.corr().iloc[:10, :10], cmap='coolwarm', annot=True)
plt.title("Correlation Heatmap (First 10 Features)")
plt.show()

## 4. Split Data

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)
X_train.shape, X_test.shape

## 5. Fit Logistic Regression Model

In [None]:
model = LogisticRegression(max_iter=10000)
model.fit(X_train, y_train)

print("Intercept:", model.intercept_)
print("Number of coefficients:", len(model.coef_[0]))

## 6. Model Evaluation

In [None]:
y_pred = model.predict(X_test)

acc = accuracy_score(y_test, y_pred)
print("Accuracy:", acc)

print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=cancer.target_names))

## 7. ROC Curve & AUC

In [None]:
y_prob = model.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)

plt.figure(figsize=(6,6))
plt.plot(fpr, tpr, label=f"ROC curve (AUC = {roc_auc:.2f})")
plt.plot([0,1], [0,1], 'r--')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.legend()
plt.show()

## 8. Feature Importance (Coefficients)
We can see which features contribute most to the prediction.

In [None]:
coef_importance = pd.Series(model.coef_[0], index=X.columns)
coef_importance = coef_importance.sort_values()

plt.figure(figsize=(8,10))
sns.barplot(x=coef_importance.values, y=coef_importance.index, palette='coolwarm')
plt.title("Feature Importance (Logistic Regression Coefficients)")
plt.show()