#Breast Cancer Classification using Logistic Regression

**Objective:**
Build a binary classifier using logistic regression.

**Dataset:** Breast Cancer Wisconsin Dataset  
**Tools:** Scikit-learn, Pandas, Matplotlib, Seaborn  

**Author:** Diya B  
**Internship Task â€“ AI/ML Internship**

##Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    confusion_matrix, classification_report,
    accuracy_score, precision_score, recall_score,
    roc_curve, roc_auc_score
)

##Load Dataset

In [None]:
data = load_breast_cancer()

df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

df.head()

##Dataset Overview

In [None]:
df.shape, df.info(), df.describe()

##Feature-Target Split

In [None]:
X = df.drop('target', axis=1)
y = df['target']

##Train-Test Split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

##Feature Scaling

In [None]:
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

##Train Logistic Regression Model

In [None]:
model = LogisticRegression(max_iter=1000)
model.fit(X_train_scaled, y_train)

##Predictions

In [None]:
y_pred = model.predict(X_test_scaled)
y_prob = model.predict_proba(X_test_scaled)[:,1]

##Confusion Matrix

In [None]:
cm = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(6,4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.show()

##Classification Metrics

In [None]:
print("Accuracy :", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall   :", recall_score(y_test, y_pred))

print("\nClassification Report:\n")
print(classification_report(y_test, y_pred))

##ROC Curve & AUC Score

In [None]:
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
roc_auc = roc_auc_score(y_test, y_prob)

plt.figure(figsize=(6,5))
plt.plot(fpr, tpr, label=f"AUC = {roc_auc:.3f}")
plt.plot([0,1],[0,1],'k--')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.legend()
plt.show()

print("ROC-AUC Score:", roc_auc)

##Threshold Tuning

In [None]:
custom_threshold = 0.3
y_custom = (y_prob >= custom_threshold).astype(int)

print("Custom Threshold Confusion Matrix:")
print(confusion_matrix(y_test, y_custom))

print("\nClassification Report:\n")
print(classification_report(y_test, y_custom))

##Sigmoid Function Visualization & Explanation

In [None]:
x = np.linspace(-10,10,100)
sigmoid = 1 / (1 + np.exp(-x))

plt.figure(figsize=(6,4))
plt.plot(x, sigmoid)
plt.xlabel("z")
plt.ylabel("Sigmoid Output")
plt.title("Sigmoid Function")
plt.show()

##Feature Importance Analysis

In [None]:
feature_importance = pd.DataFrame({
    'Feature': X.columns,
    'Coefficient': model.coef_[0]
}).sort_values(by='Coefficient', ascending=False)

feature_importance.head(10)

## Conclusion

This project successfully implemented a robust binary classification system using Logistic Regression to detect breast cancer. Advanced evaluation metrics such as confusion matrix, precision, recall, ROC-AUC, and threshold tuning were applied to ensure high diagnostic reliability.

Threshold optimization significantly improved recall, which is critical in medical diagnosis to minimize false negatives. Feature importance analysis highlighted key diagnostic parameters contributing to tumor malignancy prediction.

Overall, this project demonstrates a complete and professional machine learning pipeline suitable for real-world healthcare predictive analytics applications.
