# üß† Breast Cancer Detection Pipeline with Logistic Regression

This notebook walks you through an end-to-end classical ML pipeline using logistic regression and feature importance, all based on scikit‚Äëlearn‚Äôs breast cancer dataset.

**In this notebook you will:**
1. Load & explore the data  
2. Split into training/test sets  
3. Scale features  
4. Train logistic regression model  
5. Evaluate performance  
6. Analyze top predictive features  
7. Save model artifacts  

---

```python
# Cell 1: Imports & Data Loading
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import joblib

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.inspection import permutation_importance

%matplotlib inline

# Load the dataset
data = load_breast_cancer(as_frame=True)
X = data.data
y = data.target

# Quick info
print(f"Dataset shape: {X.shape}")
print("Target distribution:")
print(y.value_counts().rename({0: 'malignant', 1: 'benign'}))


### What's in this dataset?

- **569 samples**, each with **30 numeric features**
- Binary target labels:  
  - `0 = malignant` (cancerous)  
  - `1 = benign` (non-cancerous)  
- Features: measurements computed from cell nuclei


In [1]:
# Cell 2: Exploratory Data Analysis (EDA)
df = pd.concat([X, y.rename('target')], axis=1)

# Count of classes
plt.figure(figsize=(4,4))
sns.countplot(x='target', data=df)
plt.xticks([0,1], ['malignant', 'benign'])
plt.title("Class Distribution")
plt.show()

# Correlation heatmap
plt.figure(figsize=(12,10))
sns.heatmap(df.corr(), cmap='coolwarm', fmt=".2f", square=True, linewidths=0.5)
plt.title("Feature Correlation Matrix")
plt.show()


NameError: name 'pd' is not defined

**Observations:**
- The dataset is imbalanced (~63% benign, ~37% malignant).
- Many features are correlated‚Äîeliminating redundancy may help the model.


In [None]:
# Cell 3: Train/Test Split and Feature Scaling
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

# Save the scaler
joblib.dump(scaler, 'scaler.pkl')

print("X_train shape:", X_train_s.shape, "| X_test shape:", X_test_s.shape)


**Why scale features?**  
Logistic regression assumes all features contribute equally; scaling ensures they are on the same range, avoiding bias from large-magnitude features.


In [None]:
# Cell 4: Train Logistic Regression
model = LogisticRegression(solver='liblinear', random_state=42)
model.fit(X_train_s, y_train)

# Save trained model
joblib.dump(model, 'cancer_model.pkl')

print("Model training complete.")


We use the `liblinear` solver since it‚Äôs effective for small to medium-sized datasets with binary targets.


In [None]:
# Cell 5: Evaluate Model Performance
y_pred = model.predict(X_test_s)

print("‚úÖ Accuracy:", accuracy_score(y_test, y_pred))
print("\nüìã Classification Report:")
print(classification_report(y_test, y_pred, target_names=data.target_names))

cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(5,4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=data.target_names,
            yticklabels=data.target_names)
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.show()


- **Accuracy** tells us how many predictions were correct overall.
- **Classification report** includes precision, recall, and F1-score‚Äîuseful for imbalanced data.
- **Confusion matrix** visually breaks down true/false positives and negatives.


In [None]:
# Cell 6: Feature Importance Analysis
r = permutation_importance(
    model, X_test_s, y_test, n_repeats=10, random_state=42
)
importances = pd.Series(r.importances_mean, index=data.feature_names)
top10 = importances.nlargest(10)

plt.figure(figsize=(6,5))
top10.plot(kind='barh')
plt.title("Top 10 Feature Importances (Permutation)")
plt.xlabel("Importance")
plt.gca().invert_yaxis()
plt.show()

print("\nTop features contributing to model decisions:")
for feat, score in top10.iteritems():
    print(f" ‚Ä¢ {feat}: importance score ‚âà {score:.4f}")


Permutation importance randomly shuffles each feature to see how it affects performance. The larger the drop in accuracy, the more important the feature.


---

## üèÅ Conclusion

You have built a full machine learning pipeline for breast cancer detection:

1. **Explored** data distribution and feature correlations  
2. **Scaled** numeric features  
3. **Trained** a logistic regression model  
4. **Evaluated** model accuracy, precision, recall  
5. **Interpreted** model behavior via feature importance  
6. **Saved** models and scaler for future use  

**Next steps you can try:**
- Add **GridSearchCV** to tune regularization strength (`C`)  
- Experiment with **other models** like Random Forest or SVM  
- Visualize **ROC curve** and compute AUC  
- Build a simple **Flask** or **Streamlit** app to serve predictions  
---

Make sure your `venv` is activated when you run this notebook. To get started:

```bash
pip install -r requirements.txt
jupyter notebook
