# In-class Worksheet ‚Äî scikit-learn Intro (Binary Classification)

**Dataset:** Wisconsin Breast Cancer (Diagnostic)  
**Goal:** Follow along as the lesson progresses. Fill in each **TODO** section.

## Rules (for this worksheet)
- This is **not** a quiz. Use your notes and discuss with peers.
- Keep `random_state=42` where specified so results are reproducible.
- Do **not** rename variables that the worksheet defines (the grader will look for them).

## Grading
At the end, you will run a correction cell that prints a **score / 100** plus feedback.  
The score is only to help you self-check completion; it is not an exam.

---

## üîé Why Exact Variable Names Matter

This worksheet includes an automatic self-check grading script.

The grader:
- Checks specific variable names (e.g., `X_train`, `pipe`, `test_accuracy`)
- Checks specific Pipeline step names (`"scaler"` and `"clf"`)
- Verifies expected structures and shapes

Even if your code is logically correct, using different variable or step names
(e.g., `"scale"` instead of `"scaler"`) may cause the grader to mark it as incorrect.

This is done only to:
- Keep grading consistent
- Avoid ambiguity
- Ensure reproducibility

In real projects, step names are flexible ‚Äî here they must match exactly.


## 0) Setup

Run the cell below to import libraries.


In [1]:
# Imports (run once)
import numpy as np

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, ShuffleSplit, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import (
    confusion_matrix, accuracy_score,
    precision_score, recall_score, f1_score,
    roc_curve, auc
)

print("Setup complete.")


Setup complete.


## 1) Load the dataset

**Tasks**
1. Load the breast cancer dataset using `load_breast_cancer`.
2. Store:
   - features in `X`
   - labels in `y`
3. Print:
   - `X.shape`
   - class distribution (counts of 0 and 1)

**Notes**
- This is a **binary** classification dataset.


In [9]:
# TODO 1: Load dataset
# - X, y
# - print X.shape
# - print class counts (0 and 1)

# YOUR CODE HERE
data = load_breast_cancer()
X = data.data
y = data.target
print("X.shape:", X.shape)
print("Class distribution (0 and 1):", dict(zip(*np.unique(y, return_counts=True))))


X.shape: (569, 30)
Class distribution (0 and 1): {0: 212, 1: 357}


## 2) Train/test split (stratified)

‚ö†Ô∏è **IMPORTANT ‚Äî Use the exact variable names below (required for grading):**

You must create:

- `X_train`
- `X_test`
- `y_train`
- `y_test`

**Tasks**
1. Split into train/test with:
   - `test_size=0.2`
   - `random_state=42`
   - `stratify=y`
2. Store results exactly in:
   - `X_train, X_test, y_train, y_test`
3. Print class proportions in train and test to confirm stratification.


In [None]:
# TODO 2: Stratified split
# - X_train, X_test, y_train, y_test
# - print class proportions in train/test

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

_, counts_train = np.unique(y_train, return_counts=True)
_, counts_test = np.unique(y_test, return_counts=True)
print("Train class proportions (0, 1):", counts_train / counts_train.sum())
print("Test class proportions (0, 1):", counts_test / counts_test.sum())




Train / Test size (test_size=0.2): 0.7996485061511424 , 0.20035149384885764
Train class proportions (0, 1): [0.37362637 0.62637363]
Test class proportions (0, 1): [0.36842105 0.63157895]


## 3) Build a Pipeline and train a classifier

We will use:

- `StandardScaler()`
- `LogisticRegression(max_iter=2000)`

‚ö†Ô∏è **IMPORTANT ‚Äî Use the exact names below (required for grading):**

Create a Pipeline named **`pipe`** with the following step names:

```python
pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("clf", LogisticRegression(max_iter=2000))
])
```

**Tasks**
1. Build the `pipe` exactly as shown above (same step names).
2. Fit it on the training data.
3. Compute:
   - `test_accuracy` using `pipe.score(X_test, y_test)`
   - `y_pred` using `pipe.predict(X_test)`

Print `test_accuracy`.


In [12]:
# TODO 3: Pipeline + training
# - pipe
# - fit
# - test_accuracy
# - y_pred

# YOUR CODE HERE
pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("clf", LogisticRegression(max_iter=2000, random_state=42))
])
pipe.fit(X_train, y_train)
test_accuracy = pipe.score(X_test, y_test)
y_pred = pipe.predict(X_test)
print("Test accuracy:", test_accuracy)

Test accuracy: 0.9824561403508771


## 4) Evaluate with confusion matrix + metrics

**Tasks**
1. Compute the confusion matrix and store it in `cm`.
2. Compute and store the following floats:
   - `test_precision`
   - `test_recall`
   - `test_f1`

Print the confusion matrix and the metrics.


In [None]:
# TODO 4: Confusion matrix + metrics
# - cm, test_precision, test_recall, test_f1

# YOUR CODE HERE
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_pred)

## 5) Adjust the decision threshold (binary)

In medical screening, **recall** is often important.

‚ö†Ô∏è **IMPORTANT ‚Äî Use the exact variable names below (required for grading):**

You must create:

- `probs_pos`
- `tau` (must be exactly `0.30`)
- `y_pred_tau`
- `recall_tau`
- `cm_tau`

**Tasks**
1. Compute predicted probabilities for the positive class:
   - store in `probs_pos`
2. Set `tau = 0.30`
3. Create predictions with this threshold:
   - store in `y_pred_tau`
4. Compute and store:
   - `recall_tau`
   - `cm_tau`

Print `recall_tau` and `cm_tau`.


In [None]:
# TODO 5: Threshold adjustment
# - probs_pos
# - tau = 0.30
# - y_pred_tau
# - recall_tau
# - cm_tau

# YOUR CODE HERE


## 6) ROC curve + AUC (binary)

**Tasks**
1. Using `probs_pos`, compute:
   - `fpr`, `tpr`, `_` using `roc_curve`
2. Compute:
   - `roc_auc` using `auc(fpr, tpr)`

Print `roc_auc`.


In [None]:
# TODO 6: ROC + AUC
# - fpr, tpr, roc_auc

# YOUR CODE HERE


## 7) Cross-validation with ShuffleSplit

**Tasks**
1. Create a `ShuffleSplit` object named `cv_ss` with:
   - `n_splits=20`
   - `test_size=0.2`
   - `random_state=42`
2. Compute cross-validation accuracy scores with:
   - `cross_val_score(pipe, X, y, cv=cv_ss, scoring="accuracy")`
   - store the array in `cv_scores`
3. Store:
   - `cv_mean` (mean of `cv_scores`)
   - `cv_std` (std of `cv_scores`)

Print `cv_mean` and `cv_std`.


In [None]:
# TODO 7: Cross-validation with ShuffleSplit
# - cv_ss, cv_scores, cv_mean, cv_std

# YOUR CODE HERE


## 8) Hyperparameter search: GridSearchCV

‚ö†Ô∏è **IMPORTANT ‚Äî Use the exact names below (required for grading):**

You must create:

- `param_grid`
- `grid`
- `best_params`
- `best_f1`

Parameter grid must be:

```python
param_grid = {
    "clf__C": [0.01, 0.1, 1.0, 10.0, 100.0]
}
```

`GridSearchCV` must use:

- `estimator=pipe`
- `cv=5`
- `scoring="f1"`
- `n_jobs=-1`

After fitting, store:

```python
best_params = grid.best_params_
best_f1 = grid.best_score_
```

Print `best_params` and `best_f1`.


In [None]:
# TODO 8: GridSearchCV
from sklearn.model_selection import GridSearchCV

# - param_grid
# - grid
# - best_params, best_f1

# YOUR CODE HERE


## 9) Correction / self-check (run at the end)

Run the cell below **after you completed all TODOs**.
It will output a score and feedback.


In [13]:
# Run correction (self-check)
from grader_breast_cancer import grade

result = grade(globals())
result


Status,Pts,Item,Feedback
1) Load dataset ‚Äî 8/8,1) Load dataset ‚Äî 8/8,1) Load dataset ‚Äî 8/8,1) Load dataset ‚Äî 8/8
PASS,4,X loaded with correct shape,"Shape: (569, 30)"
PASS,4,y loaded with correct shape,"Shape: (569,)"
2) Train/test split ‚Äî 10/10,2) Train/test split ‚Äî 10/10,2) Train/test split ‚Äî 10/10,2) Train/test split ‚Äî 10/10
PASS,10,Stratified split shapes correct,"Train (455, 30), Test (114, 30)"
3) Pipeline + train ‚Äî 20/20,3) Pipeline + train ‚Äî 20/20,3) Pipeline + train ‚Äî 20/20,3) Pipeline + train ‚Äî 20/20
PASS,10,Pipeline: StandardScaler + LogisticRegression,Pipeline structure OK.
PASS,6,test_accuracy computed (close to reference),test_accuracy=0.982
PASS,4,y_pred computed with correct shape,"Shape: (114,)"
4) Evaluation metrics ‚Äî 0/15,4) Evaluation metrics ‚Äî 0/15,4) Evaluation metrics ‚Äî 0/15,4) Evaluation metrics ‚Äî 0/15
