# AppliedML.ipynb Walkthrough (Exam-Focused)

This is a guided walkthrough of `AppliedML.ipynb`, using `AppliedML_solutions.ipynb` as a reference. It explains *what to do* in each part and *why*, with minimal “magic”.

## What you’re building

- **Goal:** predict whether an animal’s `outcome_type` is **Adoption** (`adopted = 1`) or **not** (`adopted = 0`).
- **Model:** logistic regression (binary classifier that outputs probabilities).
- **Main pipeline:** load → clean → label → train/test split → one-hot encode categoricals → standardize → fit model → evaluate with different thresholds → interpret coefficients.

## Libraries used (what they’re for)

- `pandas` (`pd`): load CSV, manipulate tabular data (`DataFrame`), one-hot encoding via `pd.get_dummies`.
- `numpy` (`np`): numeric operations, random splitting, thresholding, counting TP/FP/FN/TN.
- `sklearn.linear_model.LogisticRegression`: logistic regression classifier with `.fit()` and `.predict_proba()`.
- `matplotlib.pyplot` (`plt`) + `seaborn` (`sn`): plotting (curves, confusion matrix heatmap, coefficient bars).

---

## Part A — Load, encode, split, standardize (only `numpy` + `pandas`)

### A1) Load + drop missing rows


You’re allowed to drop missing entries:

```python
print('rows before:', len(original_data))
original_data = original_data.dropna().reset_index(drop=True)
print('rows after: ', len(original_data))
```

### A2) Create label and features table

Turn `outcome_type` into a binary label and drop the original column:

```python
data_features = original_data.copy()
data_features['adopted'] = (data_features['outcome_type'] == 'Adoption').astype(int)
data_features = data_features.drop(columns=['outcome_type'])
```

### A3) Split into train/test (80/20)

A simple split without external libraries:

```python
def split_set(df, ratio=0.8, seed=0):
    rng = np.random.default_rng(seed)
    mask = rng.random(len(df)) < ratio
    return df[mask].reset_index(drop=True), df[~mask].reset_index(drop=True)

train, test = split_set(data_features, ratio=0.8, seed=0)
```

If you want to match the solutions notebook *exactly*, remove the `seed` and use `np.random.rand(...)` instead.

### A4) Dummy-variable encoding for categoricals

Pick the categorical columns and one-hot encode:

```python
categorical_columns = [
    'sex_upon_outcome', 'animal_type', 'intake_condition',
    'intake_type', 'sex_upon_intake'
]

train_enc = pd.get_dummies(train, columns=categorical_columns)
```

Important: the test set must have the **same columns in the same order** as the training set:

```python
test_enc = pd.get_dummies(test, columns=categorical_columns)
test_enc = test_enc.reindex(columns=train_enc.columns, fill_value=0)
```

### A5) Separate labels from features

```python
train_y = train_enc['adopted']
train_X = train_enc.drop(columns=['adopted'])

test_y = test_enc['adopted']
test_X = test_enc.drop(columns=['adopted'])
```

### A6) Standardize (mean 0, variance 1) using *training stats only*

Compute mean/std on training features and apply to both:

```python
means = train_X.mean()
stds = train_X.std().replace(0, 1)  # guard against constant columns

train_X_std = (train_X - means) / stds
test_X_std  = (test_X  - means) / stds
```

That completes Part A (preprocessing).

---

## Part B — Train logistic regression + confusion matrix + manual metrics

### B1) Fit logistic regression

```python
logreg = LogisticRegression(solver='lbfgs', max_iter=10000)
logreg.fit(train_X_std, train_y)
```

### B2) Get predicted probabilities on test

`predict_proba` returns an `N x 2` array: `[:,0]` is `P(class=0)`, `[:,1]` is `P(class=1)`:

```python
proba = logreg.predict_proba(test_X_std)
```

### B3) Threshold to get binary predictions

For threshold `t=0.5`:

```python
t = 0.5
pred_y = (proba[:, 1] > t).astype(int)
```

### B4) Confusion matrix (TP, FP, FN, TN)

Define TP/FP/FN/TN using boolean logic:

```python
TP = np.sum((pred_y == 1) & (test_y.values == 1))
TN = np.sum((pred_y == 0) & (test_y.values == 0))
FP = np.sum((pred_y == 1) & (test_y.values == 0))
FN = np.sum((pred_y == 0) & (test_y.values == 1))
```

Many conventions exist; be explicit. A common layout is:

```python
cm = np.array([[TP, FP],
               [FN, TN]])
cm
```

### B5) Manual metrics (positive and negative class)

With TP/FP/FN/TN:

- **Accuracy:** `(TP + TN) / (TP + TN + FP + FN)`
- **Positive precision:** `TP / (TP + FP)`
- **Positive recall (TPR):** `TP / (TP + FN)`
- **Positive F1:** `2 * prec * rec / (prec + rec)`

For the **negative class**, treat “negative” as the class of interest:

- **Negative precision:** `TN / (TN + FN)`
- **Negative recall (TNR / specificity):** `TN / (TN + FP)`
- **Negative F1:** computed the same way from negative precision/recall

---

## Part C — Sweep threshold and plot metrics vs threshold

### C1) Loop over thresholds and compute metrics

```python
thresholds = np.linspace(0, 1, 100)
rows = []

for t in thresholds:
    pred_y = (proba[:, 1] > t).astype(int)
    TP = np.sum((pred_y == 1) & (test_y.values == 1))
    TN = np.sum((pred_y == 0) & (test_y.values == 0))
    FP = np.sum((pred_y == 1) & (test_y.values == 0))
    FN = np.sum((pred_y == 0) & (test_y.values == 1))

    acc = (TP + TN) / (TP + TN + FP + FN)
    prec_p = TP / (TP + FP) if (TP + FP) else np.nan
    rec_p  = TP / (TP + FN) if (TP + FN) else np.nan
    f1_p   = 2*prec_p*rec_p/(prec_p+rec_p) if (prec_p+rec_p) else np.nan

    prec_n = TN / (TN + FN) if (TN + FN) else np.nan
    rec_n  = TN / (TN + FP) if (TN + FP) else np.nan
    f1_n   = 2*prec_n*rec_n/(prec_n+rec_n) if (prec_n+rec_n) else np.nan

    rows.append([t, acc, prec_p, rec_p, f1_p, prec_n, rec_n, f1_n])

score = pd.DataFrame(rows, columns=[
    'Threshold', 'Accuracy', 'Precision P', 'Recall P', 'F1 score P',
    'Precision N', 'Recall N', 'F1 score N'
]).set_index('Threshold')
```

### C2) Plot

```python
score['Accuracy'].plot(grid=True).set_title('Accuracy vs threshold')
```

And a grid of the other metrics:

```python
fig, axs = plt.subplots(nrows=2, ncols=3, sharex=True, sharey=True, figsize=(10, 5))
cols = ['Precision P', 'Recall P', 'F1 score P', 'Precision N', 'Recall N', 'F1 score N']
for ax, col in zip(axs.flat, cols):
    score[col].plot(ax=ax, grid=True)
    ax.set_title(col)
```

Interpretation tip: increasing `t` usually **reduces** predicted positives → often **increases precision** and **decreases recall** for the positive class.

---

## Part D — Interpret coefficients (feature importance)

For logistic regression, `coef_` contains one coefficient per feature. A larger positive coefficient increases the log-odds of class 1.

```python
coefs = pd.DataFrame({
    'name': train_X_std.columns,
    'value': logreg.coef_[0]
}).sort_values('value')

plt.figure(figsize=(6, 8))
plt.barh(coefs['name'], coefs['value'], alpha=0.7)
plt.title('Logistic regression coefficients')
plt.show()
```

Exam note: coefficients are comparable only if features are on comparable scales — that’s why standardization matters.

---

## Quiz answers (from `AppliedML_solutions.ipynb`)

- **Question 1:** **a) F1 Score** (best single-number summary when classes are imbalanced and you care about both precision and recall).
- **Question 2:** **d) True positive rate is 0.95** because `TPR = TP/(TP+FN) = 100/(100+5) ≈ 0.95`.

---

## Quick exam checklist (common pitfalls)

- **Always align dummy columns:** one-hot encode train, then `reindex` test to training columns.
- **Standardize using training stats only:** never compute mean/std on test.
- **Know your confusion matrix convention:** write down which axis is actual vs predicted before computing metrics.
- **Threshold moves precision/recall:** higher threshold → fewer predicted positives → typically higher precision, lower recall (positive class).
- **Unbalanced classes:** accuracy can be misleading; prefer precision/recall/F1 and also inspect the confusion matrix.
