# ECON7880 â€” Assignment 3 (Template)

**Notes**  
- **Note 1:** for answers with Python, display both codes and results clearly.  
- **Note 2:** for answers with manual calculation, please display all calculation steps clearly.

This notebook mirrors the structure and style you used for Assignment 2: a brief environment setup followed by one section per question, each with Markdown of the question text and ready-to-run code stubs.

## 0) Environment Setup

In [None]:

# Core libraries
import numpy as np
import pandas as pd

# Viz
import matplotlib.pyplot as plt

# ML utilities (you can import more as needed)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, roc_curve, auc

# Plot settings
plt.rcParams['figure.figsize'] = (7.5, 5.0)
plt.rcParams['axes.grid'] = True


## 1) Question 1. [30 points @ 6 points each]

A firm collected 5 training instances with 2 features $X_1$ and $X_2$, and their **Type** values:

| Instance | $X_1$ | $X_2$ | Type |
|:--:|:--:|:--:|:--:|
| 0 | 13.4 | 11.2 | 1 |
| 1 | 7.9  | 2.1  | 0 |
| 2 | 7.1  | 8.9  | 1 |
| 3 | 7.3  | 6.9  | 0 |
| 4 | 10.7 | 8.9  | 1 |



**(a)** Use Python to plot the 5 instances with $X_1$ on the x-axis and $X_2$ on the y-axis. Visualize instances with different color according to their **Type** values.

With a new instance with $(X_1, X_2) = (6.5, 2.1)$, complete the following tasks with either Python or manual calculation. **Round results to 4 decimal places if you use manual calculation. No need to round if you work with Python.**



In [None]:

# --- Q1 (a) Plot points by Type ---
import pandas as pd
import matplotlib.pyplot as plt

train = pd.DataFrame({
    "Instance": [0, 1, 2, 3, 4],
    "X1": [13.4, 7.9, 7.1, 7.3, 10.7],
    "X2": [11.2, 2.1, 8.9, 6.9, 8.9],
    "Type": [1, 0, 1, 0, 1]
})
new_pt = pd.Series({"X1": 6.5, "X2": 2.1})

# Scatter plot
fig, ax = plt.subplots()
for t, df_g in train.groupby("Type"):
    ax.scatter(df_g["X1"], df_g["X2"], label=f"Type={t}", s=70)
ax.scatter(new_pt["X1"], new_pt["X2"], marker="*", s=200, label="New point")
ax.set_xlabel("X1")
ax.set_ylabel("X2")
ax.legend()
plt.show()


**Answer (a):** The matplotlib scatter plot shows Type=1 and Type=0 points in different colors with the new case at $(6.5, 2.1)$ highlighted by a star. Visually it sits next to the lower-left Type 0 cluster near $(7, 2)$, foreshadowing the distance calculations below.

**(b)** Calculate the **Euclidean Distance** between the new instance and each training instance using both $X_1$ and $X_2$.



In [None]:

# --- Q1 (b) Euclidean distance to new_pt ---
import numpy as np

def euclidean(u, v):
    u = np.asarray(u); v = np.asarray(v)
    return float(np.linalg.norm(u - v))

new_xy = np.array([new_pt["X1"], new_pt["X2"]])
train["euclid_dist"] = train[["X1","X2"]].apply(lambda r: euclidean(r.values, new_xy), axis=1)
train[["Instance","X1","X2","Type","euclid_dist"]]


**(c)** Calculate the **Cosine Distance** between the new instance and each training instance using both $X_1$ and $X_2$.



In [None]:

# --- Q1 (c) Cosine distance to new_pt ---
def cosine_distance(u, v):
    u = np.asarray(u); v = np.asarray(v)
    num = np.dot(u, v)
    den = np.linalg.norm(u) * np.linalg.norm(v)
    cos_sim = num / den
    return float(1 - cos_sim)

train["cosine_dist"] = train[["X1","X2"]].apply(lambda r: cosine_distance(r.values, new_xy), axis=1)
train[["Instance","X1","X2","Type","euclid_dist","cosine_dist"]].sort_values("Instance")


**Answers (b)–(c):** Distances from the new case to each training instance using both metrics.| Instance | X1 | X2 | Type | Euclidean Distance | Cosine Distance ||---:|---:|---:|:--:|---:|---:|| 0 | 13.4 | 11.2 | 1 | 11.4202 | 0.0727 || 1 | 7.9 | 2.1 | 0 | 1.4000 | 0.0014 || 2 | 7.1 | 8.9 | 1 | 6.8264 | 0.1663 || 3 | 7.3 | 6.9 | 0 | 4.8662 | 0.0973 || 4 | 10.7 | 8.9 | 1 | 7.9925 | 0.0718 |The Euclidean distances define the neighbor order for parts (d)–(e); cosine distance is used for the majority-vote classifier in (d).

**(d)** What is the predicted **Type** value for the new instance using **3-NN and majority vote (based on Cosine Distance)**? What is the estimated class probability?



In [None]:

# --- Q1 (d) 3-NN majority vote using COSINE distance ---
k = 3
nbrs_cos = train.sort_values("cosine_dist").head(k)
pred_type_majority = int((nbrs_cos["Type"].sum() >= (k/2)))  # tie -> 1
prob_majority = nbrs_cos["Type"].mean()

print(f"Predicted Type (3-NN majority on cosine): {pred_type_majority}")
print(f"Estimated class probability: {prob_majority:.4f}")
nbrs_cos[["Instance","Type","cosine_dist"]]


**Answer (d):** The three cosine-distance neighbors are instances 1, 4, and 0; two of them have Type=1, so 3-NN majority predicts `Type = 1` with an estimated class probability of 0.6667.


**(e)** Whatâ€™s the predicted **Type** value for the new instance using **3-NN and weighted voting (based on Euclidean Distance)**? What is the estimated class probability?

Please report the results in one or two tables. For example, answers for Q1(b)-(c) can be organized as below:

| Instance | X1 | X2 | Type | (b) Euclidean Distance | (c) Cosine Distance |
|---:|---:|---:|---:|---:|---:|
| 0 | 13.4 | 11.2 | 1 |   |   |
| 1 | 7.9  | 2.1  | 0 |   |   |
| 2 | 7.1  | 8.9  | 1 |   |   |
| 3 | 7.3  | 6.9  | 0 |   |   |
| 4 | 10.7 | 8.9  | 1 |   |   |

In [None]:

# --- Q1 (e) 3-NN weighted vote using EUCLIDEAN distance ---
# Weights = 1 / distance (guarding against zero distance)
eps = 1e-9
nbrs_euc = train.sort_values("euclid_dist").head(3).copy()
nbrs_euc["w"] = 1.0 / (nbrs_euc["euclid_dist"] + eps)
weighted_sum = (nbrs_euc["w"] * nbrs_euc["Type"]).sum()
total_w = nbrs_euc["w"].sum()
prob_weighted = float(weighted_sum / total_w)
pred_type_weighted = int(prob_weighted >= 0.5)

print(f"Predicted Type (3-NN weighted on euclidean): {pred_type_weighted}")
print(f"Estimated class probability: {prob_weighted:.4f}")
nbrs_euc[["Instance","Type","euclid_dist","w"]]


**Answer (e):** Inverse-distance weights heavily favor instance 1 (Type=0), so the weighted 3-NN vote predicts `Type = 0` with an estimated probability of 0.1374 for Type=1.


**Details for (e):** 3-NN uses the three smallest Euclidean distances (flagged "Yes") and inverse-distance weights $w_i = 1/d_i$.| Instance | Type | Euclidean Distance | Weight $1/d_i$ | Used in 3-NN? ||---:|:--:|---:|---:|:---:|| 0 | 1 | 11.4202 | 0.0876 | No || 1 | 0 | 1.4000 | 0.7143 | Yes || 2 | 1 | 6.8264 | 0.1465 | Yes || 3 | 0 | 4.8662 | 0.2055 | Yes || 4 | 1 | 7.9925 | 0.1251 | No |Using only the three nearest neighbors (as required by 3-NN) gives $P(Type=1) = 0.1374$ and predicts Type 0 because the two closest points are both Type 0.

## 2) Question 2. [30 points]

A firm collected 6 instances with 2 features $X_1$ and $X_2$:

| Instance | $X_1$ | $X_2$ |
|:--:|:--:|:--:|
| 0 | 1 | 4 |
| 1 | 1 | 3 |
| 2 | 0 | 5 |
| 3 | 5 | 2 |
| 4 | 6 | 3 |
| 5 | 4 | 0 |

With instance **0** and **3** selected as initial centroids, weâ€™d like to simulate the **$k$-means** algorithm to separate all instances into two clusters ($k=2$). Complete the following with Python or manual calculation. **Round results to 4 decimal places if manual calculation; no need to round if using Python.**



**(a)** Compute Euclidean distance from each instance to the initial centroids.  


In [None]:
# --- Q2 setup & (a) Distances to initial centroids ---
points = np.array([
    [1, 4],
    [1, 3],
    [0, 5],
    [5, 2],
    [6, 3],
    [4, 0],
], dtype=float)

idx0, idx3 = 0, 3
centroids = np.vstack([points[idx0], points[idx3]])  # initial centroids (instances 0 and 3)

def pairwise_euclid(a, b):
    # a: (n, d), b: (m, d) -> (n, m)
    return np.linalg.norm(a[:, None, :] - b[None, :, :], axis=2)

dists = pairwise_euclid(points, centroids)
df2 = pd.DataFrame(points, columns=["X1", "X2"])
df2.insert(0, "Instance", range(len(df2)))
df2["dist_c0"] = dists[:, 0]
df2["dist_c1"] = dists[:, 1]
df2


**(b)** Assign instances to the two clusters by finding their closest centroids.  


In [None]:
# --- Q2 (b) Cluster assignments via nearest centroid ---
assign = dists.argmin(axis=1)
df2["cluster"] = assign
df2


**(c)** Compute the clustering quality with $\text{SSE} = \sum_{i=1}^k \sum_{p \in C_i} d(p, m_i)^2$.  


In [None]:
# --- Q2 (c) SSE for current clustering ---
sse = float(np.sum((dists[np.arange(len(points)), assign]) ** 2))
print(f"SSE (iteration 1): {sse:.4f}")


**(d)** Compute the **mean feature values** for instances in the two clusters respectively, in the format of $(X_1, X_2)$.  


In [None]:
# --- Q2 (d) Updated centroids from cluster means ---
means = np.vstack([
    points[assign == 0].mean(axis=0),
    points[assign == 1].mean(axis=0),
])
print("Updated centroids (iteration 1 means):", means)


**Answers (a)–(d):** Distances to the initial centroids (instances 0 and 3) together with the resulting cluster labels.| Instance | X1 | X2 | Dist to Inst 0 | Dist to Inst 3 | Cluster (iter 1) ||---:|---:|---:|---:|---:|:--:|| 0 | 1 | 4 | 0.0000 | 4.4721 | 0 || 1 | 1 | 3 | 1.0000 | 4.1231 | 0 || 2 | 0 | 5 | 1.4142 | 5.8310 | 0 || 3 | 5 | 2 | 4.4721 | 0.0000 | 1 || 4 | 6 | 3 | 5.0990 | 1.4142 | 1 || 5 | 4 | 0 | 5.0000 | 2.2361 | 1 |SSE (iteration 1) = 10.0000. Cluster means for part (d): $C_0 = (0.667, 4.000)$ and $C_1 = (5.000, 1.667)$.

**(e)** **Update** the cluster centroids with the means from (d), then repeat steps (a)â€“(d) once. Will the clustering result (i.e., cluster labels) change? Any improvement in terms of **SSE**?

**Suggested table for (a)â€“(d):**

| Instance | X1 | X2 | (a) Dist to **Inst 0** | (a) Dist to **Inst 3** | (b) Cluster Label | (d) Updated Centroid |
|---:|---:|---:|---:|---:|:--:|:--:|
| 0 |   |   |   |   |   |   |
| â€¦ |   |   |   |   |   |   |
| 5 |   |   |   |   |   |   |

**(c) SSE:** ____

In [None]:
# --- Q2 (e) Repeat after updating centroids ---
centroids2 = means.copy()
d2 = pairwise_euclid(points, centroids2)
assign2 = d2.argmin(axis=1)
sse2 = float(np.sum((d2[np.arange(len(points)), assign2]) ** 2))

df2_iter2 = pd.DataFrame(points, columns=["X1", "X2"])
df2_iter2.insert(0, "Instance", range(len(df2_iter2)))
df2_iter2["dist_c0"] = d2[:, 0]
df2_iter2["dist_c1"] = d2[:, 1]
df2_iter2["cluster"] = assign2

display(df2_iter2)
print(f"SSE (iteration 2): {sse2:.4f}")
print("Centroids (iteration 2):", centroids2)
print("Did labels change?", not np.array_equal(assign, assign2))
print(f"SSE improvement: {sse - sse2:.4f}")


**Answer (e):** After updating the centroids to (0.667, 4.000) and (5.000, 1.667), every instance keeps the same label `[0,0,0,1,1,1]`, but SSE drops from 10.0000 to 9.3333 (an improvement of 0.6667).


**Iteration 2 check for part (e):** Distances to the updated centroids and confirmation that labels are unchanged.| Instance | Dist to $C_0=(0.667, 4.000)$ | Dist to $C_1=(5.000, 1.667)$ | Cluster (iter 2) ||---:|---:|---:|:--:|| 0 | 0.3333 | 4.6308 | 0 || 1 | 1.0541 | 4.2164 | 0 || 2 | 1.2019 | 6.0093 | 0 || 3 | 4.7726 | 0.3333 | 1 || 4 | 5.4263 | 1.6667 | 1 || 5 | 5.2068 | 1.9437 | 1 |SSE (iteration 2) = 9.3333, so the procedure improves the objective by 0.6667 while keeping the same labels.

## 3) Question 3. [24 points]

A bank trained a classification model to predict the likelihood of default for each customer. There are **1000 customers** in the database: the â€œNo Defaultâ€ cases take up **80%** of the data while the â€œDefaultâ€ cases take up **20%**. Applying this classifier on this dataset yields the following confusion matrix:

**Confusion matrix**

|              | Predicted: Default | Predicted: No Default |
|:-------------|-------------------:|----------------------:|
| **Actual: Default**    | 150 | 50  |
| **Actual: No Default** | 100 | 700 |

As the average lending amount is **$100** and interest rate is **15%**, the **cost-benefit matrix** (negative numbers mean cost) is:

|              | Predicted: Default | Predicted: No Default |
|:-------------|-------------------:|----------------------:|
| **Actual: Default**    | $0  | $100 |
| **Actual: No Default** | $0  | $15  |



**(a)** Which group (â€œDefaultâ€ or â€œNo Defaultâ€) will you consider as the positive class?  


**Answer (a):** Treat `Default` as the positive class because the firm's goal is to catch defaults; all recall/precision metrics therefore refer to the default customers.


**(b)** **[8 points @ 2 points each]** Calculate the following scores for this model: (i) **Accuracy**; (ii) **True positive rate (Sensitivity/Recall)**; (iii) **True negative rate (Specificity)**; (iv) **Precision (for the positive class only)**.  


In [None]:
# --- Q3 setup & (b) Classification metrics ---
cm = np.array([[150, 50],
               [100, 700]])
TP, FN = cm[0, 0], cm[0, 1]
FP, TN = cm[1, 0], cm[1, 1]

N = cm.sum()
pos = cm[0].sum()
neg = cm[1].sum()

accuracy = (TP + TN) / N
tpr = TP / pos
tnr = TN / neg
precision = TP / (TP + FP)

print(f"Accuracy:  {accuracy:.4f}")
print(f"TPR:       {tpr:.4f}")
print(f"TNR:       {tnr:.4f}")
print(f"Precision: {precision:.4f}")


**Answer (b):** Accuracy = 0.8500, TPR = 0.7500, TNR = 0.8750, Precision (Default) = 0.6000.


**(c)** Calculate the **expected value (per person)** for this model.  


In [None]:
# --- Q3 (c) Expected value per person ---
benefit = np.array([[0, -100],
                    [0,   15]], dtype=float)
ev_total = float((cm * benefit).sum())
ev_per_person = ev_total / N
print(f"Expected value per person (this model): ${ev_per_person:.4f}")


**Answer (c):** Using the cost-benefit matrix (default + approved loan costs $100, hence -$100), the model yields $5,500 total benefit, i.e., **$5.50 per customer**.


**(d)** Assume we aim to target the same proportion of customers as in the first table, with only **positive predictions** targeted. **Write down the confusion matrix for a random classifier.**  


In [None]:
# --- Q3 (d) Random classifier confusion matrix ---
pred_pos_rate = (TP + FP) / N
rand_tp = round(pos * pred_pos_rate)
rand_fp = round(neg * pred_pos_rate)
rand_fn = pos - rand_tp
rand_tn = neg - rand_fp

cm_rand = np.array([[rand_tp, rand_fn],
                    [rand_fp, rand_tn]])
print("Random classifier confusion matrix (matching positive rate):")
print(cm_rand)


**Answer (d):** Matching the same 25% positive prediction rate gives the random-classifier confusion matrix `[[50, 150], [200, 600]]` (rows = Actual Default/No Default).


**(e)** Calculate the overall **expected value (per person)** for the random classifier in step (d).

In [None]:
# --- Q3 (e) Expected value for random classifier ---
ev_rand_total = float((cm_rand * benefit).sum())
ev_rand_per_person = ev_rand_total / N
print(f"Expected value per person (random base): ${ev_rand_per_person:.4f}")


**Answer (e):** Plugging that matrix into the same payoffs gives an expected value of **-$6.00 per customer**, so the trained model is substantially better than random guessing.


## 4) Question 4. [16 points]

Two classifiers (**Model A** and **Model B**) are used to predict whether the **Fed Funds rate will increase** or not (class label: 1 = increase, 0 = no increase), with each quarter considered as an instance. The estimated probabilities of increase over the past 6 quarters by model A and B respectively are displayed in the following table:

| Quarter | Actual Class | Model A | Model B |
|:--:|:--:|:--:|:--:|
| 0 | 1 | 0.43 | 0.63 |
| 1 | 1 | 0.52 | 0.53 |
| 2 | 1 | 0.85 | 0.56 |
| 3 | 1 | 0.69 | 0.71 |
| 4 | 0 | 0.03 | 0.18 |
| 5 | 0 | 0.31 | 0.76 |



**(a)** Plot the **ROC curve** for the 2 classifiers together with the **random classifier**. Please calculate the **TP** and **FP** rates with the following cutoff values **[0, 0.2, 0.4, 0.5, 0.6, 0.8, 1]** before plotting the ROC curve.  
(**Note:** you may need to calculate each modelâ€™s TP and FP rates at each cut-off first. The visualization can be done with either manually or with Python.)



In [None]:

# --- Q4 ROC computation and plot ---
y_true = np.array([1,1,1,1,0,0])
pA = np.array([0.43, 0.52, 0.85, 0.69, 0.03, 0.31])
pB = np.array([0.63, 0.53, 0.56, 0.71, 0.18, 0.76])

cutoffs = [0, 0.2, 0.4, 0.5, 0.6, 0.8, 1]
def tpr_fpr_at_cutoffs(y, p, cuts):
    pts = []
    pos = (y==1).sum()
    neg = (y==0).sum()
    for c in cuts:
        yhat = (p >= c).astype(int)
        TP = ((yhat==1) & (y==1)).sum()
        FP = ((yhat==1) & (y==0)).sum()
        TPR = TP / pos if pos else 0.0
        FPR = FP / neg if neg else 0.0
        pts.append((FPR, TPR))
    return np.array(pts)

ptsA = tpr_fpr_at_cutoffs(y_true, pA, cutoffs)
ptsB = tpr_fpr_at_cutoffs(y_true, pB, cutoffs)

# Also compute AUC (continuous ROC using sklearn for reference)
fprA, tprA, _ = roc_curve(y_true, pA)
fprB, tprB, _ = roc_curve(y_true, pB)
aucA = auc(fprA, tprA)
aucB = auc(fprB, tprB)

fig, ax = plt.subplots()
ax.plot([0,1], [0,1], 'k--', label="Random")
ax.plot(ptsA[:,0], ptsA[:,1], 'o-', label=f"Model A (AUC={aucA:.3f})")
ax.plot(ptsB[:,0], ptsB[:,1], 'o-', label=f"Model B (AUC={aucB:.3f})")
ax.set_xlabel("False Positive Rate (1 - Specificity)")
ax.set_ylabel("True Positive Rate (Sensitivity)")
ax.set_title("ROC Curves at Specified Cutoffs")
ax.legend()
plt.show()

# Write your reasoning for (b) below this cell.


**Answer (a):** True/false-positive rates at the requested cutoffs (the random classifier lies on the $TPR = FPR$ diagonal shown in the ROC plot).| Cutoff | TP_A | FP_A | TPR_A | FPR_A | TP_B | FP_B | TPR_B | FPR_B ||---:|---:|---:|---:|---:|---:|---:|---:|---:|| 0.0 | 4 | 2 | 1.000 | 1.000 | 4 | 2 | 1.000 | 1.000 || 0.2 | 4 | 1 | 1.000 | 0.500 | 4 | 1 | 1.000 | 0.500 || 0.4 | 4 | 0 | 1.000 | 0.000 | 4 | 1 | 1.000 | 0.500 || 0.5 | 3 | 0 | 0.750 | 0.000 | 4 | 1 | 1.000 | 0.500 || 0.6 | 2 | 0 | 0.500 | 0.000 | 2 | 1 | 0.500 | 0.500 || 0.8 | 1 | 0 | 0.250 | 0.000 | 0 | 0 | 0.000 | 0.000 || 1.0 | 0 | 0 | 0.000 | 0.000 | 0 | 0 | 0.000 | 0.000 |Model A reaches the $(0,1)$ corner (TPR=1, FPR=0) while Model B falls back to the random line, matching the AUC comparison in part (b).

**(b)** Which model is better? Why?

**Answer (b):** Model A strictly dominates Model B on the ROC curve (AUC 1.000 vs. 0.500), so Model A is better at every cutoff while Model B behaves like random guessing.


---

### References / reminders (not graded)
- Problem statements are transcribed from **Assignment 3**.  
- This template mirrors the layout you used in Assignment 2 (section headers, environment setup, and code stubs), but **no solutions are provided**.