#**CS6140 - Machine Learning, Spring 2025**
##**Homework 3**

Submission Instructions:
- Please complete this homework assignment in the same notebook provided.
- Submit your completed assignment on Canvas by the deadline.

Submission Deadline:
**Feb 16th, 2025**

<p align="justify">
Please read the instructions carefully when answering questions and ensure your code works correctly before submission. The grader will run your code for grading the coding questions.
</p>

This homework has four questions.


#@markdown ### Enter your first and last names below:
First Name = "Zhechao" #@param {type:"string"}
Last Name = "Jin" #@param {type:"string"}

#**Model Performance Optimization**

XYZ company has provided you with a real dataset generated by their machine learning model on their website. The dataset includes the following columns:

`item_id` = Unique ID for items  
`new_price` = Price for items on XYZ website   
`levels` = Different category levels (20000, 30000, 40000, 50000, 60000 and 80000)  
`upper_bound` = Model A's prediction for upper bound price  
`reliability_score` = The score determines when to switch between Model A and Model B. If the score is less than 0.67, Model B's predictions are used.  
`final_ceiling` = Model B's prediction for upper bound price  
`resultHigh` = Ground truth


The code below provides the current performance of the ML model both overall and grouped by `levels`. Your task is to optimize precision, recall, F1 scores based on the four scenarios outlined in the corresponding questions.

In [20]:
import pandas as pd
import numpy as np

df = pd.read_csv('labels.csv')
m = df.groupby('levels').resultHigh.value_counts()
rows = []

# Calculate per-level metrics
for i in np.sort(df.levels.unique()):
    try:
        fp = m[i].FP
    except KeyError:
        fp = 0
    try:
        tp = m[i].TP
    except KeyError:
        tp = 0
    try:
        fn = m[i].FN
    except KeyError:
        fn = 0
    try:
        tn = m[i].TN
    except KeyError:
        tn = 0
    totanomalies = tp + fp

    try:
        precision = tp / (tp + fp)
    except ZeroDivisionError:
        precision = np.nan
    try:
        recall = tp / (tp + fn)
    except ZeroDivisionError:
        recall = np.nan
    try:
        f1 = 2 * precision * recall / (precision + recall)
    except ZeroDivisionError:
        f1 = np.nan

    rows.append([i, m[i].sum(), precision, recall, f1, tp, fp, fn, tn])

# Calculate overall metrics
t = df['resultHigh'].value_counts()  # Get overall counts

try:
    precision = t.TP / (t.TP + t.FP)
except ZeroDivisionError:
    precision = np.nan
try:
    recall = t.TP / (t.TP + t.FN)
except ZeroDivisionError:
    recall = np.nan
try:
    f1 = 2 * precision * recall / (precision + recall)
except ZeroDivisionError:
    f1 = np.nan

# Append overall metrics to the rows
rows.append(['Overall', t.sum(), precision, recall, f1, t.TP, t.FP, t.FN, t.get('TN', 0)])

# Create the final DataFrame
metric_df = pd.DataFrame(rows, columns=['Levels', 'Total Samples', 'Precision', 'Recall', 'F1', 'TP', 'FP', 'FN', 'TN'])

# Display the final DataFrame
metric_df

Unnamed: 0,Levels,Total Samples,Precision,Recall,F1,TP,FP,FN,TN
0,20000,370,0.88,0.409938,0.559322,66,9,95,200
1,30000,549,0.852713,0.4,0.544554,110,19,165,255
2,40000,1401,0.729614,0.330097,0.454545,170,63,345,823
3,50000,159,0.636364,0.194444,0.297872,7,4,29,119
4,60000,425,0.970588,0.375,0.540984,99,3,165,158
5,80000,582,0.637931,0.268116,0.377551,37,21,101,423
6,Overall,3486,0.804276,0.352052,0.489735,489,119,900,1978


## **Q1: Based on `final_ceiling`**

The engineering team suggests that false positives (FP) could be reduced by applying a multiplier to `final_ceiling`. For example, `item_id = 45995610` is currently labeled as an FP because its `new_price` exceeds the `final_ceiling` (i.e., $26.49 > 23.23$). In other words, Model B incorrectly predicts this item as *high price*, leading to the FP label. If we multiply the `final_ceiling` by 1.2, Model B would classify it as a true negative (TN) since $26.49 < (1.2 \times 23.23)$.

Could you reduce FP by adjusting the `final_ceiling` with a multiplier? You could try values in the range [1-3] with increments of 0.1 to optimize, and report the best precision, recall, and F1 scores you achieve per `levels` level and overall (similar to the table above). Also report the best multipliers per level and overall in an additional column in the table.
There's no need to focus on reducing false negatives (FN) in this case.


In [21]:
import pandas as pd
import numpy as np

# Load dataset
file_path = "labels.csv"
df = pd.read_csv(file_path)

# Define the range of multipliers to test
multipliers = np.arange(1.0, 3.1, 0.1)

# Initialize dictionaries to store best results
best_results = {metric: {} for metric in ["Precision", "Recall", "F1"]}
best_multipliers = {metric: {} for metric in ["Precision", "Recall", "F1"]}

# Iterate through multipliers and calculate metrics
for multiplier in multipliers:
    # Adjust final ceiling
    df["adjusted_final_ceiling"] = df["final_ceiling"] * multiplier

    # Update classification based on new ceiling
    df["new_FP"] = (df["new_price"] > df["adjusted_final_ceiling"]) & (df["resultHigh"] == "FP")
    df["new_TN"] = ~df["new_FP"] & (df["resultHigh"] == "FP")
    df["new_TP"] = df["resultHigh"] == "TP"
    df["new_FN"] = df["resultHigh"] == "FN"

    # Aggregate and store best results
    for level in list(df["levels"].unique()) + ["Overall"]:
        level_df = df if level == "Overall" else df[df["levels"] == level]

        tp, fp, fn = level_df["new_TP"].sum(), level_df["new_FP"].sum(), level_df["new_FN"].sum()

        precision = tp / (tp + fp) if (tp + fp) > 0 else np.nan
        recall = tp / (tp + fn) if (tp + fn) > 0 else np.nan
        f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else np.nan

        for metric, value in zip(["Precision", "Recall", "F1"], [precision, recall, f1]):
            if level not in best_results[metric] or (value > best_results[metric][level]):
                best_results[metric][level] = value
                best_multipliers[metric][level] = multiplier

# Create DataFrames for reporting
result_dfs = {
    metric: pd.DataFrame.from_dict(best_results[metric], orient="index", columns=[f"Best {metric}"])
    .assign(**{f"Best Multiplier ({metric})": best_multipliers[metric]})
    for metric in ["Precision", "Recall", "F1"]
}

# Display results
for metric, df in result_dfs.items():
    print(f"\nBest {metric} Metrics:\n", df)





Best Precision Metrics:
          Best Precision  Best Multiplier (Precision)
40000          0.885417                          2.7
20000          0.985075                          2.3
30000          0.990991                          2.8
80000          1.000000                          2.8
60000          1.000000                          2.9
50000          1.000000                          1.7
Overall        0.953216                          2.9

Best Recall Metrics:
          Best Recall  Best Multiplier (Recall)
40000       0.330097                       1.0
20000       0.409938                       1.0
30000       0.400000                       1.0
80000       0.268116                       1.0
60000       0.375000                       1.0
50000       0.194444                       1.0
Overall     0.352052                       1.0

Best F1 Metrics:
           Best F1  Best Multiplier (F1)
40000    0.480905                   2.7
20000    0.578947                   2.3
30000    0.5

## **Q2: Based on `upper_bound`**

The engineering team suggests that false positives (FP) can be reduced by adjusting the `upper_bound` with a multiplier. For example, `item_id = 304268026` is labeled as an FP because its `new_price` exceeds the `upper_bound` (i.e., $137.31 > 118.47$). In this case, Model A incorrectly classifies the item as *high price*, leading to the FP label. If the `upper_bound` is multiplied by 1.2, Model A would predict it as a true negative (TN), since $137.31 < (1.2 \times 118.47)$.

Could you reduce FP by adjusting the `upper_bound` using a multiplier? Consider testing a range of [1-3], with increments of 0.1, to optimize, and report the best precision, recall, and F1 scores you acheive per `levels` level and overall (similar to the table above). Also report the best multipliers per level and overall in an additional column in the table.
There's no need to focus on reducing false negatives (FN) in this case.


In [22]:
# Load dataset
file_path = "labels.csv"
df = pd.read_csv(file_path)

# Define the range of multipliers to test for upper_bound
multipliers = np.arange(1.0, 3.1, 0.1)

# Initialize dictionaries to store best results for upper_bound adjustment
best_results_upper = {metric: {} for metric in ["Precision", "Recall", "F1"]}
best_multipliers_upper = {metric: {} for metric in ["Precision", "Recall", "F1"]}

# Iterate through multipliers and calculate metrics
for multiplier in multipliers:
    df["adjusted_upper_bound"] = df["upper_bound"] * multiplier

    # Update classifications based on new upper_bound
    df["new_FP"] = (df["new_price"] > df["adjusted_upper_bound"]) & (df["resultHigh"] == "FP")
    df["new_TN"] = ~df["new_FP"] & (df["resultHigh"] == "FP")
    df["new_TP"] = df["resultHigh"] == "TP"
    df["new_FN"] = df["resultHigh"] == "FN"

    # Aggregate and store best results
    for level in list(df["levels"].unique()) + ["Overall"]:
        level_df = df if level == "Overall" else df[df["levels"] == level]

        tp, fp, fn = level_df["new_TP"].sum(), level_df["new_FP"].sum(), level_df["new_FN"].sum()

        precision = tp / (tp + fp) if (tp + fp) > 0 else np.nan
        recall = tp / (tp + fn) if (tp + fn) > 0 else np.nan
        f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else np.nan

        for metric, value in zip(["Precision", "Recall", "F1"], [precision, recall, f1]):
            if level not in best_results_upper[metric] or (value > best_results_upper[metric][level]):
                best_results_upper[metric][level] = value
                best_multipliers_upper[metric][level] = multiplier

# Create DataFrames for reporting
result_dfs_upper = {
    metric: pd.DataFrame.from_dict(best_results_upper[metric], orient="index", columns=[f"Best {metric}"])
    .assign(**{f"Best Multiplier ({metric})": best_multipliers_upper[metric]})
    for metric in ["Precision", "Recall", "F1"]
}

# Display results
for metric, df in result_dfs_upper.items():
    print(f"\nBest {metric} Metrics (Upper Bound):\n", df)



Best Precision Metrics (Upper Bound):
          Best Precision  Best Multiplier (Precision)
40000          1.000000                          1.8
20000          0.985075                          1.4
30000          1.000000                          1.3
80000          1.000000                          2.6
60000          1.000000                          1.0
50000          1.000000                          1.2
Overall        0.997959                          2.6

Best Recall Metrics (Upper Bound):
          Best Recall  Best Multiplier (Recall)
40000       0.330097                       1.0
20000       0.409938                       1.0
30000       0.400000                       1.0
80000       0.268116                       1.0
60000       0.375000                       1.0
50000       0.194444                       1.0
Overall     0.352052                       1.0

Best F1 Metrics (Upper Bound):
           Best F1  Best Multiplier (F1)
40000    0.496350                   1.8
20000    0

## **Q3: Based on `reliability_score`**

The engineering team suggests that the `reliability_score` may help reduce false positives (FP) by determining when to switch between `final_ceiling` and `upper_bound`. For example, `item_id = 45995610` is labeled as FP because its `new_price` exceeds the `final_ceiling` (i.e., $26.49 > 23.23$). However, if the `upper_bound` is used for this item, it would be classified as a true negative (TN) since $26.49 < 62.07$. In this case, the `reliability_score` is 0.649277023, so setting a threshold of `reliability_score < 0.65` might allow the use of `upper_bound` instead. However, this is a simplistic approach as it only considers this specific item, and a more generalized threshold should be found to optimize precision, recall, and F1 scores.

Can you optimize these metrics by setting a cut-off point on the `reliability_score` to switch between `final_ceiling` and `upper_bound`? Propose cut-off points both at the `levels` level and overall. Also report the best cut-off points per level and overall in an additional column in the table.
There's no need to consider the multipliers from Q1 and Q2 in this part.


In [23]:
file_path = "labels.csv"
df = pd.read_csv(file_path)

# Convert resultHigh column into numerical labels for clarity
df["resultHigh"] = df["resultHigh"].map({"TP": 1, "FP": 0, "FN": -1, "TN": 2})

# compute the best threshold for precision, recall, and F1-score
def evaluate_thresholds(metric):
    best_results = []

    for level in list(df["levels"].unique()) + ["Overall"]:
        subset = df if level == "Overall" else df[df["levels"] == level]
        best_threshold, best_value = None, 0

        for t in sorted(subset["reliability_score"].unique()):
            predictions = np.where(subset["reliability_score"] >= t, subset["upper_bound"], subset["final_ceiling"])

            # Update FP to TN based on threshold switch
            fp = np.sum((subset["new_price"] > predictions) & (subset["resultHigh"] == 0))
            tp = np.sum(subset["resultHigh"] == 1)
            fn = np.sum(subset["resultHigh"] == -1)

            precision = tp / (tp + fp) if (tp + fp) > 0 else 0
            recall = tp / (tp + fn) if (tp + fn) > 0 else 0
            f1 = (2 * precision * recall) / (precision + recall) if (precision + recall) > 0 else 0

            value = {"precision": precision, "recall": recall, "f1": f1}[metric]

            if value > best_value:
                best_value, best_threshold = value, t

        best_results.append([level, best_threshold, best_value])

    return pd.DataFrame(best_results, columns=["Levels", f"Best Cutoff for {metric.capitalize()}", f"Best {metric.capitalize()}"])

# Compute optimal thresholds for Precision, Recall, and F1-score
precision_df, recall_df, f1_df = evaluate_thresholds("precision"), evaluate_thresholds("recall"), evaluate_thresholds("f1")

# Display the results
print("Best Cutoffs for Precision:\n", precision_df)
print("\nBest Cutoffs for Recall:\n", recall_df)
print("\nBest Cutoffs for F1:\n", f1_df)



Best Cutoffs for Precision:
     Levels  Best Cutoff for Precision  Best Precision
0    40000                   0.262438        0.971429
1    20000                   0.262438        0.956522
2    30000                   0.295948        0.940171
3    80000                   0.295948        0.770833
4    60000                   0.262438        1.000000
5    50000                   0.295948        0.875000
6  Overall                   0.262438        0.947674

Best Cutoffs for Recall:
     Levels  Best Cutoff for Recall  Best Recall
0    40000                0.262438     0.330097
1    20000                0.262438     0.409938
2    30000                0.295948     0.400000
3    80000                0.295948     0.268116
4    60000                0.262438     0.375000
5    50000                0.295948     0.194444
6  Overall                0.262438     0.352052

Best Cutoffs for F1:
     Levels  Best Cutoff for F1   Best F1
0    40000            0.262438  0.492754
1    20000            0

## **Q4: Based on `reliability_score`, `upper_bound`, `final_ceiling`**

If you've successfully completed the first three questions, you can now combine everything to propose a more comprehensive method. Can you optimize precision, recall, and F1 scores by setting a threshold on `reliability_score` to switch between `final_ceiling` and `upper_bound`, while also determining an optimal multiplier for each? Report the best multipliers along with the cut-off points per level and overall in additional columns in the table.
Again, there's no need to focus on reducing false negatives (FN) in this case.


In [24]:
# Reload dataset
file_path = "labels.csv"
df = pd.read_csv(file_path)

# Convert resultHigh column into numerical labels for clarity
df["resultHigh"] = df["resultHigh"].map({"TP": 1, "FP": 0, "FN": -1, "TN": 2})

# Best values per level from Q1, Q2, and Q3
best_values_per_level = {
    "20000": {"threshold": 0.262438, "multiplier_fc": 2.3, "multiplier_ub": 1.4},
    "30000": {"threshold": 0.295948, "multiplier_fc": 2.8, "multiplier_ub": 1.3},
    "40000": {"threshold": 0.262438, "multiplier_fc": 2.7, "multiplier_ub": 1.8},
    "50000": {"threshold": 0.295948, "multiplier_fc": 1.7, "multiplier_ub": 1.2},
    "60000": {"threshold": 0.262438, "multiplier_fc": 2.9, "multiplier_ub": 1.0},
    "80000": {"threshold": 0.295948, "multiplier_fc": 2.8, "multiplier_ub": 2.6},
    "Overall": {"threshold": 0.262438, "multiplier_fc": 2.9, "multiplier_ub": 2.6},
}

# Initialize results storage
final_results = []

# Iterate over levels and overall
for level in list(df["levels"].unique()) + ["Overall"]:
    subset = df if level == "Overall" else df[df["levels"] == level]

    # Retrieve best values for this level
    best_reliability_threshold = best_values_per_level[str(level)]["threshold"]
    best_multiplier_fc = best_values_per_level[str(level)]["multiplier_fc"]
    best_multiplier_ub = best_values_per_level[str(level)]["multiplier_ub"]

    # Apply best multipliers
    adjusted_fc = subset["final_ceiling"] * best_multiplier_fc
    adjusted_ub = subset["upper_bound"] * best_multiplier_ub

    # Apply the best reliability score threshold
    adjusted_predictions = np.where(subset["reliability_score"] >= best_reliability_threshold, adjusted_ub, adjusted_fc)

    # Compute metrics
    fp = np.sum((subset["new_price"] > adjusted_predictions) & (subset["resultHigh"] == 0))
    tp = np.sum(subset["resultHigh"] == 1)
    fn = np.sum(subset["resultHigh"] == -1)

    precision = tp / (tp + fp) if (tp + fp) > 0 else 0
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0
    f1 = (2 * precision * recall) / (precision + recall) if (precision + recall) > 0 else 0

    final_results.append([level, best_reliability_threshold, best_multiplier_fc, best_multiplier_ub, precision, recall, f1])

# Convert results into DataFrame
final_results_df = pd.DataFrame(final_results, columns=[
    "Levels", "Best Reliability Threshold", "Best Multiplier (Final Ceiling)",
    "Best Multiplier (Upper Bound)", "Precision", "Recall", "F1 Score"
])

# Display results
print("Final Optimized Results Per Level (Q4):\n", final_results_df)


Final Optimized Results Per Level (Q4):
     Levels  Best Reliability Threshold  Best Multiplier (Final Ceiling)  \
0    40000                    0.262438                              2.7   
1    20000                    0.262438                              2.3   
2    30000                    0.295948                              2.8   
3    80000                    0.295948                              2.8   
4    60000                    0.262438                              2.9   
5    50000                    0.295948                              1.7   
6  Overall                    0.262438                              2.9   

   Best Multiplier (Upper Bound)  Precision    Recall  F1 Score  
0                            1.8   1.000000  0.330097  0.496350  
1                            1.4   0.985075  0.409938  0.578947  
2                            1.3   1.000000  0.400000  0.571429  
3                            2.6   1.000000  0.268116  0.422857  
4                           