<a href="https://colab.research.google.com/github/gabrielomara/K-Means-Analysis/blob/main/Poisonous_Mushrooms_ML_Cost_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [165]:
!pip install ucimlrepo

import pandas as pd
import numpy as np
from ucimlrepo import fetch_ucirepo
from sklearn.linear_model import LogisticRegression



In [166]:
# fetch dataset
mushroom = fetch_ucirepo(id=73)

# data (as pandas dataframes)
k = mushroom.data.features
f = mushroom.data.targets

#copy the data to allow for changing the dataset
x = pd.DataFrame(k)
y = pd.DataFrame(f)

# metadata
mushroom.metadata

# variable information
mushroom.variables

#create a df combining poisonous with variables to have an overview of the table
df = pd.concat([x, y], axis=1)

# Step 2 & 3

In [167]:
#function to replace missing values with the most probable option
def fill_missing_values():
    l = x.columns
    for i in l:
        most = x[i].value_counts().idxmax()
        x[i] = x[i].fillna(most)

fill_missing_values()

In [168]:
#remove variables that don't appear more than 1/10 of instances and replace with a representative distribution
def remove_rare():
    l = x.columns
    for i in l:
        value_counts = x[i].value_counts()
        for value, count in value_counts.items():
            # Check if the count is less than or equal to 813 (1/10th of instances)
            if count <= 813:
                # Replace the rare value with the most frequent value
                most = x[i].value_counts().idxmax()
                x[i] = x[i].replace(value, most)

remove_rare()

In [169]:
#assign each variable a numerical value as opposed to categorical
for column_name in x.columns:
    x[column_name] = x[column_name].astype('category')
    x[column_name] = x[column_name].cat.codes

**Step 4:
Considering a specific business setting...**

Indeed, the market for rare mushrooms is niche, however if one was to be the first to find and cultivate a new species of mushroom it could be a profitable endeavour.

Yet we would have to be sure that the mushroom was edible before trying it ourselves. In this case, it would be good to have a model that could predict whether or not it was edible. As the costs of making a mistake here could be deadly.

**Step 5: False positives and negatives...**

In this scenario the cost of false negatives are incredibly high and may be a question of life or death.

In the case that you were to pick and eat a mushroom that was a false negative, you would be eating something poisonous, with effects ranging from stomach discomfort, nerve damage and possibly death. Leading to very high medical bills or even legal issues.

On the other hand, if a false positive was produced the effects would be rather benign as the mushroom would not be eaten regardless. Indeed there is an opportunity cost that must be considered in the case of false positives as it could be that the mushroom would otherwise sell well and be a profitable fungi to cultivate, something which would not happen if the mushroom wasn't falsely labelled as "poisonous".

**Cost Ratio**

I estimate that the cost of a false negative is around 1,000 times higher than that of a false positive.

Due to the potential deadly consequences, high medical fees and legal damages that could be incurred, vs the lost mushroom which I calculate at €10 (adding a little to account for the opportunity cost of picking the mushroom and potential commercial potential).

For that reason the ratio is 1,000:1. Implying that false negatives incur a seriously higher cost than false positives.

In [170]:
#step 6 function to calculate the costs of prediction
def calculate_cost(y_actual, y_pred):
    cost_FP = 0
    cost_FN = 0
    for i in range(len(y_pred)):
        if y_pred[i] == 'p' and y_actual.iloc[i]['poisonous'] == 'e':
            cost_FP += 10
        elif y_pred[i] == 'e' and y_actual.iloc[i]['poisonous'] == 'p':
            cost_FN += 10000
    cost = cost_FN + cost_FP
    return cost

In [171]:
#step 7 creating candidate thresholds for mapping probabilities to predictions

thresholds = np.linspace(0, 1, 100)


In [172]:
#step 8 creating a matrix of 100,10 with zero values

out = np.zeros((100,10))


In [173]:
#step 9 creating fold_vec
n = np.ceil(len(y) / 10)
fold_vec = np.concatenate([np.arange(1, 11)] * int(n))
fold_vec = fold_vec[0:len(y)]
np.random.seed(1)
fold_vec = np.random.permutation(fold_vec)


In [174]:
#step 10 logistical regression and prediction cost for candidate thresholds

for i in range(10):
    # Determine the test and train indices
    test_i = np.where(fold_vec == i + 1)
    train_i = np.where(fold_vec != i + 1)

    #split data into training and testing sets
    x_train = x.iloc[train_i]
    y_train = y.iloc[train_i]
    x_test = x.iloc[test_i]
    y_test = y.iloc[test_i]

    #initialize and fit the logistic regression model
    mod = LogisticRegression(max_iter=1000)
    mod.fit(x_train, y_train.values.ravel())

    #predict probabilities for the test set
    y_pred_prob = mod.predict_proba(x_test)[:, 1]

    for j, threshold in enumerate(thresholds):
        y_pred = ['p' if prob >= threshold else 'e' for prob in y_pred_prob]
        out[j, i] = calculate_cost(y_test, y_pred)




In [175]:
#step 11 determine the threshold with the best performance and its associated cost
# Calculate the row-wise sum using np.apply_along_axis
row_sums = np.apply_along_axis(np.sum, axis=1, arr=out)

# Find the index of the minimum sum
min_index = np.argmin(row_sums)

# Extract the best threshold and cost
best_threshold = thresholds[min_index]
best_cost = row_sums[min_index]

print(f" the best cost is {best_cost}, with a threshold of {best_threshold}")


 the best cost is 8480.0, with a threshold of 0.05050505050505051


**Step 12**

It is clear that givent the high cost of false negatives, the very low threshold of 0.05 leads to the lowest cost.

For this reason I will redfine the threshold to hone in a more exact value at the lower end of the threshold scale.

In [176]:
#Refining thresholds:
thresholds2 = np.linspace(0, 0.1, 100)

In [177]:
#create new out to store cost values for each test threshold
out2 = np.zeros((100,10))

In [179]:
#calculate new costs for each new threshold and store it in "out2"
for i in range(10):
    # Determine the test and train indices
    test_i = np.where(fold_vec == i + 1)
    train_i = np.where(fold_vec != i + 1)

    #split data into training and testing sets
    x_train = x.iloc[train_i]
    y_train = y.iloc[train_i]
    x_test = x.iloc[test_i]
    y_test = y.iloc[test_i]

    #initialize and fit the logistic regression model
    mod = LogisticRegression(max_iter=1000)
    mod.fit(x_train, y_train.values.ravel())

    #predict probabilities for the test set
    y_pred_prob = mod.predict_proba(x_test)[:, 1]

    for j, threshold in enumerate(thresholds2):
        y_pred = ['p' if prob >= threshold else 'e' for prob in y_pred_prob]
        out2[j, i] = calculate_cost(y_test, y_pred)

In [180]:
#calculate new cost with refined thresholds:

row_sums2 = np.apply_along_axis(np.sum, axis=1, arr=out2)

# Find the index of the minimum sum
min_index = np.argmin(row_sums2)

# Extract the best threshold and cost
best_threshold_refined = thresholds2[min_index]
best_cost_refined = row_sums2[min_index]

print("Best Threshold (Refined):", best_threshold_refined)
print("Best Cost (Refined):", best_cost_refined)

Best Threshold (Refined): 0.05454545454545454
Best Cost (Refined): 8180.0


We can see above that with the refined threshold of 0.0545, our cost has decreased everso slightly to 8180 from 8480.