In [1]:
# packages
import pandas as pd
from mod02_build_bot_predictor import train_model

### Define a function to extract predictions from the model

In [2]:
def predict_bot(df, model=None):
    """
    Predict whether each account is a bot (1) or human (0).
    """
    if model is None:
        model = train_model()

    preds = model.predict(df)
    return pd.Series(preds, index=df.index)

### Define a function to evaluate model error

In [3]:
def confusion_matrix_and_metrics(y_true, y_pred):
    """
    Computes confusion matrix and common error rates for binary classification.

    Assumes labels:
      0 = negative class
      1 = positive class

    Returns:
      dict with:
        tn, fp, fn, tp
        misclassification_rate
        false_positive_rate
        false_negative_rate
    """
    tn = fp = fn = tp = 0

    for yt, yp in zip(y_true, y_pred):
        if yt == 0 and yp == 0:
            tn += 1
        elif yt == 0 and yp == 1:
            fp += 1
        elif yt == 1 and yp == 0:
            fn += 1
        elif yt == 1 and yp == 1:
            tp += 1
        else:
            raise ValueError("Labels must be 0 or 1")

    total = tn + fp + fn + tp

    misclassification_rate = (fp + fn) / total if total > 0 else 0.0
    false_positive_rate = fp / (fp + tn) if (fp + tn) > 0 else 0.0
    false_negative_rate = fn / (fn + tp) if (fn + tp) > 0 else 0.0

    return {
        "tp": tp,
        "tn": tn,
        "fp": fp,
        "fn": fn,
        "misclassification_rate": misclassification_rate,
        "false_positive_rate": false_positive_rate,
        "false_negative_rate": false_negative_rate,
    }


### Load the data

In [4]:
TRAIN_PATH = "mod02_data/train.csv"
train = pd.read_csv(TRAIN_PATH)

TEST_PATH = "mod02_data/test.csv"
test = pd.read_csv(TEST_PATH)

### Format the data by independent vs. dependent variables

In [5]:
X_train = train.drop(columns=["is_bot"])
y_train = train['is_bot']

X_test = test.drop(columns=["is_bot"])
y_test = test['is_bot']

### Build the model on training data

In [6]:
model = train_model(X_train, y_train)

### Get the model predictions on training and test data

In [7]:
y_pred_train = predict_bot(X_train, model)
y_pred_test = predict_bot(X_test, model)

### Check results on the training set (data used to build the model)

In [8]:
confusion_matrix_and_metrics(y_train, y_pred_train)

{'tp': 152,
 'tn': 2611,
 'fp': 26,
 'fn': 211,
 'misclassification_rate': 0.079,
 'false_positive_rate': 0.009859689040576413,
 'false_negative_rate': 0.581267217630854}

### Check results on the test set (new data not yet seen by the model)

In [9]:
confusion_matrix_and_metrics(y_test, y_pred_test)

{'tp': 38,
 'tn': 850,
 'fp': 24,
 'fn': 88,
 'misclassification_rate': 0.112,
 'false_positive_rate': 0.02745995423340961,
 'false_negative_rate': 0.6984126984126984}

# Discussion Questions

### Based on the misclassification rate of your model, discuss your confidence in the ability to predict a bot. 

Based on the model's misclassification rate of about 7.9% on the training data and 11.2% on the test data, it suggest that is performs well in its ability to predict a bot. However, I only have some confidence in it's predictions because there is also a high false negative rate of accurately catching bots of about 70%, meaning that while it does not misclassify humans as a bot that much, it will have a tendency to detect bots as humand letting them get away. In a scenario with high-stakes (i.e. life or death) I would not put confidence in the model. 

### What are potential ramifications of false positives from the model?

The false positives mean that a human is improperly determined to be a bot. Looking at the training data that appears to focus on user's and their accounts, some ramification are real users being reported or remove from the platform when they have done no wrong. This could lead to unrest and frustration among people and users, harming the platfrom and its trust if real people are punished without reason. In another context like "Do Androids Dreams of Electronic Sheep" novel were life and death are at stake, a human could be incorrectly determined to be a bot and would be retired by the bounty hunters.

### What are potential ramifications of false negatives from the model?

The false negatives mean that a bot is incorrectly labeled as a human user. Some ramifications of this are bots continuing to operate undetected which may contribute to spam messages and misinfromation on platfroms. With a high false negative rate, allowing these bots to continue it disrupts the purpose of the model and untermines the platform. In a greater context, these bots could continue to be living among the humans with the possibility of a descructive tendency, but it they are benign and show real emotion then the question is how bad is a false negative in comparision to a false positive?