In [1]:
conda install pandas numpy matplotlib scikit-learn seaborn

[1;33mJupyter detected[0m[1;33m...[0m
[1;32m2[0m[1;32m channel Terms of Service accepted[0m
Retrieving notices: done
Channels:
 - defaults
Platform: osx-arm64
Collecting package metadata (repodata.json): done
Solving environment: done

# All requested packages already installed.


Note: you may need to restart the kernel to use updated packages.


In [2]:
# packages
import pandas as pd
from mod02_build_bot_predictor import train_model

### Define a function to extract predictions from the model

In [3]:
def predict_bot(df, model=None):
    """
    Predict whether each account is a bot (1) or human (0).
    """
    if model is None:
        model = train_model()

    preds = model.predict(df)
    return pd.Series(preds, index=df.index)

### Define a function to evaluate model error

In [4]:
def confusion_matrix_and_metrics(y_true, y_pred):
    """
    Computes confusion matrix and common error rates for binary classification.

    Assumes labels:
      0 = negative class
      1 = positive class

    Returns:
      dict with:
        tn, fp, fn, tp
        misclassification_rate
        false_positive_rate
        false_negative_rate
    """
    tn = fp = fn = tp = 0

    for yt, yp in zip(y_true, y_pred):
        if yt == 0 and yp == 0:
            tn += 1
        elif yt == 0 and yp == 1:
            fp += 1
        elif yt == 1 and yp == 0:
            fn += 1
        elif yt == 1 and yp == 1:
            tp += 1
        else:
            raise ValueError("Labels must be 0 or 1")

    total = tn + fp + fn + tp

    misclassification_rate = (fp + fn) / total if total > 0 else 0.0
    false_positive_rate = fp / (fp + tn) if (fp + tn) > 0 else 0.0
    false_negative_rate = fn / (fn + tp) if (fn + tp) > 0 else 0.0

    return {
        "tp": tp,
        "tn": tn,
        "fp": fp,
        "fn": fn,
        "misclassification_rate": misclassification_rate,
        "false_positive_rate": false_positive_rate,
        "false_negative_rate": false_negative_rate,
    }


### Load the data

In [5]:
TRAIN_PATH = "mod02_data/train.csv"
train = pd.read_csv(TRAIN_PATH)

TEST_PATH = "mod02_data/test.csv"
test = pd.read_csv(TEST_PATH)

### Format the data by independent vs. dependent variables

In [6]:
X_train = train.drop(columns=["is_bot"])
y_train = train['is_bot']

X_test = test.drop(columns=["is_bot"])
y_test = test['is_bot']

### Build the model on training data

In [7]:
model = train_model(X_train, y_train)

### Get the model predictions on training and test data

In [8]:
y_pred_train = predict_bot(X_train, model)
y_pred_test = predict_bot(X_test, model)

### Check results on the training set (data used to build the model)

In [9]:
confusion_matrix_and_metrics(y_train, y_pred_train)

{'tp': 188,
 'tn': 2628,
 'fp': 9,
 'fn': 175,
 'misclassification_rate': 0.06133333333333333,
 'false_positive_rate': 0.0034129692832764505,
 'false_negative_rate': 0.4820936639118457}

### Check results on the test set (new data not yet seen by the model)

In [10]:
confusion_matrix_and_metrics(y_test, y_pred_test)

{'tp': 30,
 'tn': 850,
 'fp': 24,
 'fn': 96,
 'misclassification_rate': 0.12,
 'false_positive_rate': 0.02745995423340961,
 'false_negative_rate': 0.7619047619047619}

# Discussion Questions

### Based on the misclassification rate of your model, discuss your confidence in the ability to predict a bot. 

While there's a misclassification rate of 12%, bringing the accuracy to 88%, there are more TN + FP, indicating real people, than TP + FN, indicating the bots. This means it likely could just guess human most of the time and it'd show to be highly accurate in data. 

### What are potential ramifications of false positives from the model?

False positives in the real world could have serious consequences like simply not being able to log in to your platform because it believes you're a bot all the time. It would cause someones platform to collapse and lose all trust.

### What are potential ramifications of false negatives from the model?

False negatives which would be letting bots in, would allow lots of spam, fake data, and manipulation of any data that a platform would gather. It would result in a waste of resources where the intent to serve humans is lost. 