# Project: Bot or Not? Detecting Fake Accounts with Classification

*Machine Learning Foundations for Beginners*

*Codecademy Live Learning*

<a href="https://colab.research.google.com/github/dougyd92/ML-Foudations/blob/main/Projects/Project 2 Bot or Not Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ⚠️ IMPORTANT: Save Your Own Copy Before You Start!

You are viewing a **read-only** notebook. Colab will let you run code and make edits, but **your changes will NOT be saved** unless you make your own copy first.

> A student in a previous cohort lost all their work because they didn't do this. Don't let it happen to you!

**Do this now, before anything else:**

1. Go to **File → Save a copy in Drive** (or **File → Save a copy in GitHub** if you prefer)
2. Verify that your notebook title now says **"Copy of..."** or is saved in your own Drive/repo
3. Continue working **only in your copy**

If you're not sure whether you're in your own copy, check the title bar at the top of the page.

# Overview

At companies like Meta, detecting spam accounts and fake profiles is a core machine learning problem. 
Rather than analyzing what accounts *say*, these systems focus on how accounts *behave* — features like posting frequency, account age, follower-to-following ratio, and profile completeness.

In this project, you will use account-level behavioral features to classify Twitter accounts as **bots** or **legitimate human users**. 
You will train and compare two different classifiers — logistic regression and random forest — and evaluate them using the classification metrics from class.

## Learning Objectives
- Train and compare classification models (logistic regression vs. random forest)
- Evaluate classifiers using confusion matrices, precision, recall, F1, and ROC/AUC
- Interpret feature importances from a tree-based model
- Think critically about error types and their real-world consequences

## Core Requirements
1. Complete the data preprocessing (train/test split and feature scaling)
2. Train a logistic regression model
3. Train a random forest model
4. Evaluate and compare both models using the provided helper function
5. Interpret feature importances
6. Answer the short reflection questions

## Optional Enhancements (Bonus)
- Try XGBoost or SVM
- Hyperparameter tuning with GridSearchCV
- Threshold tuning with precision-recall curves

## Due Date
**TODO: Fill in due date**

# Getting the Dataset

We will use the **Twitter Bot Accounts** dataset from Kaggle. This dataset contains ~37,000 Twitter accounts labeled as bot or human, with behavioral features like follower count, posting frequency, and account age.

**Download instructions:**

1. Go to: https://www.kaggle.com/datasets/davidmartngutirrez/twitter-bots-accounts
2. You will need a free Kaggle account to download
3. **Important:** Download **Version 2** of the dataset (use the version selector on the page). Version 2 contains all the feature columns we need.
4. Unzip the downloaded file. You should have a CSV file.
5. Upload it to this Colab notebook using the cell below.

**Licensing note:** This dataset is for educational use. Do not redistribute the raw data in your portfolio. Instead, link to the original Kaggle source.

# Setup

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    classification_report, confusion_matrix, ConfusionMatrixDisplay,
    roc_curve, auc, f1_score
)

# Reproducibility
np.random.seed(42)

# Plot style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

## Load the Dataset

In [None]:
from google.colab import files
uploaded = files.upload()

In [None]:
# TODO (Instructor): Update the filename below to match the actual CSV filename in the dataset
df = pd.read_csv('twitter_bots_data.csv')  # <-- update filename if needed
print(f'Dataset shape: {df.shape}')
df.head()

# Exploratory Data Analysis

This section is pre-completed. Read through the plots and observations — they will inform the modeling decisions later.

## Column Descriptions

**TODO (Instructor):** After loading the dataset, fill in this section with:
- A table or list describing what each column represents
- Note which columns are features vs. identifiers vs. the target label
- Note data types (numeric, boolean, text, etc.)

In [None]:
# TODO (Instructor): Display basic dataset info
# df.info()
# df.describe()

## Class Balance

**TODO (Instructor):** Plot the distribution of bot vs. human labels. Annotate with percentages. 
Note the class ratio and whether it is balanced enough to use accuracy, or whether we need to rely on other metrics.

In [None]:
# TODO (Instructor): Class balance bar chart
# Example:
# df['label_column'].value_counts().plot(kind='bar')
# plt.title('Class Distribution: Bot vs Human')
# plt.ylabel('Count')
# plt.show()

## Feature Distributions: Bot vs. Human

**TODO (Instructor):** Create side-by-side or overlapping distribution plots for key behavioral features, 
split by class (bot vs. human). Good candidates include:
- Follower count
- Friend/following count
- Statuses (tweet) count
- Favourites count
- Account age
- Average tweets per day

Note: Many of these features are likely heavily skewed. Consider using log-scale or log-transformed plots 
for readability. Write a brief observation after each plot.

In [None]:
# TODO (Instructor): Feature distribution plots, bot vs human
# Consider using log scale for skewed features
# Write observations as markdown cells after the plots

## Correlation Analysis

**TODO (Instructor):** Create a correlation heatmap of the numeric features. Note any highly correlated pairs 
and whether any features are strongly correlated with the target label.

In [None]:
# TODO (Instructor): Correlation heatmap

## EDA Summary

**TODO (Instructor):** Write a brief summary (bullet points) of the key findings from EDA that are relevant to modeling. For example:
- The class ratio is approximately X% bot / Y% human
- Features A, B, C show the clearest separation between bots and humans
- Features D and E are highly correlated with each other
- Several features are heavily right-skewed

# Data Preprocessing

The feature selection, engineering, and cleanup steps have been completed for you. 
Your tasks in this section are to **split the data** and **scale the features**.

## Feature Selection

**TODO (Instructor):** Drop non-useful columns (IDs, raw text fields like screen_name or description, etc.). 
Explain briefly why each is dropped (e.g., "screen_name is a unique identifier, not a predictive feature").

In [None]:
# TODO (Instructor): Drop non-useful columns
# Example:
# df = df.drop(columns=['id', 'screen_name', 'description', ...])
# print(f'Remaining columns: {list(df.columns)}')

## Feature Engineering

**TODO (Instructor):** Create 2-3 derived features that capture behavioral signals. For example:
- `follower_friend_ratio`: followers / (friends + 1) — how "popular" is the account relative to how many it follows?
- `follower_acq_rate`: followers / account_age_days — how fast did the account gain followers?
- `tweets_per_day`: (if not already in the dataset) statuses / account_age_days

Explain the intuition for each: why might this feature help distinguish bots from real users?

In [None]:
# TODO (Instructor): Feature engineering
# Example:
# df['follower_friend_ratio'] = df['followers_count'] / (df['friends_count'] + 1)
# df['follower_acq_rate'] = df['followers_count'] / (df['account_age_days'] + 1)

## Handle Missing Values

**TODO (Instructor):** Check for and handle any missing values. Document the approach taken.

In [None]:
# TODO (Instructor): Handle missing values
# print(df.isnull().sum())

## Prepare Features and Target

**TODO (Instructor):** Separate features (X) and target (y). Make sure the target is the bot/human label column.

In [None]:
# TODO (Instructor): Define X and y
# X = df.drop(columns=['label_column'])
# y = df['label_column']
# print(f'Features shape: {X.shape}')
# print(f'Target distribution:\n{y.value_counts()}')

## Train/Test Split

Split the data into training and test sets.

**Requirements:**
- Use an 80/20 split
- Use `random_state=42` for reproducibility
- Use stratification to preserve the class ratio in both sets

*Hint: check the `stratify` parameter of `train_test_split`.*

In [None]:
# YOUR CODE: Split the data into training and test sets
# X_train, X_test, y_train, y_test = ...



# Verify the split
print(f'Training set: {X_train.shape[0]} samples')
print(f'Test set:     {X_test.shape[0]} samples')
print(f'\nTraining class distribution:\n{y_train.value_counts(normalize=True).round(3)}')
print(f'\nTest class distribution:\n{y_test.value_counts(normalize=True).round(3)}')

## Feature Scaling

Standardize the features so each has mean 0 and standard deviation 1. 
This is important for logistic regression, which is sensitive to feature scales.

**Requirements:**
- Use `StandardScaler`
- **Fit** the scaler on the training set only
- **Transform** both the training and test sets

*Why fit on train only? To avoid data leakage — the test set should be treated as unseen data.*

In [None]:
# YOUR CODE: Scale the features
# scaler = StandardScaler()
# X_train_scaled = ...
# X_test_scaled = ...


# Model 1: Logistic Regression (Baseline)

Start with logistic regression as a baseline. It's fast, interpretable, and gives us a reference point for comparison. 
If a more complex model can't beat logistic regression, the added complexity isn't worth it.

In [None]:
# YOUR CODE: Train a logistic regression model
# - Initialize LogisticRegression (use max_iter=1000 to ensure convergence)
# - Fit on the SCALED training data
# - Generate predictions on the SCALED test data
# - Generate predicted probabilities on the SCALED test data (use .predict_proba())
#   Note: predict_proba returns probabilities for BOTH classes.
#   You want the probability of the positive class (bot), which is typically the second column: [:, 1]




# Model 2: Random Forest

Now train a random forest — an ensemble of decision trees that handles nonlinear relationships 
and feature interactions without needing explicit feature engineering. 
Random forests are also less sensitive to feature scaling, but we'll use the scaled data for consistency.

In [None]:
# YOUR CODE: Train a random forest classifier
# - Initialize RandomForestClassifier with n_estimators=200 and random_state=42
# - Fit on the SCALED training data
# - Generate predictions on the SCALED test data
# - Generate predicted probabilities on the SCALED test data (positive class: [:, 1])




# Evaluation & Comparison

Use the helper function below to evaluate both models. It will print a classification report, 
plot a confusion matrix, and plot a ROC curve.

In [None]:
def evaluate_model(model_name, y_true, y_pred, y_proba):
    """Print classification report, plot confusion matrix and ROC curve for a model."""

    print(f'=== {model_name} ===')
    print()
    print(classification_report(y_true, y_pred, digits=3))

    fig, axes = plt.subplots(1, 2, figsize=(14, 5))

    # Confusion matrix
    ConfusionMatrixDisplay.from_predictions(
        y_true, y_pred, ax=axes[0], cmap='Blues', colorbar=False
    )
    axes[0].set_title(f'{model_name} — Confusion Matrix')

    # ROC Curve
    fpr, tpr, _ = roc_curve(y_true, y_proba)
    roc_auc = auc(fpr, tpr)
    axes[1].plot(fpr, tpr, color='steelblue', lw=2, label=f'AUC = {roc_auc:.3f}')
    axes[1].plot([0, 1], [0, 1], color='gray', lw=1, linestyle='--', label='Random (AUC = 0.5)')
    axes[1].set_xlabel('False Positive Rate')
    axes[1].set_ylabel('True Positive Rate')
    axes[1].set_title(f'{model_name} — ROC Curve')
    axes[1].legend(loc='lower right')

    plt.tight_layout()
    plt.show()
    print()

## Evaluate Logistic Regression

In [None]:
# YOUR CODE: Call evaluate_model for logistic regression
# evaluate_model('Logistic Regression', y_test, y_pred_lr, y_proba_lr)


## Evaluate Random Forest

In [None]:
# YOUR CODE: Call evaluate_model for random forest
# evaluate_model('Random Forest', y_test, y_pred_rf, y_proba_rf)


## Model Comparison Questions

Answer each question in 1–2 sentences.

**1.** Which model had higher F1 score on the **bot** class?

*Your answer:*

**2.** Look at the confusion matrices. Which model produces more **false positives** (real users wrongly flagged as bots)? Which produces more **false negatives** (bots that slip through)?

*Your answer:*

**3.** Imagine this model is deployed on a real social media platform. Which type of error is more damaging: banning a real user (false positive) or letting a bot through (false negative)?

*Your answer:*

# Feature Importance

One advantage of random forests is that they provide a built-in measure of feature importance — 
how much each feature contributes to reducing impurity across all trees in the forest.

In [None]:
# Feature importance from the random forest model (pre-filled)

# NOTE: If you used different variable names for your random forest model or feature matrix,
# update 'rf_model' and 'X_train_scaled' below to match your code.

feature_names = X.columns  # original feature names before scaling
importances = rf_model.feature_importances_

# Sort by importance
sorted_idx = np.argsort(importances)

fig, ax = plt.subplots(figsize=(10, 8))
ax.barh(range(len(sorted_idx)), importances[sorted_idx], color='steelblue')
ax.set_yticks(range(len(sorted_idx)))
ax.set_yticklabels(feature_names[sorted_idx])
ax.set_xlabel('Feature Importance (Gini)')
ax.set_title('Random Forest — Feature Importance')
plt.tight_layout()
plt.show()

## Feature Importance Questions

**1.** What are the top 3 most important features according to the random forest?

*Your answer:*

**2.** Does this match what you saw in the EDA plots above? (Yes/no, plus one sentence explaining why.)

*Your answer:*

# Reflection

Answer each question in 1–2 sentences.

**1.** What was the main advantage of random forest over logistic regression on this problem (or vice versa)?

*Your answer:*

**2.** Name one thing you would try to improve this model if you had more time.

*Your answer:*

**3.** What's one thing you learned or found surprising in this project?

*Your answer:*

---

# Optional Enhancements

The sections below are **not required**. They are opportunities to explore further if you have time and interest. 
Starter code is provided to help you get going.

## Enhancement A: Try XGBoost

XGBoost is a gradient boosting library that often achieves state-of-the-art results on tabular data. 
Train an XGBoost classifier and compare it to your previous models.

In [None]:
# Optional: XGBoost
# First, install xgboost if needed (uncomment the line below)
# !pip install xgboost

from xgboost import XGBClassifier

xgb_model = XGBClassifier(
    n_estimators=200,
    max_depth=5,
    learning_rate=0.1,
    random_state=42,
    eval_metric='logloss'
)

# YOUR CODE: Fit the model, generate predictions and probabilities, then evaluate
# xgb_model.fit(...)
# y_pred_xgb = ...
# y_proba_xgb = ...
# evaluate_model('XGBoost', y_test, y_pred_xgb, y_proba_xgb)

## Enhancement B: Hyperparameter Tuning

Use grid search with cross-validation to find better hyperparameters for your random forest.

In [None]:
# Optional: Hyperparameter tuning with GridSearchCV
from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [10, 20, None],
    'min_samples_split': [2, 5],
}

# YOUR CODE: Run grid search
# grid_search = GridSearchCV(
#     estimator=RandomForestClassifier(random_state=42),
#     param_grid=param_grid,
#     cv=5,
#     scoring='f1',
#     n_jobs=-1,
#     verbose=1
# )
# grid_search.fit(X_train_scaled, y_train)
# print(f'Best parameters: {grid_search.best_params_}')
# print(f'Best CV F1 score: {grid_search.best_score_:.3f}')

## Enhancement C: Threshold Tuning

The default classification threshold is 0.5, but the optimal threshold depends on how you want to balance 
precision and recall. Plot the precision-recall curve and experiment with different thresholds.

In [None]:
# Optional: Precision-Recall curve and threshold tuning
from sklearn.metrics import precision_recall_curve, PrecisionRecallDisplay

# Plot precision-recall curve for your best model
# Example using random forest probabilities:
# PrecisionRecallDisplay.from_predictions(y_test, y_proba_rf)
# plt.title('Random Forest — Precision-Recall Curve')
# plt.show()

# Try a custom threshold:
# custom_threshold = 0.6  # experiment with different values
# y_pred_custom = (y_proba_rf >= custom_threshold).astype(int)
# print(f'\nResults at threshold = {custom_threshold}:')
# print(classification_report(y_test, y_pred_custom, digits=3))

## Enhancement D: Try an SVM

Support Vector Machines find the maximum-margin decision boundary between classes. 
Try an SVM with an RBF kernel and compare to your other models.

In [None]:
# Optional: SVM
from sklearn.svm import SVC

svm_model = SVC(
    kernel='rbf',
    probability=True,  # needed for predict_proba and ROC curves
    random_state=42
)

# YOUR CODE: Fit the model, generate predictions and probabilities, then evaluate
# Note: SVM can be slow on large datasets. If it takes too long, try using a
# subset of the training data, e.g. X_train_scaled[:5000]

# svm_model.fit(...)
# y_pred_svm = ...
# y_proba_svm = ...
# evaluate_model('SVM (RBF)', y_test, y_pred_svm, y_proba_svm)

---

# Submission Checklist

Before submitting, make sure you have:

- [ ] Train/test split with stratification
- [ ] Features scaled with StandardScaler (fit on train only)
- [ ] Logistic regression model trained and evaluated
- [ ] Random forest model trained and evaluated
- [ ] Model comparison questions answered
- [ ] Feature importance questions answered
- [ ] Reflection questions answered

**To submit:** Share a link to your completed notebook (Google Drive or GitHub).