Botbuster

Use case and product vision

Botbuster is part of a dual-layer detection approach aimed at reducing bot and autonomous-agent abuse while keeping user friction low.

Behavioral biometrics — Signals from how people actually use the page: mouse dynamics, keyboard timing, scroll patterns, and related session rhythm. Humans show natural variability, hesitation, micro-corrections, and bursty activity; many scripted agents do not. This repository implements that layer end-to-end: turning interaction JSON into engineered features and training a classifier to separate human sessions from synthetic ones.
Fingerprinting and environment signals — A complementary layer (typically in the browser or edge) that captures client and session context to spot automation frameworks, inconsistent identities, or replayed environments. That logic lives outside this offline training repo but pairs with behavioral scores in a full deployment.
Compute challenges — Lightweight proof-of-work or attestation-style checks used sparingly so that only suspicious traffic pays an extra cost, preserving a smooth path for legitimate users.

Together, behavioral models, fingerprinting, and challenges raise the bar for attackers who can mimic one signal class but not all of them at once.

Reference / upstream project: github.com/be1ani/botbuster

Overview

This repo provides a minimal, portable Python pipeline: labeled interaction files under data/human/ and data/synthetic/ → features.csv → gradient-boosted classifier → inference.py for scoring new sessions. Shared defaults live in the botbuster package (paths.py, constants.py).

Repository layout

Path	Role
`data/human/`	Labeled human interaction JSON (one file per collection)
`data/synthetic/`	Labeled synthetic/bot JSON
`extract_features.py`	Builds `features.csv` from the directories above
`train_model.py`	Trains `sklearn` `GradientBoostingClassifier` and saves a joblib model
`inference.py`	Loads the model + column order from CSV and scores a JSON file
`botbuster/`	Shared paths and constants used by the scripts
`models/v0/`	Default location for the trained `.pkl`
`docs/diagrams/`	Mermaid sources for architecture figures
`docs/images/`	Generated PNG diagrams (from `docs/generate_graphs.sh`)

Quick start

Create a virtual environment, install dependencies, then run the pipeline from the repository root so import botbuster resolves.

cd /path/to/botbuster
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

python extract_features.py
python train_model.py
python inference.py data/synthetic/some-session.json

CLI options

extract_features.py: --human-dir, --bot-dir, -o / --output override defaults (data/human, data/synthetic, features.csv at repo root).
train_model.py: --csv and -o / --model-out override features.csv and models/v0/bot_detection_model.pkl.
inference.py: --model and --csv (paths relative to the script directory unless absolute).

Labels in the CSV: bot/synthetic = 0, human = 1 (see botbuster/constants.py).

Architecture

Training and inference connect as follows. The model must see features in the same column order as in features.csv; inference reads that order from the CSV header.

flowchart LR
  subgraph inputs
    H[data/human/*.json]
    B[data/synthetic/*.json]
  end
  H --> EF[extract_features.py]
  B --> EF
  EF --> CSV[features.csv]
  CSV --> TM[train_model.py]
  TM --> M[models/v0/bot_detection_model.pkl]
  M --> INF[inference.py]
  J[Any interactions JSON] --> INF
  INF --> OUT[Printed predictions + probabilities]

Feature extraction walks human and synthetic directories, parses each JSON session, computes behavioral statistics, and writes one CSV row per session with metadata columns and a label.
Training reads the CSV, drops metadata columns, fits a GradientBoostingClassifier, and serializes the estimator with joblib.
Inference reloads the model, reads the column order from the same CSV header the model was trained on, extracts features from the input JSON via the same code path as training data, and emits per-session predictions.

Module and dependency graph

flowchart TB
  subgraph scripts
    EF[extract_features.py]
    TM[train_model.py]
    INF[inference.py]
  end
  subgraph pkg[botbuster package]
    P[paths.py]
    C[constants.py]
  end
  EF --> P
  EF --> C
  TM --> P
  TM --> C
  INF --> P
  INF --> C
  INF --> EF

botbuster.paths: Resolves the repository root and default locations for data, features.csv, and the default model path. Scripts accept CLI overrides so paths stay portable across machines.
botbuster.constants: Single source for label integers (LABEL_BOT, LABEL_HUMAN) and metadata column names excluded from the model matrix.
inference.py imports process_json_file from extract_features.py so training and scoring always use the same feature definitions.

Architecture diagram images

Regenerate these from Mermaid with bash docs/generate_graphs.sh:

Documentation graphs

To regenerate PNG diagrams from the Mermaid sources:

bash docs/generate_graphs.sh

This uses Mermaid CLI via npx when mmdc is not on your PATH. Chromium runs with the repo-local docs/puppeteer.json flags so headless rendering works on locked-down Linux hosts. Output is written to docs/images/.

Interaction JSON format

Each file is a nested object: user_id → session_id → list of events. Events include action (mouse_move, click, keypress, scroll, …), timestamp (milliseconds), and action-specific fields such as x / y for pointer events.

Feature reference

The following sections document every engineered column used by the classifier. Features fall into six categories: time-based, mouse movement, click, scroll, keystroke, and session-level.

1. Time-Based Features

These features analyze the temporal patterns of events, capturing the rhythm and timing characteristics of user interactions.

`time_inter_event_mean`

Description: Mean time between consecutive events
Units: Seconds
Purpose: Captures the average pace of interactions. Humans typically have more variable pacing compared to bots.

`time_inter_event_median`

Description: Median time between consecutive events
Units: Seconds
Purpose: Provides a robust measure of central tendency, less affected by outliers than the mean.

`time_inter_event_std`

Description: Standard deviation of inter-event times
Units: Seconds
Purpose: Measures variability in timing. Higher values indicate more irregular, human-like behavior.

`time_inter_event_skew`

Description: Skewness of inter-event time distribution
Purpose: Measures asymmetry in timing patterns. Positive skew indicates occasional long pauses (human behavior).

`time_inter_event_kurtosis`

Description: Kurtosis of inter-event time distribution
Purpose: Measures the "tailedness" of the distribution. High kurtosis indicates more extreme values (long pauses or rapid bursts).

`time_burstiness`

Description: Burstiness index calculated as (std - mean) / (std + mean)
Range: -1 to 1
Purpose: Quantifies how bursty the interaction pattern is. Values closer to 1 indicate more bursty behavior (typical of humans), while values closer to -1 indicate more regular patterns (typical of bots).

`time_longest_pause`

Description: Maximum time gap between any two consecutive events
Units: Seconds
Purpose: Captures the longest thinking/reading pause, which is characteristic of human behavior.

`time_inter_event_p10`, `time_inter_event_p25`, `time_inter_event_p75`, `time_inter_event_p90`, `time_inter_event_p95`, `time_inter_event_p99`

Description: Percentiles (10th, 25th, 75th, 90th, 95th, 99th) of inter-event times
Units: Seconds
Purpose: Provides detailed distribution information, capturing both typical and extreme timing behaviors.

2. Mouse Movement Features

These features analyze mouse movement patterns, capturing the natural dynamics and micro-movements characteristic of human motor control.

Speed Features

`mouse_speed_mean`

Description: Average mouse movement speed
Units: Pixels per second
Purpose: Humans typically have more variable speeds compared to bots.

`mouse_speed_median`

Description: Median mouse movement speed
Units: Pixels per second
Purpose: Robust measure of typical movement speed.

`mouse_speed_std`

Description: Standard deviation of mouse speeds
Units: Pixels per second
Purpose: Measures speed variability. Higher values indicate more natural, human-like movement.

`mouse_speed_skew`

Description: Skewness of speed distribution
Purpose: Captures asymmetry in speed patterns.

`mouse_speed_kurtosis`

Description: Kurtosis of speed distribution
Purpose: Measures the presence of extreme speed values.

`mouse_speed_p10`, `mouse_speed_p25`, `mouse_speed_p75`, `mouse_speed_p90`, `mouse_speed_p95`, `mouse_speed_p99`

Description: Percentiles of mouse speed distribution
Units: Pixels per second
Purpose: Detailed speed distribution information.

Acceleration and Jerk Features

`mouse_acceleration_mean`

Description: Average rate of change of mouse speed
Units: Pixels per second²
Purpose: Humans show more variable acceleration patterns. Bots often have more uniform acceleration.

`mouse_acceleration_std`

Description: Standard deviation of acceleration
Units: Pixels per second²
Purpose: Measures acceleration variability.

`mouse_acceleration_skew`

Description: Skewness of acceleration distribution
Purpose: Captures asymmetry in acceleration patterns.

`mouse_jerk_mean`

Description: Average rate of change of acceleration (3rd derivative of position)
Units: Pixels per second³
Purpose: Jerk is a key indicator of smoothness. Human movements have natural jerk, while bot movements are often unnaturally smooth.

`mouse_jerk_std`

Description: Standard deviation of jerk
Units: Pixels per second³
Purpose: Measures jerk variability.

Curvature Features

`mouse_curvature_mean`

Description: Average curvature of mouse path
Purpose: Humans rarely move in perfectly straight lines. Curvature captures the natural arc of human movements.

`mouse_curvature_std`

Description: Standard deviation of curvature
Purpose: Measures variability in path curvature.

`mouse_curvature_velocity_corr`

Description: Correlation between curvature and velocity
Range: -1 to 1
Purpose: In human movements, there's often a relationship between speed and curvature (slower around curves). Bots may lack this natural correlation.

Path Characteristics

`mouse_path_straightness`

Description: Ratio of total path length to straight-line distance from start to end
Range: ≥ 1.0 (1.0 = perfectly straight)
Purpose: Humans rarely take perfectly straight paths. Values significantly above 1.0 indicate more natural, meandering paths.

`mouse_direction_changes`

Description: Count of significant direction changes (>45 degrees)
Purpose: Humans make frequent small corrections. Bots may have fewer or more abrupt direction changes.

`mouse_movement_entropy`

Description: Entropy of movement direction angles
Purpose: Measures the randomness/diversity of movement directions. Higher entropy indicates more varied, human-like movement patterns.

Micro-Movements

`mouse_tremor_peak_power`

Description: Peak power in the 8-12 Hz frequency range (human tremor range)
Purpose: Humans exhibit natural hand tremor in the 8-12 Hz range. This feature detects the presence of this micro-movement, which is difficult for bots to replicate.

3. Click Features

These features analyze clicking behavior, capturing the hesitation, approach patterns, and click characteristics that distinguish human from bot behavior.

Click Hesitation

`click_hesitation_mean`

Description: Average time between last mouse movement and click
Units: Seconds
Purpose: Humans typically pause briefly before clicking (hesitation). Bots often click immediately after reaching the target.

`click_hesitation_median`

Description: Median hesitation time
Units: Seconds
Purpose: Robust measure of typical hesitation.

`click_hesitation_std`

Description: Standard deviation of hesitation times
Units: Seconds
Purpose: Measures variability in hesitation patterns.

Pre-Click Path

`click_pre_path_mean`

Description: Average path length traveled in the 500ms before a click
Units: Pixels
Purpose: Humans often make small adjustments before clicking. Bots may have shorter or more direct paths.

`click_pre_path_median`

Description: Median pre-click path length
Units: Pixels
Purpose: Robust measure of typical pre-click movement.

Target Approach

`click_approach_speed_ratio_mean`

Description: Average ratio of speed when closest to target vs. speed when farthest from target (in last 200ms)
Purpose: Humans typically slow down as they approach a target (Fitts' law). This ratio captures this deceleration pattern. Values < 1.0 indicate natural deceleration.

`click_approach_speed_ratio_median`

Description: Median approach speed ratio
Purpose: Robust measure of typical approach behavior.

Click Distribution

`click_x_entropy`

Description: Entropy of click X-coordinates
Purpose: Measures the randomness of click positions. Humans click in more varied locations.

`click_y_entropy`

Description: Entropy of click Y-coordinates
Purpose: Measures the randomness of click positions vertically.

Double-Click Detection

`click_double_click_count`

Description: Number of double-clicks detected (clicks 200-500ms apart)
Purpose: Humans sometimes double-click by accident or habit. Bots rarely exhibit this behavior.

`click_double_click_mean_time`

Description: Average time between clicks in double-click pairs
Units: Seconds
Purpose: Typical double-click timing is around 200-500ms.

4. Scroll Features

These features analyze scrolling behavior, capturing the natural rhythm and patterns of human scrolling.

Scroll Velocity

`scroll_velocity_mean`

Description: Average scrolling velocity
Units: Pixels per second
Purpose: Humans scroll at variable speeds. Bots may scroll at more constant rates.

`scroll_velocity_std`

Description: Standard deviation of scroll velocity
Units: Pixels per second
Purpose: Measures variability in scrolling speed.

Scroll Acceleration

`scroll_acceleration_mean`

Description: Average rate of change of scroll velocity
Units: Pixels per second²
Purpose: Captures how smoothly scrolling speed changes.

`scroll_acceleration_std`

Description: Standard deviation of scroll acceleration
Units: Pixels per second²
Purpose: Measures acceleration variability.

Scroll Patterns

`scroll_burstiness`

Description: Mean time between scroll events
Units: Seconds
Purpose: Captures the bursty nature of scrolling. Humans scroll in bursts with pauses.

`scroll_direction_entropy`

Description: Entropy of scroll directions (up/down)
Purpose: Measures the randomness of scroll direction changes. Humans scroll both up and down naturally.

`scroll_idle_ratio`

Description: Ratio of time spent not scrolling (gaps > 1 second) to total session time
Range: 0 to 1
Purpose: Humans pause to read content. Higher values indicate more natural reading behavior.

5. Keystroke Features

These features analyze typing patterns, capturing the rhythm and timing characteristics of human typing.

Inter-Key Timing

`keystroke_inter_key_mean`

Description: Average time between consecutive keystrokes
Units: Seconds
Purpose: Humans type at variable speeds. Bots may type at unnaturally constant rates.

`keystroke_inter_key_median`

Description: Median inter-key time
Units: Seconds
Purpose: Robust measure of typical typing speed.

`keystroke_inter_key_std`

Description: Standard deviation of inter-key times
Units: Seconds
Purpose: Measures typing rhythm variability. Higher values indicate more natural, human-like typing.

`keystroke_inter_key_skew`

Description: Skewness of inter-key time distribution
Purpose: Captures asymmetry in typing patterns. Positive skew indicates occasional long pauses (thinking/correction).

Typing Patterns

`keystroke_burstiness`

Description: Burstiness index for typing, calculated as (std - mean) / (std + mean)
Range: -1 to 1
Purpose: Quantifies how bursty the typing pattern is. Humans type in bursts with pauses, while bots may type more continuously.

6. Session-Level Features

These features provide aggregate statistics about the entire session, capturing overall interaction patterns.

Event Counts

`session_total_events`

Description: Total number of events in the session
Purpose: Overall activity level indicator.

`session_mouse_move_count`

Description: Total number of mouse movement events
Purpose: Measures mouse activity level.

`session_click_count`

Description: Total number of click events
Purpose: Measures clicking activity.

`session_keypress_count`

Description: Total number of keypress events
Purpose: Measures typing activity.

`session_scroll_count`

Description: Total number of scroll events
Purpose: Measures scrolling activity.

Event Ratios

`session_mouse_move_ratio`

Description: Proportion of mouse movement events to total events
Range: 0 to 1
Purpose: Relative frequency of mouse movements.

`session_click_ratio`

Description: Proportion of click events to total events
Range: 0 to 1
Purpose: Relative frequency of clicks.

`session_keypress_ratio`

Description: Proportion of keypress events to total events
Range: 0 to 1
Purpose: Relative frequency of typing.

`session_scroll_ratio`

Description: Proportion of scroll events to total events
Range: 0 to 1
Purpose: Relative frequency of scrolling.

`session_<event_type>_ratio`

Description: Proportion of any other event type to total events
Range: 0 to 1
Purpose: Captures the relative frequency of any event type in the session.

Session Characteristics

`session_idle_ratio`

Description: Proportion of time spent in idle periods (gaps > 1 second) to total session time
Range: 0 to 1
Purpose: Humans pause to read, think, or process information. Higher values indicate more natural human behavior.

`session_unique_elements`

Description: Number of unique DOM elements interacted with
Purpose: Measures the diversity of interactions. Humans explore more, while bots may focus on specific elements.

`session_time_entropy`

Description: Entropy of inter-event time distribution
Purpose: Measures the randomness/variability of timing patterns across the entire session. Higher entropy indicates more natural, human-like timing.

Feature Statistics

The feature extraction uses several statistical measures:

Mean: Average value
Median: Middle value (50th percentile)
Standard Deviation (std): Measure of variability
Skewness: Measure of distribution asymmetry
Kurtosis: Measure of distribution "tailedness"
Percentiles (p10, p25, p75, p90, p95, p99): Values below which a given percentage of observations fall
Entropy: Measure of randomness/diversity in a distribution

Why These Features Matter for Bot Detection

Human Characteristics Captured:

Natural Variability: Humans show high variability in timing, speed, and movement patterns
Micro-movements: Natural hand tremor and micro-corrections
Hesitation and Pauses: Thinking time, reading pauses, and hesitation before actions
Smooth Deceleration: Natural slowing when approaching targets (Fitts' law)
Bursty Patterns: Activity in bursts with natural pauses
Curved Paths: Rarely perfectly straight movements
Error Patterns: Occasional mistakes and corrections

Bot Characteristics Detected:

Unnatural Regularity: Too consistent timing and movement patterns
Perfect Precision: Lack of micro-movements and tremor
Immediate Actions: No hesitation or pauses
Constant Speed: Lack of natural acceleration/deceleration
Straight Paths: Overly direct movements
Perfect Execution: Lack of errors or corrections

Feature Extraction Notes

All timestamps are converted from milliseconds to seconds
Missing features are filled with 0
Infinite values are replaced with 0
Features are extracted per session (a sequence of events for a user)
The feature extraction handles edge cases (empty events, insufficient data, etc.)

Total Feature Count

The system extracts approximately 70+ features across all categories, providing a comprehensive representation of user interaction patterns for effective bot detection.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
botbuster		botbuster
data		data
docs		docs
models		models
README.md		README.md
botbuster.png		botbuster.png
extract_features.py		extract_features.py
inference.py		inference.py
requirements.txt		requirements.txt
train_model.py		train_model.py

Folders and files

Latest commit

History

Repository files navigation

Botbuster

Use case and product vision

Overview

Table of contents

Repository layout

Quick start

CLI options

Architecture

Module and dependency graph

Architecture diagram images

Documentation graphs

Interaction JSON format

Feature reference

1. Time-Based Features

time_inter_event_mean

time_inter_event_median

time_inter_event_std

time_inter_event_skew

time_inter_event_kurtosis

time_burstiness

time_longest_pause

time_inter_event_p10, time_inter_event_p25, time_inter_event_p75, time_inter_event_p90, time_inter_event_p95, time_inter_event_p99

2. Mouse Movement Features

Speed Features

mouse_speed_mean

mouse_speed_median

mouse_speed_std

mouse_speed_skew

mouse_speed_kurtosis

mouse_speed_p10, mouse_speed_p25, mouse_speed_p75, mouse_speed_p90, mouse_speed_p95, mouse_speed_p99