Botbuster is part of a dual-layer detection approach aimed at reducing bot and autonomous-agent abuse while keeping user friction low.
-
Behavioral biometrics — Signals from how people actually use the page: mouse dynamics, keyboard timing, scroll patterns, and related session rhythm. Humans show natural variability, hesitation, micro-corrections, and bursty activity; many scripted agents do not. This repository implements that layer end-to-end: turning interaction JSON into engineered features and training a classifier to separate human sessions from synthetic ones.
-
Fingerprinting and environment signals — A complementary layer (typically in the browser or edge) that captures client and session context to spot automation frameworks, inconsistent identities, or replayed environments. That logic lives outside this offline training repo but pairs with behavioral scores in a full deployment.
-
Compute challenges — Lightweight proof-of-work or attestation-style checks used sparingly so that only suspicious traffic pays an extra cost, preserving a smooth path for legitimate users.
Together, behavioral models, fingerprinting, and challenges raise the bar for attackers who can mimic one signal class but not all of them at once.
Reference / upstream project: github.com/be1ani/botbuster
This repo provides a minimal, portable Python pipeline: labeled interaction files under data/human/ and data/synthetic/ → features.csv → gradient-boosted classifier → inference.py for scoring new sessions. Shared defaults live in the botbuster package (paths.py, constants.py).
- Use case and product vision
- Overview
- Repository layout
- Quick start
- CLI options
- Architecture
- Documentation graphs
- Interaction JSON format
- Feature reference
| Path | Role |
|---|---|
data/human/ |
Labeled human interaction JSON (one file per collection) |
data/synthetic/ |
Labeled synthetic/bot JSON |
extract_features.py |
Builds features.csv from the directories above |
train_model.py |
Trains sklearn GradientBoostingClassifier and saves a joblib model |
inference.py |
Loads the model + column order from CSV and scores a JSON file |
botbuster/ |
Shared paths and constants used by the scripts |
models/v0/ |
Default location for the trained .pkl |
docs/diagrams/ |
Mermaid sources for architecture figures |
docs/images/ |
Generated PNG diagrams (from docs/generate_graphs.sh) |
Create a virtual environment, install dependencies, then run the pipeline from the repository root so import botbuster resolves.
cd /path/to/botbuster
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
python extract_features.py
python train_model.py
python inference.py data/synthetic/some-session.jsonextract_features.py:--human-dir,--bot-dir,-o/--outputoverride defaults (data/human,data/synthetic,features.csvat repo root).train_model.py:--csvand-o/--model-outoverridefeatures.csvandmodels/v0/bot_detection_model.pkl.inference.py:--modeland--csv(paths relative to the script directory unless absolute).
Labels in the CSV: bot/synthetic = 0, human = 1 (see botbuster/constants.py).
Training and inference connect as follows. The model must see features in the same column order as in features.csv; inference reads that order from the CSV header.
flowchart LR
subgraph inputs
H[data/human/*.json]
B[data/synthetic/*.json]
end
H --> EF[extract_features.py]
B --> EF
EF --> CSV[features.csv]
CSV --> TM[train_model.py]
TM --> M[models/v0/bot_detection_model.pkl]
M --> INF[inference.py]
J[Any interactions JSON] --> INF
INF --> OUT[Printed predictions + probabilities]
- Feature extraction walks human and synthetic directories, parses each JSON session, computes behavioral statistics, and writes one CSV row per session with metadata columns and a
label. - Training reads the CSV, drops metadata columns, fits a
GradientBoostingClassifier, and serializes the estimator with joblib. - Inference reloads the model, reads the column order from the same CSV header the model was trained on, extracts features from the input JSON via the same code path as training data, and emits per-session predictions.
flowchart TB
subgraph scripts
EF[extract_features.py]
TM[train_model.py]
INF[inference.py]
end
subgraph pkg[botbuster package]
P[paths.py]
C[constants.py]
end
EF --> P
EF --> C
TM --> P
TM --> C
INF --> P
INF --> C
INF --> EF
botbuster.paths: Resolves the repository root and default locations for data,features.csv, and the default model path. Scripts accept CLI overrides so paths stay portable across machines.botbuster.constants: Single source for label integers (LABEL_BOT,LABEL_HUMAN) and metadata column names excluded from the model matrix.inference.pyimportsprocess_json_filefromextract_features.pyso training and scoring always use the same feature definitions.
Regenerate these from Mermaid with bash docs/generate_graphs.sh:
To regenerate PNG diagrams from the Mermaid sources:
bash docs/generate_graphs.shThis uses Mermaid CLI via npx when mmdc is not on your PATH. Chromium runs with the repo-local docs/puppeteer.json flags so headless rendering works on locked-down Linux hosts. Output is written to docs/images/.
Each file is a nested object: user_id → session_id → list of events. Events include action (mouse_move, click, keypress, scroll, …), timestamp (milliseconds), and action-specific fields such as x / y for pointer events.
The following sections document every engineered column used by the classifier. Features fall into six categories: time-based, mouse movement, click, scroll, keystroke, and session-level.
These features analyze the temporal patterns of events, capturing the rhythm and timing characteristics of user interactions.
- Description: Mean time between consecutive events
- Units: Seconds
- Purpose: Captures the average pace of interactions. Humans typically have more variable pacing compared to bots.
- Description: Median time between consecutive events
- Units: Seconds
- Purpose: Provides a robust measure of central tendency, less affected by outliers than the mean.
- Description: Standard deviation of inter-event times
- Units: Seconds
- Purpose: Measures variability in timing. Higher values indicate more irregular, human-like behavior.
- Description: Skewness of inter-event time distribution
- Purpose: Measures asymmetry in timing patterns. Positive skew indicates occasional long pauses (human behavior).
- Description: Kurtosis of inter-event time distribution
- Purpose: Measures the "tailedness" of the distribution. High kurtosis indicates more extreme values (long pauses or rapid bursts).
- Description: Burstiness index calculated as
(std - mean) / (std + mean) - Range: -1 to 1
- Purpose: Quantifies how bursty the interaction pattern is. Values closer to 1 indicate more bursty behavior (typical of humans), while values closer to -1 indicate more regular patterns (typical of bots).
- Description: Maximum time gap between any two consecutive events
- Units: Seconds
- Purpose: Captures the longest thinking/reading pause, which is characteristic of human behavior.
time_inter_event_p10, time_inter_event_p25, time_inter_event_p75, time_inter_event_p90, time_inter_event_p95, time_inter_event_p99
- Description: Percentiles (10th, 25th, 75th, 90th, 95th, 99th) of inter-event times
- Units: Seconds
- Purpose: Provides detailed distribution information, capturing both typical and extreme timing behaviors.
These features analyze mouse movement patterns, capturing the natural dynamics and micro-movements characteristic of human motor control.
- Description: Average mouse movement speed
- Units: Pixels per second
- Purpose: Humans typically have more variable speeds compared to bots.
- Description: Median mouse movement speed
- Units: Pixels per second
- Purpose: Robust measure of typical movement speed.
- Description: Standard deviation of mouse speeds
- Units: Pixels per second
- Purpose: Measures speed variability. Higher values indicate more natural, human-like movement.
- Description: Skewness of speed distribution
- Purpose: Captures asymmetry in speed patterns.
- Description: Kurtosis of speed distribution
- Purpose: Measures the presence of extreme speed values.
mouse_speed_p10, mouse_speed_p25, mouse_speed_p75, mouse_speed_p90, mouse_speed_p95, mouse_speed_p99
- Description: Percentiles of mouse speed distribution
- Units: Pixels per second
- Purpose: Detailed speed distribution information.
- Description: Average rate of change of mouse speed
- Units: Pixels per second²
- Purpose: Humans show more variable acceleration patterns. Bots often have more uniform acceleration.
- Description: Standard deviation of acceleration
- Units: Pixels per second²
- Purpose: Measures acceleration variability.
- Description: Skewness of acceleration distribution
- Purpose: Captures asymmetry in acceleration patterns.
- Description: Average rate of change of acceleration (3rd derivative of position)
- Units: Pixels per second³
- Purpose: Jerk is a key indicator of smoothness. Human movements have natural jerk, while bot movements are often unnaturally smooth.
- Description: Standard deviation of jerk
- Units: Pixels per second³
- Purpose: Measures jerk variability.
- Description: Average curvature of mouse path
- Purpose: Humans rarely move in perfectly straight lines. Curvature captures the natural arc of human movements.
- Description: Standard deviation of curvature
- Purpose: Measures variability in path curvature.
- Description: Correlation between curvature and velocity
- Range: -1 to 1
- Purpose: In human movements, there's often a relationship between speed and curvature (slower around curves). Bots may lack this natural correlation.
- Description: Ratio of total path length to straight-line distance from start to end
- Range: ≥ 1.0 (1.0 = perfectly straight)
- Purpose: Humans rarely take perfectly straight paths. Values significantly above 1.0 indicate more natural, meandering paths.
- Description: Count of significant direction changes (>45 degrees)
- Purpose: Humans make frequent small corrections. Bots may have fewer or more abrupt direction changes.
- Description: Entropy of movement direction angles
- Purpose: Measures the randomness/diversity of movement directions. Higher entropy indicates more varied, human-like movement patterns.
- Description: Peak power in the 8-12 Hz frequency range (human tremor range)
- Purpose: Humans exhibit natural hand tremor in the 8-12 Hz range. This feature detects the presence of this micro-movement, which is difficult for bots to replicate.
These features analyze clicking behavior, capturing the hesitation, approach patterns, and click characteristics that distinguish human from bot behavior.
- Description: Average time between last mouse movement and click
- Units: Seconds
- Purpose: Humans typically pause briefly before clicking (hesitation). Bots often click immediately after reaching the target.
- Description: Median hesitation time
- Units: Seconds
- Purpose: Robust measure of typical hesitation.
- Description: Standard deviation of hesitation times
- Units: Seconds
- Purpose: Measures variability in hesitation patterns.
- Description: Average path length traveled in the 500ms before a click
- Units: Pixels
- Purpose: Humans often make small adjustments before clicking. Bots may have shorter or more direct paths.
- Description: Median pre-click path length
- Units: Pixels
- Purpose: Robust measure of typical pre-click movement.
- Description: Average ratio of speed when closest to target vs. speed when farthest from target (in last 200ms)
- Purpose: Humans typically slow down as they approach a target (Fitts' law). This ratio captures this deceleration pattern. Values < 1.0 indicate natural deceleration.
- Description: Median approach speed ratio
- Purpose: Robust measure of typical approach behavior.
- Description: Entropy of click X-coordinates
- Purpose: Measures the randomness of click positions. Humans click in more varied locations.
- Description: Entropy of click Y-coordinates
- Purpose: Measures the randomness of click positions vertically.
- Description: Number of double-clicks detected (clicks 200-500ms apart)
- Purpose: Humans sometimes double-click by accident or habit. Bots rarely exhibit this behavior.
- Description: Average time between clicks in double-click pairs
- Units: Seconds
- Purpose: Typical double-click timing is around 200-500ms.
These features analyze scrolling behavior, capturing the natural rhythm and patterns of human scrolling.
- Description: Average scrolling velocity
- Units: Pixels per second
- Purpose: Humans scroll at variable speeds. Bots may scroll at more constant rates.
- Description: Standard deviation of scroll velocity
- Units: Pixels per second
- Purpose: Measures variability in scrolling speed.
- Description: Average rate of change of scroll velocity
- Units: Pixels per second²
- Purpose: Captures how smoothly scrolling speed changes.
- Description: Standard deviation of scroll acceleration
- Units: Pixels per second²
- Purpose: Measures acceleration variability.
- Description: Mean time between scroll events
- Units: Seconds
- Purpose: Captures the bursty nature of scrolling. Humans scroll in bursts with pauses.
- Description: Entropy of scroll directions (up/down)
- Purpose: Measures the randomness of scroll direction changes. Humans scroll both up and down naturally.
- Description: Ratio of time spent not scrolling (gaps > 1 second) to total session time
- Range: 0 to 1
- Purpose: Humans pause to read content. Higher values indicate more natural reading behavior.
These features analyze typing patterns, capturing the rhythm and timing characteristics of human typing.
- Description: Average time between consecutive keystrokes
- Units: Seconds
- Purpose: Humans type at variable speeds. Bots may type at unnaturally constant rates.
- Description: Median inter-key time
- Units: Seconds
- Purpose: Robust measure of typical typing speed.
- Description: Standard deviation of inter-key times
- Units: Seconds
- Purpose: Measures typing rhythm variability. Higher values indicate more natural, human-like typing.
- Description: Skewness of inter-key time distribution
- Purpose: Captures asymmetry in typing patterns. Positive skew indicates occasional long pauses (thinking/correction).
- Description: Burstiness index for typing, calculated as
(std - mean) / (std + mean) - Range: -1 to 1
- Purpose: Quantifies how bursty the typing pattern is. Humans type in bursts with pauses, while bots may type more continuously.
These features provide aggregate statistics about the entire session, capturing overall interaction patterns.
- Description: Total number of events in the session
- Purpose: Overall activity level indicator.
- Description: Total number of mouse movement events
- Purpose: Measures mouse activity level.
- Description: Total number of click events
- Purpose: Measures clicking activity.
- Description: Total number of keypress events
- Purpose: Measures typing activity.
- Description: Total number of scroll events
- Purpose: Measures scrolling activity.
- Description: Proportion of mouse movement events to total events
- Range: 0 to 1
- Purpose: Relative frequency of mouse movements.
- Description: Proportion of click events to total events
- Range: 0 to 1
- Purpose: Relative frequency of clicks.
- Description: Proportion of keypress events to total events
- Range: 0 to 1
- Purpose: Relative frequency of typing.
- Description: Proportion of scroll events to total events
- Range: 0 to 1
- Purpose: Relative frequency of scrolling.
- Description: Proportion of any other event type to total events
- Range: 0 to 1
- Purpose: Captures the relative frequency of any event type in the session.
- Description: Proportion of time spent in idle periods (gaps > 1 second) to total session time
- Range: 0 to 1
- Purpose: Humans pause to read, think, or process information. Higher values indicate more natural human behavior.
- Description: Number of unique DOM elements interacted with
- Purpose: Measures the diversity of interactions. Humans explore more, while bots may focus on specific elements.
- Description: Entropy of inter-event time distribution
- Purpose: Measures the randomness/variability of timing patterns across the entire session. Higher entropy indicates more natural, human-like timing.
The feature extraction uses several statistical measures:
- Mean: Average value
- Median: Middle value (50th percentile)
- Standard Deviation (std): Measure of variability
- Skewness: Measure of distribution asymmetry
- Kurtosis: Measure of distribution "tailedness"
- Percentiles (p10, p25, p75, p90, p95, p99): Values below which a given percentage of observations fall
- Entropy: Measure of randomness/diversity in a distribution
- Natural Variability: Humans show high variability in timing, speed, and movement patterns
- Micro-movements: Natural hand tremor and micro-corrections
- Hesitation and Pauses: Thinking time, reading pauses, and hesitation before actions
- Smooth Deceleration: Natural slowing when approaching targets (Fitts' law)
- Bursty Patterns: Activity in bursts with natural pauses
- Curved Paths: Rarely perfectly straight movements
- Error Patterns: Occasional mistakes and corrections
- Unnatural Regularity: Too consistent timing and movement patterns
- Perfect Precision: Lack of micro-movements and tremor
- Immediate Actions: No hesitation or pauses
- Constant Speed: Lack of natural acceleration/deceleration
- Straight Paths: Overly direct movements
- Perfect Execution: Lack of errors or corrections
- All timestamps are converted from milliseconds to seconds
- Missing features are filled with 0
- Infinite values are replaced with 0
- Features are extracted per session (a sequence of events for a user)
- The feature extraction handles edge cases (empty events, insufficient data, etc.)
The system extracts approximately 70+ features across all categories, providing a comprehensive representation of user interaction patterns for effective bot detection.


