Moltbook Moderation: Uncovering Hidden Intent Through Multi-Turn Dialogue
This repository accompanies the paper Moltbook Moderation: Uncovering Hidden Intent Through Multi-Turn Dialogue (Al-Lawati, Tripto, Ansari, Lucas, Wang, Lee; 2026). It contains the code and data splits for the Bot-Moderation framework, a lightweight, prompt-only pipeline that detects malicious bot posts and comments on a synthetic social platform (Moltbook) and classifies their underlying intent into one of five categories. The moderator engages the target agent through multi-turn dialogue guided by Gibbs-based sampling; it wraps a single 8B-parameter open-weights LLM (Qwen3-8B) and uses inference only — no fine-tuning is performed.
| Branch | Use when you want to… |
|---|---|
final |
Reproduce the results in the paper. Ships the trained pipeline, cached splits, baselines, logs, and plotting code. Run everything from here. |
main |
Re-run the autonomous-research experiment from scratch. Starting point before the AR loop was applied. |
autoresearch/may5 |
Inspect the history of the AR experiment. One commit per accepted/rejected experiment, ending at 27f6221 — the commit that became final. |
- Task. Two-level classification of posts/comments:
benign / malicious, plus one of five intent classes (subtle_promotion,narrative_pushing,spam,elicitation,organic_contribution). Comments are seen together with the parent post title. - Metric.
val_f1 = f1_binary**0.7 * f1_categorical**0.3, wheref1_categoricalis macro-F1 over the four malicious intents on true-positive items only. - Model. Qwen3-8B served via vLLM at
localhost:8000. No training. - Final pipeline. Zero-shot self-consistency seed → k rounds of interactive probing with voted refinement → high-sample final vote.
- Code footprint. ~125 lines in train.py. The evaluation harness in prepare.py is fixed and not modified.
- Python ≥ 3.10, managed with
uv. See pyproject.toml. - GPUs for serving the three open-weights models used by the pipeline and evaluator (Qwen3-8B on port 8000, Mistral-7B-Instruct-v0.3 on 8001, Llama-3.1-8B-Instruct on 8002).
- The
cache/dataset-train.jsonandcache/test-generated-new.jsonfiles, produced by the data pipeline that ships with the upstream dataset release. - A
.envfile in the repo root with any secrets the vLLM server needs (e.g.HF_TOKENfor gated weights). The start-up scripts belowsourceit.
uv syncThe moderator talks to Qwen3-8B; the evaluator's user-simulator additionally uses
Mistral-7B-Instruct-v0.3 and Llama-3.1-8B-Instruct. Each model must be reachable at
the exact host / port declared in the _BASE_URL table in prepare.py:
| Model | URL |
|---|---|
Qwen/Qwen3-8B |
http://localhost:8000/v1 |
mistralai/Mistral-7B-Instruct-v0.3 |
http://localhost:8001/v1 |
meta-llama/Llama-3.1-8B-Instruct |
http://localhost:8002/v1 |
Any OpenAI-compatible server will work — we used vLLM. A minimal launcher for one model looks like this:
export HF_TOKEN=... # required for gated Llama / Mistral weights
export CUDA_VISIBLE_DEVICES=<gpu-ids> # pick free GPU(s) for this model
.venv/bin/vllm serve <model-id> \
--tensor-parallel-size <N> \
--host 0.0.0.0 \
--port <port>Substitute the <model-id> / <port> pair from the table above, set
--tensor-parallel-size to the number of GPUs you gave the process, and — if you
are serving Qwen3-8B and want its reasoning channel parsed — add
--reasoning-parser qwen3. Repeat the recipe three times (in three shells, tmux
panes, or nohup ... & background jobs) so that all three servers run concurrently.
Each server prints Uvicorn running on http://0.0.0.0:<port> once it is ready;
wait for all three before running the evaluators.
All of the numbers below were produced from the final branch (which is the
27f6221 commit of autoresearch/may5, packaged together
with the paper's assets — baselines, plots, cached splits). To reproduce them:
git fetch origin
git checkout finalpython3 train.pyThis runs the moderator over the train split and prints the metrics block:
val_f1: 0.7100
f1_binary: 0.7466
f1_categorical: 0.6316
val_f1_zs: 0.5764
f1_zs: 0.6477
f1_cat_zs: 0.4390
f1_posts: 0.7809
f1_comments: 0.6302
total_seconds: 391.6
The _zs rows are the moderator's zero-shot pass (no probing); the unlabelled rows
are the full pipeline with two probing iterations and a final high-sample vote.
python3 eval.pyUses the identical ModeratorBot class against the held-out
cache/test-generated-new.json.
For a clean machine, the full sequence from checkout to metrics is:
git clone <this-repo> autoresearch && cd autoresearch
git checkout final # results branch — paper assets + trained pipeline
uv sync # install Python deps into .venv
cp /path/to/env .env # provide HF_TOKEN etc.
# launch each vLLM server (see "Start the vLLM servers" above for the exact
# `vllm serve ...` command, GPU assignment, and port for each model).
# Qwen3-8B on :8000, Mistral-7B-Instruct-v0.3 on :8001, Llama-3.1-8B-Instruct on :8002.
# wait until each server prints "Uvicorn running on http://0.0.0.0:<port>"
python3 train.py # train-split metrics
python3 eval.py # held-out test-split metrics| Path | Role |
|---|---|
| train.py | ModeratorBot — the only file modified during research |
| prepare.py | Fixed constants, data loading, LLM client, metric, evaluator |
| eval.py | Held-out test evaluation (uses the same ModeratorBot) |
| program.md | Autonomous-research protocol used during development |
cache/ |
Pre-generated train/test splits (not tracked) |
logs/, results.tsv |
Per-experiment logs and results table (not tracked) |
Inside cache/:
dataset-train.json— the canonical train split used by prepare.py.dataset-train-AR.json— the same split with every entry duplicated (each row appears twice). Swapping this file in during the autonomous-research loop halves the per-run variance from the LLM at the cost of roughly 2× wall-clock time, which makes small effects easier to distinguish from noise when deciding whether to keep an experiment.
Relative to the un-modified baseline train.py, the final pipeline adds or changes:
- Self-consistency seed — the zero-shot intent is drawn by majority vote over 5 samples at temperature 0.7 instead of a single greedy call.
- Voted refinement per probing round — the intent is re-voted (5 samples, t=0.7) after each new probe/response pair.
- High-sample final vote — after the last probing round, a final 11-sample vote determines the label.
- Interactive probing without a critique step — probes are generated directly from the content and previous exchange; the earlier separate "critique" LLM call was removed after we observed it was redundant once the rest of the pipeline matured.
- Prompt hygiene — a vigilant/skeptical framing on the intent prompt only, a
compact
Q:/A:probe transcript format, recency-ordered context (content and most-recent probe last), and no leakage of the currently-suspected intent into the probe-generation prompt. - Deterministic binary mapping —
organic_contribution → benign, everything else→ malicious. Attempts to predict the binary label separately from the intent consistently destabilised the system and were removed.
All of the above are visible in train.py; each was adopted only after it improved val_f1 across multiple runs (run-to-run variance is ≈ ±0.015, so every accept decision required 2–3 reruns).
- Results depend on the vLLM server being reachable and serving the exact model IDs declared in prepare.py. Different serving stacks can change sampling behaviour subtly.
- The evaluator uses a
ThreadPoolExecutor(max_workers=64); expect 6–8 minutes for a full train-split run on a single A10/L4-class GPU behind vLLM. results.tsvandlogs/are intentionally untracked — they are regenerated on every run.
If you use this code or the Moltbook moderator pipeline in academic work, please cite:
@misc{allawati2026moltbookmoderationuncoveringhidden,
title={Moltbook Moderation: Uncovering Hidden Intent Through Multi-Turn Dialogue},
author={Ali Al-Lawati and Nafis Tripto and Abolfazl Ansari and Jason Lucas and Suhang Wang and Dongwon Lee},
year={2026},
eprint={2605.12856},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2605.12856},
}Released under the terms stated at the top of the repository. LLM weights are governed by their upstream licences (Qwen3-8B, Mistral-7B-Instruct-v0.3, Llama-3.1-8B-Instruct).