Moltbook Moderation: Uncovering Hidden Intent Through Multi-Turn Dialogue

This repository accompanies the paper Moltbook Moderation: Uncovering Hidden Intent Through Multi-Turn Dialogue (Al-Lawati, Tripto, Ansari, Lucas, Wang, Lee; 2026). It contains the code and data splits for the Bot-Moderation framework, a lightweight, prompt-only pipeline that detects malicious bot posts and comments on a synthetic social platform (Moltbook) and classifies their underlying intent into one of five categories. The moderator engages the target agent through multi-turn dialogue guided by Gibbs-based sampling; it wraps a single 8B-parameter open-weights LLM (Qwen3-8B) and uses inference only — no fine-tuning is performed.

Branches

Branch	Use when you want to…
`final`	Reproduce the results in the paper. Ships the trained pipeline, cached splits, baselines, logs, and plotting code. Run everything from here.
`main`	Re-run the autonomous-research experiment from scratch. Starting point before the AR loop was applied.
`autoresearch/may5`	Inspect the history of the AR experiment. One commit per accepted/rejected experiment, ending at `27f6221` — the commit that became `final`.

Release summary

Task. Two-level classification of posts/comments: benign / malicious, plus one of five intent classes (subtle_promotion, narrative_pushing, spam, elicitation, organic_contribution). Comments are seen together with the parent post title.
Metric. val_f1 = f1_binary**0.7 * f1_categorical**0.3, where f1_categorical is macro-F1 over the four malicious intents on true-positive items only.
Model. Qwen3-8B served via vLLM at localhost:8000. No training.
Final pipeline. Zero-shot self-consistency seed → k rounds of interactive probing with voted refinement → high-sample final vote.
Code footprint. ~125 lines in train.py. The evaluation harness in prepare.py is fixed and not modified.

Quick start

Requirements

Python ≥ 3.10, managed with uv. See pyproject.toml.
GPUs for serving the three open-weights models used by the pipeline and evaluator (Qwen3-8B on port 8000, Mistral-7B-Instruct-v0.3 on 8001, Llama-3.1-8B-Instruct on 8002).
The cache/dataset-train.json and cache/test-generated-new.json files, produced by the data pipeline that ships with the upstream dataset release.
A .env file in the repo root with any secrets the vLLM server needs (e.g. HF_TOKEN for gated weights). The start-up scripts below source it.

Install

uv sync

Start the vLLM servers

The moderator talks to Qwen3-8B; the evaluator's user-simulator additionally uses Mistral-7B-Instruct-v0.3 and Llama-3.1-8B-Instruct. Each model must be reachable at the exact host / port declared in the _BASE_URL table in prepare.py:

Model	URL
`Qwen/Qwen3-8B`	`http://localhost:8000/v1`
`mistralai/Mistral-7B-Instruct-v0.3`	`http://localhost:8001/v1`
`meta-llama/Llama-3.1-8B-Instruct`	`http://localhost:8002/v1`

Any OpenAI-compatible server will work — we used vLLM. A minimal launcher for one model looks like this:

export HF_TOKEN=...                  # required for gated Llama / Mistral weights
export CUDA_VISIBLE_DEVICES=<gpu-ids> # pick free GPU(s) for this model
.venv/bin/vllm serve <model-id> \
    --tensor-parallel-size <N> \
    --host 0.0.0.0 \
    --port <port>

Substitute the <model-id> / <port> pair from the table above, set --tensor-parallel-size to the number of GPUs you gave the process, and — if you are serving Qwen3-8B and want its reasoning channel parsed — add --reasoning-parser qwen3. Repeat the recipe three times (in three shells, tmux panes, or nohup ... & background jobs) so that all three servers run concurrently.

Each server prints Uvicorn running on http://0.0.0.0:<port> once it is ready; wait for all three before running the evaluators.

Check out the trained configuration

All of the numbers below were produced from the final branch (which is the 27f6221 commit of autoresearch/may5, packaged together with the paper's assets — baselines, plots, cached splits). To reproduce them:

git fetch origin
git checkout final

Train-set evaluation (the loop we optimised)

python3 train.py

This runs the moderator over the train split and prints the metrics block:

val_f1:           0.7100
f1_binary:        0.7466
f1_categorical:   0.6316
val_f1_zs:        0.5764
f1_zs:            0.6477
f1_cat_zs:        0.4390
f1_posts:         0.7809
f1_comments:      0.6302
total_seconds:    391.6

The _zs rows are the moderator's zero-shot pass (no probing); the unlabelled rows are the full pipeline with two probing iterations and a final high-sample vote.

Test-set evaluation

python3 eval.py

Uses the identical ModeratorBot class against the held-out cache/test-generated-new.json.

End-to-end one-shot

For a clean machine, the full sequence from checkout to metrics is:

git clone <this-repo> autoresearch && cd autoresearch
git checkout final                 # results branch — paper assets + trained pipeline
uv sync                            # install Python deps into .venv
cp /path/to/env .env               # provide HF_TOKEN etc.

# launch each vLLM server (see "Start the vLLM servers" above for the exact
# `vllm serve ...` command, GPU assignment, and port for each model).
# Qwen3-8B on :8000, Mistral-7B-Instruct-v0.3 on :8001, Llama-3.1-8B-Instruct on :8002.

# wait until each server prints "Uvicorn running on http://0.0.0.0:<port>"

python3 train.py   # train-split metrics
python3 eval.py    # held-out test-split metrics

Repository layout

Path	Role
train.py	`ModeratorBot` — the only file modified during research
prepare.py	Fixed constants, data loading, LLM client, metric, evaluator
eval.py	Held-out test evaluation (uses the same `ModeratorBot`)
program.md	Autonomous-research protocol used during development
`cache/`	Pre-generated train/test splits (not tracked)
`logs/`, `results.tsv`	Per-experiment logs and results table (not tracked)

Inside cache/:

dataset-train.json — the canonical train split used by prepare.py.
dataset-train-AR.json — the same split with every entry duplicated (each row appears twice). Swapping this file in during the autonomous-research loop halves the per-run variance from the LLM at the cost of roughly 2× wall-clock time, which makes small effects easier to distinguish from noise when deciding whether to keep an experiment.

What changed in this release

Relative to the un-modified baseline train.py, the final pipeline adds or changes:

Self-consistency seed — the zero-shot intent is drawn by majority vote over 5 samples at temperature 0.7 instead of a single greedy call.
Voted refinement per probing round — the intent is re-voted (5 samples, t=0.7) after each new probe/response pair.
High-sample final vote — after the last probing round, a final 11-sample vote determines the label.
Interactive probing without a critique step — probes are generated directly from the content and previous exchange; the earlier separate "critique" LLM call was removed after we observed it was redundant once the rest of the pipeline matured.
Prompt hygiene — a vigilant/skeptical framing on the intent prompt only, a compact Q: / A: probe transcript format, recency-ordered context (content and most-recent probe last), and no leakage of the currently-suspected intent into the probe-generation prompt.
Deterministic binary mapping — organic_contribution → benign, everything else → malicious. Attempts to predict the binary label separately from the intent consistently destabilised the system and were removed.

All of the above are visible in train.py; each was adopted only after it improved val_f1 across multiple runs (run-to-run variance is ≈ ±0.015, so every accept decision required 2–3 reruns).

Reproducibility notes

Results depend on the vLLM server being reachable and serving the exact model IDs declared in prepare.py. Different serving stacks can change sampling behaviour subtly.
The evaluator uses a ThreadPoolExecutor(max_workers=64); expect 6–8 minutes for a full train-split run on a single A10/L4-class GPU behind vLLM.
results.tsv and logs/ are intentionally untracked — they are regenerated on every run.

Citation

If you use this code or the Moltbook moderator pipeline in academic work, please cite:

@misc{allawati2026moltbookmoderationuncoveringhidden,
      title={Moltbook Moderation: Uncovering Hidden Intent Through Multi-Turn Dialogue}, 
      author={Ali Al-Lawati and Nafis Tripto and Abolfazl Ansari and Jason Lucas and Suhang Wang and Dongwon Lee},
      year={2026},
      eprint={2605.12856},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2605.12856}, 
}

License

Released under the terms stated at the top of the repository. LLM weights are governed by their upstream licences (Qwen3-8B, Mistral-7B-Instruct-v0.3, Llama-3.1-8B-Instruct).

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
cache		cache
.gitignore		.gitignore
README.md		README.md
prepare.py		prepare.py
program.md		program.md
pyproject.toml		pyproject.toml
train.py		train.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Moltbook Moderation: Uncovering Hidden Intent Through Multi-Turn Dialogue

Branches

Release summary

Quick start

Requirements

Install

Start the vLLM servers

Check out the trained configuration

Train-set evaluation (the loop we optimised)

Test-set evaluation

End-to-end one-shot

Repository layout

What changed in this release

Reproducibility notes

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Moltbook Moderation: Uncovering Hidden Intent Through Multi-Turn Dialogue

Branches

Release summary

Quick start

Requirements

Install

Start the vLLM servers

Check out the trained configuration

Train-set evaluation (the loop we optimised)

Test-set evaluation

End-to-end one-shot

Repository layout

What changed in this release

Reproducibility notes

Citation

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages