A high-speed desktop tool for searching, filtering, analyzing, and exporting Danbooru metadata for AI training dataset curation.
This project is based on ThetaCursed/Danbooru-Dataset-Filter and keeps the same core idea: a fast local Danbooru metadata explorer powered by Polars, Apache Parquet, and PyQt6. This extended version adds a plugin system, richer Danbooru-style query syntax, memory-safe processing modes, thumbnail previews, and multiple analysis extensions for tag research.
Designed for LoRA, checkpoint, Flux/SDXL/Anima, ControlNet, and general computer-vision dataset curation workflows.
The original Danbooru Dataset Filter is great for quickly filtering millions of Danbooru records by tags, score, favorites, rating, orientation, and date. This version turns that workflow into a larger research dashboard for people who need to explore tag relationships, tag growth, dataset trends, and query performance before exporting image URLs.
- Fast local search over Danbooru Parquet metadata with Polars lazy queries.
- Unified Danbooru-style query box supporting tags, negation,
OR, wildcards, metadata filters, and order modifiers. - Rating, date, score, favorites, orientation, and MD5 dedup filters from the GUI.
- Two search modes:
- Fast in-process search for normal usage.
- Memory-safe isolated search for huge queries that should release temporary native memory after completion.
- Optional thumbnail preview using Danbooru CDN thumbnail URLs.
- Extension loader that automatically loads Python plugins from
./extensions/. - Export image URLs to TXT for downstream downloaders or training pipelines.
- Dark Catppuccin-style UI optimized for long curation sessions.
| Area | Upstream project | This extended version |
|---|---|---|
| Core purpose | Fast GUI metadata filtering and URL export | Same core filtering workflow, plus analysis and research tooling |
| Query input | Include/exclude style filtering | Unified Danbooru-style query box with OR, negation, wildcards, numeric ranges, andorder:` syntax |
| Data loading | Main local Parquet database | Combines local clean metadata and optional API-synced metadata when available |
| Memory behavior | Fast local Polars search | Adds a memory-safe isolated search mode for very large queries |
| Extensions | Not the main focus | Built-in extension API using setup(app) / register(app), extension buttons, get_current_df(), and data_updated |
| Visual preview | Table preview and tags | Optional CDN thumbnails, colored tags, clickable tag preview, and image viewing workflow |
| Analytics | Basic curation workflow | Tag analytics dashboard, global tag comparison charts, related-tag mapping, bubble visualization, and tag-growth discovery |
| Tag discovery | Manual query exploration | Low-RAM normalized tag explosion finder with baseline/recent windows and tag-type filters |
| Research workflow | Find and export a dataset | Explore trends, compare tags over time, inspect related tags, then export URLs |
1girl score:>50 rating:g,s order:score
hatsune_miku OR megurine_luka -lowres favcount:>=25
*miku* order:favcount
width:>=1024 height:>=1024 rating:e
score:100..500 date order:random
Supported query concepts include:
- Exact tags:
1girl,solo,blue_archive - Negative tags:
-lowres,-bad_id - Simple OR:
hatsune_miku OR megurine_luka - Wildcards:
*miku* - Numeric metadata:
score:>50,favcount:10..200,width:>=1024,height:<2048,id:123456 - Ratings:
rating:g,rating:s,rating:q,rating:e, or comma groups likerating:g,s - Sorting from the query:
order:score,order:favcount,order:date,order:id,order:random
Place extension files in the extensions/ folder. The app loads every .py file with a setup(app) or register(app) entry point.
A dashboard for analyzing the current filtered dataset:
- Date distribution by year, month, or day.
- Rating distribution.
- Top character, copyright, artist, and general tags.
- Optional normalization and smoothing for date charts.
Compare multiple tag queries over time:
- Supports year/month/day grouping.
- Uses the same advanced query parser when available.
- Can normalize tag counts against total posts for the selected period.
- Caches chart data so repeated comparisons are faster.
Find tags that are growing unusually fast between two date windows:
- Streams Parquet batches with PyArrow instead of loading everything into RAM.
- Compares a baseline window against a recent/explosion window.
- Scores tags by normalized growth, recent count, and percentage delta.
- Supports tag-type filters: artist, copyright, character, general, and meta.
- Supports seed tags, exclusion lists, only-new tags, exclude-new tags, and artist-dominance filtering.
Explore tag relationships inside the current search result:
- Counts tag coverage by category.
- Maps related tags from a selected tag.
- Opens an interactive bubble graph of co-occurring tags.
- Includes category legend, search highlighting, edge-strength controls, zoom controls, context menus, CSV export, and clipboard helpers.
You can download the required data files from the repository's Releases page.
Two release ZIPs are available:
statbooru.zipβ ready-to-use package with the.exe, extensions, and data included.data.zipβ data-only package for users who want to run Statbooru from their own Python environment.
If you use statbooru.zip, extract it and run the executable.
If you use data.zip, extract it and place the included data/ folder next to main.py.
git clone https://github.com/Y1-studio/statbooru.git
cd statboorupython -m venv .venvActivate it:
# Windows
.venv\Scripts\activate
# Linux / macOS
source .venv/bin/activatepip install polars pyarrow PyQt6 requests matplotlibOptional but recommended for a repo release:
pip freeze > requirements.txtCreate a data/ folder next to main.py:
The required metadata files can be downloaded from the repository's Releases page.
Download data.zip if you only need the metadata files for a self-made Python environment.
project-root/
βββ data/
β βββ danbooru2026_clean.parquet
β βββ tags_dictionary.parquet
β βββ danbooru_api_clean.parquet # optional
β βββ tags_dictionary_API.parquet # optional
βββ extensions/
βββ main.py
βββ README.md
The base upstream project points users to the Danbooru 2026 clean metadata files on Hugging Face. This extended app expects the same style of local Parquet metadata and tag dictionary files.
Recommended file names:
extensions/
βββ analytics.py
βββ tag_compare.py
βββ tag_explosion_finder.py
βββ interactive_tag_mapper_extension_v9.py
If your files were downloaded with names like analytics(2).py or tag_compare(1).py, rename them before committing.
python main.py- Start the app with
python main.py. - Enter a query such as
1girl score:>50 rating:g,s order:favcount. - Adjust score/favorite thresholds, rating, orientation, deduplication, and date filters.
- Choose Fast in-process for speed or Memory-safe isolated for massive searches.
- Run the search.
- Use the extensions to analyze results:
- Open Tag Analytics for distribution summaries.
- Open Compare Tags Globally to chart tag trends.
- Open Low-RAM Tag Explosions to discover rising tags.
- Open Interactive Tag Mapper to inspect related tags and co-occurrence clusters.
- Export image URLs to
.txtwhen the dataset looks right.
project-root/
βββ main.py
βββ README.md
βββ extensions/
β βββ analytics.py
β βββ tag_compare.py
β βββ tag_explosion_finder.py
β βββ interactive_tag_mapper_extension_v9.py
βββ data/
βββ danbooru2026_clean.parquet
βββ tags_dictionary.parquet
βββ danbooru_api_clean.parquet
βββ tags_dictionary_API.parquet
Extensions can integrate with the host app through:
def setup(app):
# create buttons, dialogs, and hooks here
...Useful host methods/signals:
app.add_extension_button(button)β add a button to the main extension button row.app.get_current_df()β access the current filtered Polars DataFrame.app.get_results_df()β compatibility alias forget_current_df().app.get_unlimited_lazy_df()β access a lazy pipeline when available.app.data_updatedβ react when the active search result changes.
- This tool works on local metadata. It does not download full images by itself unless you use the exported URLs with an external downloader.
- Thumbnail preview requires network access to the Danbooru CDN.
- Very large searches can temporarily use significant RAM. Use Memory-safe isolated mode when working with huge result sets.
- The app assumes Danbooru-style metadata columns such as
id,rating,score,fav_count,file_url,created_at,md5, and tag category columns. - Required metadata files are distributed through GitHub Releases as
data.zip. - The ready-to-use release package is distributed as
statbooru.zipand includes the executable plus data.
Based on ThetaCursed/Danbooru-Dataset-Filter, a fast Polars/PyQt6 GUI for curating Danbooru image datasets.
This extended version adds plugin-based analytics, richer query parsing, memory-focused processing, and tag research tools on top of the original concept.