Skip to content

Y1-studio/statbooru

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

10 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Statbooru

A high-speed desktop tool for searching, filtering, analyzing, and exporting Danbooru metadata for AI training dataset curation.

This project is based on ThetaCursed/Danbooru-Dataset-Filter and keeps the same core idea: a fast local Danbooru metadata explorer powered by Polars, Apache Parquet, and PyQt6. This extended version adds a plugin system, richer Danbooru-style query syntax, memory-safe processing modes, thumbnail previews, and multiple analysis extensions for tag research.

Designed for LoRA, checkpoint, Flux/SDXL/Anima, ControlNet, and general computer-vision dataset curation workflows.


Why this exists

The original Danbooru Dataset Filter is great for quickly filtering millions of Danbooru records by tags, score, favorites, rating, orientation, and date. This version turns that workflow into a larger research dashboard for people who need to explore tag relationships, tag growth, dataset trends, and query performance before exporting image URLs.


Key features

  • Fast local search over Danbooru Parquet metadata with Polars lazy queries.
  • Unified Danbooru-style query box supporting tags, negation, OR, wildcards, metadata filters, and order modifiers.
  • Rating, date, score, favorites, orientation, and MD5 dedup filters from the GUI.
  • Two search modes:
    • Fast in-process search for normal usage.
    • Memory-safe isolated search for huge queries that should release temporary native memory after completion.
  • Optional thumbnail preview using Danbooru CDN thumbnail URLs.
  • Extension loader that automatically loads Python plugins from ./extensions/.
  • Export image URLs to TXT for downstream downloaders or training pipelines.
  • Dark Catppuccin-style UI optimized for long curation sessions.

What changed from ThetaCursed/Danbooru-Dataset-Filter?

Area Upstream project This extended version
Core purpose Fast GUI metadata filtering and URL export Same core filtering workflow, plus analysis and research tooling
Query input Include/exclude style filtering Unified Danbooru-style query box with OR, negation, wildcards, numeric ranges, andorder:` syntax
Data loading Main local Parquet database Combines local clean metadata and optional API-synced metadata when available
Memory behavior Fast local Polars search Adds a memory-safe isolated search mode for very large queries
Extensions Not the main focus Built-in extension API using setup(app) / register(app), extension buttons, get_current_df(), and data_updated
Visual preview Table preview and tags Optional CDN thumbnails, colored tags, clickable tag preview, and image viewing workflow
Analytics Basic curation workflow Tag analytics dashboard, global tag comparison charts, related-tag mapping, bubble visualization, and tag-growth discovery
Tag discovery Manual query exploration Low-RAM normalized tag explosion finder with baseline/recent windows and tag-type filters
Research workflow Find and export a dataset Explore trends, compare tags over time, inspect related tags, then export URLs

Query examples

1girl score:>50 rating:g,s order:score
hatsune_miku OR megurine_luka -lowres favcount:>=25
*miku* order:favcount
width:>=1024 height:>=1024 rating:e
score:100..500 date order:random

Supported query concepts include:

  • Exact tags: 1girl, solo, blue_archive
  • Negative tags: -lowres, -bad_id
  • Simple OR: hatsune_miku OR megurine_luka
  • Wildcards: *miku*
  • Numeric metadata: score:>50, favcount:10..200, width:>=1024, height:<2048, id:123456
  • Ratings: rating:g, rating:s, rating:q, rating:e, or comma groups like rating:g,s
  • Sorting from the query: order:score, order:favcount, order:date, order:id, order:random

Extensions included

Place extension files in the extensions/ folder. The app loads every .py file with a setup(app) or register(app) entry point.

πŸ“Š Tag Analytics

A dashboard for analyzing the current filtered dataset:

  • Date distribution by year, month, or day.
  • Rating distribution.
  • Top character, copyright, artist, and general tags.
  • Optional normalization and smoothing for date charts.

πŸ“ˆ Compare Tags Globally

Compare multiple tag queries over time:

  • Supports year/month/day grouping.
  • Uses the same advanced query parser when available.
  • Can normalize tag counts against total posts for the selected period.
  • Caches chart data so repeated comparisons are faster.

πŸ’₯ Low-RAM Tag Explosions

Find tags that are growing unusually fast between two date windows:

  • Streams Parquet batches with PyArrow instead of loading everything into RAM.
  • Compares a baseline window against a recent/explosion window.
  • Scores tags by normalized growth, recent count, and percentage delta.
  • Supports tag-type filters: artist, copyright, character, general, and meta.
  • Supports seed tags, exclusion lists, only-new tags, exclude-new tags, and artist-dominance filtering.

🧭 Interactive Tag Mapper + Bubble Visualizer

Explore tag relationships inside the current search result:

  • Counts tag coverage by category.
  • Maps related tags from a selected tag.
  • Opens an interactive bubble graph of co-occurring tags.
  • Includes category legend, search highlighting, edge-strength controls, zoom controls, context menus, CSV export, and clipboard helpers.

Installation

Option A: Download from Releases

You can download the required data files from the repository's Releases page.

Two release ZIPs are available:

  • statbooru.zip β€” ready-to-use package with the .exe, extensions, and data included.
  • data.zip β€” data-only package for users who want to run Statbooru from their own Python environment.

If you use statbooru.zip, extract it and run the executable.

If you use data.zip, extract it and place the included data/ folder next to main.py.

Option B:

1. Clone the repository

git clone https://github.com/Y1-studio/statbooru.git
cd statbooru

2. Create a virtual environment

python -m venv .venv

Activate it:

# Windows
.venv\Scripts\activate

# Linux / macOS
source .venv/bin/activate

3. Install dependencies

pip install polars pyarrow PyQt6 requests matplotlib

Optional but recommended for a repo release:

pip freeze > requirements.txt

4. Add the metadata files

Create a data/ folder next to main.py:

The required metadata files can be downloaded from the repository's Releases page.
Download data.zip if you only need the metadata files for a self-made Python environment.

project-root/
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ danbooru2026_clean.parquet
β”‚   β”œβ”€β”€ tags_dictionary.parquet
β”‚   β”œβ”€β”€ danbooru_api_clean.parquet          # optional
β”‚   └── tags_dictionary_API.parquet         # optional
β”œβ”€β”€ extensions/
β”œβ”€β”€ main.py
└── README.md

The base upstream project points users to the Danbooru 2026 clean metadata files on Hugging Face. This extended app expects the same style of local Parquet metadata and tag dictionary files.

5. Install extensions

Recommended file names:

extensions/
β”œβ”€β”€ analytics.py
β”œβ”€β”€ tag_compare.py
β”œβ”€β”€ tag_explosion_finder.py
└── interactive_tag_mapper_extension_v9.py

If your files were downloaded with names like analytics(2).py or tag_compare(1).py, rename them before committing.

6. Run the app

python main.py


Usage workflow

  1. Start the app with python main.py.
  2. Enter a query such as 1girl score:>50 rating:g,s order:favcount.
  3. Adjust score/favorite thresholds, rating, orientation, deduplication, and date filters.
  4. Choose Fast in-process for speed or Memory-safe isolated for massive searches.
  5. Run the search.
  6. Use the extensions to analyze results:
    • Open Tag Analytics for distribution summaries.
    • Open Compare Tags Globally to chart tag trends.
    • Open Low-RAM Tag Explosions to discover rising tags.
    • Open Interactive Tag Mapper to inspect related tags and co-occurrence clusters.
  7. Export image URLs to .txt when the dataset looks right.

Project structure

project-root/
β”œβ”€β”€ main.py
β”œβ”€β”€ README.md
β”œβ”€β”€ extensions/
β”‚   β”œβ”€β”€ analytics.py
β”‚   β”œβ”€β”€ tag_compare.py
β”‚   β”œβ”€β”€ tag_explosion_finder.py
β”‚   └── interactive_tag_mapper_extension_v9.py
└── data/
    β”œβ”€β”€ danbooru2026_clean.parquet
    β”œβ”€β”€ tags_dictionary.parquet
    β”œβ”€β”€ danbooru_api_clean.parquet
    └── tags_dictionary_API.parquet

Extension API notes

Extensions can integrate with the host app through:

def setup(app):
    # create buttons, dialogs, and hooks here
    ...

Useful host methods/signals:

  • app.add_extension_button(button) β€” add a button to the main extension button row.
  • app.get_current_df() β€” access the current filtered Polars DataFrame.
  • app.get_results_df() β€” compatibility alias for get_current_df().
  • app.get_unlimited_lazy_df() β€” access a lazy pipeline when available.
  • app.data_updated β€” react when the active search result changes.

Notes and limitations

  • This tool works on local metadata. It does not download full images by itself unless you use the exported URLs with an external downloader.
  • Thumbnail preview requires network access to the Danbooru CDN.
  • Very large searches can temporarily use significant RAM. Use Memory-safe isolated mode when working with huge result sets.
  • The app assumes Danbooru-style metadata columns such as id, rating, score, fav_count, file_url, created_at, md5, and tag category columns.
  • Required metadata files are distributed through GitHub Releases as data.zip.
  • The ready-to-use release package is distributed as statbooru.zip and includes the executable plus data.

Credits

Based on ThetaCursed/Danbooru-Dataset-Filter, a fast Polars/PyQt6 GUI for curating Danbooru image datasets.

This extended version adds plugin-based analytics, richer query parsing, memory-focused processing, and tag research tools on top of the original concept.

About

Statbooru is an analytics tool built on Danbooru metadata. It helps artists and researchers discover stable, recognizable characters to draw, identify subjects that may attract attention, compare character popularity, analyze popular formats and tag patterns, and track how popularity changes over time.

Topics

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors