Skip to content

drajb/sonic-phoenix

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sonic Phoenix

A reproducible, resumable, language-agnostic pipeline for turning a chaotic folder of audio files into a clean Language/Artist/Album/Artist - Title.ext library with correct ID3 tags, embedded cover art, and synchronised lyrics — then optionally pushing the result into Spotify playlists and exposing it to AI agents for on-demand playlist curation.

The pipeline is broken into seven phases numbered 0106 plus a Phase 7 AI skill. Each phase is a series of small, independent scripts that write their state to disk so you can stop, inspect, edit, and resume at any point without redoing work.


Table of contents

  1. How it works at a glance
  2. What this does
  3. Supported formats
  4. Requirements
  5. Install
  6. Configure
  7. Optional: language hint files
  8. Running the pipeline
  9. Phase-by-phase reference
  10. Utility scripts
  11. ClawHub AI skill (Phase 7)
  12. Data files produced
  13. Troubleshooting
  14. Design notes
  15. License

How it works at a glance

Raw audio files (any state of disorganisation)
  │
  ▼
Phase 1 — Acoustic fingerprint every file via Shazam (20-way concurrent, resumable)
  │
  ▼
Phase 2 — SHA-256 deduplication + language classification via langdetect + hint files
  │
  ▼
Phase 3 — Fuzzy artist consolidation + structural enforcement into Language/Artist/Album/
  │
  ▼
Phase 4 — Metadata enrichment: iTunes tags, LrcLib synchronised lyrics, HD cover art
  │
  ▼
Phase 5 — Catalog finalisation: merge all data sources into a single read-only JSON
  │
  ▼
Phase 6 — (Optional) Spotify sync: mirror playlists + auto-generate genre "Essentials"
  │
  ▼
Phase 7 — (Optional) AI skill: on-demand playlist curation via any OpenClaw-compatible agent

What this does

Given a pile of audio files scattered across an arbitrary folder tree — with broken, missing, or misleading metadata — the pipeline will:

  1. Identify every track by acoustic fingerprint via ShazamIO, cross-validating against existing tags rather than blindly trusting them.
  2. Deduplicate bit-for-bit via SHA-256 (never deletes — stages duplicates for your review).
  3. Classify each track's language using langdetect plus optional per-language hint files you control.
  4. Organise everything into <SORTED_ROOT>/<Language>/<Artist>/<Album>/<Artist> - <Title>.<ext>.
  5. Enrich the library by fetching canonical metadata from the iTunes Search API, synchronised lyrics from LrcLib, and embedding 1000x1000 cover art.
  6. Sync the finalised local library up to Spotify as either a full mirror playlist or a curated set of per-genre "Essentials" playlists cross-referenced against your actual listening history.
  7. Expose the structured catalog to AI coding agents via a ClawHub skill for natural-language playlist curation on demand.

No part of the pipeline assumes any particular language. Drop a JSON file per language you care about into config/language_hints/ and the pipeline routes into those buckets automatically.


Supported formats

Format Extension Shazam ID ID3 enrichment Notes
MP3 .mp3 Direct Full Primary target format. No FFmpeg needed.
FLAC .flac Via FFmpeg Vorbis tags Requires FFmpeg on PATH or in <MUSIC_ROOT>/ffmpeg/.
AAC/M4A .m4a, .aac Via FFmpeg MP4 tags Requires FFmpeg.
WAV .wav Via FFmpeg Minimal Lossless but no native tag support.
OGG Vorbis .ogg Via FFmpeg Vorbis tags Requires FFmpeg.
WMA .wma Via FFmpeg ASF tags Legacy format. Requires FFmpeg.
Opus .opus Via FFmpeg Vorbis tags Requires FFmpeg.

Requirements

  • Python 3.12. Python 3.13+ will not work today: the shazamio-core wheel does not yet publish binaries for 3.13, and the source build needs Rust. Python 3.10 and 3.11 will work if you downgrade langdetect, but 3.12 is the supported target.
  • ~2 GB of free disk for the Shazam cache, the metadata catalog, cover art thumbnails, and Duplicates_Staging.
  • (Optional) FFmpeg on PATH. Shazam only needs it to decode non-MP3 containers (FLAC/M4A/OPUS). If you don't have it globally, drop a portable build at <MUSIC_ROOT>/ffmpeg/ and the pipeline will pick it up automatically.
  • (Optional) A Spotify developer app if you want Phase 6. See Phase 6: Spotify sync below.

Install

1. Clone the repo into your music folder

The zero-config layout the defaults expect is <MUSIC_ROOT>/sonic-phoenix/. You don't have to follow it — every path is overridable via environment variables — but it's the shortest path to running.

cd /path/to/your/music      # whatever you want MUSIC_ROOT to be
git clone https://github.com/drajb/sonic-phoenix.git
cd sonic-phoenix

2. Create a Python 3.12 virtual environment

# macOS / Linux
python3.12 -m venv .venv
source .venv/bin/activate

# Windows (PowerShell)
py -3.12 -m venv .venv
.\.venv\Scripts\Activate.ps1

# Windows (cmd)
py -3.12 -m venv .venv
.venv\Scripts\activate.bat

Verify:

python --version     # should print Python 3.12.x

3. Install Python dependencies

pip install --upgrade pip
pip install -r requirements.txt

The heavy dependency is shazamio-core (a Rust binary wheel). If pip tries to build it from source you are not on Python 3.12 — stop and fix the interpreter.


Configure

All configuration lives in environment variables. The only one you are required to set is MUSIC_ROOT.

1. Copy the env template

# macOS / Linux
cp .env.example .env

# Windows (PowerShell)
Copy-Item .env.example .env

2. Edit .env

Minimum:

MUSIC_ROOT=/absolute/path/to/your/music

On Windows you can use forward OR back slashes: MUSIC_ROOT=C:/Users/you/Music and MUSIC_ROOT=C:\Users\you\Music both work.

Everything else is optional. See .env.example for the full list with inline documentation.

3. Verify the config

python config.py

Prints a summary of every resolved path and tells you whether Spotify credentials were picked up. If MUSIC_ROOT points somewhere that doesn't exist, the individual scripts will fail loudly via config.require_music_root() — not silently.

4. .env is gitignored

The repo ships with a .gitignore that excludes .env, .venv/, .data/, Sorted/, Duplicates_Staging/, and the Spotify token cache. You cannot accidentally commit credentials or your music.

Environment variables reference

Variable Required Default Purpose
MUSIC_ROOT Yes Absolute path to the root of your music collection
SORTED_ROOT No <MUSIC_ROOT>/Sorted Where the organised library lives
DATA_DIR No <MUSIC_ROOT>/.data Where pipeline state files are written
DUPLICATES_STAGING No <MUSIC_ROOT>/sonic-phoenix/Duplicates_Staging Where bit-for-bit duplicates are staged
UNIDENTIFIED_DIR No <SORTED_ROOT>/Unidentified Where tracks that can't be classified land
FFMPEG_BIN No <MUSIC_ROOT>/ffmpeg/bin Path to FFmpeg binary (only for non-MP3 formats)
SHAZAM_CONCURRENCY No 20 Parallel Shazam lookups. Lower if rate-limited.
ITUNES_COUNTRIES No US,GB iTunes country codes to rotate through for enrichment
SPOTIFY_CLIENT_ID Phase 6 only Spotify developer app Client ID
SPOTIFY_CLIENT_SECRET Phase 6 only Spotify developer app Client Secret
SPOTIFY_REDIRECT_URI No http://127.0.0.1:8888/callback Spotify OAuth redirect URI

Optional: language hint files

The pipeline's language classifier uses langdetect by default, which does fine on obvious cases (English titles -> English, Spanish titles -> Spanish) but has a well-known failure mode: Latin-script transliterations of non-Latin-script languages (Hindi/Urdu/Punjabi written with English letters) get classified as English.

The fix is an explicit per-language hint file. Every file at config/language_hints/<Language>.json is loaded automatically. The filename (without .json) is the target folder name under SORTED_ROOT.

This is how you add a new language. It's the single user-facing extension point.

Starter templates

Example files ship under config/language_hints/examples/:

  • English.json — template for any Latin-script language
  • Hindi.json — the transliterated-Hindi use case the system was built for
  • Spanish.json — a second Latin-script example to show the pattern
  • merge_groups.json — optional, consumed by 03F --merge-languages
  • genres.json — optional, consumed by 05C_confidence_auditor

To activate any of them, copy them one directory up (out of examples/):

cp config/language_hints/examples/English.json config/language_hints/English.json
cp config/language_hints/examples/Hindi.json   config/language_hints/Hindi.json

Then edit your copies. The full field reference lives in config/language_hints/examples/README.md.

Language-agnostic by construction. There is nothing Hindi- or English-specific in the Python code. If you only care about French and Japanese, ship only French.json and Japanese.json — the pipeline will produce Sorted/French/ and Sorted/Japanese/ folders and nothing else.


Running the pipeline

The scripts are designed to be run in order. Each one is an independent Python file that imports config and picks up its inputs from disk, so you can absolutely stop after any phase, inspect the intermediate state under <MUSIC_ROOT>/.data/, fix anything by hand, and resume.

The happy path (minimal run)

For a first-time run against a fresh pile of audio files, the minimum sequence that gets you from chaos to a clean sorted library is:

# Phase 1 — identify everything via acoustic fingerprint
python 01A_extract_metadata.py
python 01D_shazam_all_files.py       # long-running; resumable

# Phase 2 — classify by language and build the hash catalog
python 02A_catalog_music.py
python 02D_organize_music.py          # physically moves files into Sorted/<Lang>/<Artist>/

# Phase 3 — audit and re-sort
python 03A_consolidate_by_artist.py
python 03D_titanium_resort.py         # requires config/language_hints/*.json

# Phase 4 — enrich with tags, lyrics, cover art
python 04I_polish_and_enrich_v6.py

# Phase 5 — finalise the master catalog
python 05I_finalize_catalog.py

That's it. You now have a clean library under <SORTED_ROOT>/ and a read-only catalog at <DATA_DIR>/final_catalog.json.

Optional extras

  • If you want to dedupe junk or residue files before enrichment: 05D_force_delete_residue.py, 05F_final_scrub.py.
  • If you want empty-folder cleanup mid-pipeline: 05E_final_cleanup.py, 05H_final_vacuum.py.
  • If you want to push everything to Spotify: see Phase 6: Spotify sync below.
  • If you want deep art fetch (1000x1000 HD): 04F_deep_art_sync.py.
  • If you want AI-driven playlist curation: see Phase 7: ClawHub AI skill below.

The historical / kitchen-sink path

Phases 1-5 each have multiple script versions that represent the evolution of the project. The 01A-01E, 04A-04I, and 05A-05I scripts are a chronological record: running any one of them in order from A->Z reproduces the full history that got us to 04I (the canonical enrichment script). Reading them in order is by far the fastest way to understand why the project does what it does, but you do not need to run every one. The "happy path" above is the canonical sequence.

Every script has a docstring at the top marked with one of these statuses:

Status Meaning
CANONICAL The recommended version. Run this.
HISTORICAL An earlier iteration kept for reference. Safe to skip.
UTILITY Standalone tool, not part of the main flow.
DESTRUCTIVE UTILITY Removes files. Read the docstring before running.
LIBRARY Imported by other scripts. Not directly runnable.

Phase-by-phase reference

Phase 1 — Discovery & identification

Script Status What it does
01A_extract_metadata.py HISTORICAL Pulls ID3 tags via mutagen. Useful as a quick sanity scan of what your files claim to contain, but existing tags are often unreliable.
01B_shazam_identify.py HISTORICAL Smoke test for shazamio — identifies a single file.
01C_shazam_by_hash.py HISTORICAL Hash-keyed Shazam cache. Predecessor to 01D.
01D_shazam_all_files.py CANONICAL 20-way concurrent Shazam over the whole library. Resumable. Writes to .data/shazam_final_results.json. This is the script you actually run.
01E_test_matching.py UTILITY Diagnostics for fuzzy string matching. Handy when debugging why an artist didn't match.

Phase 2 — Consolidation & initial sort

Script Status What it does
02A_catalog_music.py CANONICAL Builds catalog.json with one entry per file: {hash, tags, language, source}. Uses langdetect for language classification.
02B_analyze_catalog.py UTILITY Prints human-readable stats over the catalog (tracks per language, duplicates, etc). Read-only.
02C_organize_files.py HISTORICAL Early prototype mover. Superseded by 02D.
02D_organize_music.py CANONICAL Physically moves every file into <SORTED_ROOT>/<Lang>/<Artist>/. Stages duplicates into Duplicates_Staging/ instead of deleting.

Phase 3 — Hierarchical audit & re-sort

Script Status What it does
03A_consolidate_by_artist.py CANONICAL Merges feature-credited artist folders into their canonical parent (e.g. "Akon feat Eminem" -> "Akon"). Reads overrides from config/language_hints/artist_map.json.
03B_master_audit_sort.py CANONICAL Hint-driven audit pass. Flags orphans, empty folders, and residue files. Requires config/language_hints/*.json.
03C_high_confidence_resort.py HISTORICAL Re-evaluates low-confidence classifications against the hash catalog.
03D_titanium_resort.py CANONICAL Final structural enforcement. Uses hint files' artists, dna, keywords, and lang_codes to hard-route every remaining ambiguous artist to a language.
03E_scan_remnants.py UTILITY Sweeps MUSIC_ROOT for unprocessed leftovers not under Sorted/.
03F_reorganize_binary.py UTILITY Resolves byte-level hash mismatches. With --merge-languages it unifies adjacent language buckets per merge_groups.json.
03G_diagnose_shankar.py HISTORICAL Targeted debugger for feature-credit parsing edge cases. Named after the Bollywood trio Shankar-Ehsaan-Loy, which was the original test case.

Phase 4 — Enrichment (tags, lyrics, cover art)

Script Status What it does
04A_enrich_library.py HISTORICAL First iteration of the enricher.
04B_enrich_library_v2.py HISTORICAL v2 with structured API error handling.
04C_polish_library.py HISTORICAL Tightens enrichment bounds and artwork dimension minimums.
04D_fetch_lyrics.py HISTORICAL Standalone Lyrics.ovh wrapper. Superseded by LrcLib in 04I.
04E_art_decorator.py HISTORICAL Standalone ID3 APIC image embedder.
04F_deep_art_sync.py UTILITY Deep art rescue — forces 1000x1000 fetch when the normal pass failed. Run this if your library has sporadic missing cover art after 04I.
04G_polish_and_enrich.py HISTORICAL Master enricher, v3.
04H_polish_and_enrich_v5.py HISTORICAL v5 with strict subset matcher.
04I_polish_and_enrich_v6.py CANONICAL / CROWN JEWEL The one you run. iTunes country rotation, HTTP 429/403 backoff, synchronised lyrics via LrcLib, optional Pillow-based APIC embedding, strict subset matcher.

Phase 5 — Cleaning & finalisation

Script Status What it does
05A_repair_json.py UTILITY Fixes trailing-comma / truncation corruption in the JSON data stores. Run if a prior script was killed mid-write.
05B_sanitize_results_json.py UTILITY Strips junk tags before the final migration.
05C_confidence_auditor.py UTILITY Confidence-scored audit over the classified library. Uses config/language_hints/genres.json to rank suspicious classifications. Read-only.
05D_force_delete_residue.py DESTRUCTIVE UTILITY Hard-purges residue files matching a denoise pattern. Read the docstring first.
05E_final_cleanup.py UTILITY Bottom-up empty-folder vacuum. Never touches the root or per-language tops.
05F_final_scrub.py DESTRUCTIVE UTILITY Nuclear scrub of garbage extensions and fragment files. Skips the Sorted/ subtree entirely.
05G_final_migration.py HISTORICAL Final migration engine that writes ID3 tags before moving. Superseded by 04I's in-place enrichment.
05H_final_vacuum.py UTILITY Zero-remnant vacuum for a specific language bucket (defaults to "Rescued").
05I_finalize_catalog.py CANONICAL Writes .data/final_catalog.json, the read-only master catalog. Three-tier classification (ID3 -> Shazam -> filename fallback). Run this as the last step of the main pipeline.

Phase 6 — Spotify sync

Entirely optional. Skip the whole phase if you just want a clean local library.

Script Status What it does
06B_spotify_setup.py CANONICAL First Phase 6 script — completes the Spotify OAuth handshake and caches the token. Run once.
06C_spotify_backup.py CANONICAL Snapshots every playlist you own into .data/spotify_backups/. Run before 06D/06E so you have a rollback path.
06D_spotify_sync_engine.py CANONICAL Mirrors <SORTED_ROOT>/<Lang>/ into a "Local Library -- Lang" playlist per language. Resumable.
06E_spotify_discovery_sync.py CANONICAL / CROWN JEWEL Cross-references your local artists with your actual Spotify listening history and auto-generates per-genre "Essentials -- Lang -- Genre" playlists for the intersection.

Setting up Spotify

  1. Go to the Spotify Developer Dashboard and create an app.

  2. In the app settings, add this exact Redirect URI:

    http://127.0.0.1:8888/callback
    
  3. Copy the Client ID and Client Secret into your .env:

    SPOTIFY_CLIENT_ID=your_client_id
    SPOTIFY_CLIENT_SECRET=your_client_secret
  4. Run python 06B_spotify_setup.py. Your browser opens, you approve the scopes, the script confirms and caches the token at .data/.spotify_token_cache.

  5. From then on every other 06* script will pick up the cached token silently.

If 06D or 06E return a 403 on playlist creation, your app is in Spotify's "Development Mode" and needs the user explicitly added under "Users and Access" in the Dashboard. The scripts print the exact fix message.


Utility scripts

Tools that live at the repo root, not part of a numbered phase.

Script Status Purpose
absolute_zero_sort.py UTILITY Final-pass classifier for Sorted/Unidentified/Audit_Needed/. Hint-driven. Deletes the Unidentified tree after moving what it can.
common_sense_sort.py UTILITY Non-destructive sibling of absolute_zero_sort. Moves what it can but leaves Unidentified/ intact so you can inspect what's left.
total_scrub.py DESTRUCTIVE UTILITY Deletes every top-level folder under MUSIC_ROOT that is not in the protected set. Only run when you are 100% done and only want the finalised library to remain.
format_and_rename_project.py HISTORICAL One-shot bootstrap that renamed the original scripts to their phase-prefixed form. Running it today just prints the phase table.
spotify_auth.py LIBRARY Shared Spotipy OAuth helper imported by every 06* script. Not runnable on its own.
config.py LIBRARY Single source of truth for every path, tuning knob, and environment variable. Import only. Running python config.py prints a config summary.

ClawHub AI skill (Phase 7)

Sonic Phoenix is also published as an AI agent skill on ClawHub under the name ultimate-music-manager. The skill teaches an AI coding assistant (Claude Code, Codex, Copilot, or any OpenClaw-compatible agent) how to operate the full pipeline on your behalf and curate playlists from your catalog using natural language.

Phase 7 is what turns the pipeline's output from a static library into a living, queryable music system. Once the catalog is built (Phases 1-5), the AI skill lets you say things like "build me a 90s Bollywood nostalgia playlist" or "create a road trip mix, heavy on rock" and the agent has the structured metadata — artist, title, album, language, genre — to execute it.

What's in the skill

The skill lives at ultimate-music-manager/ in this repo and includes:

File Purpose
SKILL.md Main agent instruction document — setup, config, phase-by-phase reference, troubleshooting
_meta.json ClawHub registry metadata (slug + version)
scripts/preflight.sh 7-point environment validator (Python 3.12, venv, deps, .env, MUSIC_ROOT, FFmpeg)
scripts/run-pipeline.sh Single-command pipeline runner with --skip-shazam, --spotify, --dry-run flags
scripts/status.sh Dashboard: file counts, language breakdown, data file sizes, pipeline progress
hooks/safety-guard.sh PreToolUse hook that intercepts destructive scripts and requires confirmation
hooks/HOOK.md Hook metadata (OpenClaw format)
references/data-files.md Schema and lineage for every JSON artifact the pipeline produces
references/language-hints-guide.md Full guide to creating language hint files with examples

Installing from ClawHub

npx clawhub@latest install ultimate-music-manager

Using the helper scripts directly

You don't need ClawHub to use the scripts — they work standalone from the repo:

# Check your environment is ready
bash ultimate-music-manager/scripts/preflight.sh

# Preview what the pipeline will do
bash ultimate-music-manager/scripts/run-pipeline.sh --dry-run

# Run the full pipeline
bash ultimate-music-manager/scripts/run-pipeline.sh

# Run including Spotify sync
bash ultimate-music-manager/scripts/run-pipeline.sh --spotify

# Check status at any time
bash ultimate-music-manager/scripts/status.sh

Enabling the safety hook (Claude Code)

Add to .claude/settings.json:

{
  "hooks": {
    "PreToolUse": [{
      "matcher": "Bash",
      "hooks": [{
        "type": "command",
        "command": "./ultimate-music-manager/hooks/safety-guard.sh"
      }]
    }]
  }
}

This intercepts attempts to run destructive scripts (05D, 05F, total_scrub, absolute_zero_sort) and injects a confirmation warning. Zero overhead on all other commands.


Data files produced

All working state lives under <DATA_DIR> (default <MUSIC_ROOT>/.data/). Every file here is regenerable — safe to delete if you want to start over.

File Written by Purpose
metadata_catalog.json 01A ID3 tag dump per file (pre-Shazam)
shazam_final_results.json 01D Acoustic identification for every file
shazam_hash_results.json 01C Hash-keyed Shazam cache
catalog.json 02A Master SHA-256 catalog with language classification
enrichment_report.json 04I Per-file enrichment status (art + lyrics booleans)
mismatch_report.json 04I Files where iTunes returned a mismatched track
final_catalog.json 05I Read-only master catalog — the single source of truth for Phase 6 and Phase 7
confidence_report.json 05C Audit confidence scores
.spotify_token_cache 06B Cached OAuth token (gitignored)
spotify_sync_state.json 06D Resumable per-file sync state
discovery_sync_state.json 06E Resumable per-artist discovery state
spotify_backups/ 06C Rollback snapshot of every Spotify playlist
spotify_sync.log 06D Full sync log
discovery_sync.log 06E Discovery sync log

Data flow

01A -> metadata_catalog.json
01D -> shazam_final_results.json, shazam_hash_results.json
02A -> catalog.json (merges metadata + shazam + hashes)
04I -> enrichment_report.json, mismatch_report.json (reads catalog, writes enriched ID3 tags to files)
05I -> final_catalog.json (merges all sources into canonical read-only output)
06D -> spotify_sync_state.json (reads final_catalog)
06E -> discovery_sync_state.json (reads final_catalog + Spotify listening history)

Troubleshooting

shazamio-core fails to install / "no matching wheel". You are not on Python 3.12. python --version inside your activated venv should say 3.12.x. Recreate the venv with py -3.12 -m venv .venv.

Shazam returns HTTP 429. Drop SHAZAM_CONCURRENCY in your .env to 5 or 10 and rerun 01D. The script is resumable — it skips anything already identified.

iTunes returns HTTP 403. ITUNES_COUNTRIES=US,GB,AU,CA — add more countries to rotate through. 04I already retries with backoff.

[config] MUSIC_ROOT does not exist. Your .env is not being read, or MUSIC_ROOT is set to a path that doesn't exist. python config.py will tell you exactly which path it tried.

Every artist ends up under "English". You need config/language_hints/*.json files for the languages you care about. Out of the box langdetect can only tell apart actual script families — transliterated Hindi, romanised Japanese, and similar cases need hints. Copy config/language_hints/examples/ templates and edit.

Spotify 403 when creating a playlist. Your developer app is in Development Mode. Go to the Spotify Dashboard -> your app -> Users and Access -> add your Spotify account email. Then rerun.

Files are "missing" after a run. Nothing is ever deleted by the canonical pipeline. Check Duplicates_Staging/ first — that is where duplicates go to wait for your review.

JSON data files are corrupted (trailing commas, truncated). A prior script was killed mid-write. Run python 05A_repair_json.py to sanitize the data stores, then resume the pipeline from wherever you left off.

Enrichment seems stuck or slow. 04I respects iTunes rate limits with exponential backoff. If it's cycling through 429 retries, add more country codes to ITUNES_COUNTRIES in your .env to spread the load. The script is resumable — safe to kill and restart.


Design notes

  • Resumable by construction. Every long-running script writes state to disk after each unit of work and picks up where it left off on the next run. You can kill 01D, 04I, 06D, 06E at any point with Ctrl-C without losing progress.
  • Nothing is deleted without permission. The canonical pipeline (01-04, 05I, 06) only moves files. Deletions are gated behind explicit 05D, 05F, absolute_zero_sort, total_scrub which are each marked DESTRUCTIVE UTILITY in their docstrings.
  • Cross-validation over blind trust. Existing ID3 tags are not thrown away — they are cross-validated against Shazam's acoustic fingerprint. Where Shazam confirms the tags, they stay. Where it disagrees, the acoustic result wins. Where Shazam can't identify a track, the tags are sanitised and used as a fallback.
  • Single source of truth for config. Everything that's user-tunable is in config.py, which reads from environment variables (optionally via .env). No magic constants hidden in individual scripts.
  • No credentials in code. Spotify keys come from env vars only. A grep for client IDs across the repo returns zero hits — the only place they can possibly live is your local .env which is gitignored.
  • Language-agnostic. Zero hardcoded language names anywhere in the Python. All language knowledge is loaded at runtime from config/language_hints/*.json.
  • Script status is documented inline. Every .py file's top-of-file docstring begins with a Status: line (CANONICAL / HISTORICAL / UTILITY / DESTRUCTIVE UTILITY) and a Run if: line. You do not need to read the code to know whether to run a script.
  • Historical scripts are preserved. The 04A-04I evolution is kept in the repo as a chronological record of every edge case the enrichment pipeline encountered. Reading them in order is the fastest way to understand the problem space.

License

MIT License

Copyright (c) 2026 Rohit Burani

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors