Releases: atharva-again/AudioRAG
Releases · atharva-again/AudioRAG
v0.15.0 - Local-Only Audio Processing
Breaking
- Removed YouTube/URL support: AudioRAG now processes only local audio files. Removed all YouTube, yt-dlp, and HTTP URL source providers.
- Removed
YouTubeSource,URLSource,AudioSourceRouter - Removed
youtubeoptional dependency group - Removed
ydl_utils.pyandyt-dlpintegration - Removed
source_pathin favor of local paths only discover_sources()now only expands directories to audio files
- Removed
- Config changes: Removed YouTube-related config options (
youtube_format,youtube_cookies_from_browser,youtube_cookie_file, etc.)
Changed
- Simplified CLI: Index command now only accepts local file/directory paths
- README and docs updated: Reflects local-only operation
Fixed
- file:// URL handling: Fixed local file processing to properly handle file:// URLs in both
LocalSource.download()anddiscover_sources() - Copilot review fixes: Various docstring and help text updates
v0.14.1 - Narrow YouTube Format Exclusion
Fixed
- Copilot review follow-up: Narrowed Issue #43 fix to only exclude the
formatkey during metadata-only extraction, rather than allydl_opts. This preserves auth/session options (cookies, po_token, visitor_data, extractor_args) needed for age-restricted/private content discovery.
v0.14.0 - YouTube Discovery Fixes
Fixed
- Issue #43: youtube_format applied during playlist discovery: Skip yt-dlp options during metadata-only extraction to prevent format errors when discovering playlists with
youtube_formatconfig set. - Issue #41: Redundant YouTube discovery calls: Cache metadata during discovery and pass to pipeline to avoid redundant yt-dlp network calls when indexing each video in a playlist.
Breaking
discover_sources()returns rich objects: Now returnslist[DiscoveredSource]withurlandmetadatafields instead of plain strings. This enables metadata caching.- Added
discover_source_urls()convenience function for backward compatibility when only URLs are needed.
v0.13.0 - YouTube Audio-Only Downloads
Added
- youtube_format config option: New
youtube_formatconfig to enable audio-only YouTube downloads.- Set
AUDIORAG_YOUTUBE_FORMAT=bestaudioto download audio-only (saves ~95% bandwidth). - Works with yt-dlp format selection:
bestaudio,bestaudio/best,worstaudio. - Compatible with existing
-x --audio-format mp3post-processing.
- Set
v0.12.0 - Budget Store, Doctor CLI, .env Discovery, Index Status
Added
- Persistent BudgetStore: New
BudgetStoreprotocol enables pluggable budget state management backends.- Added
SqliteBudgetStorefor persistent budget tracking across process restarts. - Added
atomic_reserve()for atomic check-and-record in a single transaction.
- Added
- Budget sipping: New budget adjustment feature that reconciles estimated vs actual audio duration after download.
- Prevents budget waste by reserving based on estimated duration, then "sipping" (releasing) excess after actual duration is known.
- audiorag doctor CLI: New
audiorag doctorsubcommand for diagnosing pipeline issues.- Verifies all required dependencies are installed.
- Checks provider configurations.
- Provides actionable troubleshooting advice.
- .env auto-discovery: Configuration now automatically discovers
.envfiles by walking up from the current directory.- Supports nested project structures.
- Falls back to default locations if not found.
- get_index_status API: New
pipeline.get_index_status(source_url)method to check if a source is indexed without triggering re-indexing.- Returns indexing status:
not_indexed,indexed, orfailed. - Useful for checking pipeline state before queries.
- Returns indexing status:
Changed
- BudgetGovernor refactor: Split into separate modules for better maintainability.
budget.py- In-memory budget trackingbudget_store_sqlite.py- SQLite-backed persistent storageprotocols/budget_store.py- Protocol definition for custom stores
v0.11.0 - YouTube Improvements
Added
- YouTube cookies-from-browser: New config option
youtube_cookies_from_browserto extract cookies directly from browser (e.g.,chrome,firefox:default,chrome+gnomekeyring:Profile1). - Cookie file support: Wired up existing
youtube_cookie_fileconfig option (was dead code). - CI improvements: CI now installs all optional dependencies for better test coverage.
Changed
- Consolidated ydl_opts: Created shared
audiorag.source.ydl_utilsmodule to consolidate yt-dlp option building (was duplicated in pipeline.py and discovery.py).
Fixed
- Type safety: Resolved pre-existing type errors in weaviate.py, assemblyai.py, groq.py, and cohere.py exposed by full dependency installation.
v0.10.0 - Transcription Resumability
Added
- Transcription resumability: Pipeline now tracks per-part transcription in SQLite, enabling resumption after partial failures without re-transcribing completed parts.
- Transcripts table: New database table stores transcription segments per audio part.
- StateManager methods: Added
store_transcript(),get_transcripts(), andget_transcribed_part_indices()methods.
Benefits
- Saves money: If transcription fails at part 8/10, next run skips parts 1-7 and only transcribes 8-10, saving Groq/STT budget.
- Enables re-chunking: Raw transcript storage infrastructure ready for future re-chunking without re-STT.
- Resilient: Each part is persisted immediately after successful transcription.
Fixed
- Timestamp alignment: Stored timestamps are now adjusted with cumulative offset to ensure correct time alignment after resume.
v0.9.0 - Persistent Cache & Cache Management
Added
- Persistent work_dir default: Default
work_dirnow uses platform-appropriate cache directory:- Linux:
~/.cache/audiorag - macOS:
~/Library/Caches/audiorag - Windows:
%LOCALAPPDATA%\audiorag
- Linux:
- Cache management CLI: New commands to manage cached audio files:
audiorag cache info- Show cache location and sizeaudiorag cache clear- Clear all cached audio files
- Cache management SDK: New methods on
AudioRAGPipeline:pipeline.clear_cache()- Clear cache, returns count of items removedpipeline.get_cache_info()- Get cache location, file count, and size
Fixed
- Type safety: Fixed type checking for
metadata.durationattribute access in budget reservation logic.
v0.8.1 - Vector Store Source ID Fix
Fixed
- Leaky abstraction in vector stores: Vector store providers now use canonical
source_idinstead of rawsource_url. This fixes issue #19 where backends had to parse Source IDs from URLs.- Added
source_idtoStageContextfor pipeline-wide canonical ID - Renamed
VectorStoreProvider.delete_by_source()todelete_by_source_id() - Updated all vector store implementations (ChromaDB, Pinecone, Weaviate, Supabase) to filter by
source_id - Updated metadata to use
source_idinstead ofsource_url
- Added
Migration Note
⚠️ Existing vector stores withsource_urlmetadata will need to be re-indexed forforce=Truedeletion to work. Alternatively, users can manually delete via the vector store's native tools.
v0.8.0 - Auto-detect File Protocol
Added
- Audio source auto-routing: New
AudioSourceRouterautomatically detects URL protocol:file://URLs →LocalSource(bypasses yt-dlp)- Local paths (
/home/user/audio.mp3,./audio.mp3) →LocalSource - YouTube URLs →
YouTubeSource - Other HTTP URLs →
URLSource
- URL source provider: New
audio_source_providerconfig option supportsurlto use URLSource directly. - Robust YouTube URL detection: Uses proper URL parsing instead of substring matching to avoid false positives (e.g.,
myyoutube.comis not YouTube).
Changed
- Replaced pydub with ffprobe: Duration detection in
LocalSourceandURLSourcenow uses ffprobe directly instead of pydub.
Fixed
- Protocol conformance:
get_metadata()return type now allowsNoneto conform with implementations that don't support metadata extraction. - URL parameter naming: Fixed parameter name mismatch (
source_url→url) inURLSource.download()to match protocol.
Configuration
# Select audio source provider (default: youtube - auto-routing enabled)
export AUDIORAG_AUDIO_SOURCE_PROVIDER="local" # Force LocalSource
export AUDIORAG_AUDIO_SOURCE_PROVIDER="url" # Force URLSource
export AUDIORAG_AUDIO_SOURCE_PROVIDER="youtube" # YouTube + auto-routing (default)