Skip to content

dparedesi/YTScriber

Repository files navigation

YTScriber

Python 3.9+ License: MIT Code style: black Ruff

Download YouTube transcripts and manage channel archives with a unified CLI.

Features

  • 📹 Extract videos from any YouTube channel
  • 📝 Download transcripts with metadata (title, author, duration, etc.)
  • 🧠 Summarize while downloading — AI summaries run during the delay between downloads, adding no extra time
  • 🔐 Secure key storage in the OS keychain via ytscriber auth login
  • 📄 Save as markdown files with YAML frontmatter for easy processing
  • 🔄 Track progress in CSV files to resume interrupted downloads
  • Cross-platform data directories via platformdirs
  • Unified CLI with subcommands (ytscriber download, extract, sync-all)

Installation

From PyPI

pip install ytscriber

From source (recommended for development)

git clone https://github.com/dparedesi/YTScribe.git
cd YTScribe
pip install -e .

Development installation

pip install -e ".[dev]"

Quick Start

# Extract videos from a conference channel
ytscriber extract https://www.youtube.com/@AWSEventsChannel/videos \
  --count 100 \
  --folder aws-reinvent-2025 \
  --register-channel

# Download transcripts
ytscriber download --folder aws-reinvent-2025

Usage

Extract videos from a channel

ytscriber extract <channel_url> --count <number> --folder <folder>

Examples:

# AWS re:Invent 2025
ytscriber extract https://www.youtube.com/@AWSEventsChannel/videos \
  --count 100 \
  --folder aws-reinvent-2025 \
  --register-channel

# PyCon US
ytscriber extract https://www.youtube.com/@PyConUS \
  --count 50 \
  --folder pycon-2024 \
  --register-channel

# KubeCon
ytscriber extract https://www.youtube.com/@cncf/videos \
  --count 75 \
  --folder kubecon-2024 \
  --register-channel

Options:

Option Description Default
--count, -n Number of latest videos to extract 10
--folder Folder under data dir (shorthand for CSV) -
--append-csv Create or append to CSV file -
--output, -o Save video IDs to text file -
--register-channel Add channel to channels.yaml False
--verbose, -v Enable verbose output False

Download transcripts

ytscriber download --folder <folder>

Examples:

# Download transcripts for AWS re:Invent
ytscriber download --folder aws-reinvent-2025

# With faster processing (shorter delay)
ytscriber download --folder pycon-2024 --delay 30

# Download AND summarize each transcript during the delay window (no extra time)
ytscriber download --folder aws-reinvent-2025 --summarize

Options:

Option Description Default
--folder Folder under data dir (shorthand for CSV/output) -
--csv Input CSV file with video URLs -
--output-dir Directory for transcript files outputs
--delay Seconds between requests 60
--languages, -l Language codes to try en en-US en-GB
--summarize Summarize each transcript during the delay window (needs API key) False
--api-key OpenRouter API key (overrides env var and keychain) -
--verbose, -v Enable verbose output False

Tip: --summarize reuses the 60s rate-limit delay to run the AI summary in the background, so summaries come for free without slowing downloads. It also works on download-all. Requires an OpenRouter key (see AI Summarization Setup).

Single video mode:

ytscriber download https://www.youtube.com/watch?v=VIDEO_ID --output transcript.md

Add a video to a collection

ytscriber add <youtube_url> --folder <collection>

Sync all channels

ytscriber sync-all

Download all transcripts

ytscriber download-all

# Also generate summaries during the delay window
ytscriber download-all --summarize

View or edit config

ytscriber config
ytscriber config --set defaults.delay=45

Status

ytscriber status

Output Format

Transcript files (Markdown with YAML frontmatter)

Each transcript is saved as a Markdown file with rich metadata in the YAML frontmatter:

---
video_id: i_cskqmWA3U
video_url: https://www.youtube.com/watch?v=i_cskqmWA3U
title: Personalize ChatGPT with custom instructions
author: OpenAI
published_date: 2025-05-12
length_minutes: 2.17
views: 60882
description: "With custom instructions, ChatGPT becomes more relevant..."
is_generated: True
is_translatable: True
---

You can find memory and custom instructions by clicking on your profile
in the top right, clicking on settings, and then personalization...

Files are named with the publish date for easy sorting: 2025-05-12-i_cskqmWA3U.md

AI Summarization Setup

YTScriber supports two summarization providers:

Provider Default Model API Key
Z.AI (default) GLM-5.1 Get one at z.ai
OpenRouter nvidia/nemotron-3-super-120b-a12b:free Get one at openrouter.ai

1. Get an API Key — Sign up at your chosen provider.

2. Configure — The recommended way is ytscriber auth login:

ytscriber auth login     # prompts for provider, key + model, validates both, stores securely
ytscriber auth status    # shows active provider and where the key is resolved from
ytscriber auth logout    # removes the stored key

auth login validates your key against the chosen provider and runs a test request, so a wrong key or unavailable model is caught immediately.

Alternatively, use an environment variable or a .env file:

export ZAI_API_KEY=your-key-here           # for Z.AI (default provider)
export OPENROUTER_API_KEY=sk-or-...        # for OpenRouter

3. Switch providers — Change the active provider anytime:

ytscriber auth login                       # pick a different provider at login
ytscriber config --set summarization.provider=openrouter  # or set directly

Summarize transcripts

You can summarize during download (--summarize, see above) or run summarization separately on already-downloaded transcripts:

ytscriber summarize <folder_name> [options]

Examples:

# Summarize random folder
ytscriber summarize random

# Summarize all folders
ytscriber summarize --all

# Dry run to preview changes
ytscriber summarize random --dry-run

Options:

Option Description Default
--all Process all folders False
--dry-run Preview changes without writing False
--force overwrite existing summaries False
--model OpenRouter model to use nvidia/nemotron-3-super-120b-a12b:free

CSV tracking format

url,title,duration_minutes,view_count,description,transcript_downloaded,summary_done
https://youtube.com/watch?v=...,Talk Title,45.5,1234,Description...,success,
Column Description
transcript_downloaded Status: success, error: <reason>, or empty
summary_done Track if you've processed the transcript

Project Structure

YTScribe/
├── src/
│   └── ytscriber/
│       ├── __init__.py          # Package exports
│       ├── cli.py               # Command-line interface
│       ├── downloader.py        # Transcript downloading
│       ├── extractor.py         # Channel video extraction
│       ├── csv_handler.py       # CSV operations
│       ├── metadata.py          # Video metadata extraction
│       ├── models.py            # Data models
│       ├── exceptions.py        # Custom exceptions
│       ├── logging_config.py    # Logging setup
│       └── utils.py             # Utility functions
├── scripts/                     # Automation scripts
├── tests/                       # Unit tests
├── prompts/                     # AI prompts for analysis
├── examples/                    # Example shell scripts
├── pyproject.toml               # Project configuration
└── README.md

Data Organization

By default, data is stored in:

  • macOS/Windows: ~/Documents/YTScriber
  • Linux: ~/ytscriber

Example structure:

YTScriber/
├── aws-reinvent-2025/
│   ├── videos.csv
│   └── transcripts/
│       ├── 2025-12-03-abc123xyz.md
│       └── 2025-12-03-def456uvw.md
├── pycon-2024/
│   ├── videos.csv
│   └── transcripts/
└── kubecon-eu-2024/
    ├── videos.csv
    └── transcripts/

Migration (1.x to 2.0)

If you used the old transcript-* commands and a repo-local data/ folder:

  1. Move your existing data/ folder into the new data directory above.
  2. Copy channels.yaml into the same data directory if you rely on sync-all.
  3. Update commands:
    • transcript-extract -> ytscriber extract
    • transcript-download -> ytscriber download
    • transcript-add -> ytscriber add
    • transcript-summarize -> ytscriber summarize
  4. Optionally set defaults with ytscriber config --set defaults.delay=....

Rate Limiting & Best Practices

YouTube may rate limit or block your IP if you make too many requests:

  1. Use reasonable delays: Default 60 seconds between requests is safe
  2. Resume capability: Script tracks progress in CSV, can resume after interruption
  3. Start small: Test with 10-20 videos before large batches
  4. Respect limits: If you get blocked, wait 30-60 minutes before retrying

Development

Setup

# Clone repository
git clone https://github.com/dparedesi/YTScribe.git
cd YTScribe

# Install with development dependencies
pip install -e ".[dev]"

# Install pre-commit hooks (optional)
pre-commit install

Running tests

# Run all tests
pytest

# Run with coverage
pytest --cov=ytscriber

# Run specific test file
pytest tests/test_utils.py

Code quality

# Format code
black src tests

# Lint code
ruff check src tests

# Type checking
mypy src

Programmatic Usage

from ytscriber import TranscriptDownloader, ChannelExtractor

# Extract videos from a channel
extractor = ChannelExtractor()
videos = extractor.extract_videos(
    "https://www.youtube.com/@AWSEventsChannel/videos",
    max_videos=10
)

# Download transcripts
downloader = TranscriptDownloader(
    languages=["en", "en-US"],
    delay=30,
    output_dir="transcripts"
)

for video in videos:
    result = downloader.download(video.video_id, video.url)
    if result.success:
        print(f"Downloaded: {video.title}")
    else:
        print(f"Failed: {result.error_message}")

Troubleshooting

"No transcript found"

  • Video may not have captions/transcripts available
  • Try with different language codes: --languages en en-US auto

"IP blocked" or rate limiting

  • Wait 30-60 minutes before retrying
  • Increase delay: --delay 120
  • Use different network/IP if persistent

"Could not extract metadata"

  • Transcript will still download, just without extra metadata
  • Check if video is accessible and not private

Transcripts saved without date prefix (macOS)

If transcript files are saved as {video_id}.md instead of YYYY-MM-DD-{video_id}.md, Python on macOS is likely missing SSL root certificates. This causes pytube (used for video metadata) to fail silently.

Diagnose:

python -c "from pytube import YouTube; print(YouTube('https://youtube.com/watch?v=dQw4w9WgXcQ').publish_date)"

If you see SSLCertVerificationError, run the certificate installer:

"/Applications/Python 3.13/Install Certificates.command"

Adjust the path for your Python version (3.12, 3.14, etc.).

Note: Homebrew Python and most Linux distributions include certificates by default. This only affects Python installed from python.org.

Script interrupted

  • Just run the same command again - it will skip already downloaded videos
  • Progress is saved to CSV after each video

Requirements

  • Python 3.9+
  • youtube-transcript-api
  • yt-dlp
  • pytube

License

MIT License - see LICENSE for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Run tests and linting (pytest && ruff check .)
  4. Commit your changes (git commit -m 'Add amazing feature')
  5. Push to the branch (git push origin feature/amazing-feature)
  6. Open a Pull Request

About

Download and process YouTube video transcripts from channels and playlists

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors