A production-ready Python tool to extract full video transcripts, captions, comments, replies, and channel data from any YouTube video or playlist. Supports batch processing, multi-language transcripts, and exports to JSON/CSV.
Features Β· Quick Start Β· Usage Β· API Quota
A personal automation project exploring YouTube's data ecosystem using the official API and transcript libraries. This tool demonstrates expertise in API integration, data pipeline design, batch processing, and structured data export β built as a hands-on learning exercise and portfolio showcase.
| Data Layer | What You Get |
|---|---|
| Transcripts | Full captions/subtitles with timestamps, word count, language detection |
| Comments | All top-level comments with author, text, likes, timestamps |
| Replies | Nested reply threads under each comment |
| Video Metadata | Title, description, tags, duration, views, likes, category, thumbnails |
| Channel Data | Subscriber count, total videos, country, keywords, channel description |
| Playlists | Automatic resolution β scrape every video in a playlist |
- π¬ Full Transcript Extraction β Captions with precise timestamps, auto-generated & manual subtitle support
- π¬ Deep Comment Scraping β Top-level comments AND nested replies with like counts
- π Rich Video Metadata β Views, likes, duration, tags, category, thumbnails
- π‘ Channel Intelligence β Subscriber count, video count, country, branding keywords
- π Playlist Resolution β Auto-detect playlists and scrape every contained video
- π Multi-Language Transcripts β Specify preferred languages with automatic fallback
- π Flexible Export β JSON, CSV, or both β structured for immediate analysis
- π Batch Processing β Feed a text file of URLs for hands-free bulk scraping
- π‘οΈ Resilient Design β Exponential backoff, retry logic, configurable rate limiting
- β‘ Zero-Browser Architecture β Pure API + library approach β fast, lightweight, no Selenium
| Component | Version |
|---|---|
| Python | 3.11+ |
| google-api-python-client | β₯ 2.149.0 |
| youtube-transcript-api | β₯ 0.6.3 |
| python-dotenv | β₯ 1.1.0 |
| rich | β₯ 13.9.0 |
| pandas | β₯ 2.2.0 |
YouTube-Transcript-Comment-Scraper/
βββ main.py # CLI entry-point (argparse)
βββ scraper.py # Core scraping engine (transcript, comments, metadata)
βββ config.py # Centralised configuration (.env loader)
βββ requirements.txt # Pinned Python dependencies
βββ .env.example # Environment variable template
βββ .gitignore # Git ignore rules
βββ README.md # This file
βββ utils/
β βββ __init__.py
β βββ helpers.py # URL parsing, file I/O, retry logic, display
βββ output/ # Generated β JSON/CSV exports land here
βββ videos/
βββ playlists/
git clone https://github.com/facingshootingstar/YouTube-Transcript-Comment-Scraper.git
cd YouTube-Transcript-Comment-Scraperpython -m venv venv
# Windows
venv\Scripts\activate
# macOS / Linux
source venv/bin/activatepip install -r requirements.txtcp .env.example .env
# Edit .env and add your YouTube Data API key- Go to Google Cloud Console
- Create a new project (or select existing)
- Enable YouTube Data API v3
- Go to Credentials β Create API Key
- Paste the key into your
.envfile
Note: Transcripts work WITHOUT an API key. The key is only needed for comments, metadata, and playlist resolution.
python main.py "https://www.youtube.com/watch?v=dQw4w9WgXcQ"python main.py "https://www.youtube.com/playlist?list=PLrAXtmErZgOeiKm4sgNOknGvNjby9efdf"python main.py "URL1" "URL2" "URL3"python main.py --transcript-only "https://youtu.be/dQw4w9WgXcQ"# Create urls.txt with one URL per line
python main.py --batch urls.txt# Export as CSV
python main.py --format csv "URL"
# Export as both JSON + CSV
python main.py --format both --output ./my_data "URL"
# Limit comments
python main.py --max-comments 100 "URL"
# Prefer Spanish transcripts
python main.py --languages es,en "URL"# No comments
python main.py --no-comments "URL"
# No channel data
python main.py --no-channel "URL"{
"video_id": "dQw4w9WgXcQ",
"language": "en",
"is_generated": true,
"segment_count": 142,
"segments": [
{
"start": 0.0,
"duration": 2.4,
"timestamp": "0:00.000",
"text": "We're no strangers to love"
}
],
"full_text": "We're no strangers to love ...",
"word_count": 468
}{
"comment_id": "UgyL...",
"author": "Rick Astley Fan",
"author_channel_id": "UC...",
"text": "Never gonna give this up!",
"like_count": 1523,
"published_at": "2024-03-15T12:00:00Z",
"is_reply": false,
"parent_id": null
}All settings can be configured via .env or CLI flags:
| Variable | Default | Description |
|---|---|---|
YOUTUBE_API_KEY |
β | YouTube Data API v3 key |
OUTPUT_DIR |
output |
Base output directory |
OUTPUT_FORMAT |
json |
json, csv, or both |
MAX_COMMENTS |
500 |
Max comments per video (0 = unlimited) |
MAX_REPLIES |
50 |
Max replies per top-level comment |
COMMENT_SORT |
relevance |
relevance or time |
TRANSCRIPT_LANGS |
en,en-US |
Preferred transcript languages |
REQUEST_DELAY |
0.5 |
Seconds between API calls |
MAX_RETRIES |
3 |
Retry attempts on failure |
LOG_LEVEL |
INFO |
Logging verbosity |
YouTube Data API v3 has a 10,000 units/day free quota:
| Operation | Cost |
|---|---|
videos.list |
1 unit |
channels.list |
1 unit |
commentThreads.list |
1 unit |
comments.list |
1 unit |
playlistItems.list |
1 unit |
A full scrape of one video (metadata + comments + channel) costs roughly 5β50 units depending on comment volume. You can comfortably scrape 200β2000 videos/day on the free tier.
β οΈ This tool is for educational, research, and legitimate personal use only.
- β Academic research, content strategy, accessibility improvements
- β Scraping your own channel's data for analytics
- β Aggregating public data for sentiment analysis or NLP research
- β Mass harvesting personal data for spam or harassment
- β Violating YouTube's Terms of Service for commercial re-distribution
- β Building competing platforms with scraped content
Always:
- Respect YouTube's Terms of Service and API Terms
- Use reasonable delays and respect rate limits
- Handle personal data in compliance with GDPR / CCPA
The author assumes no liability for misuse of this tool.
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License β see the LICENSE file for details.
Built with β€οΈ by @facingshootingstar
Made for personal learning and portfolio purposes.