Skip to content

facingshootingstar/YouTube-Transcript-Comment-Scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🎬 YouTube Transcript & Comment Scraper

A production-ready Python tool to extract full video transcripts, captions, comments, replies, and channel data from any YouTube video or playlist. Supports batch processing, multi-language transcripts, and exports to JSON/CSV.

Python 3.11+ YouTube Data API v3 License: MIT

Features Β· Quick Start Β· Usage Β· API Quota


πŸ“Œ About This Project

A personal automation project exploring YouTube's data ecosystem using the official API and transcript libraries. This tool demonstrates expertise in API integration, data pipeline design, batch processing, and structured data export β€” built as a hands-on learning exercise and portfolio showcase.


🎯 What It Does

Data Layer What You Get
Transcripts Full captions/subtitles with timestamps, word count, language detection
Comments All top-level comments with author, text, likes, timestamps
Replies Nested reply threads under each comment
Video Metadata Title, description, tags, duration, views, likes, category, thumbnails
Channel Data Subscriber count, total videos, country, keywords, channel description
Playlists Automatic resolution β€” scrape every video in a playlist

✨ Key Features

  • 🎬 Full Transcript Extraction β€” Captions with precise timestamps, auto-generated & manual subtitle support
  • πŸ’¬ Deep Comment Scraping β€” Top-level comments AND nested replies with like counts
  • πŸ“Š Rich Video Metadata β€” Views, likes, duration, tags, category, thumbnails
  • πŸ“‘ Channel Intelligence β€” Subscriber count, video count, country, branding keywords
  • πŸ“‹ Playlist Resolution β€” Auto-detect playlists and scrape every contained video
  • 🌐 Multi-Language Transcripts β€” Specify preferred languages with automatic fallback
  • πŸ“ Flexible Export β€” JSON, CSV, or both β€” structured for immediate analysis
  • πŸ”„ Batch Processing β€” Feed a text file of URLs for hands-free bulk scraping
  • πŸ›‘οΈ Resilient Design β€” Exponential backoff, retry logic, configurable rate limiting
  • ⚑ Zero-Browser Architecture β€” Pure API + library approach β€” fast, lightweight, no Selenium

πŸ“¦ Tech Stack

Component Version
Python 3.11+
google-api-python-client β‰₯ 2.149.0
youtube-transcript-api β‰₯ 0.6.3
python-dotenv β‰₯ 1.1.0
rich β‰₯ 13.9.0
pandas β‰₯ 2.2.0

πŸ“‚ Project Structure

YouTube-Transcript-Comment-Scraper/
β”œβ”€β”€ main.py                  # CLI entry-point (argparse)
β”œβ”€β”€ scraper.py               # Core scraping engine (transcript, comments, metadata)
β”œβ”€β”€ config.py                # Centralised configuration (.env loader)
β”œβ”€β”€ requirements.txt         # Pinned Python dependencies
β”œβ”€β”€ .env.example             # Environment variable template
β”œβ”€β”€ .gitignore               # Git ignore rules
β”œβ”€β”€ README.md                # This file
β”œβ”€β”€ utils/
β”‚   β”œβ”€β”€ __init__.py
β”‚   └── helpers.py           # URL parsing, file I/O, retry logic, display
└── output/                  # Generated β€” JSON/CSV exports land here
    β”œβ”€β”€ videos/
    └── playlists/

πŸš€ Quick Start

1. Clone the repository

git clone https://github.com/facingshootingstar/YouTube-Transcript-Comment-Scraper.git
cd YouTube-Transcript-Comment-Scraper

2. Create a virtual environment

python -m venv venv
# Windows
venv\Scripts\activate
# macOS / Linux
source venv/bin/activate

3. Install dependencies

pip install -r requirements.txt

4. Configure environment

cp .env.example .env
# Edit .env and add your YouTube Data API key

5. Get a YouTube Data API Key (free)

  1. Go to Google Cloud Console
  2. Create a new project (or select existing)
  3. Enable YouTube Data API v3
  4. Go to Credentials β†’ Create API Key
  5. Paste the key into your .env file

Note: Transcripts work WITHOUT an API key. The key is only needed for comments, metadata, and playlist resolution.


πŸ“– Usage

Single Video

python main.py "https://www.youtube.com/watch?v=dQw4w9WgXcQ"

Playlist (auto-detected)

python main.py "https://www.youtube.com/playlist?list=PLrAXtmErZgOeiKm4sgNOknGvNjby9efdf"

Multiple URLs

python main.py "URL1" "URL2" "URL3"

Transcript Only (no API key required)

python main.py --transcript-only "https://youtu.be/dQw4w9WgXcQ"

Batch Processing

# Create urls.txt with one URL per line
python main.py --batch urls.txt

Custom Output

# Export as CSV
python main.py --format csv "URL"

# Export as both JSON + CSV
python main.py --format both --output ./my_data "URL"

# Limit comments
python main.py --max-comments 100 "URL"

# Prefer Spanish transcripts
python main.py --languages es,en "URL"

Skip Options

# No comments
python main.py --no-comments "URL"

# No channel data
python main.py --no-channel "URL"

πŸ“€ Output Examples

Transcript JSON

{
  "video_id": "dQw4w9WgXcQ",
  "language": "en",
  "is_generated": true,
  "segment_count": 142,
  "segments": [
    {
      "start": 0.0,
      "duration": 2.4,
      "timestamp": "0:00.000",
      "text": "We're no strangers to love"
    }
  ],
  "full_text": "We're no strangers to love ...",
  "word_count": 468
}

Comment JSON

{
  "comment_id": "UgyL...",
  "author": "Rick Astley Fan",
  "author_channel_id": "UC...",
  "text": "Never gonna give this up!",
  "like_count": 1523,
  "published_at": "2024-03-15T12:00:00Z",
  "is_reply": false,
  "parent_id": null
}

πŸ”§ Configuration Reference

All settings can be configured via .env or CLI flags:

Variable Default Description
YOUTUBE_API_KEY β€” YouTube Data API v3 key
OUTPUT_DIR output Base output directory
OUTPUT_FORMAT json json, csv, or both
MAX_COMMENTS 500 Max comments per video (0 = unlimited)
MAX_REPLIES 50 Max replies per top-level comment
COMMENT_SORT relevance relevance or time
TRANSCRIPT_LANGS en,en-US Preferred transcript languages
REQUEST_DELAY 0.5 Seconds between API calls
MAX_RETRIES 3 Retry attempts on failure
LOG_LEVEL INFO Logging verbosity

πŸ§ͺ API Quota Notes

YouTube Data API v3 has a 10,000 units/day free quota:

Operation Cost
videos.list 1 unit
channels.list 1 unit
commentThreads.list 1 unit
comments.list 1 unit
playlistItems.list 1 unit

A full scrape of one video (metadata + comments + channel) costs roughly 5–50 units depending on comment volume. You can comfortably scrape 200–2000 videos/day on the free tier.


βš–οΈ Ethical Use & Legal Disclaimer

⚠️ This tool is for educational, research, and legitimate personal use only.

  • βœ… Academic research, content strategy, accessibility improvements
  • βœ… Scraping your own channel's data for analytics
  • βœ… Aggregating public data for sentiment analysis or NLP research
  • ❌ Mass harvesting personal data for spam or harassment
  • ❌ Violating YouTube's Terms of Service for commercial re-distribution
  • ❌ Building competing platforms with scraped content

Always:

  • Respect YouTube's Terms of Service and API Terms
  • Use reasonable delays and respect rate limits
  • Handle personal data in compliance with GDPR / CCPA

The author assumes no liability for misuse of this tool.


🀝 Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

πŸ“„ License

This project is licensed under the MIT License β€” see the LICENSE file for details.


Built with ❀️ by @facingshootingstar

Made for personal learning and portfolio purposes.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages