🎬 YouTube Transcript & Comment Scraper

A production-ready Python tool to extract full video transcripts, captions, comments, replies, and channel data from any YouTube video or playlist. Supports batch processing, multi-language transcripts, and exports to JSON/CSV.

Features · Quick Start · Usage · API Quota

📌 About This Project

A personal automation project exploring YouTube's data ecosystem using the official API and transcript libraries. This tool demonstrates expertise in API integration, data pipeline design, batch processing, and structured data export — built as a hands-on learning exercise and portfolio showcase.

🎯 What It Does

Data Layer	What You Get
Transcripts	Full captions/subtitles with timestamps, word count, language detection
Comments	All top-level comments with author, text, likes, timestamps
Replies	Nested reply threads under each comment
Video Metadata	Title, description, tags, duration, views, likes, category, thumbnails
Channel Data	Subscriber count, total videos, country, keywords, channel description
Playlists	Automatic resolution — scrape every video in a playlist

✨ Key Features

🎬 Full Transcript Extraction — Captions with precise timestamps, auto-generated & manual subtitle support
💬 Deep Comment Scraping — Top-level comments AND nested replies with like counts
📊 Rich Video Metadata — Views, likes, duration, tags, category, thumbnails
📡 Channel Intelligence — Subscriber count, video count, country, branding keywords
📋 Playlist Resolution — Auto-detect playlists and scrape every contained video
🌐 Multi-Language Transcripts — Specify preferred languages with automatic fallback
📁 Flexible Export — JSON, CSV, or both — structured for immediate analysis
🔄 Batch Processing — Feed a text file of URLs for hands-free bulk scraping
🛡️ Resilient Design — Exponential backoff, retry logic, configurable rate limiting
⚡ Zero-Browser Architecture — Pure API + library approach — fast, lightweight, no Selenium

📦 Tech Stack

Component	Version
Python	3.11+
google-api-python-client	≥ 2.149.0
youtube-transcript-api	≥ 0.6.3
python-dotenv	≥ 1.1.0
rich	≥ 13.9.0
pandas	≥ 2.2.0

📂 Project Structure

YouTube-Transcript-Comment-Scraper/
├── main.py                  # CLI entry-point (argparse)
├── scraper.py               # Core scraping engine (transcript, comments, metadata)
├── config.py                # Centralised configuration (.env loader)
├── requirements.txt         # Pinned Python dependencies
├── .env.example             # Environment variable template
├── .gitignore               # Git ignore rules
├── README.md                # This file
├── utils/
│   ├── __init__.py
│   └── helpers.py           # URL parsing, file I/O, retry logic, display
└── output/                  # Generated — JSON/CSV exports land here
    ├── videos/
    └── playlists/

🚀 Quick Start

1. Clone the repository

git clone https://github.com/facingshootingstar/YouTube-Transcript-Comment-Scraper.git
cd YouTube-Transcript-Comment-Scraper

2. Create a virtual environment

python -m venv venv
# Windows
venv\Scripts\activate
# macOS / Linux
source venv/bin/activate

3. Install dependencies

pip install -r requirements.txt

4. Configure environment

cp .env.example .env
# Edit .env and add your YouTube Data API key

5. Get a YouTube Data API Key (free)

Go to Google Cloud Console
Create a new project (or select existing)
Enable YouTube Data API v3
Go to Credentials → Create API Key
Paste the key into your .env file

Note: Transcripts work WITHOUT an API key. The key is only needed for comments, metadata, and playlist resolution.

📖 Usage

Single Video

python main.py "https://www.youtube.com/watch?v=dQw4w9WgXcQ"

Playlist (auto-detected)

python main.py "https://www.youtube.com/playlist?list=PLrAXtmErZgOeiKm4sgNOknGvNjby9efdf"

Multiple URLs

python main.py "URL1" "URL2" "URL3"

Transcript Only (no API key required)

python main.py --transcript-only "https://youtu.be/dQw4w9WgXcQ"

Batch Processing

# Create urls.txt with one URL per line
python main.py --batch urls.txt

Custom Output

# Export as CSV
python main.py --format csv "URL"

# Export as both JSON + CSV
python main.py --format both --output ./my_data "URL"

# Limit comments
python main.py --max-comments 100 "URL"

# Prefer Spanish transcripts
python main.py --languages es,en "URL"

Skip Options

# No comments
python main.py --no-comments "URL"

# No channel data
python main.py --no-channel "URL"

📤 Output Examples

Transcript JSON

{
  "video_id": "dQw4w9WgXcQ",
  "language": "en",
  "is_generated": true,
  "segment_count": 142,
  "segments": [
    {
      "start": 0.0,
      "duration": 2.4,
      "timestamp": "0:00.000",
      "text": "We're no strangers to love"
    }
  ],
  "full_text": "We're no strangers to love ...",
  "word_count": 468
}

Comment JSON

{
  "comment_id": "UgyL...",
  "author": "Rick Astley Fan",
  "author_channel_id": "UC...",
  "text": "Never gonna give this up!",
  "like_count": 1523,
  "published_at": "2024-03-15T12:00:00Z",
  "is_reply": false,
  "parent_id": null
}

🔧 Configuration Reference

All settings can be configured via .env or CLI flags:

Variable	Default	Description
`YOUTUBE_API_KEY`	—	YouTube Data API v3 key
`OUTPUT_DIR`	`output`	Base output directory
`OUTPUT_FORMAT`	`json`	`json`, `csv`, or `both`
`MAX_COMMENTS`	`500`	Max comments per video (0 = unlimited)
`MAX_REPLIES`	`50`	Max replies per top-level comment
`COMMENT_SORT`	`relevance`	`relevance` or `time`
`TRANSCRIPT_LANGS`	`en,en-US`	Preferred transcript languages
`REQUEST_DELAY`	`0.5`	Seconds between API calls
`MAX_RETRIES`	`3`	Retry attempts on failure
`LOG_LEVEL`	`INFO`	Logging verbosity

🧪 API Quota Notes

YouTube Data API v3 has a 10,000 units/day free quota:

Operation	Cost
`videos.list`	1 unit
`channels.list`	1 unit
`commentThreads.list`	1 unit
`comments.list`	1 unit
`playlistItems.list`	1 unit

A full scrape of one video (metadata + comments + channel) costs roughly 5–50 units depending on comment volume. You can comfortably scrape 200–2000 videos/day on the free tier.

⚖️ Ethical Use & Legal Disclaimer

⚠️ This tool is for educational, research, and legitimate personal use only.

✅ Academic research, content strategy, accessibility improvements
✅ Scraping your own channel's data for analytics
✅ Aggregating public data for sentiment analysis or NLP research
❌ Mass harvesting personal data for spam or harassment
❌ Violating YouTube's Terms of Service for commercial re-distribution
❌ Building competing platforms with scraped content

Always:

Respect YouTube's Terms of Service and API Terms
Use reasonable delays and respect rate limits
Handle personal data in compliance with GDPR / CCPA

The author assumes no liability for misuse of this tool.

🤝 Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

📄 License

This project is licensed under the MIT License — see the LICENSE file for details.

Built with ❤️ by @facingshootingstar

Made for personal learning and portfolio purposes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🎬 YouTube Transcript & Comment Scraper

📌 About This Project

🎯 What It Does

✨ Key Features

📦 Tech Stack

📂 Project Structure

🚀 Quick Start

1. Clone the repository

2. Create a virtual environment

3. Install dependencies

4. Configure environment

5. Get a YouTube Data API Key (free)

📖 Usage

Single Video

Playlist (auto-detected)

Multiple URLs

Transcript Only (no API key required)

Batch Processing

Custom Output

Skip Options

📤 Output Examples

Transcript JSON

Comment JSON

🔧 Configuration Reference

🧪 API Quota Notes

⚖️ Ethical Use & Legal Disclaimer

🤝 Contributing

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
utils		utils
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.py		config.py
main.py		main.py
requirements.txt		requirements.txt
scraper.py		scraper.py

Folders and files

Latest commit

History

Repository files navigation

🎬 YouTube Transcript & Comment Scraper

📌 About This Project

🎯 What It Does

✨ Key Features

📦 Tech Stack

📂 Project Structure

🚀 Quick Start

1. Clone the repository

2. Create a virtual environment

3. Install dependencies

4. Configure environment

5. Get a YouTube Data API Key (free)

📖 Usage

Single Video

Playlist (auto-detected)

Multiple URLs

Transcript Only (no API key required)

Batch Processing

Custom Output

Skip Options

📤 Output Examples

Transcript JSON

Comment JSON

🔧 Configuration Reference

🧪 API Quota Notes

⚖️ Ethical Use & Legal Disclaimer

🤝 Contributing

📄 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages