# Example Youtube Data Pipeline

This notebook provides an example of a simple data pipeline to ingest, process, validate, version, and store Youtube data as discussed in [https://bradleyboehmke.github.io/uc-bana-7085/04-dataops-build.html#requirements](https://bradleyboehmke.github.io/uc-bana-7085/04-dataops-build.html).

## Requirements

In [1]:
import os

from dataops_utils import (
    ingest_channel_video_ids,
    ingest_video_stats,
    ingest_video_transcript,
)
from dotenv import load_dotenv

In [2]:
# I have my API key set as an environment variable
load_dotenv()
API_KEY = os.getenv('YOUTUBE_API_KEY')

# In your case you can add your API key here
if API_KEY is None:
    API_KEY = "INSERT_YOUR_YOUTUBE_API_KEY"

BASE_URL = "https://www.googleapis.com/youtube/v3"
CHANNEL_ID = 'UCgUueMmSpcl-aCTt5CuCKQw'

## Data Ingestion

In [3]:
# Ingest Youtube video IDs
video_ids = ingest_channel_video_ids(API_KEY, CHANNEL_ID)

# Example of what the first record looks like
video_ids[0]

01/08/2025 04:21:37 PM: 248 video IDs have been ingested.


{'channel_id': 'UCgUueMmSpcl-aCTt5CuCKQw',
 'video_id': 'fSHh01YT0-Q',
 'datetime': '2025-01-07T18:50:32Z',
 'title': 'Tiger Woods hits the ball off the heel.'}

In [4]:
# Ingest Youtube video statistics
video_data = ingest_video_stats(video_ids, API_KEY)

# Example of the stats collected for the first video
video_data[0]

[00:00<?] 0/248 |   0%|           ?it/s

[00:46<00:00] 248/248 | 100%|██████████  5.37it/s
01/08/2025 04:22:23 PM: Stats for 248 video IDs have been ingested.


{'channel_id': 'UCgUueMmSpcl-aCTt5CuCKQw',
 'video_id': 'fSHh01YT0-Q',
 'datetime': '2025-01-07T18:50:32Z',
 'title': 'Tiger Woods hits the ball off the heel.',
 'views': '31413',
 'likes': '1109',
 'comments': '16'}

In [5]:
# Ingest Youtube video transcripts
video_data = ingest_video_transcript(video_data)

# Example of the final raw data that includes
# video ID, title, date, stats, and transcript
video_data[0]

[03:56<00:00] 248/248 | 100%|██████████  1.05it/s


{'channel_id': 'UCgUueMmSpcl-aCTt5CuCKQw',
 'video_id': 'fSHh01YT0-Q',
 'datetime': '2025-01-07T18:50:32Z',
 'title': 'Tiger Woods hits the ball off the heel.',
 'views': '31413',
 'likes': '1109',
 'comments': '16',
 'transcript': "over the course of my career I've always hit the ball off the heel this is one of the reason why I still use my old te's O Okay so if I'm going to hit one in play that's how high you would Tee It Up if I have to hit one in play if I have to hit one out of the wind okay I like that what's wrong with that wow that's right at the pole this keeps you on top and Swinging more left and as I said I would try and hit it for me over a course of my career I've always hit ball off the heel"}

## Data Processing

## Final Data Validation

## Data Versioning & Storage

## Computing Environment

In [6]:
import sys

print(f'Python version: {sys.version}', end='\n\n')

with open('dataops-requirements.txt', 'r') as file:
    for line in file:
        print(line.strip())


Python version: 3.13.1 | packaged by conda-forge | (main, Dec  5 2024, 21:18:03) [Clang 18.1.8 ]

jupyterlab==4.1.6
matplotlib==3.8.0
polars==0.20.21
python-dotenv==0.21.0
tqdm==4.63.0
youtube_transcript_api==0.6.2
