# Example Youtube Data Pipeline

This notebook provides an example of a simple data pipeline to ingest, process, validate, version, and store Youtube data as discussed in [https://bradleyboehmke.github.io/uc-bana-7085/04-dataops-build.html#hands-on-example-a-youtube-data-pipeline](https://bradleyboehmke.github.io/uc-bana-7085/04-dataops-build.html#hands-on-example-a-youtube-data-pipeline).

## Requirements

In [1]:
import great_expectations as gx
import os
import numpy as np
import pandas as pd
import unicodedata

from dataops_utils import (
    ingest_channel_video_ids,
    ingest_video_stats,
    ingest_video_transcript,
)
from dotenv import load_dotenv

In [2]:
# I have my API key set as an environment variable
load_dotenv()
API_KEY = os.getenv('YOUTUBE_API_KEY')

# In your case you can add your API key here
if API_KEY is None:
    API_KEY = "INSERT_YOUR_YOUTUBE_API_KEY"

BASE_URL = "https://www.googleapis.com/youtube/v3"
CHANNEL_ID = 'UCgUueMmSpcl-aCTt5CuCKQw'

## Data Ingestion

In [3]:
# Ingest Youtube video IDs
video_ids = ingest_channel_video_ids(API_KEY, CHANNEL_ID)

# Example of what the first record looks like
video_ids[0]

01/12/2025 12:35:05 PM: 247 video IDs have been ingested.


{'channel_id': 'UCgUueMmSpcl-aCTt5CuCKQw',
 'video_id': 'fSHh01YT0-Q',
 'datetime': '2025-01-07T18:50:32Z',
 'title': 'Tiger Woods hits the ball off the heel.'}

In [4]:
# Ingest Youtube video statistics
video_data = ingest_video_stats(video_ids, API_KEY)

# Example of the stats collected for the first video
video_data[0]

  if rate and total else datetime.utcfromtimestamp(0))
[00:42<00:00] 247/247 | 100%|██████████  5.87it/s
01/12/2025 12:35:47 PM: Stats for 247 video IDs have been ingested.


{'channel_id': 'UCgUueMmSpcl-aCTt5CuCKQw',
 'video_id': 'fSHh01YT0-Q',
 'datetime': '2025-01-07T18:50:32Z',
 'title': 'Tiger Woods hits the ball off the heel.',
 'views': '59334',
 'likes': '1881',
 'comments': '30'}

In [5]:
# Ingest Youtube video transcripts
video_data = ingest_video_transcript(video_data)

# Example of the final raw data that includes
# video ID, title, date, stats, and transcript
video_data[0]

[04:06<00:00] 247/247 | 100%|██████████  1.00it/s


{'channel_id': 'UCgUueMmSpcl-aCTt5CuCKQw',
 'video_id': 'fSHh01YT0-Q',
 'datetime': '2025-01-07T18:50:32Z',
 'title': 'Tiger Woods hits the ball off the heel.',
 'views': '59334',
 'likes': '1881',
 'comments': '30',
 'transcript': "over the course of my career I've always hit the ball off the heel this is one of the reason why I still use my old te's O Okay so if I'm going to hit one in play that's how high you would Tee It Up if I have to hit one in play if I have to hit one out of the wind okay I like that what's wrong with that wow that's right at the pole this keeps you on top and Swinging more left and as I said I would try and hit it for me over a course of my career I've always hit ball off the heel"}

## Data Processing

In [6]:
raw_data = pd.DataFrame(video_data)
raw_data.head()

Unnamed: 0,channel_id,video_id,datetime,title,views,likes,comments,transcript
0,UCgUueMmSpcl-aCTt5CuCKQw,fSHh01YT0-Q,2025-01-07T18:50:32Z,Tiger Woods hits the ball off the heel.,59334,1881,30,over the course of my career I've always hit t...
1,UCgUueMmSpcl-aCTt5CuCKQw,erzLT7fy2r0,2025-01-07T17:56:07Z,Tiger Woods liked my golf swing!,115716,4212,72,what's wrong with that yeah that came off you ...
2,UCgUueMmSpcl-aCTt5CuCKQw,3O08SnyZ88U,2025-01-07T17:07:04Z,Tiger Woods teaches me how to hit it straight!,482074,15750,124,what did you do in your career when you had a ...
3,UCgUueMmSpcl-aCTt5CuCKQw,E6gPs8E4138,2025-01-07T17:00:07Z,Tiger Woods Gives Me a Golf Lesson,1664852,71879,4097,all right guys today's the day today is the da...
4,UCgUueMmSpcl-aCTt5CuCKQw,WktrsZC8VJI,2024-12-26T20:00:02Z,It’s Coming to an End. (Episode 7),519212,16328,1193,we are headed to our final video on my channel...


In [7]:
# Remove rows with missing data
cleaned_data = raw_data.dropna()

# Remove duplicate rows
cleaned_data = cleaned_data.drop_duplicates()

# Remove any inconsistent data types
for col in ['views', 'likes', 'comments']:
    cleaned_data[col] = pd.to_numeric(cleaned_data[col], errors='coerce')

# Remove any observations that have invalid datetime values
cleaned_data['datetime'] = pd.to_datetime(cleaned_data['datetime'], errors='coerce')
cleaned_data = cleaned_data.dropna(subset=['datetime'])

# Remove any observations where the views value is less than 3 standard deviations
# from the mean
mean_views = cleaned_data['views'].mean()
std_views = cleaned_data['views'].std()
cleaned_data = cleaned_data[cleaned_data['views'] >= (mean_views - 3 * std_views)]

# Remove any observations where the transcript length is less than 3 standard deviations
# from the mean transcript length
cleaned_data['transcript_length'] = cleaned_data['transcript'].apply(lambda x: len(x) if pd.notnull(x) else 0)
mean_transcript_length = cleaned_data['transcript_length'].mean()
std_transcript_length = cleaned_data['transcript_length'].std()
cleaned_data = cleaned_data[cleaned_data['transcript_length'] >= (mean_transcript_length - 3 * std_transcript_length)]

# Remove/clean the title and transcript columns for non-character string values
# (i.e. unicode characters)
def clean_text(text):
    if isinstance(text, str):
        return unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('ascii')
    return text

cleaned_data['title'] = cleaned_data['title'].apply(clean_text)
cleaned_data['transcript'] = cleaned_data['transcript'].apply(clean_text)

cleaned_data.head()

Unnamed: 0,channel_id,video_id,datetime,title,views,likes,comments,transcript,transcript_length
0,UCgUueMmSpcl-aCTt5CuCKQw,fSHh01YT0-Q,2025-01-07 18:50:32+00:00,Tiger Woods hits the ball off the heel.,59334,1881,30,over the course of my career I've always hit t...,483
1,UCgUueMmSpcl-aCTt5CuCKQw,erzLT7fy2r0,2025-01-07 17:56:07+00:00,Tiger Woods liked my golf swing!,115716,4212,72,what's wrong with that yeah that came off you ...,208
2,UCgUueMmSpcl-aCTt5CuCKQw,3O08SnyZ88U,2025-01-07 17:07:04+00:00,Tiger Woods teaches me how to hit it straight!,482074,15750,124,what did you do in your career when you had a ...,646
3,UCgUueMmSpcl-aCTt5CuCKQw,E6gPs8E4138,2025-01-07 17:00:07+00:00,Tiger Woods Gives Me a Golf Lesson,1664852,71879,4097,all right guys today's the day today is the da...,16046
4,UCgUueMmSpcl-aCTt5CuCKQw,WktrsZC8VJI,2024-12-26 20:00:02+00:00,Its Coming to an End. (Episode 7),519212,16328,1193,we are headed to our final video on my channel...,56511


## Final Data Validation

In [3]:
!great_expectations init

  pid, fd = os.forkpty()


zsh:1: command not found: great_expectations


In [2]:
context = gx.get_context()

01/12/2025 01:01:59 PM: Could not find local file-backed GX project
01/12/2025 01:01:59 PM: Created temporary directory '/var/folders/8f/c06lv6q17tjbyjv2nkt0_s4s1sh0tg/T/tmp8zi9mhka' for ephemeral docs site
01/12/2025 01:01:59 PM: Loading 'datasources' ->
[]


01/12/2025 01:02:00 PM: Backing off send_request(...) for 0.9s (requests.exceptions.SSLError: HTTPSConnectionPool(host='posthog.greatexpectations.io', port=443): Max retries exceeded with url: /batch/ (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self-signed certificate in certificate chain (_ssl.c:1000)'))))
01/12/2025 01:02:01 PM: Backing off send_request(...) for 1.8s (requests.exceptions.SSLError: HTTPSConnectionPool(host='posthog.greatexpectations.io', port=443): Max retries exceeded with url: /batch/ (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self-signed certificate in certificate chain (_ssl.c:1000)'))))
01/12/2025 01:02:02 PM: Backing off send_request(...) for 2.8s (requests.exceptions.SSLError: HTTPSConnectionPool(host='posthog.greatexpectations.io', port=443): Max retries exceeded with url: /batch/ (Caused by SSLError(SSLCertVerificationError(1, '[S

## Data Versioning & Storage

## Computing Environment

In [4]:
import sys

print(f'Python version: {sys.version}', end='\n\n')

with open('dataops-requirements.txt', 'r') as file:
    for line in file:
        print(line.strip())


Python version: 3.12.7 | packaged by Anaconda, Inc. | (main, Oct  4 2024, 08:28:27) [Clang 14.0.6 ]

great_expectations==1.3.1
jupyterlab==4.1.6
matplotlib==3.8.0
numpy==1.26.4
pandas<=2.2
python-dotenv==0.21.0
tqdm==4.63.0
youtube_transcript_api==0.6.2
