## Ingest Exploration

This notebook serves as the initial exploration of the YouTube API ingestion process. We'll look at how to interface with the API, what the responses look like, and explore potential options for sequencing batch requests of data.

In [1]:
import os
from datetime import datetime, timezone

import pandas as pd
from tqdm.notebook import tqdm
from dotenv import load_dotenv
from googleapiclient.discovery import build

Using Google's Python API client, we'll build a client object with a YouTube API key obtained via Google Cloud. This object will handle all of our data requests throughout the notebook.

In [2]:
load_dotenv()

API_KEY = os.getenv("YOUTUBE_API_KEY")
if not API_KEY:
    raise RuntimeError("YOUTUBE_API_KEY not set in .env")

In [3]:
youtube = build("youtube", "v3", developerKey=API_KEY)
youtube

<googleapiclient.discovery.Resource at 0x1b2ca60bc90>

We'll start with defining a simple function to pull a YouTube channel's details based on its handle.

In [4]:
def get_channel_by_handle (handle):
    request = youtube.channels().list(
        part="snippet,contentDetails,statistics",
        forHandle=handle
    )
    return request.execute()

In [5]:
test_channel = get_channel_by_handle("@MrBeast")
test_channel

{'kind': 'youtube#channelListResponse',
 'etag': 'tPJzEPwR7eXJ4HwTrdcLcZli-po',
 'pageInfo': {'totalResults': 1, 'resultsPerPage': 5},
 'items': [{'kind': 'youtube#channel',
   'etag': 'YxE5wlRzoYX3ZZrkOxK9u2g6yO0',
   'id': 'UCX6OQ3DkcsbYNE6H8uQQuVA',
   'snippet': {'title': 'MrBeast',
    'description': "SUBSCRIBE FOR A COOKIE!\nNew MrBeast or MrBeast Gaming video every single Saturday at noon eastern time!\nAccomplishments:\n- Raised $20,000,000 To Plant 20,000,000 Trees\n- Removed 30,000,000 pounds of trash from the ocean\n- Helped 2,000 people walk again\n- Helped 1,000 blind people see\n- Helped 1,000 deaf people hear\n- Built wells in Africa\n- Built and gave away 100 houses\n- Adopted every dog in a shelter (twice)\n- Given millions to charity\n- Started my own snack company Feastables\n- Started my own software company Viewstats\n- Started Lunchly, a tasty, better-for-you lunch option\n- Gave away a private island (twice)\n- Gave away 1 million meals\n- I counted to 100k\n- Ra

Based on the shape of 'test_channel' we know that `youtube.channels().list()`, and likely all client calls, return a dictionary with a sort of "header" containing *'kind'*, *'etag'*, *'pageInfo'*. The actual data we want will usually be in an ***'items'*** list.

In [6]:
test_channel_handles = ["@MrBeast", "@veritasium", "HealthyGamerGG", "@Vanillamacee", "@JeffNippard", "@mitocw", "@boilerroom"]
test_channels = {
    handle: get_channel_by_handle(handle).get('items', [])[0]
    for handle in test_channel_handles
}
test_channels["@MrBeast"]

{'kind': 'youtube#channel',
 'etag': 'YxE5wlRzoYX3ZZrkOxK9u2g6yO0',
 'id': 'UCX6OQ3DkcsbYNE6H8uQQuVA',
 'snippet': {'title': 'MrBeast',
  'description': "SUBSCRIBE FOR A COOKIE!\nNew MrBeast or MrBeast Gaming video every single Saturday at noon eastern time!\nAccomplishments:\n- Raised $20,000,000 To Plant 20,000,000 Trees\n- Removed 30,000,000 pounds of trash from the ocean\n- Helped 2,000 people walk again\n- Helped 1,000 blind people see\n- Helped 1,000 deaf people hear\n- Built wells in Africa\n- Built and gave away 100 houses\n- Adopted every dog in a shelter (twice)\n- Given millions to charity\n- Started my own snack company Feastables\n- Started my own software company Viewstats\n- Started Lunchly, a tasty, better-for-you lunch option\n- Gave away a private island (twice)\n- Gave away 1 million meals\n- I counted to 100k\n- Ran a marathon in the world's largest shoes\n- Survived 50 hours in Antarctica\n- Recreated Squid Game in real life\n- Created the largest competition show 

In [7]:
for channel in test_channels.values():
    title = channel.get('snippet').get('title')
    country = channel.get('snippet').get('country')

    raw_published = channel.get('snippet').get('publishedAt')
    published = datetime.fromisoformat(raw_published.replace("Z", "+00:00"))

    subcount = int(channel.get('statistics').get('subscriberCount'))
    vidcount = int(channel.get('statistics').get('videoCount'))
    
    print(
        f"{title}:\n"
        f"Created: {published:%b %d, %Y} | "
        f"Subscribers: {subcount:,} | "
        f"Videos: {vidcount} | "
        f"Country: {country}\n\n"
    )

MrBeast:
Created: Feb 20, 2012 | Subscribers: 454,000,000 | Videos: 928 | Country: US


Veritasium:
Created: Jul 21, 2010 | Subscribers: 19,700,000 | Videos: 473 | Country: US


HealthyGamerGG:
Created: Jan 17, 2019 | Subscribers: 3,150,000 | Videos: 2397 | Country: US


Vanillamace:
Created: Feb 06, 2024 | Subscribers: 1,450,000 | Videos: 83 | Country: None


Jeff Nippard:
Created: Apr 19, 2014 | Subscribers: 7,980,000 | Videos: 605 | Country: CA


MIT OpenCourseWare:
Created: Oct 11, 2005 | Subscribers: 6,060,000 | Videos: 7838 | Country: US


Boiler Room:
Created: May 10, 2012 | Subscribers: 5,030,000 | Videos: 10183 | Country: GB




Unfortunately, getting a channel by its handle means we can only get one channel at a time, which would be extremely time consuming. Moreover, we want to ensure that our data is a fairly even distribution of YouTube content to get a quality model that generalizes well. It's important to note that our model will not represent *all* YouTube content. The vast majority of content that gets uploaded to YouTube has views in the single digits, and is usually just a one-off upload. We want our model to represent the behaviors of YouTube channels that upload consistently and are aiming to grow their engagement. For this reason, we have to define a subset of YouTube videos that we would like to represent.

## Defining Data Constraints

We want our data to represent the behavioral patterns of YouTube content creators who are currently active and reasonably consistent. To ensure we have enough data to train our model, we'll be aiming to pull around 20,000 - 50,000 videos in total. The videos should be evenly distributed in categories and in subscriber count. To meet these requirements, we will define the following constraints on data we gather:

- Data must be an evenly split distribution across YouTube video categories.
- Data should be logarthmically distributed across channel subscriber counts.
- YouTube channels must have uploaded at least 10 videos within the past 2 years.
- YouTube channels must have at least 10,000 subscribers.
- YouTube channels must be from the US (for simplicity sake).

### Categories
YouTube has a built in *Category* feature for video objects that defines generalized styles of content for the platform. While Categories are not an entirely accurate representation of the wide range of niche genres YouTube hosts, it can help us refine our model's subset even further.

In [8]:
# Fetch and display available YouTube video categories for the US region

request = youtube.videoCategories().list(
    part='snippet',
    regionCode="US"
)

response = request.execute()
categories = response.get('items', [])
categories[:5]

[{'kind': 'youtube#videoCategory',
  'etag': 'grPOPYEUUZN3ltuDUGEWlrTR90U',
  'id': '1',
  'snippet': {'title': 'Film & Animation',
   'assignable': True,
   'channelId': 'UCBR8-60-B28hp2BmDPdntcQ'}},
 {'kind': 'youtube#videoCategory',
  'etag': 'Q0xgUf8BFM8rW3W0R9wNq809xyA',
  'id': '2',
  'snippet': {'title': 'Autos & Vehicles',
   'assignable': True,
   'channelId': 'UCBR8-60-B28hp2BmDPdntcQ'}},
 {'kind': 'youtube#videoCategory',
  'etag': 'qnpwjh5QlWM5hrnZCvHisquztC4',
  'id': '10',
  'snippet': {'title': 'Music',
   'assignable': True,
   'channelId': 'UCBR8-60-B28hp2BmDPdntcQ'}},
 {'kind': 'youtube#videoCategory',
  'etag': 'HyFIixS5BZaoBdkQdLzPdoXWipg',
  'id': '15',
  'snippet': {'title': 'Pets & Animals',
   'assignable': True,
   'channelId': 'UCBR8-60-B28hp2BmDPdntcQ'}},
 {'kind': 'youtube#videoCategory',
  'etag': 'PNU8SwXhjsF90fmkilVohofOi4I',
  'id': '17',
  'snippet': {'title': 'Sports',
   'assignable': True,
   'channelId': 'UCBR8-60-B28hp2BmDPdntcQ'}}]

In [9]:
categories = pd.json_normalize(categories)
categories['id'] = categories['id'].astype(int)
categories = (
    categories
    .drop(columns=['kind', 'etag', 'snippet.channelId'])
    .rename(columns={'snippet.title': "title", 'snippet.assignable': "assignable"})
    .set_index('id')
)
categories

Unnamed: 0_level_0,title,assignable
id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Film & Animation,True
2,Autos & Vehicles,True
10,Music,True
15,Pets & Animals,True
17,Sports,True
18,Short Movies,False
19,Travel & Events,True
20,Gaming,True
21,Videoblogging,False
22,People & Blogs,True


The ***'assignable'*** feature plays an important role for us. This is YouTube's internal method of seperating regular, user-submitted, long-form video content from the rest of the content it hosts. YouTube in its current state features movies, TV shows, and trailers all uploaded by production companies. Additionally, YouTube also features "YouTube Shorts" which is their subplatform for short-form content, as a competitor of sorts to TikTok & Instagram Reels. Strangely, *'Videoblogging'* seems to stand out from these nonassignable categories for being the only one that represents a regular style of YouTube content. However, looking into this further it's likely that this category has been depreciated due to YouTube's shift of focus on content's topics over format.

Thus, we want our dataset to only include assignable categories.

In [10]:
# Creating new 'trainable' column to indicate if our model will use the category.
categories["trainable"] = categories['assignable']
categories[categories['trainable']]

Unnamed: 0_level_0,title,assignable,trainable
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,Film & Animation,True,True
2,Autos & Vehicles,True,True
10,Music,True,True
15,Pets & Animals,True,True
17,Sports,True,True
19,Travel & Events,True,True
20,Gaming,True,True
22,People & Blogs,True,True
23,Comedy,True,True
24,Entertainment,True,True


Among this list, there are a few more categories that stand out as potentially too "different" from the average content on YouTube to include in our dataset. *"Music"*, *"News & Politics"*, and *"Nonprofits & Activism"* all include a large percentage of studio or organization produced content, and thus are likely to cause a distribution shift away from typical, content creator viewership. *"Film & Animation"* is an interesting one because most content in this category is driven by single content creators, however upload frequency and viewership patterns are dramatically different. Engagement tends to be consistently higher in this category, but at the cost of high production content which greatly affects the creator's upload schedule. This behavior of low uploads + high viewership is too unlike the content we would like our model to generalize, and so it will also be excluded from our dataset.

In [11]:
categories.loc[[1, 10, 25, 29], 'trainable'] = False
categories[categories['trainable']]

Unnamed: 0_level_0,title,assignable,trainable
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2,Autos & Vehicles,True,True
15,Pets & Animals,True,True
17,Sports,True,True
19,Travel & Events,True,True
20,Gaming,True,True
22,People & Blogs,True,True
23,Comedy,True,True
24,Entertainment,True,True
26,Howto & Style,True,True
27,Education,True,True


In [12]:
def get_most_popular_for_category (category_id, max_results=5, page_token=None):
    request = youtube.videos().list(
        part="snippet,contentDetails,statistics",
        chart="mostPopular",
        maxResults=max_results,
        pageToken=page_token,
        regionCode="US",
        videoCategoryId=category_id
    )
    return request.execute()

In [13]:
most_viewed_sports = get_most_popular_for_category (17)
most_viewed_sports

{'kind': 'youtube#videoListResponse',
 'etag': 'AitLZGdvNhj_O0AXUQBloqMRbpM',
 'items': [{'kind': 'youtube#video',
   'etag': 'sHKejdmZGo-T_7iVtSvxhCXMLoU',
   'id': '4KOe2h9OJO4',
   'snippet': {'publishedAt': '2025-12-09T23:09:12Z',
    'channelId': 'UC5QESDRf1F0v1Ig8KHpCVCQ',
    'title': 'The Catch That SAVED The Baseball Game‚Ä¶ü§Ø‚öæÔ∏è',
    'description': '',
    'thumbnails': {'default': {'url': 'https://i.ytimg.com/vi/4KOe2h9OJO4/default.jpg',
      'width': 120,
      'height': 90},
     'medium': {'url': 'https://i.ytimg.com/vi/4KOe2h9OJO4/mqdefault.jpg',
      'width': 320,
      'height': 180},
     'high': {'url': 'https://i.ytimg.com/vi/4KOe2h9OJO4/hqdefault.jpg',
      'width': 480,
      'height': 360},
     'standard': {'url': 'https://i.ytimg.com/vi/4KOe2h9OJO4/sddefault.jpg',
      'width': 640,
      'height': 480},
     'maxres': {'url': 'https://i.ytimg.com/vi/4KOe2h9OJO4/maxresdefault.jpg',
      'width': 1280,
      'height': 720}},
    'channelTitle': 'Peakz

We've run into a bit of an issue here. seemingly all of the 