# Demo: Data Extraction and Cleaning
This notebook demonstrates how to extract video data from the YouTube API using two methods:
<ol>
  <li>Scraped Channels: Scrape channels based on article that list popular cooking channels, then plug those channels into the YouTube API to collect each channel's video data.</li>
  <li>Search Based: Plug in a cooking keyword directly into the YouTube API to collect the highest ranked videos per keyword searched on YouTube.</li>
</ol>

The first method (scraped channels) guarantees that generally agreed upon popular cooking YouTube channels are in our dataset, whereas the second method (search based) collects the highest ranked videos based on cooking terms. We believe that these were the best two methods to attempt to collect cooking videos for our dataset.

## Packages
These are the imports needed to run data extraction and cleaning.

In [24]:
#general
import pandas as pd
import numpy as np

#website scraping
import requests
from bs4 import BeautifulSoup
import re

#youtube api
import googleapiclient.discovery
from googleapiclient.discovery import build
import urllib.request
import json

#data cleaning
import isodate

#api utls
from typing import List, Set, Dict, Tuple, Optional
from sklearn import linear_model
import os
from googleapiclient.errors import HttpError


## YouTube API Setup
First, get a YouTube API key so that you can extract video data. The function below creates a YouTube API client.

In [121]:
# Individual API key to extract videos via web scraping method. 
# As of Nov 2022, you shouldn't need more than one API key if you are scraping <50 websites.

api_key = '' 

In [122]:
# API List to extract videos via search based method.
# Search based method is expensive in terms of API requests.

api_list = [''
            '',
            ''
            ]

In [27]:
def make_client(api_key: str) -> object:

    # Creates a YouTube API client for use in subsequent requests
    # User's API key is only needed once to create the client

    yt_client = googleapiclient.discovery.build('youtube', 'v3', developerKey = api_key)

    return yt_client

## Method #1: Scraped Channels
This method is used to collect publicly agreed upon "popular" cooking channels. Note that most articles don't really define popularity, but it seems that the articles rank the most "popular" channels based on each channel's total view count or total subscriber count.

### Build the Scraper

In [28]:
def scrape_channel_ids(url, id_type):

    # Scrape channel IDs and channel usernames from YouTube URLs on any given website. 
    # We used ~50 websites that recommended top cooking channels

    # Parameters:
    # url: website URL as a string
    # id_type: the ID type you want to scrape; 'channel_id' or 'channel_username'

    page = requests.get(url) 
    soup = BeautifulSoup(page.content, 'html.parser')
    urls = soup.find_all('a', href=True)

    hrefs=[]
    for item in urls:
        hrefs.append(item.get('href'))

    youtube_urls=[]
    for item in hrefs:
        if id_type == 'channel_id':
            youtube_urls.append(re.findall("https://www.youtube.com/channel[^\s?]+", item))
        if id_type == 'channel_username':
            youtube_urls.append(re.findall("https://www.youtube.com/c/[^\s?]+", item))


    flat_list = [str(item.split('/')[4]) for sublist in youtube_urls for item in sublist] #just the channel ids

    return flat_list

### Scrape Websites
In this demo, we only scrape 5 websites, but we scraped roughly 45 websites for channel IDs and usernames. We've provided the full list of website URLs below.

In [63]:
# full list of URLs used for project scraping
urls = ['https://www.equipmentnerd.com/top-food-cooking-youtubers-list/',
        'https://www.purewow.com/food/best-cooking-channels-on-youtube',
        'https://www.thrillist.com/entertainment/nation/best-youtube-cooking-channels',
        'https://www.cinemablend.com/television/2561526/great-cooking-channels-to-subscribe-to-on-youtube',
        'https://techboomers.com/best-youtube-cooking-channels',
        'https://www.businessinsider.com/youtube-best-cooking-channels-2017-11',
        'https://www.mob.co.uk/life/best-youtube-food-channels',
        'https://www.netinfluencer.com/top-cooking-channels-on-youtube/',
        'https://www.spark-lang.com/blog/5-great-cooking-youtube-channels-for-english-learners',
        'https://www.cleaneatingkitchen.com/best-youtube-cooking-channels/',
        'https://www.finedininglovers.com/article/best-youtube-food-channels',
        'https://mashable.com/article/best-youtube-cooking-channels',
        'https://www.thekitchn.com/youtube-most-popular-cooking-channels-258119',
        'https://www.mashed.com/146555/ranking-the-most-popular-cooking-channels-on-youtube-from-best-to-worst/',
        'https://camp.com/cooking/cooking-youtube-channels-for-kids',
        'https://www.starterstory.com/cooking-youtube-channels',
        'https://www.scrolldroll.com/best-indian-cooking-channels-on-youtube/',
        'https://www.kobejones.com.au/top-5-youtube-channels-for-japanese-food-lovers/',
        'https://sprintkitchen.com/youtube-cooking-channels/',
        'https://purgula.com/kitchen/best-international-cooking-channels-on-youtube/',
        'https://brentwoodnylibrary.org/adult/history-food-some-historical-cooking-channels-youtube',
        'https://www.foodbeast.com/news/top-youtube-chefs-worth-clicking-subscribe-for/',
        'https://www.gadgetbridge.com/gadget-bridge-ace/best-youtube-channels-to-follow-for-recipes-and-home-cooking/',
        'https://foodzodiac.com/best-indian-cooking-channels-on-youtube/',
        'https://medium.com/creative-landscape-of-youtube/5-awesome-food-shows-on-youtube-that-cook-things-a-bit-differently-814b52ea7355',
        'https://techbanta.com/article/best-youtube-cooking-channels/',
        'https://www.shortform.com/blog/best-cooking-blogs-podcasts-youtube-channels/',
        'https://spoonuniversity.com/lifestyle/quick-and-easy-recipes-on-youtube',
        'https://www.foodforfitness.co.uk/best-youtube-recipes/',
        'https://blog.feedspot.com/food_youtube_channels/',
        'https://hiplatina.com/latin-cooking-channels-tutorials-videos/',
        'https://blog.feedspot.com/home_cooking_youtube_channels/',
        'https://blog.feedspot.com/italian_food_youtube_channels/',
        'https://blog.feedspot.com/japanese_food_youtube_channels/',
        'https://blog.feedspot.com/chinese_food_youtube_channels/',
        'https://blog.feedspot.com/indian_food_youtube_channels/',
        'https://blog.feedspot.com/asian_food_youtube_channels/',
        'https://www.businessinsider.com/the-best-youtube-cooking-channels-2019-10',
        'https://blog.feedspot.com/baking_youtube_channels/',
        'https://blog.feedspot.com/bbq_youtube_channels/',
        'https://blog.feedspot.com/vegetarian_youtube_channels/',
        'https://blog.feedspot.com/vegan_food_youtube_channels/',
        'https://blog.feedspot.com/gluten_free_youtube_channels/',
        'https://blog.feedspot.com/healthy_food_youtube_channels/',
        'https://blog.feedspot.com/street_food_youtube_channels/']

In [64]:
#scrape the first 5 URLs for the demo
urls = urls[0:5]

In [65]:
# get the channel IDs and channel usernames from the URLs

channel_ids_lists = []
channel_username_lists = []

for url in urls:
  channel_ids_lists.append(scrape_channel_ids(url, 'channel_id'))
  channel_username_lists.append(scrape_channel_ids(url, 'channel_username'))

all_channel_ids = [item for sublist in channel_ids_lists for item in sublist]
all_channel_usernames = [item for sublist in channel_username_lists for item in sublist]

unique_channel_ids = list(dict.fromkeys(all_channel_ids))
unique_channel_usernames = list(dict.fromkeys(all_channel_usernames))

In [66]:
print("Unique Channel IDs:",len(unique_channel_ids))
print("Unique Channel Usernames:", len(unique_channel_usernames))

Unique Channel IDs: 64
Unique Channel Usernames: 7


### Video Extraction Functions for YouTube API

In [67]:
def get_channel_stats(youtube, channel_id, id_type):
    
    # Get the YouTube API response based on the IDs you scrape using scrape_channel_ids

    # Parameters:
    # youtube: result of make_client
    # channel_id: scraped ID
    # id_type: 'channel_id' or 'channel_username'

    if id_type == 'channel_id':
        request = youtube.channels().list(
            part = 'snippet,contentDetails,statistics',
            id=channel_id
        )

        response = request.execute()
  
    if id_type == 'channel_username':
        request = youtube.channels().list(
        part = 'snippet,contentDetails,statistics',
        forUsername=channel_id
        )

        response = request.execute()
    
    return response['items']

def get_video_list(youtube, upload_id):

    # Get a channel's videos based on the upload id

    # Parameters:
    # youtube: result of make_client
    # upload_id:  channel_stats[0]['contentDetails']['relatedPlaylists']['uploads']

    video_list = []
    request = youtube.playlistItems().list(
        part="snippet,contentDetails",
        playlistId=upload_id,
        maxResults=50
    )
    next_page = True
    while next_page:
        response = request.execute()
        data = response['items']

        for video in data:
            video_id = video['contentDetails']['videoId']
            if video_id not in video_list:
                video_list.append(video_id)

        if 'nextPageToken' in response.keys():
            next_page = True
            request = youtube.playlistItems().list(
                part="snippet,contentDetails",
                playlistId=upload_id,
                pageToken=response['nextPageToken'],
                maxResults=50
            )
        else:
            next_page = False

    return video_list

def get_video_details(youtube, video_list):

    # Extract the columns needed for the final data frame based on the video list

    # Parameters:
    # youtube: result of make_client
    # video_list: videos from get_video_list

    stats_list=[]
    for i in range(0, len(video_list), 50):
        request= youtube.videos().list(
            part="snippet,contentDetails,statistics",
            id=video_list[i:i+50]
        )

        data = request.execute()
        for video in data['items']:
            chan_id = video['snippet']['channelId']
            vid_id = video['id']
            vid_name = video['snippet']['title']
            vid_publish_dt = video['snippet']['publishedAt']
            vid_thumb = video['snippet']['thumbnails']['default']['url']
            vid_duration = video['contentDetails']['duration']
            vid_caption = video['contentDetails']['caption']
            vid_viewcount = video['statistics'].get('viewCount',0)
            vid_likecount = video['statistics'].get('likeCount',0)
            vid_commentcount = video['statistics'].get('commentCount',0)
            data_dict=dict(chan_id=chan_id, vid_id=vid_id, vid_name=vid_name, vid_publish_dt=vid_publish_dt,
                          vid_thumb=vid_thumb,vid_duration=vid_duration,vid_caption=vid_caption,vid_viewcount=vid_viewcount,
                          vid_likecount=vid_likecount,vid_commentcount=vid_commentcount)
            stats_list.append(data_dict)

    return stats_list

def create_video_df(youtube, channel_ids, id_type):

    # Input a list of channel IDs, and this function outputs all videos for those channel IDs in the format needed for this project

    # Parameters:
    # youtube: result of make_client
    # channel_ids: list of channel IDs
    # id_type: 'channel_id' or 'channel_username' 

    channel_dfs = []
    vid_dfs = []
    for channel_id in channel_ids:
        try:
            channel_stats = get_channel_stats(youtube, channel_id, id_type)

            channel_dfs.append(pd.json_normalize(channel_stats))

            upload_id = channel_stats[0]['contentDetails']['relatedPlaylists']['uploads']
            video_list = get_video_list(youtube, upload_id)

            video_data = get_video_details(youtube, video_list)
            vid_dfs.append(pd.json_normalize(video_data))
        except:
            pass

    channel_df = pd.concat(channel_dfs)

    vid_df = pd.concat(vid_dfs)

    channel_df = channel_df.rename(columns={'id':'chan_id','snippet.title':'chan_name','statistics.viewCount':'chan_viewcount',
                                            'statistics.subscriberCount':'chan_subcount','snippet.publishedAt':'chan_start_dt',
                                            'snippet.thumbnails.default.url':'chan_thumb','statistics.videoCount':'chan_vidcount'})
  
    channel_df = channel_df[['chan_id','chan_name','chan_viewcount','chan_subcount','chan_start_dt','chan_thumb','chan_vidcount']]

    final_df = vid_df.merge(channel_df, how='left', on='chan_id')

    column_order = ['chan_id','chan_name','chan_viewcount','chan_subcount','chan_start_dt','chan_thumb','chan_vidcount',
                    'vid_id','vid_name','vid_publish_dt','vid_thumb','vid_duration','vid_caption','vid_viewcount','vid_likecount','vid_commentcount']
    
    return final_df[column_order]


### Extract Videos from YouTube API
For this demo, we'll just use the channel ID list. This will give us enough videos.

In [70]:
# YouTube API Client
youtube = make_client(api_key)

In [71]:
vids_by_channel_id = create_video_df(youtube, all_channel_ids, 'channel_id')

In [72]:
vids_by_channel_id.head()

Unnamed: 0,chan_id,chan_name,chan_viewcount,chan_subcount,chan_start_dt,chan_thumb,chan_vidcount,vid_id,vid_name,vid_publish_dt,vid_thumb,vid_duration,vid_caption,vid_viewcount,vid_likecount,vid_commentcount
0,UCYjk_zY-iYR8YNfJmuzd70A,Epic Meal Time,562726381,6890000,2010-09-29T19:34:09Z,https://yt3.ggpht.com/ytc/AMLnZu-qFeW2GIJmuBxv...,428,7zmtAqjWmOo,Do I Tip when I Eat Out? | Binge Eater Ep.10,2022-11-28T19:00:03Z,https://i.ytimg.com/vi/7zmtAqjWmOo/default.jpg,PT58M53S,False,3887,148,66
1,UCYjk_zY-iYR8YNfJmuzd70A,Epic Meal Time,562726381,6890000,2010-09-29T19:34:09Z,https://yt3.ggpht.com/ytc/AMLnZu-qFeW2GIJmuBxv...,428,WHWc-l5ZpPk,WORLD’s FIRST CANDIED APPLE PIE!!,2022-11-26T16:45:00Z,https://i.ytimg.com/vi/WHWc-l5ZpPk/default.jpg,PT6M21S,False,15724,1004,236
2,UCYjk_zY-iYR8YNfJmuzd70A,Epic Meal Time,562726381,6890000,2010-09-29T19:34:09Z,https://yt3.ggpht.com/ytc/AMLnZu-qFeW2GIJmuBxv...,428,M5I7kBjPv7o,Never Trust an Influencer | Binge Eater Ep.9,2022-11-21T21:25:58Z,https://i.ytimg.com/vi/M5I7kBjPv7o/default.jpg,PT1H19M17S,False,7508,272,58
3,UCYjk_zY-iYR8YNfJmuzd70A,Epic Meal Time,562726381,6890000,2010-09-29T19:34:09Z,https://yt3.ggpht.com/ytc/AMLnZu-qFeW2GIJmuBxv...,428,e9zp3jQy-98,"Rejecting $100,000 He Will Never Get Back | Bi...",2022-11-14T21:11:28Z,https://i.ytimg.com/vi/e9zp3jQy-98/default.jpg,PT1H23M14S,False,7891,266,75
4,UCYjk_zY-iYR8YNfJmuzd70A,Epic Meal Time,562726381,6890000,2010-09-29T19:34:09Z,https://yt3.ggpht.com/ytc/AMLnZu-qFeW2GIJmuBxv...,428,Mdn687guoB8,WATCH YOUR LANGUAGE!! #shorts,2022-11-08T16:41:00Z,https://i.ytimg.com/vi/Mdn687guoB8/default.jpg,PT54S,False,23751,1122,24


In [78]:
vids_by_channel_id.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100737 entries, 0 to 100736
Data columns (total 16 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   chan_id           100737 non-null  object
 1   chan_name         100737 non-null  object
 2   chan_viewcount    100737 non-null  object
 3   chan_subcount     100737 non-null  object
 4   chan_start_dt     100737 non-null  object
 5   chan_thumb        100737 non-null  object
 6   chan_vidcount     100737 non-null  object
 7   vid_id            100737 non-null  object
 8   vid_name          100737 non-null  object
 9   vid_publish_dt    100737 non-null  object
 10  vid_thumb         100737 non-null  object
 11  vid_duration      100737 non-null  object
 12  vid_caption       100737 non-null  object
 13  vid_viewcount     100737 non-null  object
 14  vid_likecount     100737 non-null  object
 15  vid_commentcount  100737 non-null  object
dtypes: object(16)
memory usage: 13.1+ MB


## Method #2: Search Based
This method directly taps the YouTube API to collect videos based on a keyword search term.

### Search Based Functions

In [80]:
def get_playlist(yt_client: object, playlist_id: str, max_vids: int = 50) -> List[str]:

    # Parameters:
    # yt_client -- YouTube API client for requests
    # playlist_id -- ID of a YouTube playlist
    # max_vids -- Number of videos to return, if available (50 allowed by YouTube)

    # Returns:
    # List of video ID strings

    
    # Get initial batch of results

    results = yt_client.playlistItems().list(
        playlistId = playlist_id,
        part = 'snippet',
        maxResults = max_vids
    ).execute()

    video_list = [ video['snippet']['resourceId']['videoId'] for video in results['items'] ]

    max_vids = max_vids - 50

    while ('nextPageToken' in results) and (max_vids > 0):

        # Continue pulling playlist results as long as there is a 'next page'

        results = yt_client.playlistItems().list(
            part = 'snippet',
            playlistId = playlist_id,
            pageToken = results['nextPageToken'],
            maxResults = max_vids
        ).execute()

        for video in results['items']:

            video_list.append(video['snippet']['resourceId']['videoId'])

        max_vids = max_vids - 50
        
    return video_list        


def get_uploads(yt_client: object, channel_id: str, max_vids: int = 50) -> List[str]:

    # Finds the most recent uploads associated with a YouTube channel

    # Parameters:
    # yt_client -- YouTube API client for requests
    # channel_id -- ID of a YouTube channel
    # max_vids -- Number of videos to return, if available (50 allowed by YouTube)

    # Returns:
    # List of video ID strings

    results = yt_client.channels().list(
        part='contentDetails',
        id = channel_id,
    ).execute()

    upload_id = results['items'][0]['contentDetails']['relatedPlaylists']['uploads']

    upload_list = get_playlist(yt_client, upload_id, max_vids = max_vids)

    return upload_list


def get_channel_from_vid(yt_client: object, vid_id: str) -> str:

    # Finds the channel associated with the input video

    # Parameters:
    # yt_client -- YouTube API client for requests
    # vid_id -- ID of a YouTube video

    # Returns:
    # ID string of channel associated with video

    results = yt_client.videos().list(
        part = 'snippet',
        id = vid_id
    ).execute()

    channel_id = results['items'][0]['snippet']['channelId']

    return channel_id
    


def extract_by_query(yt_client: object, query: str, max_channels: int = 50, max_vids: int = 100, excluded_channels: List = []) -> pd.DataFrame:

    # Performs data extraction from YouTube API using a keyword(s) query
    # Quota estimates use max_channels = 50, max_vids = 100

    # Parameters:
    # yt_client -- YouTube API client for requests
    # query -- A string of key words, presumably related to culinary topics
    # max_channels -- the number of channels to survey              
    # max_vids -- the number of videos to pull from each channel  
    # excluded_channels -- optional list of channel ids to exclude from results (excluded channels are still deducted from the max_channels total)  

    # Returns:
    # Pandas dataframe with channel and video features

    chan_cols = [ 'chan_query', 'chan_id', 'chan_name', 'chan_viewcount', 'chan_subcount', 'chan_start_dt', 'chan_thumb', 'chan_vidcount']
    vid_cols = ['vid_id', 'vid_name', 'vid_publish_dt', 'vid_thumb', 'vid_duration', 'vid_caption', 'vid_viewcount', 'vid_likecount', 'vid_commentcount']

    df = pd.DataFrame(columns = chan_cols + vid_cols)

    channel_results = yt_client.search().list(
        part = 'snippet',
        type = 'channel',
        q = query + ' cooking videos',
        maxResults = max_channels
    ).execute()

    # 100 quota for the search

    for channel in channel_results['items']:
        channel_id = channel['id']['channelId']

        if channel_id in excluded_channels:
            continue

        chan_info = yt_client.channels().list(
            part = ['snippet', 'contentDetails', 'statistics', 'topicDetails'],
            id = channel_id
        ).execute()['items'][0]

        # 1 quota x 50 channels = 50 quota

        chan_snip = chan_info['snippet']
        chan_det = chan_info['contentDetails']
        chan_stats = chan_info['statistics']

        # Building dataframe rows, starting with channel features.

        chan_values = [ query, channel_id, chan_snip['title'], int(chan_stats['viewCount']), int(chan_stats['subscriberCount']), chan_snip['publishedAt'], chan_snip['thumbnails']['default']['url'], int(chan_stats['videoCount']) ]

        chan_uploads_id = chan_det['relatedPlaylists']['uploads']

        # Get the id values for the channel's vids
        # 2 quota (100 vids = 2 x 50) x 50 channels = 100 quota

        # Need to catch upload errors, caused (?) by channels with no videos

        try:

            vid_ids = get_playlist(yt_client, chan_uploads_id, max_vids)

        except Exception:

            print(f"Error retrieving uploads for channel {chan_snip['title']}, ID {channel_id}.")

            continue

        for vid_id in vid_ids:

            vid_info = yt_client.videos().list(
                part = ['contentDetails', 'snippet', 'statistics'],
                id = vid_id
            ).execute()['items'][0]

            # 1 quota x 50 channels x 100 videos = 5000 quota

            vid_snip = vid_info['snippet']
            vid_det = vid_info['contentDetails']
            vid_stats = vid_info['statistics']

            # If comments are turned off, the key is missing 

            if 'commentCount' in vid_stats:
                vid_comment_count = int(vid_stats['commentCount'])
            else:
                vid_comment_count = 0

            # Key for likes can be missing 

            if 'likeCount' in vid_stats:
                vid_like_count = int(vid_stats['likeCount'])
            else:
                vid_like_count = 0

            # Key for views can be missing
                
            if 'viewCount' in vid_stats:
                vid_view_count = int(vid_stats['viewCount'])
            else:
                vid_view_count = 0

            # Finish building rows, add to dataframe

            vid_values = [ vid_id, vid_snip['title'], vid_snip['publishedAt'], vid_snip['thumbnails']['default']['url'], 
                            vid_det['duration'], vid_det['caption'], vid_view_count, vid_like_count, vid_comment_count]

            current_row = len(df.index)+1

            df.loc[current_row,:] = chan_values + vid_values


    # Total quota estimate: 100 + 50 + 100 + 5000 = 5250

    return df

def extract_query(api_key:str, query:str, excluded_chanels:set, compiled_data=None):
    """extracts a dataframe of query results
    Args:
        api_key (str): api key as a string
        query (str): the query for the search
        excluded_chanels (set): a set of channel ids to exclude from the results
        compiled_data (None or pd.DataFrame, optional): data to append results to. Defaults to None.
    Returns:
        tuple(pd.DataFrame, set): updated dataframe with new query results, updated set of excluded channels
    """
    client = make_client(api_key)
    query = query.replace('/', ' ')
    print('Searching query:', query.rstrip('\n'))
    df = extract_by_query(client, query.rstrip('\n'), excluded_channels=list(excluded_chanels), max_channels=50, max_vids=50)
    if compiled_data is None:
        compiled_data = df
    else:
        compiled_data = pd.concat((compiled_data, df), axis=0, ignore_index=True)
    exclude = set(df['chan_id'].unique())
    excluded_chanels = excluded_chanels.union(exclude)
    return compiled_data, excluded_chanels

def api_gen(apis:list):
    """_summary_
    Args:
        apis (list): list of api keys as strings
    Yields:
        str: api key
    """
    for api in apis:
        yield api

def extract_all(api_key_list:list, query_list:list, excluded_chanels: set, compiled_data=None, with_terminal=False, intermediate_save_folder=None):
    """extracts queries based on a query list and a list of api keys
    Args:
        api_key_list (list): list of api keys as strings
        query_list (list): a list of querries
        excluded_chanels (set): a set of channel ids to exclude from the search results
        compiled_data (None or pd.DataFrame, optional): data to append results to. Defaults to None.
        with_terminal (bool, optional): whether to inspect each query and decide to modify or skip only for use with a terminal. Defaults to False.
        intermediate_save_folder (None, or str, optional): folder to save results after each query. Defaults to None.
    Returns:
        tuple(pd.DataFrame, set): updated dataframe with new query results, updated set of excluded channels
    """



    api_keys = api_gen(api_key_list)
    api_string = next(api_keys)
    for query in query_list:

        
        if with_terminal:
            query = get_input(query.rstrip('\n').replace('/', ' '))
            if query is None:
                continue
        
        try:
            compiled_data, excluded_chanels = extract_query(api_string, query=query, excluded_chanels=excluded_chanels, compiled_data=compiled_data)
            print('Data Shape:', compiled_data.shape, 'Excluded Channel List len:', len(excluded_chanels))
        except HttpError:
            try:
                api_string = next(api_keys)
                print('='*50)
                print('Query limit reached trying next api key')
                compiled_data, excluded_chanels = extract_query(api_string, query=query, excluded_chanels=excluded_chanels, compiled_data=compiled_data)
                print('Data Shape:', compiled_data.shape, 'Excluded Channel List len:', len(excluded_chanels))
            except StopIteration:
                print(f'Final api ran out on query: {query}, not included in data')
                return compiled_data, excluded_chanels
        
        if intermediate_save_folder is not None:
            compiled_data.to_csv(os.path.join(intermediate_save_folder, f'compiled_data_through_query_{query}.csv'), index=False)
            excluded = pd.Series(list(excluded_chanels))
            excluded.to_csv(os.path.join(intermediate_save_folder, f'excluded_channels_through_query_{query}.csv'), index=False)
            


    return compiled_data, excluded_chanels


### Get Videos

In [82]:
# For this method, we used a list of search terms related to cooking. 
# For this demo, we will use 4 example words
query_list=['biscuits','snickerdoodle','bbq','muffins']

In [91]:
# excluded channels as a set
excluded_channels = set('')

# data so far if starting from scratch set to None
data_file = ''
data = pd.read_csv(data_file) if data_file != '' else None

# where to save intermediate results (after each query) if blank intermediate saves will not happen
intermediate_save_folder = ''

# extract data
finished_data, excluded_channel_set = extract_all(api_list, query_list, excluded_channels, data, with_terminal=False, intermediate_save_folder=intermediate_save_folder if intermediate_save_folder != '' else None)

Query limit reached trying next api key
Searching query: biscuits
Data Shape: (2461, 17) Excluded Channel List len: 50
Searching query: snickerdoodle
Error retrieving uploads for channel Hi, ID UCDSRBKDbyowqZQimprxLHcg.
Data Shape: (4031, 17) Excluded Channel List len: 98
Searching query: bbq
Data Shape: (6308, 17) Excluded Channel List len: 147
Searching query: muffins




Query limit reached trying next api key
Searching query: muffins
Data Shape: (8028, 17) Excluded Channel List len: 190


In [92]:
finished_data.head()

Unnamed: 0,chan_query,chan_id,chan_name,chan_viewcount,chan_subcount,chan_start_dt,chan_thumb,chan_vidcount,vid_id,vid_name,vid_publish_dt,vid_thumb,vid_duration,vid_caption,vid_viewcount,vid_likecount,vid_commentcount
0,biscuits,UCtby6rJtBGgUm-2oD_E7bzw,Cooking tree 쿠킹트리,462890217,4640000,2015-05-18T04:00:01Z,https://yt3.ggpht.com/ytc/AMLnZu-raDpPaw-svdkR...,1381,aJ1RVfhLflA,흑설탕 파운드 케이크 만들기 : Dark Brown Sugar Pound Cake ...,2022-11-29T12:00:00Z,https://i.ytimg.com/vi/aJ1RVfhLflA/default.jpg,PT4M33S,True,8157,809,23
1,biscuits,UCtby6rJtBGgUm-2oD_E7bzw,Cooking tree 쿠킹트리,462890217,4640000,2015-05-18T04:00:01Z,https://yt3.ggpht.com/ytc/AMLnZu-raDpPaw-svdkR...,1381,bJXcZ7WcPtc,#83 베이킹 영상 3배속으로 몰아보기 : 3x Speed Baking Video ...,2022-11-27T04:30:10Z,https://i.ytimg.com/vi/bJXcZ7WcPtc/default.jpg,PT12M23S,False,16416,787,15
2,biscuits,UCtby6rJtBGgUm-2oD_E7bzw,Cooking tree 쿠킹트리,462890217,4640000,2015-05-18T04:00:01Z,https://yt3.ggpht.com/ytc/AMLnZu-raDpPaw-svdkR...,1381,4Xr_8g96dAI,한입 와앙~😚 베어 물고 싶은 도지마롤 * 생크림 롤케이크 만들기 : Dojima ...,2022-11-26T04:30:01Z,https://i.ytimg.com/vi/4Xr_8g96dAI/default.jpg,PT7M3S,True,35525,2246,39
3,biscuits,UCtby6rJtBGgUm-2oD_E7bzw,Cooking tree 쿠킹트리,462890217,4640000,2015-05-18T04:00:01Z,https://yt3.ggpht.com/ytc/AMLnZu-raDpPaw-svdkR...,1381,oSsTQe4Nnxc,킨더 초콜릿 우유 쿠키 만들기 : Kinder Chocolate Milk Cooki...,2022-11-24T12:00:37Z,https://i.ytimg.com/vi/oSsTQe4Nnxc/default.jpg,PT4M,True,30866,1851,50
4,biscuits,UCtby6rJtBGgUm-2oD_E7bzw,Cooking tree 쿠킹트리,462890217,4640000,2015-05-18T04:00:01Z,https://yt3.ggpht.com/ytc/AMLnZu-raDpPaw-svdkR...,1381,ZiLnr6QWr7A,사 먹는 것보다 맛있어요!👍🏻 쫄깃한 꽈배기 만들기 : Chewy Twisted D...,2022-11-22T12:00:21Z,https://i.ytimg.com/vi/ZiLnr6QWr7A/default.jpg,PT6M12S,True,34104,2052,44


In [93]:
finished_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8028 entries, 0 to 8027
Data columns (total 17 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   chan_query        8028 non-null   object
 1   chan_id           8028 non-null   object
 2   chan_name         8028 non-null   object
 3   chan_viewcount    8028 non-null   object
 4   chan_subcount     8028 non-null   object
 5   chan_start_dt     8028 non-null   object
 6   chan_thumb        8028 non-null   object
 7   chan_vidcount     8028 non-null   object
 8   vid_id            8028 non-null   object
 9   vid_name          8028 non-null   object
 10  vid_publish_dt    8028 non-null   object
 11  vid_thumb         8028 non-null   object
 12  vid_duration      8028 non-null   object
 13  vid_caption       8028 non-null   object
 14  vid_viewcount     8028 non-null   object
 15  vid_likecount     8028 non-null   object
 16  vid_commentcount  8028 non-null   object
dtypes: object(17)


# Combine Datasets and Clean
We need to remove duplicate videos based on the video ID, videos that are shorter than 60 seconds, and videos that contain the word '#shorts' in the title. For our project, we made the decision to attempt to remove as many videos that may be Shorts due to how this video type is new and inherently different from a standard YouTube video. Future research would be needed to create a model that caters to both standard, long-form videos and short-form Shorts.

### Cleaning Function

In [113]:
def remove_unqualified_videos(df):

    # data cleaning for videos
    # remove videos<60 seconds and any that contain '#shorts' in the title
    # also drop duplicates based on the vid_id

    #parameters: dataframe that you are looking to clean


    df = df[df['vid_duration'].notna()]
    df = df.drop_duplicates(subset='vid_id', keep="first")
    df = df[~df['vid_name'].str.contains('#shorts')]
    df['vid_seconds'] = df['vid_duration'].apply(lambda x: isodate.parse_duration(x).total_seconds())
    df = df[df['vid_seconds']>60]
    return df.reset_index(drop=True)

### Clean the Combined Dataset

In [114]:
final_df = pd.concat([finished_data,vids_by_channel_id])

In [115]:
# For the rows from the scraping method, we fill the chan_query column with 'no query' or whatever is desired
final_df['chan_query'] = final_df['chan_query'].fillna('no query')

In [116]:
final_df.shape

(108765, 17)

In [117]:
cleaned_vid_df = remove_unqualified_videos(final_df)

In [118]:
cleaned_vid_df.shape

(56093, 18)

In [119]:
cleaned_vid_df.head()

Unnamed: 0,chan_query,chan_id,chan_name,chan_viewcount,chan_subcount,chan_start_dt,chan_thumb,chan_vidcount,vid_id,vid_name,vid_publish_dt,vid_thumb,vid_duration,vid_caption,vid_viewcount,vid_likecount,vid_commentcount,vid_seconds
0,biscuits,UCtby6rJtBGgUm-2oD_E7bzw,Cooking tree 쿠킹트리,462890217,4640000,2015-05-18T04:00:01Z,https://yt3.ggpht.com/ytc/AMLnZu-raDpPaw-svdkR...,1381,aJ1RVfhLflA,흑설탕 파운드 케이크 만들기 : Dark Brown Sugar Pound Cake ...,2022-11-29T12:00:00Z,https://i.ytimg.com/vi/aJ1RVfhLflA/default.jpg,PT4M33S,True,8157,809,23,273.0
1,biscuits,UCtby6rJtBGgUm-2oD_E7bzw,Cooking tree 쿠킹트리,462890217,4640000,2015-05-18T04:00:01Z,https://yt3.ggpht.com/ytc/AMLnZu-raDpPaw-svdkR...,1381,bJXcZ7WcPtc,#83 베이킹 영상 3배속으로 몰아보기 : 3x Speed Baking Video ...,2022-11-27T04:30:10Z,https://i.ytimg.com/vi/bJXcZ7WcPtc/default.jpg,PT12M23S,False,16416,787,15,743.0
2,biscuits,UCtby6rJtBGgUm-2oD_E7bzw,Cooking tree 쿠킹트리,462890217,4640000,2015-05-18T04:00:01Z,https://yt3.ggpht.com/ytc/AMLnZu-raDpPaw-svdkR...,1381,4Xr_8g96dAI,한입 와앙~😚 베어 물고 싶은 도지마롤 * 생크림 롤케이크 만들기 : Dojima ...,2022-11-26T04:30:01Z,https://i.ytimg.com/vi/4Xr_8g96dAI/default.jpg,PT7M3S,True,35525,2246,39,423.0
3,biscuits,UCtby6rJtBGgUm-2oD_E7bzw,Cooking tree 쿠킹트리,462890217,4640000,2015-05-18T04:00:01Z,https://yt3.ggpht.com/ytc/AMLnZu-raDpPaw-svdkR...,1381,oSsTQe4Nnxc,킨더 초콜릿 우유 쿠키 만들기 : Kinder Chocolate Milk Cooki...,2022-11-24T12:00:37Z,https://i.ytimg.com/vi/oSsTQe4Nnxc/default.jpg,PT4M,True,30866,1851,50,240.0
4,biscuits,UCtby6rJtBGgUm-2oD_E7bzw,Cooking tree 쿠킹트리,462890217,4640000,2015-05-18T04:00:01Z,https://yt3.ggpht.com/ytc/AMLnZu-raDpPaw-svdkR...,1381,ZiLnr6QWr7A,사 먹는 것보다 맛있어요!👍🏻 쫄깃한 꽈배기 만들기 : Chewy Twisted D...,2022-11-22T12:00:21Z,https://i.ytimg.com/vi/ZiLnr6QWr7A/default.jpg,PT6M12S,True,34104,2052,44,372.0


In [120]:
cleaned_vid_df.to_csv('demo.csv',index=False)

As you can see from this demo, more than half of our rows were dropped due to duplicates and/or shorts. During our project, we were able to collect more than 1 million rows of unique videos, but this took roughly 8 weeks to do so. We typically saw that roughly 10-20% of videos were dropped because they did not meet our requirements; however, this percentage may change as YouTube promotes creators to put out more Shorts over time.