# **Method 1: Web Scraping**

Install necessary packages

In [None]:
%pip install beautifulsoup4 requests openpyxl

Next, we import the packages into the notebook.
1. `pandas`: for managing data through dataframes (https://pandas.pydata.org/docs/)
2. `requests`: for connecting to webpages (needs an "agent" to emulate a web browser) (https://requests.readthedocs.io/en/latest/)
3. `beautifulsoup`: for processing HTML data collected from webpages (https://www.crummy.com/software/BeautifulSoup/)

In [1]:
# Import libraries
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
import time
import random

# Declare headers for the requests agent
headers = {
  'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36 Edg/127.0.0.0',
  'Accept-Language': 'en-US,en;q=0.9',
  'Accept-Encoding': 'gzip, deflate, br',
  'Connection': 'keep-alive'
}

Let's try extracting the content of the article from Rappler:

[Sara Duterte face book isang kaibigan](https://www.rappler.com/newsbreak/fact-check/sara-duterte-face-book-isang-kaibigan/)


In [2]:
# Define article based on the link
# link = 'https://www.rappler.com/newsbreak/fact-check/sara-duterte-face-book-isang-kaibigan/'
link = 'https://www.rappler.com/newsbreak/inside-track/conflict-team-duterte-kaufman-roque-clash/'

# Request using `request` library
r = requests.get(link, headers=headers)
r

<Response [200]>

Look at the webpage's content

In [3]:
# Inspect the content
r.content



Use `BeautifulSoup` to parse the webpage's HTML source code.

In [4]:
# Use BeautifulSoup to parse the HTML page
soup = BeautifulSoup(r.content, 'html.parser')
soup

<!DOCTYPE html>

<html class="no-js" lang="en-US">
<head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<script>(function(c){c.add('has-js');c.remove('no-js')})(document.documentElement.classList)</script>
<script>
		(function() {
			const userAgent = navigator.userAgent;

			// Check if user agent contains mobile app identifiers
			if (userAgent.includes("RapplerMobileAndroid") ||
				userAgent.includes("RapplerMobileiOS")) {
				document.documentElement.classList.add("r6-app");
				document.body.classList.add("hide-site-header");
			}
		})();
	</script>
<meta content="index, follow, max-image-preview:large, max-snippet:-1, max-video-preview:-1" name="robots">
<script>window.dataLayer = window.dataLayer || []; window.dataLayer.push( {"type":"post","subtype":"post","context":{"is_front_page":false,"is_singular":true,"is_archive":false,"is_home":false,"is_search":false,"is_404":false,"is_post_type_archive":false,"is_tax":false,"is_article"

We'll extract the following information to create our corpus.
* article link
* article title
* date published
* article's text / body
* *optional*: tags, i.e., related topic

## Extract the webpage's title.

In [5]:
soup.title  # with HTML tags
soup.title.text  # text within the HTML tags
title = soup.title.text.strip()  # whitespaces are removed
title

'Conflict in Team Duterte: Kaufman and Roque clash'

## Extract the webpage's date published.

In [6]:
# Date published
date_published = soup.find(
  'time',  # HTML element tag
  {
      'class': 'published',
      'datetime': True
  }  # HTML attribute
)
# Actual text within the HTML element
date_published.text.strip()

# If you're after the HTML attribute, use the attribute name for the key
date_published = date_published['datetime']

date_published

'2025-08-01T10:00:00+08:00'

## Extract the webpage's text

In [7]:
# Extract article text
# element: div
# class: post-single__content entry-content
article_text = soup.find(
  'div',  # HTML element tag
  {
    'class': 'post-single__content entry-content'
  }  # HTML attribute
)
# Returns the HTML code for the article
article_text

# Get all paragraph (<p>) from the article_text
tagged_lines = article_text.find_all('p')  # Return list of paragraph elements

# Removes the HTML tags
text = ''
for line in tagged_lines:
  untagged_line = line.get_text()
  text += untagged_line + '\n'

# Returns the article text as 1 big string
print(text)

There’s conflict brewing in The Hague, as the lead counsel of former president Rodrigo Duterte and his loyal former spokesperson traded barbs over strategies.
It all started in June when Vice President Sara Duterte, who had just visited her father at the International Criminal Court (ICC), told supporters that former spokesperson Harry Roque had suggested “going through the Dutch laws” to get Duterte out of prison in the interim.
“One of the discussions which was one of your suggestions, Spokes, to go through the Dutch laws — which he said okay, you explore that. And sabi ko (and I said) I can do that while I’m in the Philippines,” the Vice President said on June 3.
It turns out that this suggestion entailed suing the Dutch government for what they claimed was a case of extraordinary rendition. This is a term in international law which alleges that the transfer to a foreign state of a person was illegal, and a case of abduction. The Philippine government said it was following a domesti

## Combine features into dataframe

In [8]:
# Link
# Title
# Date Published
# Article's Text
rappler = pd.DataFrame(
    columns=['title', 'link', 'date_published', 'text']
)
doc_details = [title, link, date_published, text]
rappler.loc[len(rappler)] = doc_details
rappler

Unnamed: 0,title,link,date_published,text
0,Conflict in Team Duterte: Kaufman and Roque clash,https://www.rappler.com/newsbreak/inside-track...,2025-08-01T10:00:00+08:00,"There’s conflict brewing in The Hague, as the ..."


## Export to Excel file

In [9]:
try:
  # Google Colab
  from google.colab import drive  # type: ignore
  drive.mount('/content/drive')

  file_name = '/content/drive/MyDrive/rappler.xlsx'
except:
  # VS Code / Local machine
  file_name = 'rappler.xlsx'

rappler.to_excel(file_name)
print(f'File saved to {file_name}')

File saved to rappler.xlsx


## Function `extract_article_data`

In [49]:
# Extract data from article based on link
def extract_article_data(link):
  """
  Extracts data from an article based on the provided link.

  Parameters:
    link (str): The URL of the article.

  Returns:
    list: A list containing the extracted article details in the following order:
      - title (str): The title of the article.
      - date (str): The date the article was published.
      - link (str): The URL of the article.
      - text (str): The content of the article.
  """
  # Make a GET request for the article URL
  r = requests.get(link, headers=headers)

  # Parse the HTML
  soup = BeautifulSoup(r.content, 'html.parser')

  # Retrieve doc title
  title = soup.title.text.strip()

  # IF NOT WITH RAPPLER ARTICLE, CHANGE THIS XOXO
  # Retrieve doc date
  date = soup.find("time", {"datetime": True})['datetime']

  # IF NOT WITH RAPPLER ARTICLE, CHANGE THIS XOXO
  # Retrieve article content
  text = ''
  tagged_lines = soup.find("div", {"class": "post-single__content entry-content"}).find_all('p')
  for line in tagged_lines:
    untagged_line = line.get_text()
    text += untagged_line + '\n'

  # Create list containing doc details
  # Append to dataset
  doc_details = [title, link, date, text]
  return doc_details

In [50]:
# Test the `extract_article_data` function
rappler.loc[len(rappler)] = extract_article_data(
  'https://www.rappler.com/philippines/duterte-lawyer-nicholas-kaufman-meeting-icc-prosecutor-karim-khan-israel/'
)
rappler

Unnamed: 0,title,link,date_published,text
0,Conflict in Team Duterte: Kaufman and Roque clash,https://www.rappler.com/newsbreak/inside-track...,2025-08-01T10:00:00+08:00,"There’s conflict brewing in The Hague, as the ..."
1,"Meeting with ICC prosecutor about Duterte, Isr...",2025-07-17T09:42:59+08:00,https://www.rappler.com/philippines/duterte-la...,"MANILA, Philippines – Nicholas Kaufman, the le..."
2,"Meeting with ICC prosecutor about Duterte, Isr...",https://www.rappler.com/philippines/duterte-la...,2025-07-17T09:42:59+08:00,"MANILA, Philippines – Nicholas Kaufman, the le..."


## Extracting Multiple Articles

In [51]:
# mother_url = 'https://www.rappler.com/topic/philippine-offshore-gaming-operations/page/'

mother_url = "https://www.rappler.com/wp-json/rappler/v1/ontology-topics/2653059/latest-news?page="
page = 1
page_limit = 5
corpus = pd.DataFrame(columns=['title', 'link', 'date_published', 'text'])

while True:
  # Convert page number as a string
  page_str = str(page)
  # Form the article page
  page_url = mother_url + page_str
  print('\nWorking on ' + page_url)

  # Add random time between 1 to 5 seconds before requesting
  time.sleep(random.randint(1, 5))

  # # Get the list of articles within the page
  page_r = requests.get(page_url, headers=headers)

  # Check if page request is 200
  if page_r.status_code != 200:
    print('Failed to retrieve page')
    break

  # Check if the response contains valid JSON
  try:
    page_data = page_r.json()
  except ValueError:
    print('Failed to parse JSON')
    break

  # Go through each article to extract and save to the dataframe
  for article in page_data:
    # Focus on the article permalink, skip article if no 'permalink' found
    if 'permalink' not in article:
      continue
    article_link = article['permalink']

    # Clean up article_link
    article_link = article_link.replace('\\/', '/')

    # For each articles, invoke `extract_article_data`
    try:
      # Append to a dataframe
      tmp = extract_article_data(article_link)
      print(tmp)
      corpus.loc[len(corpus)] = tmp
    except:
      # if there's an extraction error, skip
      continue

  # Check whether you have reached the page limit
  if page >= page_limit:
    break

  # Go to the next page
  page += 1


Working on https://www.rappler.com/wp-json/rappler/v1/ontology-topics/2653059/latest-news?page=1
["Hindi Ito Marites: The Hague and Filipinos' pursuit of justice", 'https://www.rappler.com/newsbreak/podcasts-videos/video-hindi-ito-marites-hague-filipinos-pursuit-justice/', '2025-08-15T16:00:29+08:00', 'When local courts offer no refuge, or when faced with a formidable foreign adversary, Filipinos have, in recent history, turned to two institutions based in the same city: The Hague in the Netherlands.\nThe international tribunal that in July 2016 ruled in favor of the Philippines in a pleading against China, concerning the West Philippine Sea, was housed in the Permanent Court of Arbitration, whose headquarters is the Peace Palace. Now, former president Rodrigo Duterte is detained in the premises of the International Criminal Court, where he awaits trial for crimes against humanity for his brutal drug war.\nRappler editor-at-large Marites Vitug takes us on a visual tour of these places

In [52]:
try:
  # Google Colab
  from google.colab import drive  # type: ignore
  drive.mount('/content/drive')

  file_name = '/content/drive/MyDrive/rappler_corpus.xlsx'
except:
  # VS Code / Local machine
  file_name = 'rappler_corpus.xlsx'

corpus.to_excel(file_name)
print(f'File saved to {file_name}')
corpus

File saved to rappler_corpus.xlsx


Unnamed: 0,title,link,date_published,text
0,Hindi Ito Marites: The Hague and Filipinos' pu...,https://www.rappler.com/newsbreak/podcasts-vid...,2025-08-15T16:00:29+08:00,"When local courts offer no refuge, or when fac..."
1,FACT CHECK: No Guinness record of Duterte as ‘...,https://www.rappler.com/newsbreak/fact-check/n...,2025-08-12T10:00:00+08:00,Claim: Former president Rodrigo Duterte holds ...
2,WATCH: Tuloy ba ang pre-trial hearing ni Duter...,https://www.rappler.com/philippines/video-will...,2025-08-11T19:00:00+08:00,"MANILA, Philippines – May bagong hiling ang de..."
3,Duterte team wants ICC Prosecutor Khan disqual...,https://www.rappler.com/video/daily-wrap/augus...,2025-08-08T22:07:53+08:00,Here are today’s headlines – the latest news i...
4,Duterte team wants Prosecutor Khan disqualifie...,https://www.rappler.com/philippines/duterte-de...,2025-08-08T07:00:00+08:00,"MANILA, Philippines – Prosecutor Karim Khan, o..."
5,FACT CHECK: ICC did not sentence Duterte to re...,https://www.rappler.com/newsbreak/fact-check/i...,2025-08-04T11:30:00+08:00,Claim: The International Criminal Court (ICC) ...
6,FACT CHECK: Photo of barong-clad Duterte insid...,https://www.rappler.com/newsbreak/fact-check/p...,2025-08-02T10:30:00+08:00,Claim: A photo circulating on social media sho...
7,"Kaufman, Roque clash over defense strategy for...",https://www.rappler.com/philippines/august-1-2...,2025-08-01T22:00:52+08:00,Here are today’s headlines – the latest news i...
8,FACT CHECK: ICC deferred decision on Duterte i...,https://www.rappler.com/newsbreak/fact-check/i...,2025-08-01T18:30:00+08:00,Claim: Former president Rodrigo Duterte will b...
9,Conflict in Team Duterte: Kaufman and Roque clash,https://www.rappler.com/newsbreak/inside-track...,2025-08-01T10:00:00+08:00,"There’s conflict brewing in The Hague, as the ..."


# Method 2: Using the YouTube API

## Create YouTube Data API v3 key

Before we can scrape data from YouTube, we need to get an API key. APIs, or Application Programming Interfaces, are ways in which two applications (e.g., Google Colaboratory and YouTube) can talk to each other. You can think of it as the language through which two apps can speak so they can send and receive information from each other.

Some apps (e.g., YouTube and Google services) require an API key, or a string of characters that authenticates a user to access the app through an API.

1.   To get your API key, head to https://console.cloud.google.com/cloud-resource-manager
2.   Click on Create Project, name the project "YouTube scraping," and press "CREATE"
3.   In your Google Developers Console dashboard, click the Navigation Menu (the three lines) at the upper left corner and select "APIs & Services"
4.   When prompted, select the "YouTube scraping" project
5.   You will be directed to the project's dashboard, where you are to click "Explore & Enable APIs"
6.   Search for and select "YouTube Data API v3"
7.   Enable the API and click "Create Credentials"
8.   Select "public data" when asked what kind of data you will access
9.   Find your API key at the "Credentials" tab on the left side of your dashboard (If it's not there, just click "Create Credentials" again and select "API key")

Now, we import the pertinent packages.

`build` creates a resource object that uses the API key to communicate with YouTube.

In [22]:
import pandas as pd
from googleapiclient.discovery import build

In [23]:
api_key = 'AIzaSyArJa4h3CPtJILqexIrRWL22W7K1ZzxABs'
youtube = build('youtube', 'v3', developerKey=api_key)

Let's make it our goal to make a corpus based on YouTube comments about Alice Guo. 
Let's start with this link: https://www.youtube.com/watch?v=yfoq-0gGTLM

To get comment thread results from YouTube, we execute the code below that uses the `commentThreads()` and `list` methods with two arguments:
*   `id`, which we get from the link above
*   `parts`, which we set as "snippet,replies" because these are the only parts of the comment results we are interested in

To learn more about how to use the YouTube API to get comment thread results, refer to https://developers.google.com/youtube/v3/docs/commentThreads/list



Now, let's try and look at the contents of the video_response object.

## Extract the YouTube comments

In [26]:
video_id = 'xPb4FMfGbos'
video_response = youtube.commentThreads().list(
  videoId=video_id, part='snippet,replies', maxResults=50,
  order='time'
).execute()
video_response

{'kind': 'youtube#commentThreadListResponse',
 'etag': 'ewHWdkDSbiv0bp1sCDqRRZQlsG4',
 'nextPageToken': 'Z2V0X25ld2VzdF9maXJzdC0tQ2dnSWdBUVZGN2ZST0JJRkNJZ2dHQUFTQlFpSklCZ0FFZ1VJblNBWUFSSUZDSWNnR0FBU0JRaW9JQmdBSWc0S0RBakg5UFMtQmhDUXd1VFVBdw==',
 'pageInfo': {'totalResults': 50, 'resultsPerPage': 50},
 'items': [{'kind': 'youtube#commentThread',
   'etag': '3QTe6t3GBHw3uoh678KrcC8yCeg',
   'id': 'UgxKqA6e6cbuxrTcN854AaABAg',
   'snippet': {'channelId': 'UCE2606prvXQc_noEqKxVJXA',
    'videoId': 'xPb4FMfGbos',
    'topLevelComment': {'kind': 'youtube#comment',
     'etag': 'yb-SxNKCYpBxFlt_X-QqnEYh2As',
     'id': 'UgxKqA6e6cbuxrTcN854AaABAg',
     'snippet': {'channelId': 'UCE2606prvXQc_noEqKxVJXA',
      'videoId': 'xPb4FMfGbos',
      'textDisplay': 'Sept 23, 2026',
      'textOriginal': 'Sept 23, 2026',
      'authorDisplayName': '@farinyeshar.697',
      'authorProfileImageUrl': 'https://yt3.ggpht.com/Ga_lioy2Mz_mMDHcIwTOqgyFKysK2XbBSWbv39XxvbyvN1d69qoEKGLEmKZT4kRvldaRwMy8EQ=s48-c-k-

We can see that the contents are quite long, but if we take the time to analyze it, we'll realize it's really just a nested dictionary with the following keys: `kind`, `etag`, `nextPageToken`, `pageInfo`, and `items`.

The values in `pageInfo` tell us how many results are included in this set--in this case, 50 comments and their corresponding replies. The 50 comments are contained as dictionaries within the `items` key.

To confirm, let's check the length of the `items`.

In [27]:
len(video_response['items'])

50

## Inspecting a single YouTube comment

Now, let's try and see what each item in the response looks like. Let's examine the 1st item in the list.

In [28]:
video_response['items'][0]

{'kind': 'youtube#commentThread',
 'etag': '3QTe6t3GBHw3uoh678KrcC8yCeg',
 'id': 'UgxKqA6e6cbuxrTcN854AaABAg',
 'snippet': {'channelId': 'UCE2606prvXQc_noEqKxVJXA',
  'videoId': 'xPb4FMfGbos',
  'topLevelComment': {'kind': 'youtube#comment',
   'etag': 'yb-SxNKCYpBxFlt_X-QqnEYh2As',
   'id': 'UgxKqA6e6cbuxrTcN854AaABAg',
   'snippet': {'channelId': 'UCE2606prvXQc_noEqKxVJXA',
    'videoId': 'xPb4FMfGbos',
    'textDisplay': 'Sept 23, 2026',
    'textOriginal': 'Sept 23, 2026',
    'authorDisplayName': '@farinyeshar.697',
    'authorProfileImageUrl': 'https://yt3.ggpht.com/Ga_lioy2Mz_mMDHcIwTOqgyFKysK2XbBSWbv39XxvbyvN1d69qoEKGLEmKZT4kRvldaRwMy8EQ=s48-c-k-c0x00ffffff-no-rj',
    'authorChannelUrl': 'http://www.youtube.com/@farinyeshar.697',
    'authorChannelId': {'value': 'UCYR1Gq2JNC9w_ur6ChxOJAg'},
    'canRate': True,
    'viewerRating': 'none',
    'likeCount': 0,
    'publishedAt': '2025-08-07T14:01:01Z',
    'updatedAt': '2025-08-07T14:01:01Z'}},
  'canReply': True,
  'totalReplyC

We can extract the following
- textDisplay
- id
- publishedAt
- textOriginal

Then add additional features
- likeCount
- repliedParentId

In [29]:
# For the original comment
print(
  f'https://www.youtube.com/watch?v={video_id}&lc={video_response["items"][0]["snippet"]["topLevelComment"]["id"]}'
)
print(video_response['items'][0]['snippet']['topLevelComment']['snippet']['textDisplay'])
print(video_response['items'][0]['snippet']['topLevelComment']['snippet']['publishedAt'])
print(video_response['items'][0]['snippet']['topLevelComment']['snippet']['textOriginal'])
print(video_response['items'][0]['snippet']['topLevelComment']['snippet']['likeCount'])
# No repliedParentId

https://www.youtube.com/watch?v=xPb4FMfGbos&lc=UgxKqA6e6cbuxrTcN854AaABAg
Sept 23, 2026
2025-08-07T14:01:01Z
Sept 23, 2026
0


## Extracting replies from comment

If `totalReplyCount` is greater than 0, we can extract the replies to the comment.
In the `video_response` object, we can see that the replies are nested within the `comments` key. HOWEVER, this method does not include all replies.

As such, we have to use the `comments.list` method to get all replies to a comment. 
Refer to [CommentsList](https://developers.google.com/youtube/v3/docs/comments/list) document for more information.

In [30]:
video_response['items'][-1]['snippet']['totalReplyCount']

2

In [31]:
len(
  video_response['items'][-1]['replies']['comments']
)

2

Using a similar process, we can also extract the one reply to this comment. We know there's three replies because of the value of the `totalReplyCount` key.

In [32]:
comment_number = -1
total_reply_count = video_response['items'][comment_number]['snippet']['totalReplyCount']

if total_reply_count > 0:
  parent_id = video_response['items'][comment_number]['snippet']['topLevelComment']['id']

  replies = youtube.comments().list(
    part='snippet', parentId=parent_id, maxResults=50
  ).execute()

  for reply in replies['items']:
    print(
      f"https://www.youtube.com/watch?v={video_id}&lc={reply['id']}"
    )
    print(reply['snippet']['textDisplay'])
    print(reply['snippet']['publishedAt'])
    print(reply['snippet']['textOriginal'])
    print(reply['snippet']['likeCount'])
    print(reply['snippet']['parentId'])
    print()

https://www.youtube.com/watch?v=xPb4FMfGbos&lc=UgzEdtXG7-BheRo3KZt4AaABAg.AFvmZZUVmmCAFvqvQY67ej
kaya nga international Court. hindi ito american court
2025-03-21T10:45:08Z
kaya nga international Court. hindi ito american court
1
UgzEdtXG7-BheRo3KZt4AaABAg

https://www.youtube.com/watch?v=xPb4FMfGbos&lc=UgzEdtXG7-BheRo3KZt4AaABAg.AFvmZZUVmmCAFvr1qSDrZd
hindi sinasamba ng ibang european contry ang english language. same sa Japan
2025-03-21T10:46:09Z
hindi sinasamba ng ibang european contry ang english language. same sa Japan
0
UgzEdtXG7-BheRo3KZt4AaABAg



## Automating the extraction

Now that we know how to extract individual comments from YouTube videos, we need to figure out how we can automate this process so that we can build a corpus of comments without manually getting comments and replies one by one.

To do this, we rely on the following:
1.   the consistency of the nested dictionary results structure of the YouTube API
2.   Python loops
3.   page tokens

Let's work on the first two.


First, we create a list where we will store all scraped data.

In [33]:
comments = []

Then, we use loops to iterate through the contents of `video_response`.

In [34]:
video_id = 'xPb4FMfGbos'

# iterate through items
for item in video_response['items']:

  # extract comment from each item
  comment = item['snippet']['topLevelComment']['snippet']

  # append comment to list of comments
  comments.append([
    comment['textDisplay'],
    f'https://www.youtube.com/watch?v={video_id}&lc={item["snippet"]["topLevelComment"]["id"]}',
    comment['publishedAt'],
    comment['textOriginal'],
    comment['likeCount'],
    np.nan
  ])

  # count number of replies
  total_reply_count = item['snippet']['totalReplyCount']

  # if there is at least one reply
  if total_reply_count > 0:
    parent_id = item["snippet"]["topLevelComment"]["id"]

    replies = youtube.comments().list(
      part='snippet', parentId=parent_id, maxResults=50
    ).execute()

    # iterate through the replies
    for reply in replies['items']:
      # extract text from each reply
      # append reply to list of comments
      replyBody = reply['snippet']
      comments.append([
        replyBody['textDisplay'],
        f"https://www.youtube.com/watch?v={video_id}&lc={reply['id']}",
        replyBody['publishedAt'],
        replyBody['textOriginal'],
        replyBody['likeCount'],
        replyBody['parentId']
      ])

In [35]:
youtube_corpus = pd.DataFrame(
  comments, columns=['title', 'link', 'date_published', 'text', 'like_count', 'reply_parent_id'])
youtube_corpus

Unnamed: 0,title,link,date_published,text,like_count,reply_parent_id
0,"Sept 23, 2026",https://www.youtube.com/watch?v=xPb4FMfGbos&lc...,2025-08-07T14:01:01Z,"Sept 23, 2026",0,
1,When Digong said to the Police who carried his...,https://www.youtube.com/watch?v=xPb4FMfGbos&lc...,2025-08-03T20:45:50Z,When Digong said to the Police who carried his...,0,
2,ang tapang nuong hindi pa naaresto..dalian daw...,https://www.youtube.com/watch?v=xPb4FMfGbos&lc...,2025-08-01T07:29:49Z,ang tapang nuong hindi pa naaresto..dalian daw...,0,
3,"Life in prison please. <a href=""UCkszU2WH9gy1m...",https://www.youtube.com/watch?v=xPb4FMfGbos&lc...,2025-07-25T22:06:40Z,Life in prison please.,0,
4,Galing mag drama ni digong,https://www.youtube.com/watch?v=xPb4FMfGbos&lc...,2025-07-10T14:28:05Z,Galing mag drama ni digong,0,
5,Pilipino ako duterti ako si allan gaTmaitan,https://www.youtube.com/watch?v=xPb4FMfGbos&lc...,2025-07-06T02:33:33Z,Pilipino ako duterti ako si allan gaTmaitan,0,
6,Ply gus wag yong kame ioababoy duterti melmnid...,https://www.youtube.com/watch?v=xPb4FMfGbos&lc...,2025-07-06T02:31:23Z,Ply gus wag yong kame ioababoy duterti melmnid...,0,
7,"I think this ICC thing looks inaccurate, becau...",https://www.youtube.com/watch?v=xPb4FMfGbos&lc...,2025-06-28T13:24:11Z,"I think this ICC thing looks inaccurate, becau...",0,
8,"For a french judge, that judge hair is a bit u...",https://www.youtube.com/watch?v=xPb4FMfGbos&lc...,2025-05-11T03:15:44Z,"For a french judge, that judge hair is a bit u...",0,
9,THE PHILLIPPINNES Govermennt Is Sourounded be ...,https://www.youtube.com/watch?v=xPb4FMfGbos&lc...,2025-05-10T04:23:27Z,THE PHILLIPPINNES Govermennt Is Sourounded be ...,0,


## Handling next page with page tokens

Let's take another example with more comments.

We'll have a list of 50 or so comments from the video. 

However, if we check the video's comments section, we'll see that there are 128 comments in the video.

So how can we extract more comments? That's where `pageToken` comes in.

In [36]:
video_id = 'xPb4FMfGbos'
video_response = youtube.commentThreads().list(
  videoId=video_id, part='snippet,replies', maxResults=50,
  order='time'
).execute()

# iterate through items
for item in video_response['items']:

  # extract comment from each item
  comment = item['snippet']['topLevelComment']['snippet']

  # append comment to list of comments
  comments.append([
    comment['textDisplay'],
    f'https://www.youtube.com/watch?v={video_id}&lc={item["snippet"]["topLevelComment"]["id"]}',
    comment['publishedAt'],
    comment['textOriginal'],
    comment['likeCount'],
    np.nan
  ])

  # count number of replies
  total_reply_count = item['snippet']['totalReplyCount']

  # if there is at least one reply
  if total_reply_count > 0:
    parent_id = item["snippet"]["topLevelComment"]["id"]

    replies = youtube.comments().list(
      part='snippet', parentId=parent_id, maxResults=50
    ).execute()

    # iterate through the replies
    for reply in replies['items']:
      # extract text from each reply
      # append reply to list of comments
      replyBody = reply['snippet']
      comments.append([
        replyBody['textDisplay'],
        f"https://www.youtube.com/watch?v={video_id}&lc={reply['id']}",
        replyBody['publishedAt'],
        replyBody['textOriginal'],
        replyBody['likeCount'],
        replyBody['parentId']
      ])

youtube_corpus = pd.DataFrame(
  comments, columns=['title', 'link', 'date_published', 'text', 'like_count', 'reply_parent_id'])
youtube_corpus

Unnamed: 0,title,link,date_published,text,like_count,reply_parent_id
0,"Sept 23, 2026",https://www.youtube.com/watch?v=xPb4FMfGbos&lc...,2025-08-07T14:01:01Z,"Sept 23, 2026",0,
1,When Digong said to the Police who carried his...,https://www.youtube.com/watch?v=xPb4FMfGbos&lc...,2025-08-03T20:45:50Z,When Digong said to the Police who carried his...,0,
2,ang tapang nuong hindi pa naaresto..dalian daw...,https://www.youtube.com/watch?v=xPb4FMfGbos&lc...,2025-08-01T07:29:49Z,ang tapang nuong hindi pa naaresto..dalian daw...,0,
3,"Life in prison please. <a href=""UCkszU2WH9gy1m...",https://www.youtube.com/watch?v=xPb4FMfGbos&lc...,2025-07-25T22:06:40Z,Life in prison please.,0,
4,Galing mag drama ni digong,https://www.youtube.com/watch?v=xPb4FMfGbos&lc...,2025-07-10T14:28:05Z,Galing mag drama ni digong,0,
...,...,...,...,...,...,...
101,Sa airplain palang me medical asssis na po ah,https://www.youtube.com/watch?v=xPb4FMfGbos&lc...,2025-03-21T13:18:13Z,Sa airplain palang me medical asssis na po ah,0,
102,The Philippines is ridiculous. To voluntarily ...,https://www.youtube.com/watch?v=xPb4FMfGbos&lc...,2025-03-21T11:11:32Z,The Philippines is ridiculous. To voluntarily ...,0,
103,Klasing imternational atty. yan di marunong ma...,https://www.youtube.com/watch?v=xPb4FMfGbos&lc...,2025-03-21T10:07:04Z,Klasing imternational atty. yan di marunong ma...,0,
104,kaya nga international Court. hindi ito americ...,https://www.youtube.com/watch?v=xPb4FMfGbos&lc...,2025-03-21T10:45:08Z,kaya nga international Court. hindi ito americ...,1,UgzEdtXG7-BheRo3KZt4AaABAg


If we return to the contents of `video_response`, we'll see that one of its keys is the `nextPageToken` key. This key is like an ID that distinguishes the pages within a set of search results--in this case, the comments section of the video.

If we add a `pageToken` argument to our `youtube.commentThreads().list()` method, you'll see that we will get different results from the previous one.

In [37]:
# Check whether video_response has `nextPageToken`
video_response['nextPageToken']

'Z2V0X25ld2VzdF9maXJzdC0tQ2dnSWdBUVZGN2ZST0JJRkNJa2dHQUFTQlFpSUlCZ0FFZ1VJcUNBWUFCSUZDSWNnR0FBU0JRaWRJQmdCSWc0S0RBakg5UFMtQmhDUXd1VFVBdw=='

In [38]:
# Make the same request but with the `nextPageToken`
video_response_2 = youtube.commentThreads().list(
  videoId=video_id, part='snippet,replies', maxResults=50,
  order='time',
  pageToken=video_response['nextPageToken']
).execute()

In [39]:
video_response_2['items'][0]['snippet']['topLevelComment']['snippet']['textDisplay']

'Bat Si Putin at yung presedente ng israel  hindi inimbistigahan ng ICC na dami rin napatay na inosente sa pagkakaroon ng digmaan, unfair ICC'

Let's now incorporate `pageToken` into our loop.

In [40]:
# how many comments do we want
no_comments = 500

# re-initialize list of YouTube comments
comments = []
youtube_corpus = None

video_id = 'xPb4FMfGbos'  # more comments
# video_id = 'yfoq-0gGTLM' # fewer comments

# get first page of the comments
video_response = youtube.commentThreads().list(
  videoId=video_id, part='snippet,replies', maxResults=50,
  order='time', moderationStatus='published'
).execute()

while len(comments) < no_comments:
  # iterate through items
  for item in video_response['items']:
    ##### Parent Comments #####
    # extract comment from each item
    comment = item['snippet']['topLevelComment']['snippet']

    # append comment to list of comments
    comments.append([
      comment['textDisplay'],
      f'https://www.youtube.com/watch?v={video_id}&lc={item["snippet"]["topLevelComment"]["id"]}',
      comment['publishedAt'],
      comment['textOriginal'],
      comment['likeCount'],
      np.nan
    ])

    ##### Reply Comments #####
    # count number of replies
    total_reply_count = item['snippet']['totalReplyCount']

    # if there is at least one reply
    if total_reply_count > 0:
      parent_id = item["snippet"]["topLevelComment"]["id"]

      replies = youtube.comments().list(
        part='snippet',
        parentId=parent_id
      ).execute()

      # iterate through the replies
      for reply in replies['items']:
        # extract text from each reply
        # append reply to list of comments
        replyBody = reply['snippet']
        comments.append([
          replyBody['textDisplay'],
          f"https://www.youtube.com/watch?v={video_id}&lc={reply['id']}",
          replyBody['publishedAt'],
          replyBody['textOriginal'],
          replyBody['likeCount'],
          replyBody['parentId']
        ])

  ##### Next Page #####
  # print number of comments
  print(str(len(comments)) + ' comments in list')

  # if there is a next page to the comment result
  if 'nextPageToken' in video_response:
    # notify that next page has been found
    print('Next comment page found. Now extracting data. \n')

    # get the next page
    video_response = youtube.commentThreads().list(
      videoId=video_id, part='snippet,replies', maxResults=50,
      order='time', pageToken=video_response['nextPageToken'],
      moderationStatus='published'
    ).execute()
  else:
    # notify that no more pages are left
    print('No more comment pages left.')
    break

53 comments in list
Next comment page found. Now extracting data. 

110 comments in list
Next comment page found. Now extracting data. 

163 comments in list
Next comment page found. Now extracting data. 

217 comments in list
Next comment page found. Now extracting data. 

270 comments in list
Next comment page found. Now extracting data. 

322 comments in list
Next comment page found. Now extracting data. 

378 comments in list
Next comment page found. Now extracting data. 

431 comments in list
Next comment page found. Now extracting data. 

495 comments in list
Next comment page found. Now extracting data. 

550 comments in list
Next comment page found. Now extracting data. 



In [41]:
youtube_corpus = pd.DataFrame(
  comments, columns=['title', 'link', 'date_published', 'text', 'like_count', 'reply_parent_id'])
youtube_corpus

Unnamed: 0,title,link,date_published,text,like_count,reply_parent_id
0,"Sept 23, 2026",https://www.youtube.com/watch?v=xPb4FMfGbos&lc...,2025-08-07T14:01:01Z,"Sept 23, 2026",0,
1,When Digong said to the Police who carried his...,https://www.youtube.com/watch?v=xPb4FMfGbos&lc...,2025-08-03T20:45:50Z,When Digong said to the Police who carried his...,0,
2,ang tapang nuong hindi pa naaresto..dalian daw...,https://www.youtube.com/watch?v=xPb4FMfGbos&lc...,2025-08-01T07:29:49Z,ang tapang nuong hindi pa naaresto..dalian daw...,0,
3,"Life in prison please. <a href=""UCkszU2WH9gy1m...",https://www.youtube.com/watch?v=xPb4FMfGbos&lc...,2025-07-25T22:06:40Z,Life in prison please.,0,
4,Galing mag drama ni digong,https://www.youtube.com/watch?v=xPb4FMfGbos&lc...,2025-07-10T14:28:05Z,Galing mag drama ni digong,0,
...,...,...,...,...,...,...
545,Bbm amd dds sucks,https://www.youtube.com/watch?v=xPb4FMfGbos&lc...,2025-03-16T07:32:44Z,Bbm amd dds sucks,1,
546,❤❤❤,https://www.youtube.com/watch?v=xPb4FMfGbos&lc...,2025-03-16T07:30:44Z,❤❤❤,0,
547,Diba sana si late Senator Miriam Santiago-Defe...,https://www.youtube.com/watch?v=xPb4FMfGbos&lc...,2025-03-16T07:30:26Z,Diba sana si late Senator Miriam Santiago-Defe...,0,
548,Asan na yong sipa niya😂😂😂,https://www.youtube.com/watch?v=xPb4FMfGbos&lc...,2025-03-16T07:29:31Z,Asan na yong sipa niya😂😂😂,0,


## Function `extract_youtube_comments`

Now, we can define a function that extracts a certain number of comments for a particular video and returns a list of these comments.

In [42]:
def extract_youtube_comments(video_id, no_comments):
  """
  Extracts comments from a YouTube video.

  Args:
    video_id (str): The ID of the YouTube video.
    no_comments (int): The number of comments to extract.

  Returns:
    pandas.DataFrame: A DataFrame containing the extracted comments with the following columns:
      - title (str): The display text of the comment.
      - link (str): The URL of the comment.
      - date_published (str): The date and time when the comment was published.
      - text (str): The original text of the comment.
      - like_count (int): The number of likes the comment has received.
      - reply_parent_id (float): The ID of the parent comment if the comment is a reply, otherwise NaN.
  """
  # re-initialize list of YouTube comments
  comments = []

  # get first page of the comments
  video_response = youtube.commentThreads().list(
    videoId=video_id, part='snippet,replies', maxResults=50,
    order='time', moderationStatus='published'
  ).execute()

  while len(comments) < no_comments:
    # iterate through items
    for item in video_response['items']:
      ##### Parent Comments #####
      # extract comment from each item
      comment = item['snippet']['topLevelComment']['snippet']

      # append comment to list of comments
      comments.append([
        comment['textDisplay'],
        f'https://www.youtube.com/watch?v={video_id}&lc={item["snippet"]["topLevelComment"]["id"]}',
        comment['publishedAt'],
        comment['textOriginal'],
        comment['likeCount'],
        np.nan
      ])

      ##### Reply Comments #####
      # count number of replies
      total_reply_count = item['snippet']['totalReplyCount']

      # if there is at least one reply
      if total_reply_count > 0:
        parent_id = item["snippet"]["topLevelComment"]["id"]

        replies = youtube.comments().list(
          part='snippet',
          parentId=parent_id
        ).execute()

        # iterate through the replies
        for reply in replies['items']:
          # extract text from each reply
          # append reply to list of comments
          replyBody = reply['snippet']
          comments.append([
            replyBody['textDisplay'],
            f"https://www.youtube.com/watch?v={video_id}&lc={reply['id']}",
            replyBody['publishedAt'],
            replyBody['textOriginal'],
            replyBody['likeCount'],
            replyBody['parentId']
          ])

    ##### Next Page #####
    # print number of comments
    print(str(len(comments)) + ' comments in list')

    # if there is a next page to the comment result
    if 'nextPageToken' in video_response:
      # notify that next page has been found
      print('Next comment page found. Now extracting data. \n')

      # get the next page
      video_response = youtube.commentThreads().list(
        videoId=video_id, part='snippet,replies', maxResults=50,
        order='time', pageToken=video_response['nextPageToken'],
        moderationStatus='published'
      ).execute()
    else:
      # notify that no more pages are left
      print('No more comment pages left.\n')
      break

  return pd.DataFrame(
    comments, columns=['title', 'link', 'date_published', 'text', 'like_count', 'reply_parent_id']
  )

In [44]:
video_links = [
  'xPb4FMfGbos',  # more comments
  'yfoq-0gGTLM'  # fewer comments
]

youtube_corpus = None

for video_link in video_links:
  print(f'Extracting comments from video: {video_link}')
  if youtube_corpus is None:
    youtube_corpus = extract_youtube_comments(video_link, 750)
  else:
    youtube_corpus = pd.concat([
      youtube_corpus, extract_youtube_comments(video_link, 750)
    ])

youtube_corpus

Extracting comments from video: xPb4FMfGbos
53 comments in list
Next comment page found. Now extracting data. 

110 comments in list
Next comment page found. Now extracting data. 

163 comments in list
Next comment page found. Now extracting data. 

217 comments in list
Next comment page found. Now extracting data. 

270 comments in list
Next comment page found. Now extracting data. 

322 comments in list
Next comment page found. Now extracting data. 

378 comments in list
Next comment page found. Now extracting data. 

431 comments in list
Next comment page found. Now extracting data. 

495 comments in list
Next comment page found. Now extracting data. 

550 comments in list
Next comment page found. Now extracting data. 

619 comments in list
Next comment page found. Now extracting data. 

675 comments in list
Next comment page found. Now extracting data. 

731 comments in list
Next comment page found. Now extracting data. 

789 comments in list
Next comment page found. Now extracting

Unnamed: 0,title,link,date_published,text,like_count,reply_parent_id
0,"Sept 23, 2026",https://www.youtube.com/watch?v=xPb4FMfGbos&lc...,2025-08-07T14:01:01Z,"Sept 23, 2026",0,
1,When Digong said to the Police who carried his...,https://www.youtube.com/watch?v=xPb4FMfGbos&lc...,2025-08-03T20:45:50Z,When Digong said to the Police who carried his...,0,
2,ang tapang nuong hindi pa naaresto..dalian daw...,https://www.youtube.com/watch?v=xPb4FMfGbos&lc...,2025-08-01T07:29:49Z,ang tapang nuong hindi pa naaresto..dalian daw...,0,
3,"Life in prison please. <a href=""UCkszU2WH9gy1m...",https://www.youtube.com/watch?v=xPb4FMfGbos&lc...,2025-07-25T22:06:40Z,Life in prison please.,0,
4,Galing mag drama ni digong,https://www.youtube.com/watch?v=xPb4FMfGbos&lc...,2025-07-10T14:28:05Z,Galing mag drama ni digong,0,
...,...,...,...,...,...,...
32,Ul*l dignified. Pinapakyuhan nga si Bong Daza ...,https://www.youtube.com/watch?v=yfoq-0gGTLM&lc...,2024-08-27T11:24:46Z,Ul*l dignified. Pinapakyuhan nga si Bong Daza ...,0,UgwBBzeGtqWna6y4KUJ4AaABAg
33,"That vp is rude, entitled and arrogant like he...",https://www.youtube.com/watch?v=yfoq-0gGTLM&lc...,2024-08-27T13:13:32Z,"That vp is rude, entitled and arrogant like he...",3,UgwBBzeGtqWna6y4KUJ4AaABAg
34,What country is this? Very interesting.,https://www.youtube.com/watch?v=yfoq-0gGTLM&lc...,2024-08-27T14:16:52Z,What country is this? Very interesting.,0,UgwBBzeGtqWna6y4KUJ4AaABAg
35,@@cvoutdoors9859palamunin,https://www.youtube.com/watch?v=yfoq-0gGTLM&lc...,2024-08-27T15:14:26Z,@@cvoutdoors9859palamunin,0,UgwBBzeGtqWna6y4KUJ4AaABAg


In [45]:
youtube_corpus

Unnamed: 0,title,link,date_published,text,like_count,reply_parent_id
0,"Sept 23, 2026",https://www.youtube.com/watch?v=xPb4FMfGbos&lc...,2025-08-07T14:01:01Z,"Sept 23, 2026",0,
1,When Digong said to the Police who carried his...,https://www.youtube.com/watch?v=xPb4FMfGbos&lc...,2025-08-03T20:45:50Z,When Digong said to the Police who carried his...,0,
2,ang tapang nuong hindi pa naaresto..dalian daw...,https://www.youtube.com/watch?v=xPb4FMfGbos&lc...,2025-08-01T07:29:49Z,ang tapang nuong hindi pa naaresto..dalian daw...,0,
3,"Life in prison please. <a href=""UCkszU2WH9gy1m...",https://www.youtube.com/watch?v=xPb4FMfGbos&lc...,2025-07-25T22:06:40Z,Life in prison please.,0,
4,Galing mag drama ni digong,https://www.youtube.com/watch?v=xPb4FMfGbos&lc...,2025-07-10T14:28:05Z,Galing mag drama ni digong,0,
...,...,...,...,...,...,...
32,Ul*l dignified. Pinapakyuhan nga si Bong Daza ...,https://www.youtube.com/watch?v=yfoq-0gGTLM&lc...,2024-08-27T11:24:46Z,Ul*l dignified. Pinapakyuhan nga si Bong Daza ...,0,UgwBBzeGtqWna6y4KUJ4AaABAg
33,"That vp is rude, entitled and arrogant like he...",https://www.youtube.com/watch?v=yfoq-0gGTLM&lc...,2024-08-27T13:13:32Z,"That vp is rude, entitled and arrogant like he...",3,UgwBBzeGtqWna6y4KUJ4AaABAg
34,What country is this? Very interesting.,https://www.youtube.com/watch?v=yfoq-0gGTLM&lc...,2024-08-27T14:16:52Z,What country is this? Very interesting.,0,UgwBBzeGtqWna6y4KUJ4AaABAg
35,@@cvoutdoors9859palamunin,https://www.youtube.com/watch?v=yfoq-0gGTLM&lc...,2024-08-27T15:14:26Z,@@cvoutdoors9859palamunin,0,UgwBBzeGtqWna6y4KUJ4AaABAg


We can now save these comments to a dataframe and Excel file.

In [48]:
try:
  # Google Colab
  from google.colab import drive  # type: ignore
  drive.mount('/content/drive')

  file_name = '/content/drive/MyDrive/youtube_corpus.xlsx'
except:
  # VS Code / Local machine
  file_name = 'youtube_corpus.xlsx'

youtube_corpus.to_excel(file_name)
print(f'File saved to {file_name}')

File saved to youtube_corpus.xlsx


What we've done so far is extract comments for **one** video. If you want to extract comments for multiple videos, you'll have to use the `channel`, `search`, or `playlist` resource objects of the YouTube API. You can study these more through the documentation.

However, note that each account has an allocation of 10,000 units per day. Each call that uses the API costs a certain number of units, and if you reach 10,000, you won't be able to properly use the API (and possibly YouTube) until the next day.

Other APIs you might be interested in exploring:
*   Genius API for extracting song lyrics (https://docs.genius.com/)
*   Python Reddit API Wrapper (PRAW) for extracting Reddit posts (https://praw.readthedocs.io/en/stable/)
*   Tweepy for extracting tweets (but Elon kinda ruined it now) (https://www.tweepy.org/)





# Method 3: Facepager

Facepager is an application made for automating data extraction (esp. for Facebook). Follow the installation instructions in this page: https://github.com/strohne/Facepager

To learn more about how to use Facepager, navigate through its wiki page: https://github.com/strohne/Facepager/wiki/Getting-Started