# **Method 1: Web Scraping**

Install necessary packages

In [1]:
!pip install beautifulsoup4 requests

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
[33mDEPRECATION: textract 1.6.5 has a non-standard dependency specifier extract-msg<=0.29.*. pip 24.1 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of textract or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0m

Next, we import the packages into the notebook.
1. `pandas`: for managing data through dataframes (https://pandas.pydata.org/docs/)
2. `requests`: for connecting to webpages (needs an "agent" to emulate a web browser) (https://requests.readthedocs.io/en/latest/)
3. `beautifulsoup`: for processing HTML data collected from webpages (https://www.crummy.com/software/BeautifulSoup/)

In [2]:
# Import libraries
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
import time
import random

# Declare headers for the requests agent
headers = {
  'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36 Edg/127.0.0.0',
  'Accept-Language': 'en-US,en;q=0.9',
  'Accept-Encoding': 'gzip, deflate, br',
  'Connection': 'keep-alive'
}

Let's try extracting the content of the article from Rappler:

[Sara Duterte face book isang kaibigan](https://www.rappler.com/newsbreak/fact-check/sara-duterte-face-book-isang-kaibigan/)


In [3]:
# Define article based on the link
link = 'https://www.rappler.com/newsbreak/fact-check/sara-duterte-face-book-isang-kaibigan/'

# Request using `request` library
r = requests.get(link, headers=headers)
r

<Response [200]>

Look at the webpage's content

In [4]:
# Inspect the content
r.content



Use `BeautifulSoup` to parse the webpage's HTML source code.

In [5]:
# Use BeautifulSoup to parse the HTML page
soup = BeautifulSoup(r.content, 'html.parser')
soup

<!DOCTYPE html>

<html class="no-js" lang="en-US">
<head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<script>(function(c){c.add('has-js');c.remove('no-js')})(document.documentElement.classList)</script>
<meta content="index, follow, max-image-preview:large, max-snippet:-1, max-video-preview:-1" name="robots">
<script>window.dataLayer = window.dataLayer || []; window.dataLayer.push( {"type":"post","subtype":"post","context":{"is_front_page":false,"is_singular":true,"is_archive":false,"is_home":false,"is_search":false,"is_404":false,"is_post_type_archive":false,"is_tax":false,"is_article":true},"user":{"role":[]},"blog":{"url":"https:\/\/www.rappler.com","id":1},"network":{"url":"https:\/\/www.rappler.com","id":1},"post":{"ID":2759297,"slug":"sara-duterte-face-book-isang-kaibigan","published":"2024-08-22 12:11:07","modified":"2024-08-22 12:11:24","comments":0,"template":"","thumbnail":"https:\/\/www.rappler.com\/tachyon\/2024\/08\/fc-ls-

We'll extract the following information to create our corpus.
* article link
* article title
* date published
* article's text / body
* *optional*: tags, i.e., related topic

## Extract the webpage's title.

In [6]:
soup.title  # with HTML tags
soup.title.text  # text within the HTML tags
title = soup.title.text.strip()  # whitespaces are removed

## Extract the webpage's date published.

In [7]:
# Date published
date_published = soup.find(
  'time',  # HTML element tag
  {
      'class': 'published',
      'datetime': True
  }  # HTML attribute
)
# Actual text within the HTML element
date_published.text.strip()

# If you're after the HTML attribute, use the attribute name for the key
date_published = date_published['datetime']

## Extract the webpage's text

In [8]:
# Extract article text
# element: div
# class: post-single__content entry-content
article_text = soup.find(
  'div',  # HTML element tag
  {
      'class': 'post-single__content entry-content'
  }  # HTML attribute
)
# Returns the HTML code for the article
article_text

# Get all paragraph (<p>) from the article_text
tagged_lines = article_text.find_all('p')  # Return list of paragraph elements

# Removes the HTML tags
text = ''
for line in tagged_lines:
  untagged_line = line.get_text()
  text += untagged_line + '\n'

# Returns the article text as 1 big string
print(text)

Claim: The face of Vice President Sara Duterte is not in the children’s book that she wrote titled, “Isang Kaibigan“.
Why we fact-checked this: The claim can be found in an August 21 post on social media platform X (formerly Twitter). The post says: “Yung libro na ni anino ni VPISD eh wala nga dyan” (The book that does not bear even a hint of a shadow of VPISD [Vice President Inday Sara Duterte] is not there).
The post also includes a video from the TikTok account “klc4.0” which — referring to Duterte’s book, Isang Kaibigan — says: “Yung libro ni Sara Duterte na walang mukha niya” (The book of Sara Duterte that does not carry her face).
As of writing, the post on X already has around 51 comments, 50 shares, 135 reactions, 7 bookmarks, and 23,800 video views. The source video on TikTok, meanwhile, has around 198,700 views, 3,690 reactions, 1,076 comments, 226 bookmarks, and 138 shares. 
Both the X post and the TikTok video compare Sara Duterte and Senator Risa Hontiveros, pointing out t

## Combine features into dataframe

In [9]:
# Link
# Title
# Date Published
# Article's Text
rappler = pd.DataFrame(
    columns=['title', 'link', 'date_published', 'text']
)
doc_details = [title, link, date_published, text]
rappler.loc[len(rappler)] = doc_details
rappler

Unnamed: 0,title,link,date_published,text
0,FACT CHECK: VP Sara Duterte's face is in her b...,https://www.rappler.com/newsbreak/fact-check/s...,2024-08-22T20:11:07+08:00,Claim: The face of Vice President Sara Duterte...


## Export to Excel file

In [10]:
try:
  # Google Colab
  from google.colab import drive
  drive.mount('/content/drive')

  file_name = '/content/drive/MyDrive/rappler.xlsx'
except:
  # VS Code / Local machine
  file_name = 'rappler.xlsx'

rappler.to_excel(file_name)
print(f'File saved to {file_name}')

File saved to rappler.xlsx


## Function `extract_article_data`

In [11]:
# Extract data from article based on link
def extract_article_data(link):
  """
  Extracts data from an article based on the provided link.

  Parameters:
    link (str): The URL of the article.

  Returns:
    list: A list containing the extracted article details in the following order:
      - title (str): The title of the article.
      - date (str): The date the article was published.
      - link (str): The URL of the article.
      - text (str): The content of the article.
  """
  # Make a GET request for the article URL
  r = requests.get(link, headers=headers)

  # Parse the HTML
  soup = BeautifulSoup(r.content, 'html.parser')

  # Retrieve doc title
  title = soup.title.text.strip()

  # IF NOT WITH RAPPLER ARTICLE, CHANGE THIS XOXO
  # Retrieve doc date
  date = soup.find("time", {"datetime": True})['datetime']

  # IF NOT WITH RAPPLER ARTICLE, CHANGE THIS XOXO
  # Retrieve article content
  text = ''
  tagged_lines = soup.find("div", {"class": "post-single__content entry-content"}).find_all('p')
  for line in tagged_lines:
    untagged_line = line.get_text()
    text += untagged_line + '\n'

  # Create list containing doc details
  # Append to dataset
  doc_details = [title, date, link, text]
  return doc_details

In [12]:
# Test the `extract_article_data` function
rappler.loc[len(rappler)] = extract_article_data(
  'https://www.rappler.com/philippines/hontiveros-seek-realignment-sara-duterte-book-fund-request-2025-budget/'
)
rappler

Unnamed: 0,title,link,date_published,text
0,FACT CHECK: VP Sara Duterte's face is in her b...,https://www.rappler.com/newsbreak/fact-check/s...,2024-08-22T20:11:07+08:00,Claim: The face of Vice President Sara Duterte...
1,Hontiveros to seek realignment of Sara Duterte...,2024-08-22T13:05:55+08:00,https://www.rappler.com/philippines/hontiveros...,"MANILA, Philippines – Senator Risa Hontiveros ..."


## Extracting Multiple Articles

In [13]:
mother_url = 'https://www.rappler.com/topic/philippine-offshore-gaming-operations/page/'
page = 1
page_limit = 5
corpus = pd.DataFrame(columns=['title', 'link', 'date_published', 'text'])

while True:
  if page == 1:
    # Remove 'page/' at the end of mother_url
    page_url = mother_url[:-6]
  else:
    # Convert page number as a string
    page_str = str(page)
    # Form the article page
    page_url = mother_url + page_str
  print('Working on ' + page_url)

  # Add random time between 1 to 5 seconds before requesting
  time.sleep(random.randint(1, 5))

  # Get the list of articles within the page
  page_r = requests.get(page_url, headers=headers)
  page_soup = BeautifulSoup(page_r.content, 'html.parser')

  # Get the container of the articles
  article_container = page_soup.find('div', {'id': 'primary'})
  if article_container is None:
    continue

  article_previews = article_container.find_all('article', {'class': 'post'})
  number_of_articles = len(article_previews)

  # If no article/s found, end
  if number_of_articles < 1:
    print('Extraction Finished!')
    break

  # Go through each article to extract and save to the dataframe
  for article_id in range(number_of_articles):
    # Focus on the article preview
    article = article_previews[article_id]
    # Get the clickable article title
    article_title = article.find("h2")

    # If no title, skip
    if article_title is None:
      continue

    # For each articles, invoke `extract_article_data`
    try:
      # Append to a dataframe
      corpus.loc[len(corpus)] = extract_article_data(article_title.find("a")['href'])
    except:
      # if there's an extraction error, skip
      continue

  # Check whether you have reached the page limit
  if page >= page_limit:
    break

  # Go to the next page
  page += 1

Working on https://www.rappler.com/topic/philippine-offshore-gaming-operations
Working on https://www.rappler.com/topic/philippine-offshore-gaming-operations/page/2
Working on https://www.rappler.com/topic/philippine-offshore-gaming-operations/page/3
Working on https://www.rappler.com/topic/philippine-offshore-gaming-operations/page/4
Working on https://www.rappler.com/topic/philippine-offshore-gaming-operations/page/5


In [39]:
try:
  # Google Colab
  from google.colab import drive
  drive.mount('/content/drive')

  file_name = '/content/drive/MyDrive/rappler_corpus.xlsx'
except:
  # VS Code / Local machine
  file_name = 'rappler_corpus.xlsx'

corpus.to_excel(file_name)
print(f'File saved to {file_name}')

File saved to rappler_corpus.xlsx


In [15]:
corpus

Unnamed: 0,title,link,date_published,text
0,DOJ on 'slow' pace of Alice Guo case: 'We cann...,2024-08-28T10:24:17+08:00,https://www.rappler.com/philippines/doj-respon...,"MANILA, Philippines – The Department of Justic..."
1,Lawyer who notarized Alice Guo's counter affid...,2024-08-28T07:00:00+08:00,https://www.rappler.com/philippines/lawyer-not...,"CLARK FREEPORT, Philippines – The lawyer who n..."
2,Alice Guo and siblings fled Philippines by boat,2024-08-27T13:20:33+08:00,https://www.rappler.com/philippines/alice-guo-...,"MANILA, Philippines – Dismissed Bamban, Tarlac..."
3,"Alice Guo's sister, Porac POGO staff to face S...",2024-08-27T10:03:09+08:00,https://www.rappler.com/philippines/alice-guo-...,"MANILA, Philippines – After an embarrassing la..."
4,"Cassandra Ong, Sheila Guo in Congress custody ...",2024-08-26T21:53:00+08:00,https://www.rappler.com/video/daily-wrap/augus...,Here are today’s headlines – the latest news i...
5,"In the Golden Triangle, a Filipino was told 'y...",2024-08-26T17:51:50+08:00,https://www.rappler.com/newsbreak/in-depth/gol...,"You cannot go home, you will die here,” are wo..."
6,House detains POGO worker Cassandra Ong,2024-08-26T14:38:08+08:00,https://www.rappler.com/philippines/house-deta...,"MANILA, Philippines – The National Bureau of I..."
7,Harry Roque's 'footprint is everywhere' in POG...,2024-08-24T19:02:20+08:00,https://www.rappler.com/philippines/harry-roqu...,"MANILA, Philippines – Former presidential spok..."
8,"Harry Roque, from party-list congressman to en...",2024-08-24T14:25:05+08:00,https://www.rappler.com/newsbreak/in-depth/pro...,"In the words of Harry Roque, the House of Repr..."
9,Shades of gray: How PH ended up with a POGO cr...,2024-08-24T09:30:00+08:00,https://www.rappler.com/newsbreak/in-depth/sha...,"It was June 30, 2016, when Rodrigo Duterte, th..."


# Method 2: Using the YouTube API

## Create YouTube Data API v3 key

Before we can scrape data from YouTube, we need to get an API key. APIs, or Application Programming Interfaces, are ways in which two applications (e.g., Google Colaboratory and YouTube) can talk to each other. You can think of it as the language through which two apps can speak so they can send and receive information from each other.

Some apps (e.g., YouTube and Google services) require an API key, or a string of characters that authenticates a user to access the app through an API.

1.   To get your API key, head to https://console.cloud.google.com/cloud-resource-manager
2.   Click on Create Project, name the project "YouTube scraping," and press "CREATE"
3.   In your Google Developers Console dashboard, click the Navigation Menu (the three lines) at the upper left corner and select "APIs & Services"
4.   When prompted, select the "YouTube scraping" project
5.   You will be directed to the project's dashboard, where you are to click "Explore & Enable APIs"
6.   Search for and select "YouTube Data API v3"
7.   Enable the API and click "Create Credentials"
8.   Select "public data" when asked what kind of data you will access
9.   Find your API key at the "Credentials" tab on the left side of your dashboard (If it's not there, just click "Create Credentials" again and select "API key")

Now, we import the pertinent packages.

`build` creates a resource object that uses the API key to communicate with YouTube.

In [16]:
import pandas as pd
from googleapiclient.discovery import build

In [17]:
api_key = 'AIzaSyArJa4h3CPtJILqexIrRWL22W7K1ZzxABs'
youtube = build('youtube', 'v3', developerKey=api_key)

Let's make it our goal to make a corpus based on YouTube comments about Alice Guo. 
Let's start with this link: https://www.youtube.com/watch?v=yfoq-0gGTLM

To get comment thread results from YouTube, we execute the code below that uses the `commentThreads()` and `list` methods with two arguments:
*   `id`, which we get from the link above
*   `parts`, which we set as "snippet,replies" because these are the only parts of the comment results we are interested in

To learn more about how to use the YouTube API to get comment thread results, refer to https://developers.google.com/youtube/v3/docs/commentThreads/list



Now, let's try and look at the contents of the video_response object.

## Extract the YouTube comments

In [18]:
video_id = 'yfoq-0gGTLM'
video_response = youtube.commentThreads().list(
  videoId=video_id, part='snippet,replies', maxResults=50,
  order='time'
).execute()
video_response

{'kind': 'youtube#commentThreadListResponse',
 'etag': 'h6GG2Ymhk7BukQ66l1nfYBYSjFo',
 'pageInfo': {'totalResults': 25, 'resultsPerPage': 50},
 'items': [{'kind': 'youtube#commentThread',
   'etag': 'KwQ8MckUnNNTj0heexX_ywyh6sc',
   'id': 'UgzmI6rrrSbY2I4XZTd4AaABAg',
   'snippet': {'channelId': 'UCvRAX-ujvZ0eTMLGG2vki9w',
    'videoId': 'yfoq-0gGTLM',
    'topLevelComment': {'kind': 'youtube#comment',
     'etag': 'BNE0HFymQ4Vds3IsOY1HDEtwem8',
     'id': 'UgzmI6rrrSbY2I4XZTd4AaABAg',
     'snippet': {'channelId': 'UCvRAX-ujvZ0eTMLGG2vki9w',
      'videoId': 'yfoq-0gGTLM',
      'textDisplay': 'Ang kapalpakan ng marcos admin lumaganap na ulit ang droga kay duterte oarin isinisisi, may god i hate cocaine.',
      'textOriginal': 'Ang kapalpakan ng marcos admin lumaganap na ulit ang droga kay duterte oarin isinisisi, may god i hate cocaine.',
      'authorDisplayName': '@2littlebee1',
      'authorProfileImageUrl': 'https://yt3.ggpht.com/ytc/AIdro_ndI6ZouWopQgS2uB15Ofp0k1WFDSPMH9F5_xSaG

We can see that the contents are quite long, but if we take the time to analyze it, we'll realize it's really just a nested dictionary with the following keys: `kind`, `etag`, `nextPageToken`, `pageInfo`, and `items`.

The values in `pageInfo` tell us how many results are included in this set--in this case, 50 comments and their corresponding replies. The 50 comments are contained as dictionaries within the `items` key.

To confirm, let's check the length of the `items`.

In [19]:
len(video_response['items'])

25

## Inspecting a single YouTube comment

Now, let's try and see what each item in the response looks like. Let's examine the 1st item in the list.

In [20]:
video_response['items'][0]

{'kind': 'youtube#commentThread',
 'etag': 'KwQ8MckUnNNTj0heexX_ywyh6sc',
 'id': 'UgzmI6rrrSbY2I4XZTd4AaABAg',
 'snippet': {'channelId': 'UCvRAX-ujvZ0eTMLGG2vki9w',
  'videoId': 'yfoq-0gGTLM',
  'topLevelComment': {'kind': 'youtube#comment',
   'etag': 'BNE0HFymQ4Vds3IsOY1HDEtwem8',
   'id': 'UgzmI6rrrSbY2I4XZTd4AaABAg',
   'snippet': {'channelId': 'UCvRAX-ujvZ0eTMLGG2vki9w',
    'videoId': 'yfoq-0gGTLM',
    'textDisplay': 'Ang kapalpakan ng marcos admin lumaganap na ulit ang droga kay duterte oarin isinisisi, may god i hate cocaine.',
    'textOriginal': 'Ang kapalpakan ng marcos admin lumaganap na ulit ang droga kay duterte oarin isinisisi, may god i hate cocaine.',
    'authorDisplayName': '@2littlebee1',
    'authorProfileImageUrl': 'https://yt3.ggpht.com/ytc/AIdro_ndI6ZouWopQgS2uB15Ofp0k1WFDSPMH9F5_xSaGgz16sk=s48-c-k-c0x00ffffff-no-rj',
    'authorChannelUrl': 'http://www.youtube.com/@2littlebee1',
    'authorChannelId': {'value': 'UCKi4zLPRj1ChJmYfyIHKvkA'},
    'canRate': True,

We can extract the following
- textDisplay
- id
- publishedAt
- textOriginal

Then add additional features
- likeCount
- repliedParentId

In [21]:
# For the original comment
print(
  f'https://www.youtube.com/watch?v={video_id}&lc={video_response["items"][0]["snippet"]["topLevelComment"]["id"]}'
)
print(video_response['items'][0]['snippet']['topLevelComment']['snippet']['textDisplay'])
print(video_response['items'][0]['snippet']['topLevelComment']['snippet']['publishedAt'])
print(video_response['items'][0]['snippet']['topLevelComment']['snippet']['textOriginal'])
print(video_response['items'][0]['snippet']['topLevelComment']['snippet']['likeCount'])
# No repliedParentId

https://www.youtube.com/watch?v=yfoq-0gGTLM&lc=UgzmI6rrrSbY2I4XZTd4AaABAg
Ang kapalpakan ng marcos admin lumaganap na ulit ang droga kay duterte oarin isinisisi, may god i hate cocaine.
2024-08-28T11:55:23Z
Ang kapalpakan ng marcos admin lumaganap na ulit ang droga kay duterte oarin isinisisi, may god i hate cocaine.
0


## Extracting replies from comment

If `totalReplyCount` is greater than 0, we can extract the replies to the comment.
In the `video_response` object, we can see that the replies are nested within the `comments` key. HOWEVER, this method does not include all replies.

As such, we have to use the `comments.list` method to get all replies to a comment. 
Refer to [CommentsList](https://developers.google.com/youtube/v3/docs/comments/list) document for more information.

In [22]:
video_response['items'][-1]['snippet']['totalReplyCount']

8

In [23]:
len(
  video_response['items'][-1]['replies']['comments']
)

5

Using a similar process, we can also extract the one reply to this comment. We know there's three replies because of the value of the `totalReplyCount` key.

In [24]:
comment_number = -1
total_reply_count = video_response['items'][comment_number]['snippet']['totalReplyCount']

if total_reply_count > 0:
  parent_id = video_response['items'][comment_number]['snippet']['topLevelComment']['id']

  replies = youtube.comments().list(
    part='snippet', parentId=parent_id, maxResults=50
  ).execute()

  for reply in replies['items']:
    print(
      f"https://www.youtube.com/watch?v={video_id}&lc={reply['id']}"
    )
    print(reply['snippet']['textDisplay'])
    print(reply['snippet']['publishedAt'])
    print(reply['snippet']['textOriginal'])
    print(reply['snippet']['likeCount'])
    print(reply['snippet']['parentId'])
    print()

https://www.youtube.com/watch?v=yfoq-0gGTLM&lc=UgwBBzeGtqWna6y4KUJ4AaABAg.A7dRYCHsjCfA7dRyhsr2Rf
The previous one wants to make philippines a province of china 😂
2024-08-27T11:00:55Z
The previous one wants to make philippines a province of china 😂
4
UgwBBzeGtqWna6y4KUJ4AaABAg

https://www.youtube.com/watch?v=yfoq-0gGTLM&lc=UgwBBzeGtqWna6y4KUJ4AaABAg.A7dRYCHsjCfA7dSXN3ixKS
LBM ONG TRAYDOR
2024-08-27T11:05:47Z
LBM ONG TRAYDOR
0
UgwBBzeGtqWna6y4KUJ4AaABAg

https://www.youtube.com/watch?v=yfoq-0gGTLM&lc=UgwBBzeGtqWna6y4KUJ4AaABAg.A7dRYCHsjCfA7dUIMBsnFa
Very singhot
2024-08-27T11:21:13Z
Very singhot
0
UgwBBzeGtqWna6y4KUJ4AaABAg

https://www.youtube.com/watch?v=yfoq-0gGTLM&lc=UgwBBzeGtqWna6y4KUJ4AaABAg.A7dRYCHsjCfA7dUhI_shHq
Ul*l dignified. Pinapakyuhan nga si Bong Daza sa picture nung nag aadik sila. Hindi ka ba updated? 😂 nilantad na kung sino kumanta sa kanila. Palit ulo
2024-08-27T11:24:46Z
Ul*l dignified. Pinapakyuhan nga si Bong Daza sa picture nung nag aadik sila. Hindi ka ba updated?

## Automating the extraction

Now that we know how to extract individual comments from YouTube videos, we need to figure out how we can automate this process so that we can build a corpus of comments without manually getting comments and replies one by one.

To do this, we rely on the following:
1.   the consistency of the nested dictionary results structure of the YouTube API
2.   Python loops
3.   page tokens

Let's work on the first two.


First, we create a list where we will store all scraped data.

In [25]:
comments = []

Then, we use loops to iterate through the contents of `video_response`.

In [26]:
# iterate through items
for item in video_response['items']:

  # extract comment from each item
  comment = item['snippet']['topLevelComment']['snippet']

  # append comment to list of comments
  comments.append([
    comment['textDisplay'],
    f'https://www.youtube.com/watch?v={video_id}&lc={item["snippet"]["topLevelComment"]["id"]}',
    comment['publishedAt'],
    comment['textOriginal'],
    comment['likeCount'],
    np.nan
  ])

  # count number of replies
  total_reply_count = item['snippet']['totalReplyCount']

  # if there is at least one reply
  if total_reply_count > 0:
    parent_id = item["snippet"]["topLevelComment"]["id"]

    replies = youtube.comments().list(
      part='snippet', parentId=parent_id, maxResults=50
    ).execute()

    # iterate through the replies
    for reply in replies['items']:
      # extract text from each reply
      # append reply to list of comments
      replyBody = reply['snippet']
      comments.append([
        replyBody['textDisplay'],
        f"https://www.youtube.com/watch?v={video_id}&lc={reply['id']}",
        replyBody['publishedAt'],
        replyBody['textOriginal'],
        replyBody['likeCount'],
        replyBody['parentId']
      ])

In [27]:
youtube_corpus = pd.DataFrame(
  comments, columns=['title', 'link', 'date_published', 'text', 'like_count', 'reply_parent_id'])
youtube_corpus

Unnamed: 0,title,link,date_published,text,like_count,reply_parent_id
0,Ang kapalpakan ng marcos admin lumaganap na ul...,https://www.youtube.com/watch?v=yfoq-0gGTLM&lc...,2024-08-28T11:55:23Z,Ang kapalpakan ng marcos admin lumaganap na ul...,0,
1,Only this type of human race can go anywhere i...,https://www.youtube.com/watch?v=yfoq-0gGTLM&lc...,2024-08-27T23:12:34Z,Only this type of human race can go anywhere i...,0,
2,Mga duterte puro angas wala naman gawa,https://www.youtube.com/watch?v=yfoq-0gGTLM&lc...,2024-08-27T22:36:52Z,Mga duterte puro angas wala naman gawa,0,
3,Now that you know she is in Jakarta what is th...,https://www.youtube.com/watch?v=yfoq-0gGTLM&lc...,2024-08-27T21:59:23Z,Now that you know she is in Jakarta what is th...,0,
4,Huliin na po!,https://www.youtube.com/watch?v=yfoq-0gGTLM&lc...,2024-08-27T21:05:23Z,Huliin na po!,0,
5,duterte never again.,https://www.youtube.com/watch?v=yfoq-0gGTLM&lc...,2024-08-27T20:16:59Z,duterte never again.,0,
6,Once Gibo is rightfully elected our next Presi...,https://www.youtube.com/watch?v=yfoq-0gGTLM&lc...,2024-08-27T19:26:58Z,Once Gibo is rightfully elected our next Presi...,0,
7,magkano bayad pag escape beauru?,https://www.youtube.com/watch?v=yfoq-0gGTLM&lc...,2024-08-27T19:12:55Z,magkano bayad pag escape beauru?,0,
8,She’s just another two faced Chinese.,https://www.youtube.com/watch?v=yfoq-0gGTLM&lc...,2024-08-27T17:30:17Z,She’s just another two faced Chinese.,0,
9,Dapat isang tanong isang sagot un to clarify a...,https://www.youtube.com/watch?v=yfoq-0gGTLM&lc...,2024-08-27T16:26:10Z,Dapat isang tanong isang sagot un to clarify a...,1,


## Handling next page with page tokens

Let's take another example with more comments.

We'll have a list of 50 or so comments from the video. 

However, if we check the video's comments section, we'll see that there are 128 comments in the video.

So how can we extract more comments? That's where `pageToken` comes in.

In [28]:
video_id = '0v1XHgWyIvU'
video_response = youtube.commentThreads().list(
  videoId=video_id, part='snippet,replies', maxResults=50,
  order='time'
).execute()

# iterate through items
for item in video_response['items']:

  # extract comment from each item
  comment = item['snippet']['topLevelComment']['snippet']

  # append comment to list of comments
  comments.append([
    comment['textDisplay'],
    f'https://www.youtube.com/watch?v={video_id}&lc={item["snippet"]["topLevelComment"]["id"]}',
    comment['publishedAt'],
    comment['textOriginal'],
    comment['likeCount'],
    np.nan
  ])

  # count number of replies
  total_reply_count = item['snippet']['totalReplyCount']

  # if there is at least one reply
  if total_reply_count > 0:
    parent_id = item["snippet"]["topLevelComment"]["id"]

    replies = youtube.comments().list(
      part='snippet', parentId=parent_id, maxResults=50
    ).execute()

    # iterate through the replies
    for reply in replies['items']:
      # extract text from each reply
      # append reply to list of comments
      replyBody = reply['snippet']
      comments.append([
        replyBody['textDisplay'],
        f"https://www.youtube.com/watch?v={video_id}&lc={reply['id']}",
        replyBody['publishedAt'],
        replyBody['textOriginal'],
        replyBody['likeCount'],
        replyBody['parentId']
      ])

youtube_corpus = pd.DataFrame(
  comments, columns=['title', 'link', 'date_published', 'text', 'like_count', 'reply_parent_id'])
youtube_corpus

Unnamed: 0,title,link,date_published,text,like_count,reply_parent_id
0,Ang kapalpakan ng marcos admin lumaganap na ul...,https://www.youtube.com/watch?v=yfoq-0gGTLM&lc...,2024-08-28T11:55:23Z,Ang kapalpakan ng marcos admin lumaganap na ul...,0,
1,Only this type of human race can go anywhere i...,https://www.youtube.com/watch?v=yfoq-0gGTLM&lc...,2024-08-27T23:12:34Z,Only this type of human race can go anywhere i...,0,
2,Mga duterte puro angas wala naman gawa,https://www.youtube.com/watch?v=yfoq-0gGTLM&lc...,2024-08-27T22:36:52Z,Mga duterte puro angas wala naman gawa,0,
3,Now that you know she is in Jakarta what is th...,https://www.youtube.com/watch?v=yfoq-0gGTLM&lc...,2024-08-27T21:59:23Z,Now that you know she is in Jakarta what is th...,0,
4,Huliin na po!,https://www.youtube.com/watch?v=yfoq-0gGTLM&lc...,2024-08-27T21:05:23Z,Huliin na po!,0,
...,...,...,...,...,...,...
80,PI tlaga sa pilipinas! ginag*g0 na lang mga tao.,https://www.youtube.com/watch?v=0v1XHgWyIvU&lc...,2024-08-19T14:22:49Z,PI tlaga sa pilipinas! ginag*g0 na lang mga tao.,0,
81,SERVICE FROM THE INTERPOOL IS HIGHLY NEEDED❤😮 ...,https://www.youtube.com/watch?v=0v1XHgWyIvU&lc...,2024-08-19T14:22:11Z,SERVICE FROM THE INTERPOOL IS HIGHLY NEEDED❤😮 ...,1,
82,Interfool 🤣 . Interpol po yata yon…,https://www.youtube.com/watch?v=0v1XHgWyIvU&lc...,2024-08-23T14:17:24Z,Interfool 🤣 . Interpol po yata yon…,0,Ugz3RrSepytB8qik2WV4AaABAg
83,yung attorney nila patawan nyo ng administrati...,https://www.youtube.com/watch?v=0v1XHgWyIvU&lc...,2024-08-19T14:18:47Z,yung attorney nila patawan nyo ng administrati...,0,


If we return to the contents of `video_response`, we'll see that one of its keys is the `nextPageToken` key. This key is like an ID that distinguishes the pages within a set of search results--in this case, the comments section of the video.

If we add a `pageToken` argument to our `youtube.commentThreads().list()` method, you'll see that we will get different results from the previous one.

In [29]:
# Check whether video_response has `nextPageToken`
video_response['nextPageToken']

'Z2V0X25ld2VzdF9maXJzdC0tQ2dnSWdBUVZGN2ZST0JJRkNJZ2dHQUFTQlFpSElCZ0FFZ1VJcUNBWUFCSUZDSWtnR0FBU0JRaWRJQmdCR0FBaURnb01DS3luamJZR0VMQ1F2SjBC'

In [30]:
# Make the same request but with the `nextPageToken`
video_response_2 = youtube.commentThreads().list(
  videoId=video_id, part='snippet,replies', maxResults=50,
  order='time',
  pageToken=video_response['nextPageToken']
).execute()

In [31]:
video_response_2['items'][0]['snippet']['topLevelComment']['snippet']['textDisplay']

'as long as you pay, gov do everything for you'

Let's now incorporate `pageToken` into our loop.

In [32]:
# how many comments do we want
no_comments = 500

# re-initialize list of YouTube comments
comments = []
youtube_corpus = None

video_id = '0v1XHgWyIvU'  # more comments
# video_id = 'yfoq-0gGTLM' # fewer comments

# get first page of the comments
video_response = youtube.commentThreads().list(
  videoId=video_id, part='snippet,replies', maxResults=50,
  order='time', moderationStatus='published'
).execute()

while len(comments) < no_comments:
  # iterate through items
  for item in video_response['items']:
    ##### Parent Comments #####
    # extract comment from each item
    comment = item['snippet']['topLevelComment']['snippet']

    # append comment to list of comments
    comments.append([
      comment['textDisplay'],
      f'https://www.youtube.com/watch?v={video_id}&lc={item["snippet"]["topLevelComment"]["id"]}',
      comment['publishedAt'],
      comment['textOriginal'],
      comment['likeCount'],
      np.nan
    ])

    ##### Reply Comments #####
    # count number of replies
    total_reply_count = item['snippet']['totalReplyCount']

    # if there is at least one reply
    if total_reply_count > 0:
      parent_id = item["snippet"]["topLevelComment"]["id"]

      replies = youtube.comments().list(
        part='snippet',
        parentId=parent_id
      ).execute()

      # iterate through the replies
      for reply in replies['items']:
        # extract text from each reply
        # append reply to list of comments
        replyBody = reply['snippet']
        comments.append([
          replyBody['textDisplay'],
          f"https://www.youtube.com/watch?v={video_id}&lc={reply['id']}",
          replyBody['publishedAt'],
          replyBody['textOriginal'],
          replyBody['likeCount'],
          replyBody['parentId']
        ])

  ##### Next Page #####
  # print number of comments
  print(str(len(comments)) + ' comments in list')

  # if there is a next page to the comment result
  if 'nextPageToken' in video_response:
    # notify that next page has been found
    print('Next comment page found. Now extracting data. \n')

    # get the next page
    video_response = youtube.commentThreads().list(
      videoId=video_id, part='snippet,replies', maxResults=50,
      order='time', pageToken=video_response['nextPageToken'],
      moderationStatus='published'
    ).execute()
  else:
    # notify that no more pages are left
    print('No more comment pages left.')
    break

51 comments in list
Next comment page found. Now extracting data. 

102 comments in list
Next comment page found. Now extracting data. 

128 comments in list
No more comment pages left.


In [33]:
youtube_corpus = pd.DataFrame(
  comments, columns=['title', 'link', 'date_published', 'text', 'like_count', 'reply_parent_id'])
youtube_corpus

Unnamed: 0,title,link,date_published,text,like_count,reply_parent_id
0,Escape of guo is a reflection of country&#39;s...,https://www.youtube.com/watch?v=0v1XHgWyIvU&lc...,2024-08-27T10:40:10Z,Escape of guo is a reflection of country's st...,0,
1,BAKIT masama bang magsinungaling ? Sino bang t...,https://www.youtube.com/watch?v=0v1XHgWyIvU&lc...,2024-08-26T02:25:48Z,BAKIT masama bang magsinungaling ? Sino bang t...,0,
2,Ano ka ngayon Honti Virus Nga Nga ?,https://www.youtube.com/watch?v=0v1XHgWyIvU&lc...,2024-08-26T02:19:50Z,Ano ka ngayon Honti Virus Nga Nga ?,0,
3,NO TO DUTERTE AND CHINA. ARREST GUO LAWYER STE...,https://www.youtube.com/watch?v=0v1XHgWyIvU&lc...,2024-08-22T15:18:38Z,NO TO DUTERTE AND CHINA. ARREST GUO LAWYER STE...,0,
4,"Grabe nakalagpas sa immigration, ano yan hindi...",https://www.youtube.com/watch?v=0v1XHgWyIvU&lc...,2024-08-22T07:58:58Z,"Grabe nakalagpas sa immigration, ano yan hindi...",0,
...,...,...,...,...,...,...
123,Magkano kaya binayad oara maka eskapo?,https://www.youtube.com/watch?v=0v1XHgWyIvU&lc...,2024-08-20T11:02:17Z,Magkano kaya binayad oara maka eskapo?,0,UgzidE_cOM1z3f7pnCF4AaABAg
124,Nagtatago lang yan,https://www.youtube.com/watch?v=0v1XHgWyIvU&lc...,2024-08-19T12:15:42Z,Nagtatago lang yan,1,
125,Iba talaga pag may Pera. Lahat ma bibili. Hay ...,https://www.youtube.com/watch?v=0v1XHgWyIvU&lc...,2024-08-19T12:14:12Z,Iba talaga pag may Pera. Lahat ma bibili. Hay ...,8,
126,Bakit nakalabas,https://www.youtube.com/watch?v=0v1XHgWyIvU&lc...,2024-08-19T12:12:46Z,Bakit nakalabas,2,


## Function `extract_youtube_comments`

Now, we can define a function that extracts a certain number of comments for a particular video and returns a list of these comments.

In [34]:
def extract_youtube_comments(video_id, no_comments):
  """
  Extracts comments from a YouTube video.

  Args:
    video_id (str): The ID of the YouTube video.
    no_comments (int): The number of comments to extract.

  Returns:
    pandas.DataFrame: A DataFrame containing the extracted comments with the following columns:
      - title (str): The display text of the comment.
      - link (str): The URL of the comment.
      - date_published (str): The date and time when the comment was published.
      - text (str): The original text of the comment.
      - like_count (int): The number of likes the comment has received.
      - reply_parent_id (float): The ID of the parent comment if the comment is a reply, otherwise NaN.
  """
  # re-initialize list of YouTube comments
  comments = []

  # get first page of the comments
  video_response = youtube.commentThreads().list(
    videoId=video_id, part='snippet,replies', maxResults=50,
    order='time', moderationStatus='published'
  ).execute()

  while len(comments) < no_comments:
    # iterate through items
    for item in video_response['items']:
      ##### Parent Comments #####
      # extract comment from each item
      comment = item['snippet']['topLevelComment']['snippet']

      # append comment to list of comments
      comments.append([
        comment['textDisplay'],
        f'https://www.youtube.com/watch?v={video_id}&lc={item["snippet"]["topLevelComment"]["id"]}',
        comment['publishedAt'],
        comment['textOriginal'],
        comment['likeCount'],
        np.nan
      ])

      ##### Reply Comments #####
      # count number of replies
      total_reply_count = item['snippet']['totalReplyCount']

      # if there is at least one reply
      if total_reply_count > 0:
        parent_id = item["snippet"]["topLevelComment"]["id"]

        replies = youtube.comments().list(
          part='snippet',
          parentId=parent_id
        ).execute()

        # iterate through the replies
        for reply in replies['items']:
          # extract text from each reply
          # append reply to list of comments
          replyBody = reply['snippet']
          comments.append([
            replyBody['textDisplay'],
            f"https://www.youtube.com/watch?v={video_id}&lc={reply['id']}",
            replyBody['publishedAt'],
            replyBody['textOriginal'],
            replyBody['likeCount'],
            replyBody['parentId']
          ])

    ##### Next Page #####
    # print number of comments
    print(str(len(comments)) + ' comments in list')

    # if there is a next page to the comment result
    if 'nextPageToken' in video_response:
      # notify that next page has been found
      print('Next comment page found. Now extracting data. \n')

      # get the next page
      video_response = youtube.commentThreads().list(
        videoId=video_id, part='snippet,replies', maxResults=50,
        order='time', pageToken=video_response['nextPageToken'],
        moderationStatus='published'
      ).execute()
    else:
      # notify that no more pages are left
      print('No more comment pages left.\n')
      break

  return pd.DataFrame(
    comments, columns=['title', 'link', 'date_published', 'text', 'like_count', 'reply_parent_id']
  )

In [35]:
video_links = [
  '0v1XHgWyIvU',  # more comments
  'yfoq-0gGTLM'  # fewer comments
]

aliceguo_youtube_corpus = None

for video_link in video_links:
  print(f'Extracting comments from video: {video_link}')
  if aliceguo_youtube_corpus is None:
    aliceguo_youtube_corpus = extract_youtube_comments(video_link, 750)
  else:
    aliceguo_youtube_corpus = pd.concat([
      aliceguo_youtube_corpus, extract_youtube_comments(video_link, 750)
    ])

aliceguo_youtube_corpus

Extracting comments from video: 0v1XHgWyIvU
51 comments in list
Next comment page found. Now extracting data. 

102 comments in list
Next comment page found. Now extracting data. 

128 comments in list
No more comment pages left.

Extracting comments from video: yfoq-0gGTLM
34 comments in list
No more comment pages left.



Unnamed: 0,title,link,date_published,text,like_count,reply_parent_id
0,Escape of guo is a reflection of country&#39;s...,https://www.youtube.com/watch?v=0v1XHgWyIvU&lc...,2024-08-27T10:40:10Z,Escape of guo is a reflection of country's st...,0,
1,BAKIT masama bang magsinungaling ? Sino bang t...,https://www.youtube.com/watch?v=0v1XHgWyIvU&lc...,2024-08-26T02:25:48Z,BAKIT masama bang magsinungaling ? Sino bang t...,0,
2,Ano ka ngayon Honti Virus Nga Nga ?,https://www.youtube.com/watch?v=0v1XHgWyIvU&lc...,2024-08-26T02:19:50Z,Ano ka ngayon Honti Virus Nga Nga ?,0,
3,NO TO DUTERTE AND CHINA. ARREST GUO LAWYER STE...,https://www.youtube.com/watch?v=0v1XHgWyIvU&lc...,2024-08-22T15:18:38Z,NO TO DUTERTE AND CHINA. ARREST GUO LAWYER STE...,0,
4,"Grabe nakalagpas sa immigration, ano yan hindi...",https://www.youtube.com/watch?v=0v1XHgWyIvU&lc...,2024-08-22T07:58:58Z,"Grabe nakalagpas sa immigration, ano yan hindi...",0,
...,...,...,...,...,...,...
29,Ul*l dignified. Pinapakyuhan nga si Bong Daza ...,https://www.youtube.com/watch?v=yfoq-0gGTLM&lc...,2024-08-27T11:24:46Z,Ul*l dignified. Pinapakyuhan nga si Bong Daza ...,0,UgwBBzeGtqWna6y4KUJ4AaABAg
30,"That vp is rude, entitled and arrogant like he...",https://www.youtube.com/watch?v=yfoq-0gGTLM&lc...,2024-08-27T13:13:32Z,"That vp is rude, entitled and arrogant like he...",2,UgwBBzeGtqWna6y4KUJ4AaABAg
31,What country is this? Very interesting.,https://www.youtube.com/watch?v=yfoq-0gGTLM&lc...,2024-08-27T14:16:52Z,What country is this? Very interesting.,0,UgwBBzeGtqWna6y4KUJ4AaABAg
32,@@cvoutdoors9859palamunin,https://www.youtube.com/watch?v=yfoq-0gGTLM&lc...,2024-08-27T15:14:26Z,@@cvoutdoors9859palamunin,0,UgwBBzeGtqWna6y4KUJ4AaABAg


In [36]:
aliceguo_youtube_corpus

Unnamed: 0,title,link,date_published,text,like_count,reply_parent_id
0,Escape of guo is a reflection of country&#39;s...,https://www.youtube.com/watch?v=0v1XHgWyIvU&lc...,2024-08-27T10:40:10Z,Escape of guo is a reflection of country's st...,0,
1,BAKIT masama bang magsinungaling ? Sino bang t...,https://www.youtube.com/watch?v=0v1XHgWyIvU&lc...,2024-08-26T02:25:48Z,BAKIT masama bang magsinungaling ? Sino bang t...,0,
2,Ano ka ngayon Honti Virus Nga Nga ?,https://www.youtube.com/watch?v=0v1XHgWyIvU&lc...,2024-08-26T02:19:50Z,Ano ka ngayon Honti Virus Nga Nga ?,0,
3,NO TO DUTERTE AND CHINA. ARREST GUO LAWYER STE...,https://www.youtube.com/watch?v=0v1XHgWyIvU&lc...,2024-08-22T15:18:38Z,NO TO DUTERTE AND CHINA. ARREST GUO LAWYER STE...,0,
4,"Grabe nakalagpas sa immigration, ano yan hindi...",https://www.youtube.com/watch?v=0v1XHgWyIvU&lc...,2024-08-22T07:58:58Z,"Grabe nakalagpas sa immigration, ano yan hindi...",0,
...,...,...,...,...,...,...
29,Ul*l dignified. Pinapakyuhan nga si Bong Daza ...,https://www.youtube.com/watch?v=yfoq-0gGTLM&lc...,2024-08-27T11:24:46Z,Ul*l dignified. Pinapakyuhan nga si Bong Daza ...,0,UgwBBzeGtqWna6y4KUJ4AaABAg
30,"That vp is rude, entitled and arrogant like he...",https://www.youtube.com/watch?v=yfoq-0gGTLM&lc...,2024-08-27T13:13:32Z,"That vp is rude, entitled and arrogant like he...",2,UgwBBzeGtqWna6y4KUJ4AaABAg
31,What country is this? Very interesting.,https://www.youtube.com/watch?v=yfoq-0gGTLM&lc...,2024-08-27T14:16:52Z,What country is this? Very interesting.,0,UgwBBzeGtqWna6y4KUJ4AaABAg
32,@@cvoutdoors9859palamunin,https://www.youtube.com/watch?v=yfoq-0gGTLM&lc...,2024-08-27T15:14:26Z,@@cvoutdoors9859palamunin,0,UgwBBzeGtqWna6y4KUJ4AaABAg


We can now save these comments to a dataframe and Excel file.

In [38]:
try:
  # Google Colab
  from google.colab import drive
  drive.mount('/content/drive')

  file_name = '/content/drive/MyDrive/aliceguo_youtube.xlsx'
except:
  # VS Code / Local machine
  file_name = 'aliceguo_youtube.xlsx'

aliceguo_youtube_corpus.to_excel(file_name)
print(f'File saved to {file_name}')

File saved to aliceguo_youtube.xlsx


What we've done so far is extract comments for **one** video. If you want to extract comments for multiple videos, you'll have to use the `channel`, `search`, or `playlist` resource objects of the YouTube API. You can study these more through the documentation.

However, note that each account has an allocation of 10,000 units per day. Each call that uses the API costs a certain number of units, and if you reach 10,000, you won't be able to properly use the API (and possibly YouTube) until the next day.

Other APIs you might be interested in exploring:
*   Genius API for extracting song lyrics (https://docs.genius.com/)
*   Python Reddit API Wrapper (PRAW) for extracting Reddit posts (https://praw.readthedocs.io/en/stable/)
*   Tweepy for extracting tweets (but Elon kinda ruined it now) (https://www.tweepy.org/)





# Method 3: Facepager

Facepager is an application made for automating data extraction (esp. for Facebook). Follow the installation instructions in this page: https://github.com/strohne/Facepager

To learn more about how to use Facepager, navigate through its wiki page: https://github.com/strohne/Facepager/wiki/Getting-Started