# Web Scraping and APIs - Requests, Selenium and BeautifulSoup

This notebook provides a quick introduction to web scraping with Python, as part of the University of Amsterdam course Computational Social Science Analysis.

We will start by a simple introduction to using APIs and web scraping. We will learn scraping using requests and selenium. We will then build up toward a small research project that uses scraping and API.

It has been developed by Petter Törnberg. p.tornberg@uva.nl
Version 1.0. 2024-04-04 


# Interacting with APIs 

An API, or Application Programming Interface, allows different software applications to talk to each other, sharing data and functionalities easily. Developers use APIs to access features or data from other services, enabling more complex and feature-rich applications. Essentially, APIs serve as bridges between different software, making it possible for them to interact and share resources.



We're going to start with getting data from a simple API. It's easy!

## 1. Using a simple API:  How's the weather?

To fetch data from any API or website, we can use the requests package. The requests package abstracts the complexities of making requests behind simple API methods, allowing developers to send HTTP/1.1 requests with various methods like GET, POST, PUT, and others

In [None]:
!pip install requests

As an example, we will use OpenWeatherMap.

#### API Documentation: _Read The Fine Manual! (RTFM)_
Public APIs always come with documentation that describes how to use the API, and what data you can expect. 

To find the OpenWeatherMap API, you can go to:
https://openweathermap.org/api


#### Getting the current weather
We will here use the current weather function, to get the current weather in Amsterdam.

In [46]:
import requests

api_key = "de26752686c975de6a1c38a998f50fec"
city_name = "Amsterdam"
base_url = "http://api.openweathermap.org/data/2.5/weather?"

# Complete URL for the API call
url = f"{base_url}q={city_name}&appid={api_key}"

response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    print("Here is the result from the API:")
    print(response.text)
    json_string = response.text
else:
    print("Error: Unable to get data from OpenWeatherMap API! :(")

Here is the result from the API:
{"coord":{"lon":4.8897,"lat":52.374},"weather":[{"id":800,"main":"Clear","description":"clear sky","icon":"01n"}],"base":"stations","main":{"temp":285.16,"feels_like":284.24,"temp_min":284.19,"temp_max":285.87,"pressure":1026,"humidity":70},"visibility":10000,"wind":{"speed":4.12,"deg":210},"clouds":{"all":0},"dt":1712778839,"sys":{"type":2,"id":2012552,"country":"NL","sunrise":1712724795,"sunset":1712773777},"timezone":7200,"id":2759794,"name":"Amsterdam","cod":200}


#### Huh, what is this strange text?
As you can see, the result we get is in a particular text format. This format is called JSON (pronounced "Jason"), which is used by most APIs - both internal and public.

JSON (JavaScript Object Notation) is a data interchange format that is easy for humans to read and write and easy for machines to parse and generate. It is primarily used to transmit data between a server and a web application, serving as an alternative to XML, and is widely used for representing structured data and exchanging information in web development.


### Parsing JSON

Luckily, JSON is very easy to parse using Python. We may for instance turn it into a dict. We use the json library to do so.

In [48]:
import json

data = json.loads(json_string)

# Now parsed_data is a Python dictionary containing the data from the JSON string
main = data['main']
weather = data['weather']
print(f"{city_name:-^30}")
print(f"Temperature: {main['temp']}K")
print(f"Humidity: {main['humidity']}%")
print(f"Weather: {weather[0]['main']}")
print(f"Description: {weather[0]['description']}")

----------Amsterdam-----------
Temperature: 285.16K
Humidity: 70%
Weather: Clear
Description: clear sky


#### Exercise 1: Your turn! Get the forecast!

[Go to Solution](#exercise1)

Now your task is to get the "5 day / 3 hour forecast data" from the API, to figure out how the weather in Amsterdam will be in the coming days. Read the manual!

The goal is to print the date in the following format: 
- On 2023-10-06 12:00:00 the temperature will be 15 C
- On 2023-10-06 15:00:00 the temperature will be  4 C

etc.

There are two extra challenges here. 
First, the datetime is a timestamp (a float value representing the number of seconds since January 1, 1970, the Unix epoch), which you will need to convert to a readable date.

Second, you will need to convert the temperature from Kelvin to Celsius.

In [2]:
#Some help: a function to convert timestamp to date-time string
from datetime import datetime

def parse_timestamp(dt):
    dt_object = datetime.utcfromtimestamp(dt)
    formatted_date = dt_object.strftime('%Y-%m-%d %H:%M:%S')
    return formatted_date

In [3]:
# SOLUTION:
import requests

api_key = "de26752686c975de6a1c38a998f50fec"
city_name = "Amsterdam"
base_url = "http://api.openweathermap.org/data/2.5/forecast?"

# Complete URL for the API call
url = f"{base_url}q={city_name}&appid={api_key}"

response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    j = response.json()
    print(j)
    # As you can see the json contains a list of timestamps and temperatures
    # Loop over the list, and for each entry, parse the timestamp (using the method above)
    # and print the dates and the temperature (tip: Kelvin - 273 = Celsius)
    
    #[YOUR CODE HERE]
    
else:
    print("Error: Unable to get data from OpenWeatherMap API! :(")
    print(response)

2024-04-10 12:00:00: 13C
2024-04-10 15:00:00: 13C
2024-04-10 18:00:00: 12C
2024-04-10 21:00:00: 10C
2024-04-11 00:00:00: 11C
2024-04-11 03:00:00: 11C
2024-04-11 06:00:00: 11C
2024-04-11 09:00:00: 11C
2024-04-11 12:00:00: 13C
2024-04-11 15:00:00: 14C
2024-04-11 18:00:00: 13C
2024-04-11 21:00:00: 12C
2024-04-12 00:00:00: 12C
2024-04-12 03:00:00: 12C
2024-04-12 06:00:00: 12C
2024-04-12 09:00:00: 15C
2024-04-12 12:00:00: 17C
2024-04-12 15:00:00: 16C
2024-04-12 18:00:00: 14C
2024-04-12 21:00:00: 13C
2024-04-13 00:00:00: 12C
2024-04-13 03:00:00: 12C
2024-04-13 06:00:00: 12C
2024-04-13 09:00:00: 17C
2024-04-13 12:00:00: 18C
2024-04-13 15:00:00: 19C
2024-04-13 18:00:00: 15C
2024-04-13 21:00:00: 15C
2024-04-14 00:00:00: 13C
2024-04-14 03:00:00: 11C
2024-04-14 06:00:00: 9C
2024-04-14 09:00:00: 10C
2024-04-14 12:00:00: 11C
2024-04-14 15:00:00: 11C
2024-04-14 18:00:00: 8C
2024-04-14 21:00:00: 7C
2024-04-15 00:00:00: 7C
2024-04-15 03:00:00: 7C
2024-04-15 06:00:00: 8C
2024-04-15 09:00:00: 10C


Now you have a sense of how to get data from a simple API!

## 2. Simple webscraping

Let's start by using _requests_ on a normal website instead. It's quite similar! We here use it to fetch the CSS programme website. 

In [6]:
import requests

url = "https://www.uva.nl/en/programmes/bachelors/computational-social-science/computational-social-science.html"

response = requests.get(url)

if response.status_code == 200:
    print("Here is the result:")
    print(f"{response.text[:300]} ...")
else:
    print(f"Failed to retrieve the webpage. Status code: {response.status_code}")


Here is the result:






<!doctype html>
<html class="no-js" lang="en">
<head>
    <meta charset="utf-8"/>

    <title>Bachelor's Computational Social Science - University of Amsterdam</title>
            <link rel="canonical" href="https://www.uva.nl/en/programmes/bachelors/computational-social-science/computational- ...


As you can see, the result is in HTML: the simple markup language that the internet is built on.

To get data from HTML, we therefore need to parse the HTML to fetch the data that we are interested in. This is core to all scraping.

We therefore need a way of parsing the HTML to get the data that we are interested in.

This is where BeautifulSoup comes in!

Beautifulsoup is a complex library for parsing HTML.

Let's first install it and load it.


In [None]:
# Install the library if you do not already have it
!pip install beautifulsoup4

In [5]:
#Load the library
from bs4 import BeautifulSoup

### Parsing a simple example website with beautifulsoup
As you may know, HTML is hierarchically structured  - sometimes referred to as an HTML parse tree or the DOM tree. The DOM is a tree data structure that represents the hierarchical structure of an HTML document. Each node in the tree corresponds to an element (or "tag") in the HTML document, and the edges represent the nesting relationships between the elements. The root of the tree is typically the <html> tag, and it has child nodes representing the head and body of the HTML document, and those child nodes, in turn, have their own child nodes representing nested elements within them.

For example, consider a simple HTML document:


In [32]:
simplehtml = '''<html>
    <head>
        <title>My Page</title>
    </head>
    <body>
        <h1 class='mainheader'>Welcome to My Page</h1>
        <p id='theparagraph'>This is a paragraph.</p>
        <p class='paraclass'>This is a second paragraph.</p>
        <div><p>This is a third paragraph, inside a div!</p></div>
    </body>
</html>'''

Let's try parsing elements of this page!

In [33]:
#This turns the website into a beautifulsoup object that we can then fetch elements from
soup = BeautifulSoup(simplehtml, 'html.parser')


There are many functions in BeautifulSoup, but we will focus on soup.select(), which uses a _CSS selector_ to select elements and data.

This function uses a particular type of strings for selecting elements, and returns a list of all matching elements (if any).

In [23]:
# This means "select all elements of type 'title'"
alltitles = soup.select('title')
# We then pick the first one; since we know there is only one
firsttitle = alltitles[0]
# And we can then select the text inside it, by getting the attribute text:
print(firsttitle.text)

My Page


In [31]:
# This means "select all elements of type 'p'"
paragraphs = soup.select('p')
# We then loop over the paragraphs and print each one
for paragraph in paragraphs:
    print(paragraph.text)

This is a paragraph.
This is a second paragraph.
This is a third paragraph, inside a div!


In [34]:
# This means "select all elements of type 'p' with class 'paraclass'"
# The dot signifies class names
# We can then select the first element, and the text content, in the same line
soup.select('p.paraclass')[0].text


'This is a second paragraph.'

In [35]:
# This means "select all elements of type 'p' with id 'paraclass'"
# The # signifies id.
soup.select('p#theparagraph')[0].text


'This is a paragraph.'

### Exercise 2: Parse a simple website
[Go to Solution](#exercise2)

Your task is to parse the following simple website using beautifulsoup, and extract a dataframe that has the products listed, with their name, description, and price in separate columns.


In [37]:
from IPython.display import display, HTML
website_html = '''<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Simple Website Example</title>
</head>
<body>
<h1>Welcome to Our Simple Website</h1>
<p>This is a demonstration of a simple HTML website designed for parsing practice.</p>
<h2>About Us</h2>
<p>We are a team dedicated to learning web scraping with BeautifulSoup.</p>
<h3>Contact Information</h3>
<p>Email us at: <a href="mailto:info@example.com">info@example.com</a></p>
<h2>Our Products</h2>
<table border="1">
    <tr>
        <th>Product Name</th>
        <th>Description</th>
        <th>Price</th>
    </tr>
    <tr>
        <td>Product 1</td>
        <td>An essential item for beginners.</td>
        <td>$19.99</td>
    </tr>
    <tr>
        <td>Product 2</td>
        <td>A must-have for advanced users.</td>
        <td>$29.99</td>
    </tr>
    <tr>
        <td>Product 3</td>
        <td>Now with bacon-flavor!</td>
        <td>$39.99</td>
    </tr>
</table>

</body>
</html>'''
print("This is how the website looks:")
display(HTML(website_html))

This is how the website looks:


Product Name,Description,Price
Product 1,An essential item for beginners.,$19.99
Product 2,A must-have for advanced users.,$29.99
Product 3,Now with bacon-flavor!,$39.99


In [38]:
# Import pandas
import pandas as pd 

# Parse the HTML
soup = BeautifulSoup(website_html, 'html.parser')

# [YOUR SOLUTION HERE]
#----- SOLUTION -----
# Find the table containing products
product_table = soup.select('table')[0]

# Extract the rows in the table, skipping the header row
rows = product_table.select('tr')[1:]

# Extract the data for each row
products = []
for row in rows:
    #-------
    # Select all tds in the 
    # and put the first one in variable called product_name,
    # second in description, and third in price
    
    # [YOUR CODE HERE]
    #-------

    products.append([product_name, description, price])

# Convert the data into a DataFrame
df = pd.DataFrame(products, columns=['Product Name', 'Description', 'Price'])
#------ /SOLUTION -----

# Dataframe that has the products
df

Unnamed: 0,Product Name,Description,Price
0,Product 1,An essential item for beginners.,$19.99
1,Product 2,A must-have for advanced users.,$29.99
2,Product 3,Now with bacon-flavor!,$39.99


## A realistic example: Parsing the CSSci website
Let's use this to parse the CSSci website we fetched earlier.

In [40]:
import requests

url = "https://www.uva.nl/en/programmes/bachelors/computational-social-science/computational-social-science.html"

response = requests.get(url)

In [41]:
#Parse the html
soup = BeautifulSoup(response.text, 'html.parser')


So, how do we find the element we want, in a complex HTML website like UvA.nl?

One way is to find out the CSS selector for the element you are seeking, you can use the extremely useful Chrome Developer Tools. Open Chrome. Go to the website. Go to Menu > More Tools > Developer Tools.

You can use the "Select element", represented by a diagonal arrow in the upper right corner. 

Click the element on the page that you are interested in: the main description.

You can now see that the description text is inside a _p_ of class _lead_ which is inside a _div_ with class 'c-programmepageheader'.

We can use the CSS selector to select this element:

In [42]:
description = soup.select('p.lead')

The result is a list with a single element in it: the element we are looking for. To get the text of this element, we simply need to:

In [43]:
print(description[0].get_text())

In this digital age, we need people who can solve societal problems using data and technology. That's why this unique programme blends social sciences and humanities with data and computer science. Are you ready to dive into issues like climate change, inequality and the impact of AI? Say goodbye to traditional learning and embrace hands-on projects with real data, to make the world a better place.


### Exercise 3: "Is Computational Social Science right for you?"
[Go to Solution](#exercise3)

Your task is to find the list of four points on the UvA CSSci website that answers the questions of whether the CSS programme is right for you. 

Use what you've learned to fetch this list, and print each point.


In [13]:
# [YOUR CODE HERE]


Here are the 5 points to decide if CSS is right for you:
0. envision yourself as a future change-maker 
1. are excited to use (advanced) research, data science, and programming skills to understand and solve complex problems 
2. have an interest in understanding the legal, economic, socio-cultural, and political contexts of societal challenges 
3. are proactive in taking responsibility for your learning within a dynamic, project-based curriculum 
4. want to collaborate with stakeholders to tackle real societal problems during your Bachelor's


# Below here is advanced scraping and advanced APIs

## Selenium: Scraping dynamic pages

While _requests_ is a powerful tool for getting static HTML pages, most websites these days are not static HTML. _requests_ cannot handle Javascript pages or dynamic content.

This is where Selenium comes in. Selenium automates a web browser, allowing it to interact with the JavaScript and dynamically loaded content on the webpage, thereby providing access to content modified or loaded by JavaScript after the initial page load. Selenium can also automate interactions with the website, such as clicking buttons, filling out forms, or navigating through pages. For these points, the requests library alone would be insufficient, as it cannot interact with webpage elements or execute user-like actions.

Selenium in other words runs a complete web browser, and automates clicking on the websites. This allows it to scrape nearly any website. But it also means that it is relatively heavy and slow, compared to a _requests_ based solution. 

*Takeaway: use requests when dealing with static websites or APIs. Use Selenium when dealing with more complex dynamic websites.*


### Installing Selenium

Installing selenium can be a bit of a challenge on its own, as it is dependent on having a chrome/chromium browser installed. Expect a bit of fiddling!

Regardless of the OS, you first need to install the Selenium python package: 





In [None]:
!pip install selenium

!pip install webdriver-manager


In [13]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
import time
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys

driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get("https://www.google.com")

In [42]:
#You should now see a web browser opening on your computer 

In [14]:
#Let's first close the cookie window by click the "no thanks" button
button = driver.find_element('id','W0wltc')
button.click()

In [15]:
# Find the search bar using the name of the input field
search_bar = driver.find_element("name", "q")

# Type the search term and hit ENTER
search_bar.send_keys("university of amsterdam")

In [18]:
#Click enter!
search_bar.send_keys(Keys.RETURN)

# Wait for some time to let the results load
time.sleep(2) 

#The page will now have made the search!

In [100]:
# Locate the titles and URLs of the search hits.
results = driver.find_elements(by=By.CSS_SELECTOR,value='a h3') #Select all links under the div with id search: these are the search results.

# Extract and print the top 10 hits 
for result in results:
    if len(result.text)>0:
        #This is how to get the parent element in selenium. We want the <a> to get the URL.
        parent_element = result.find_element(by=By.XPATH, value='..')
        
        print(f"{result.text}. {parent_element.get_attribute('href')}")

-----
To get more results, we can scroll to the bottom of the page, and wait for a moment

In [19]:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

# Wait for a while to allow contents to load, if any
time.sleep(2)


In [22]:
# Running the same code as above, we now get a larger number of results!
results = driver.find_elements(by=By.CSS_SELECTOR,value='a h3') #Select all links under the div with id search: these are the search results.

# Extract and print the top 10 hits 
i=0
for result in results:
    if len(result.text)>0:
        i+=1
        #This is how to get the parent element in selenium. We want the <a> to get the URL.
        parent_element = result.find_element(by=By.XPATH, value='..')
        
        print(f"{i}. {result.text}. {parent_element.get_attribute('href')}")

In [21]:
# Close the browser window
driver.quit()

### Exercise 4: Find the Google ranking of UvA's Computational Social Science. 
[Go to Solution](#exercise4)

Your task is to adapt the code to use selenium to search for 'computational social science', and to find where UvA shows up in the search ranking. 

1. Use Selenium to open google.com. Close the popup, and search for 'computational social science'

2. Your script should keep scrolling in the search result until it finds a search result with an HREF that includes with 'uva.nl'. 

3. It should then print the number in the list of the identified link, and how many pages you had to scroll. For instance, if it is the first link found, your code should output: 'UvA was number 1 link for search result, on page 1 in the Google ranking!'


In [33]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
import time
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys

# This function fetches the ranking of a website in a search.
# Takes: a search term string
# Returns: the link in order it was found on, and the page it was found on. e.g., "return rank, page"
# If it does not find the link in the first 5 pages, "return None, None"
def find_google_ranking(search_term,url_to_look_for):
    #-----------
    #[YOUR CODE HERE]
    #-----------

rank, page = find_google_ranking("computational social science","uva.nl")

if rank is None:
    print(f"Uva.nl was not listed in the first {how_many_pages_to_try} pages! :( We need to work on our SEO!")
else:
    print(f"UvA was number {rank} link for search result, on page {page} in the Google ranking!")
    

## More advanced API: How toxic are YouTube comments? Combining YouTube API and Perspective API
In this part of the guide, we will use YouTube API to collect comments from videos. 

### About authentication
APIs often require users to sign up and use credentials. These are often based on "API keys" which link a call to the API to a particular user or registered application. There are many reason for APIs requiring authentifiatoin: by requiring credentials, API providers can control access to the data or services they offer, preventing unauthorized access and abuse, and to ensure rate limiting - that is, managing the load on the server by restricting the number of API calls from a single user or application within a given time frame.


You can sign up to the YouTube API at https://developers.google.com/youtube/v3 
Read about the process on: https://developers.google.com/youtube/v3/getting-started

Google offers a range of powerful and interesting APIs, both for data collection and analysis. Have a look and browse their offerings.



### Fetching YouTube comments
We will now use the YouTube API to fetch comments associated to a particular YouTube video.

You'll find the API documentation here: https://developers.google.com/youtube/v3/docs/commentThreads 


In [None]:
import requests
import time

# The API key is your key to the YouTube API. You will neeed to get your own. To do so, visit https://developers.google.com/youtube/v3/getting-started 
api_key = #[YOUR API KEY]
video_id = "dQw4w9WgXcQ"  
# Replace with the ID of the video you are interested in. 
# You can find the ID by going to a video in Youtube, and getting the string after v= in the URL. For instance, i0EfLMe5FGk in https://www.youtube.com/watch?v=i0EfLMe5FGk

url = f"https://www.googleapis.com/youtube/v3/commentThreads"
params = {
    'part': 'snippet',
    'videoId': video_id,
    'maxResults': 100,  # max number of comments to fetch 
    'textFormat': 'plainText',
    'key': api_key,
}

all_comments = []

maximum_pages = 3 #How many pages to get at most

for page in range(maximum_pages):
    print(f"Getting page {page}...")
    response = requests.get(url, params=params)
    if response.status_code == 200:
        result_json = response.json()
        all_comments.extend([item['snippet']['topLevelComment']['snippet']['textDisplay'] for item in result_json.get('items', [])])

        # Many APIs provide the result page by page. If there is another page, this API returns a nextPageToken, that we can
        # send to the API to get the next page in line. If there are no more comments, there will be no such token.
        if 'nextPageToken' in result_json:
            params['pageToken'] = result_json['nextPageToken']
            
            # Ensure you don't hit the quota limits by adding a delay
            time.sleep(1)
        else: #No token, so no more pages
            break
    else:
        print("Error: ", response.status_code)
        break

# Now 'all_comments' list contains all the comments from the video
print(f"Done. Fetched {len(all_comments)} comments!")


Getting page 0...
Getting page 1...
Getting page 2...
Getting page 3...
Getting page 4...
Done! Fetched 490 comments!


In [None]:
#Print the first five comments
print(all_comments[:5])

['1 BILLION views for Never Gonna Give You Up!\xa0 Amazing, crazy, wonderful! Rick ♥️', "Greetings from Japan. This melody gives me a strong sense of déjà vu. It feels like something is in the right place. It's a very good song. I definitely love this song and find it nostalgic.", '13 years still watching this masterpiece \U0001fae1\U0001fae1', 'Chính nó:)))', 'I love this song\n\n\n\n\n\nEdit:CAN I GET 178 LIKES?🗣️🔥🔥']


### Perspective Toxicity API

The Perspective API, developed by Jigsaw and Google's Counter Abuse Technology team, is a tool that leverages machine learning to score toxicity in online conversation. The API provides various models to assess different aspects of conversations, like toxicity, severe toxicity, and threat, allowing developers and service providers to automatically moderate content that is harmful, abusive, or likely to drive users away, thus fostering healthier and more respectful online interactions.

Perspective API is an example of an API that can be used to analyze your own data, rather than just fetching existing data.



Perspective API is easiest to use through the Python package offered by Google. Many APIs offer Python packages to make it easier to use the API. APIs offer packages to simplify and streamline the interaction between the end-user's code and the API’s endpoints, abstracting the intricacies of HTTP requests, response handling, and error handling.


In [None]:
#Install the package
!pip install googleapiclient

In [None]:
from googleapiclient import discovery

In [None]:
PERSPECTIVE_API_KEY = # YOUR API KEY HERE

# The text string you want to analyze
message = "I fart in your general direction! Your mother was a hamster and your father smelt of elderberries!"

client = discovery.build(
  "commentanalyzer",
  "v1alpha1",
  developerKey=PERSPECTIVE_API_KEY,
  discoveryServiceUrl="https://commentanalyzer.googleapis.com/$discovery/rest?version=v1alpha1",
  static_discovery=False,
)

analyze_request = {
  'comment': { 'text': message },
  'requestedAttributes': {'TOXICITY': {}}
}

# Don't overload the API
time.sleep(0.5)

response = client.comments().analyze(body=analyze_request).execute()

toxicity = response['attributeScores']['TOXICITY']['summaryScore']['value']

print(f"The message scores {toxicity} in toxicity")


The message scores 0.85850734 in toxicity


### Mini-project 1: How toxic are the YouTube comments?
[Go to Solution](#miniproject1)

Your task is to write a script that: 

1. Takes a list of YouTube video IDs and collects the first 100 comments from each video.
2. Calculate the toxicity of each comment on the videos using Perspective API, and stores the result in a pandas Dataframe.
3. Shows how toxic the comments are on average according to the Perspective API. (Use for instance np.mean() to calculate the average toxicity.)

Select a couple of Youtube videos of your own choice, and use your code to analyze which of them has the most toxic comments. Reflect about the meaning of the findings.


In [None]:

import requests
import time
from googleapiclient import discovery
import pandas as pd

youtube_api_key = [YOUR KEY HERE]
PERSPECTIVE_API_KEY = [YOUR KEY HERE]

client = discovery.build(
  "commentanalyzer",
  "v1alpha1",
  developerKey=PERSPECTIVE_API_KEY,
  discoveryServiceUrl="https://commentanalyzer.googleapis.com/$discovery/rest?version=v1alpha1",
  static_discovery=False,
)

#This function returns a list of comments (strings) associated to a video on Youtube
def fetch_comments_for_video(video_id, max_comments_to_fetch=100):
    # Use YouTube API to fetch all comments associated to a video
    # You will have to go through several pages. 
    #####-------------
    # [YOUR CODE HERE]
    #####--------
    print(f"Done. Fetched {len(all_comments)} comments!")
    return all_comments

#This function uses fetch_comments_for_video() to collect comments for several videos. 
# It takes a list of video ids and returns a dataframe  with the structure:
# video_id | comment
# 
def fetch_comments_for_videos(list_of_video_ids):
    l = []
    for video_id in list_of_video_ids:
        comments = fetch_comments_for_video(video_id,)
        for comment in comments:
            l.append({'video_id':video_id,'comment': comment})
                 
    return pd.DataFrame(l)

# This measures the toxicity of a single message using the Perspective API
def measure_toxicity_of_message(message):
    
    analyze_request = {
      'comment': { 'text': message },
      'requestedAttributes': {'TOXICITY': {}}
    }

    time.sleep(0.1)

    response = client.comments().analyze(body=analyze_request).execute()

    toxicity = response['attributeScores']['TOXICITY']['summaryScore']['value']

    return toxicity


In [None]:
#Trump's state of the union vs Biden's state of the union
comments = fetch_comments_for_videos(['ATFwMO9CebA','Wl6b5KnpmB4'])

Fetching comments for video ATFwMO9CebA.
Getting page 1...
Getting page 2...
Done. Fetched 199 comments!
Fetching comments for video Wl6b5KnpmB4.
Getting page 1...
Done. Fetched 100 comments!


In [None]:
#Prepare the dataframe
comments['toxicity'] = None
comments['analyzed'] = False    

In [None]:
#This is a simple way of structuring your code when scraping many pages.
i = 0
nrfailed = 0
while(True):    
    #Fetch a random row
    left_to_process = comments.loc[comments['analyzed']==False]
    
    if len(left_to_process)==0:
        print(f"We're done! Analysis failed for {nrfailed} of {len(comments)}.")
        break
    
    else:
        comment = left_to_process.sample(1)
        index = comment.index[0]
        message = comment.comment.values[0]

        #Keep track of progress. Every 10 measures, we print out a progress report
        i+=1
        if i%10==0:
            print(f"{len(comments.loc[comments['analyzed']==False])} comments left out of {len(comments)}...")

        try:
            #Analyze toxicity
            toxicity = measure_toxicity_of_message(message)
            comments.loc[index,'toxicity'] = toxicity

        except Exception as e:
            #The API will fail for mant comments, for instance if they are too short or in the wrong language.
            nrfailed+=1
        finally:
            comments.loc[index,'analyzed'] = True

290 comments left out of 299...
280 comments left out of 299...
270 comments left out of 299...
260 comments left out of 299...
250 comments left out of 299...
240 comments left out of 299...
230 comments left out of 299...
220 comments left out of 299...
210 comments left out of 299...
200 comments left out of 299...
190 comments left out of 299...
180 comments left out of 299...
170 comments left out of 299...
160 comments left out of 299...
150 comments left out of 299...
140 comments left out of 299...
130 comments left out of 299...
120 comments left out of 299...
110 comments left out of 299...
100 comments left out of 299...
90 comments left out of 299...
80 comments left out of 299...
70 comments left out of 299...
60 comments left out of 299...
50 comments left out of 299...
40 comments left out of 299...
30 comments left out of 299...
20 comments left out of 299...
10 comments left out of 299...
We're done! Analysis failed for 18 of 299.


In [None]:
comments

Unnamed: 0,video_id,comment,toxicity,analyzed
0,ATFwMO9CebA,We love you president Trump!,0.015205,True
1,ATFwMO9CebA,Take a lesson people! That's what a real presi...,0.045131,True
2,ATFwMO9CebA,🕊️🏡🕊️💝💝💝💝💝💝💝💝💝💝💝💝💝💝💝💝💝💝🦅💝💝💝💝💝💝🌧️💝💝💝💝💝💝💝💝💝💝🦅🦅🦅🦅...,0.042657,True
3,ATFwMO9CebA,This is the real state of the union,0.028149,True
4,ATFwMO9CebA,How great. 2 people who ran for vice president...,0.05849,True
...,...,...,...,...
294,Wl6b5KnpmB4,ខ្ញុំchea den ខ្ញុំមិនទាន់បានលុយទេសូមលោកជូបៃឌិ...,,True
295,Wl6b5KnpmB4,Refund your money if your airplanes delayed. W...,0.034277,True
296,Wl6b5KnpmB4,He slipped up 31:02 and was gonna say came tog...,0.071337,True
297,Wl6b5KnpmB4,Who’s the girl in the yellow dress she looks l...,0.157667,True


In [None]:
#Let's compare the toxicities.
# We look at the mean toxicity for the successfully analyzed comments:
comments.loc[~comments['toxicity'].isna()].groupby(['video_id'])['toxicity'].mean()

video_id
ATFwMO9CebA    0.214887
Wl6b5KnpmB4    0.202312
Name: toxicity, dtype: float64

## Mini-project 2: Does the YouTube algorithm radicalize?
[Go to Solution](#miniproject2)

We will now go through a more complex exercise, for a small research paper.

Researchers have argued that the YouTube autoplay feature can lead to radicalization. The platform's recommendation system is designed to keep users engaged for as long as possible. The algorithm achieves this by suggesting content that it predicts the user will find interesting or compelling, based on their viewing history, search terms, and other interactions. 

However, critics argue that this approach can create a "filter bubble," where users are only exposed to content and perspectives similar to those they have already encountered, thereby reinforcing existing beliefs and opinions. There are concerns that this can lead to the incremental presentation of more extreme content, as users are gradually exposed to increasingly radical viewpoints in a bid to sustain engagement. This phenomenon, sometimes referred to as "algorithmic radicalization," has sparked debates about the ethical responsibilities of social media and content-sharing platforms and their role in the spread of misinformation, hate speech, and extremist ideologies. 

This would suggest that the comments on videos become more and more toxic as the algorithm proceeds!

In this exercise, we are going to explore this hypothesis by tracing the YouTube autoplay feature from a given starting point.

You will use the code we developed in our previous practical to fetch the comments on the video and evaluate their average toxicity. In case you did not complete that task, we will give you the solution here.

While the Youtube API gives access to some features, the "next video" feature is not accessible through the API. For this, we therefore need to scrape the interface.

#### Task: 

1. Choose a video to start from. This might for instance be a political video, where you would expect a radicalization loop to take place.

2. Write code to repeatedly go to the "next upcoming video" and store the number of steps taken, the video id and the title. (Remember to pause between each fetch, so the page has time to load.)

3. Store at least 10 steps of "next video", so that a trend can be spotted.

4. Use your code from the previous practical, where you collected comments on YouTube videos using the API, and calculated the toxicity of each comment.

5. For each video, calculate the average toxicity of the comments. 

6. Plot the trend: are the comments becoming more toxic? Do your findings fit with the YouTube autoplay radicalization hypothesis?


#### Code for analyzing how toxic text is
We will use the Perspective API to measure toxicity. It's a machine learning API that classifies how incivil a social media message is. 

Make sure that you go through this code and understand it!

In [68]:
import requests
import time
from googleapiclient import discovery
import pandas as pd

api_key = [YOUR PERSPECTIVE API KEY HERE]

client = discovery.build(
  "commentanalyzer",
  "v1alpha1",
  developerKey=PERSPECTIVE_API_KEY,
  discoveryServiceUrl="https://commentanalyzer.googleapis.com/$discovery/rest?version=v1alpha1",
  static_discovery=False,
)

#This function returns a list of comments (strings) associated to a video on Youtube
def fetch_comments_for_video(video_id, max_comments_to_fetch=100):

    print(f"Fetching comments for video {video_id}.")
    
    url = f"https://www.googleapis.com/youtube/v3/commentThreads"
    params = {
        'part': 'snippet',
        'videoId': video_id,
        'maxResults': 100,  
        'textFormat': 'plainText',
        'key': api_key,
    }

    all_comments = []
    page = 0
    while(True):
        page+=1
        
        if len(all_comments)>=max_comments_to_fetch:
            break
        
        print(f"Getting page {page}...")
        response = requests.get(url, params=params)
        if response.status_code == 200:
            result_json = response.json()
            all_comments.extend([item['snippet']['topLevelComment']['snippet']['textDisplay'] for item in result_json.get('items', [])])

            # Many APIs provide the result page by page. If there is another page, this API returns a nextPageToken, that we can
            # send to the API to get the next page in line. If there are no more comments, there will be no such token.
            if 'nextPageToken' in result_json:
                params['pageToken'] = result_json['nextPageToken']

                # Ensure you don't hit the quota limits by adding a delay
                time.sleep(1)
            else: #No token mean no more pages, so we're done
                break
        else:
            print("Error: ", response.status_code)
            break

    # Now 'all_comments' list contains all the comments from the video
    print(f"Done. Fetched {len(all_comments)} comments!")
    return all_comments

# This measures the toxicity of a single message using the Perspective API
def measure_toxicity_of_message(message):
    
    analyze_request = {
      'comment': { 'text': message },
      'requestedAttributes': {'TOXICITY': {}}
    }

    time.sleep(0.1)

    response = client.comments().analyze(body=analyze_request).execute()

    toxicity = response['attributeScores']['TOXICITY']['summaryScore']['value']

    return toxicity


# We first adapt our fetch_comments_for_videos() from last week so that it preserves our additional information about the video.
# This function now takes a list of dicts (created by the function above), and fetches the comments for each video.
# It produces a dataframe where each line is a comment, and the video information is included:
# [{'step':0,'video_id': 'ATFwMO9CebA', 'President Trump 2018 State of the Union Address (C-SPAN)', 'comment':'Great speech!' }...]
def fetch_comments_for_videos(list_of_videos,max_comments_to_fetch=100):
    list_of_comments = []
    for video in list_of_videos:
        comments = fetch_comments_for_video(video['video_id'],max_comments_to_fetch)
        for comment in comments:
            list_of_comments.append(video | {'comment': comment}) #Add comment information to video information
                 
    return pd.DataFrame(list_of_comments)

#This function takes a dataframe with comments, and analyzes each comment using perspective.
# It returns an updated dataframe with toxicity information for each comment.
def analyze_toxicity_of_comments(comments):
    #Prepare the dataframe
    comments['toxicity'] = None
    comments['analyzed'] = False    
    #This is a simple way of structuring your code when scraping many pages.
    i = 0
    nrfailed = 0
    while(True):    
        #Fetch a random row
        left_to_process = comments.loc[comments['analyzed']==False]

        if len(left_to_process)==0:
            print(f"We're done! Analysis failed for {nrfailed} of {len(comments)}.")
            break

        else:
            comment = left_to_process.sample(1)
            index = comment.index[0]
            message = comment.comment.values[0]

            #Keep track of progress. Every 10 measures, we print out a progress report
            i+=1
            if i%10==0:
                print(f"{len(comments.loc[comments['analyzed']==False])} comments left out of {len(comments)}...")

            try:
                #Analyze toxicity
                toxicity = measure_toxicity_of_message(message)
                comments.loc[index,'toxicity'] = toxicity

            except Exception as e:
                #The API will fail for mant comments, for instance if they are too short or in the wrong language.
                nrfailed+=1
            finally:
                comments.loc[index,'analyzed'] = True

    return comments

In [70]:
# In the previous exercise, we used this to compare the state of the union speeches of Trump and Biden:

#Fetch the comments: 
comments = fetch_comments_for_videos([{'president':'Trump','video_id':'ATFwMO9CebA'},{'president':'Biden','video_id':'Wl6b5KnpmB4'}],max_comments_to_fetch=50)

#Analyze comments:
comments = analyze_toxicity_of_comments(comments)

#Calculate average toxicity:
print("Average video comment toxicity:")
comments.loc[~comments['toxicity'].isna()].groupby(['president'])['toxicity'].mean()

# Who was more toxic? 

### Additional snippets of code to help you 

In [102]:
# Go to the initial video URL. You can modify this to your preferred video
driver.get('https://www.youtube.com/watch?v=dQw4w9WgXcQ')  

In [119]:
# Close cookie popup
buttons = driver.find_elements(by=By.CSS_SELECTOR,value='button')

for button in buttons:
    if 'Reject' in button.text:
        button.click()

In [133]:
#Get the title and id of the current video
video_id = driver.current_url.split('=')[1]
title = driver.find_element(by=By.CSS_SELECTOR,value='div#title h1')
print(f"Id: {video_id}. Title: {title.text}")

Id: PvHGl3L0WUI. Title: Karine Jean-Pierre has no answer for the crisis at the border.


In [142]:
# Click next video
nextvid = driver.find_element(by=By.CSS_SELECTOR,value='ytd-compact-video-renderer.ytd-watch-next-secondary-results-renderer a')
nextvid.click()

### Your code goes here:

In [None]:
# This function takes a video_id to start with, and then takes nr_steps of "next video" from that video.
# It returns a list of dicts, each containing with the step number, the video_id, and the title of the video
#  e.g., [{'step':0,'video_id': 'ATFwMO9CebA', 'President Trump 2018 State of the Union Address (C-SPAN)' }...]

def follow_next_video(start_video_id,nr_steps):
#[YOUR CODE HERE]

# This should result in the next 10 steps from the Biden state of the union video
list_of_videos = follow_next_video('Wl6b5KnpmB4',10)



In [67]:
list_of_videos
#Should look more or less like this:
# [{'step': 0,
#   'video_id': 'Wl6b5KnpmB4',
#   'title': 'President Joe Biden delivers 2023 State of the Union address to Congress — 2/7/23'},
#  {'step': 1,
#   'video_id': 'FtzvOZNyXdw',
#   'title': "Rise, fall of Sam Bankman-Fried, FTX at center of Michael Lewis' new book | 60 Minutes"},
#  {'step': 2,
#   'video_id': 'XqwGt69pDXQ',
#   'title': 'The Collapse Of FTX: Insiders Tell All | CNBC Documentary'},
#  {'step': 3,
#   'video_id': 'gqDCrdZVZnk',
#   'title': 'The world’s most dangerous arms dealer | DW Documentary'},
#  {'step': 4, 'video_id': 'G1p6rlDCxq0', 'title': 'World War One (ALL PARTS)'},
# ...

## What can you observe about the videos? 
# For instance, does the recommendation algorithm gets stuck in a cycle between two or three videos? 

In [64]:
#Now we're going to use our old code to analyze the data!

#Fetch comments for the videos
comments = fetch_comments_for_videos(list_of_videos,max_comments_to_fetch=50)

In [65]:
#Calculate toxicity of each comment. This will take a while!
comments = analyze_toxicity_of_comments(comments)

In [57]:
#Let's plot the average toxicity over time. Is there a clear trend?
comments.loc[~comments['toxicity'].isna()].groupby(['step'])['toxicity'].mean().plot()

# Solutions

### Exercise 1: Weather API
<a id='exercise1'></a>

In [None]:
# SOLUTION:
import requests

api_key = "de26752686c975de6a1c38a998f50fec"
city_name = "Amsterdam"
base_url = "http://api.openweathermap.org/data/2.5/forecast?"

# Complete URL for the API call
url = f"{base_url}q={city_name}&appid={api_key}"

response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    j = response.json()
    print(j)
    # As you can see the json contains a list of timestamps and temperatures
    # Loop over the list, and for each entry, parse the timestamp (using the method above)
    # and print the dates and the temperature (tip: Kelvin - 273 = Celsius)
    for l in j['list']:
        print(f"{parse_timestamp(l['dt'])}: {int(l['main']['temp']-273)}C") # Note that we need to turn the temperature into a number!
else:
    print("Error: Unable to get data from OpenWeatherMap API! :(")
    print(response)

### Exercise 2: Parsing HTML 

In [None]:
cols = row.select('td')
product_name = cols[0].text
description = cols[1].text
price = cols[2].text

### Exercise 3: Parsing CSS website

In [None]:
points = soup.select('div.richtext ul li')
print(f"Here are the {len(points)} points to decide if CSS is right for you:")
for i,point in enumerate(points):
    print(f"{i}. {point.text}")


### Exercise 4: Selenium

In [None]:

# SOLUTION
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
import time
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys

# This function fetches
# Takes: a search term string
# Returns: the link in order it was found on, and the page it was found on.
def find_google_ranking(search_term,url_to_look_for):
    driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
    driver.get("https://www.google.com")

    time.sleep(5) 

    #Let's first close the cookie window by click the "no thanks" button
    button = driver.find_element('id','W0wltc')
    button.click()

    time.sleep(3) 

    # Find the search bar using the name of the input field
    search_bar = driver.find_element("name", "q")

    # Type the search term and hit ENTER
    search_bar.send_keys(search_term)

    #Click enter!
    search_bar.send_keys(Keys.RETURN)

    # Wait for some time to let the results load
    time.sleep(2) 

    how_many_pages_to_try=5
    #The page will now have made the search
    # We go over pages by scrolling down
    for page in range(how_many_pages_to_try):

        # Get search results
        results = driver.find_elements(by=By.CSS_SELECTOR,value='a h3') 

        #Extract hits
        for result_nr,result in enumerate(results):
            if len(result.text)>0:

                parent_element = result.find_element(by=By.XPATH, value='..')

                #Does it contain the URL to uva?
                if parent_element.get_attribute('href') is not None and url_to_look_for in parent_element.get_attribute('href'):
                    #We found it!
                    driver.quit()
                    return result_nr+1, page+1

        #Scroll to next page
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(2) # Wait for a while to allow contents to load

    driver.quit()
    return None,None

rank, page = find_google_ranking("computational social science","uva.nl")

if rank is None:
    print(f"Uva.nl was not listed in the first {how_many_pages_to_try} pages! :( We need to work on our SEO!")
else:
    print(f"UvA was number {rank} link for search result, on page {page} in the Google ranking!")
    

### Mini project 1: How toxic are YouTube comments?
<a id='miniproject1'></a>

In [None]:
# SOLUTION

import requests
import time
from googleapiclient import discovery
import pandas as pd

youtube_api_key = [YOUR KEY HERE]
PERSPECTIVE_API_KEY = [YOUR KEY HERE]

client = discovery.build(
  "commentanalyzer",
  "v1alpha1",
  developerKey=PERSPECTIVE_API_KEY,
  discoveryServiceUrl="https://commentanalyzer.googleapis.com/$discovery/rest?version=v1alpha1",
  static_discovery=False,
)

#This function returns a list of comments (strings) associated to a video on Youtube
def fetch_comments_for_video(video_id, max_comments_to_fetch=100):

    print(f"Fetching comments for video {video_id}.")
    
    url = f"https://www.googleapis.com/youtube/v3/commentThreads"
    params = {
        'part': 'snippet',
        'videoId': video_id,
        'maxResults': max_comments_to_fetch,  
        'textFormat': 'plainText',
        'key': youtube_api_key,
    }

    all_comments = []
    page = 0
    while(True):
        page+=1
        
        if len(all_comments)>=max_comments_to_fetch:
            break
        
        print(f"Getting page {page}...")
        response = requests.get(url, params=params)
        if response.status_code == 200:
            result_json = response.json()
            all_comments.extend([item['snippet']['topLevelComment']['snippet']['textDisplay'] for item in result_json.get('items', [])])

            # Many APIs provide the result page by page. If there is another page, this API returns a nextPageToken, that we can
            # send to the API to get the next page in line. If there are no more comments, there will be no such token.
            if 'nextPageToken' in result_json:
                params['pageToken'] = result_json['nextPageToken']

                # Ensure you don't hit the quota limits by adding a delay
                time.sleep(1)
            else: #No token mean no more pages, so we're done
                break
        else:
            print("Error: ", response.status_code)
            break

    # Now 'all_comments' list contains all the comments from the video
    print(f"Done. Fetched {len(all_comments)} comments!")
    return all_comments

#This function uses fetch_comments_for_video() to collect comments for several videos. 
# It takes a list of video ids and returns a dataframe  with the structure:
# video_id | comment
# 
def fetch_comments_for_videos(list_of_video_ids):
    l = []
    for video_id in list_of_video_ids:
        comments = fetch_comments_for_video(video_id,)
        for comment in comments:
            l.append({'video_id':video_id,'comment': comment})
                 
    return pd.DataFrame(l)

# This measures the toxicity of a single message using the Perspective API
def measure_toxicity_of_message(message):
    
    analyze_request = {
      'comment': { 'text': message },
      'requestedAttributes': {'TOXICITY': {}}
    }

    time.sleep(0.1)

    response = client.comments().analyze(body=analyze_request).execute()

    toxicity = response['attributeScores']['TOXICITY']['summaryScore']['value']

    return toxicity


### Mini project 2: Does YouTube Radicalize?
<a id='miniproject2'></a>

In [None]:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
import time
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys

import pandas as pd

# This function takes a video_id to start with, and then takes nr_steps of "next video" from that video.
# It returns a list of dicts, each containing with the step number, the video_id, and the title of the video
#  e.g., [{'step':0,'video_id': 'ATFwMO9CebA', 'President Trump 2018 State of the Union Address (C-SPAN)' }...]
def follow_next_video(start_video_id,nr_steps):
    l = []
    driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
    driver.get(f'https://www.youtube.com/watch?v={start_video_id}')  
    time.sleep(8)

    #Close popup
    buttons = driver.find_elements(by=By.CSS_SELECTOR,value='button')
    for button in buttons:
        if 'Reject' in button.text:
            button.click()
    
    time.sleep(2)
    
    #Get information for first video
    video_id = driver.current_url.split('=')[1]
    title = driver.find_element(by=By.CSS_SELECTOR,value='div#title h1')    
    l.append({'step':0, 'video_id': video_id, 'title': title.text})

    # Click next video repeatedly
    for step in range(nr_steps):
        # Click next video
        nextvid = driver.find_element(by=By.CSS_SELECTOR,value='ytd-compact-video-renderer.ytd-watch-next-secondary-results-renderer a')
        nextvid.click()
        time.sleep(5)
        
        #Fetch video_id and video title
        video_id = driver.current_url.split('=')[1]
        title = driver.find_element(by=By.CSS_SELECTOR,value='div#title h1')    
        l.append({'step':step+1, 'video_id': video_id, 'title': title.text})

    driver.quit()
    
    return l

list_of_videos = follow_next_video('Wl6b5KnpmB4',10)