# Web Scraping with Beautiful Soup

* * * 

### Icons used in this notebook
🔔 **Question**: A quick question to help you understand what's going on.<br>
🥊 **Challenge**: Interactive exercise. We'll work through these in the workshop!<br>
⚠️ **Warning**: Heads-up about tricky stuff or common mistakes.<br>
💡 **Tip**: How to do something a bit more efficiently or effectively.<br>

### Learning Objectives
1. [Understanding how websites work (HTML basics)](#h1)
2. [Getting webpage content with Python](#h2)
3. [Extracting specific information](#h3)
4. [Collecting data from tables](#h4)
5. [Building a simple dataset for analysis](#h5)

### Why Wikipedia?
- It's free and open to use
- Has tons of social science relevant data
- Well-structured HTML that's easy to learn from
- No login or API keys required!

### Installing Our Tools

We need two main tools:
- **requests**: This lets Python fetch webpages (like opening a website in your browser)
- **beautifulsoup4**: This helps us read and understand the webpage content

In [None]:
# Install the packages we need (only need to run once)
%pip install requests beautifulsoup4

In [2]:
# Import the tools we'll use
import requests
from bs4 import BeautifulSoup
import time  # We'll use this to be polite to Wikipedia's servers

<a id='h1'></a>

## Understanding HTML

Websites are written in HTML, which uses **tags** to organize content. Think of tags like containers:

- `<p>` = paragraph of text
- `<h1>` = big heading
- `<table>` = data table
- `<a>` = link to another page

Let's look at a simple example:

In [2]:
# Here's what HTML looks like
simple_html = """
<html>
  <body>
    <h1>Welcome to Web Scraping</h1>
    <p>This is a paragraph with some text.</p>
    <p>This is another paragraph!</p>
  </body>
</html>
"""

# Let's parse it with BeautifulSoup
soup = BeautifulSoup(simple_html, 'html.parser')

# Find the heading
heading = soup.find('h1')
print("Heading text:", heading.text)

# Find all paragraphs
paragraphs = soup.find_all('p')
for p in paragraphs:
    print("Paragraph:", p.text)

Heading text: Welcome to Web Scraping
Paragraph: This is a paragraph with some text.
Paragraph: This is another paragraph!


## Your First Wikipedia Scrape

Let's start with something simple - getting basic information from a Wikipedia page about a major city.

We'll scrape the Wikipedia page for **New York City**.

⚠️ **Troubleshooting Tip:** If you get an error like `'NoneType' object has no attribute 'text'`, it means BeautifulSoup couldn't find the element you were looking for. This can happen if:
- The webpage structure changed
- Your internet connection blocked the request
- The class or id name is different

Always check if an element exists before trying to access its properties!

In [11]:
# Step 1: Get the webpage (with headers this time!)
url = "https://en.wikipedia.org/wiki/New_York_City"
response = requests.get(url)
response

<Response [403]>

Here's a list of common HTML response codes:

- `200` OK → Everything worked, page content is in .text or .content.
- `301 / 302` Redirect → The page moved to a new URL; requests follows redirects automatically.
- `403` Forbidden → Server blocked the request (often missing headers like User-Agent).
- `404` Not Found → The page doesn’t exist (or the URL pattern is wrong).
- `429` Too Many Requests → You’re being rate-limited; slow down or add delays.

We're getting a `403` -- let's see what the reason is!

In [12]:
response.text  # Check for request errors

'Please set a user-agent and respect our robot policy https://w.wiki/4wJS. See also T400119.\n'

### What's a User-Agent?

A **user-agent** is like introducing yourself when you knock on someone's door. It tells the website:
- Who you are (what browser/program you're using)
- That you're a legitimate visitor, not a malicious bot

Wikipedia requires this because they want to:
- Prevent abuse from bots
- Know who's accessing their site
- Ensure people follow their guidelines

Without a user-agent, Wikipedia will refuse your request (403 error)!

In [51]:
# Set up headers to identify ourselves to Wikipedia
# This is required - Wikipedia blocks requests without a user-agent
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}

In [None]:
# Step 1: Get the webpage (with headers this time!)
url = "https://en.wikipedia.org/wiki/Berkeley,_California"
response = requests.get(url, headers=headers)

# Check if the request was successful
if response.status_code == 200:
    print("Successfully connected to Wikipedia!")
else:
    print(f"Error: Got status code {response.status_code}")

✅ Successfully connected to Wikipedia!
Page title: Berkeley, California


Get the title: it's a `h1` with a class name `firstHeading`:

In [None]:
# Step 2: Parse the HTML
soup = BeautifulSoup(response.text, 'html.parser')

# Step 3: Get the page title
soup.find('h1', class_='firstHeading')

The first paragraph of a Wikipedia article usually contains a great summary!

In [42]:
# Find the first non-empty paragraph on the page
paragraphs = soup.find_all('p')
paragraphs[1].get_text()

'Berkeley (/ˈbɜːrkli/ BURK-lee) is a city on the eastern shore of San Francisco Bay in northern Alameda County, California, United States. It is named after the 18th-century Anglo-Irish bishop and philosopher George Berkeley. It borders the cities of Oakland and Emeryville to the south and the city of Albany and the unincorporated community of Kensington to the north. Its eastern border with Contra Costa County generally follows the ridge of the Berkeley Hills. The 2020 census recorded a population of 124,321.\n'

### HTML Terminology (Cheat Sheet)

- **Tag**: The basic building block in HTML.  
  Example: `<table> ... </table>`

- **Element**: A tag **plus** its content.  
  Example: `<p>Hello</p>` is a paragraph element.

- **Attribute**: Extra information about a tag, written inside the opening tag.  
  Example: `<table class="wikitable">` → `class="wikitable"` is an attribute.

- **Class**: A *name* inside the `class` attribute.  
  - An element can have **many class names**, separated by spaces.  
  Example: `<table class="wikitable mw-collapsible">`  
  → Classes = `wikitable`, `mw-collapsible`

- **ID**: A unique identifier for one element.  
  Example: `<div id="main"> ... </div>`


### 🥊 Practice: Find Other Elements

Now it's your turn! The Berkeley Wikipedia page has lots of interesting information. Let's practice finding different HTML elements.

**Your Challenge:** 
1. Look for the **infobox** on the right side of the page (it contains quick facts)
2. Try to extract the **Climate data** of Berkeley
3. Find any **image captions** on the page
4. Experiment with finding other elements!

**Hints:**
- The infobox usually has a class like `infobox` or `infobox geography`
- Look for `<th>` (table header) and `<td>` (table data) tags within the infobox
- Image captions are often in `<div>` tags with class `thumbcaption`

In [48]:
# Your practice code here!
# Remember: we already have 'soup' from the NYC Wikipedia page above

# 1. Get the infobox
infobox = soup.find('table', class_='infobox')
# Now try to find specific information within it

# 2. Find the climate data table within the infobox
soup.find('table', class_='wikitable collapsible')

# 3. Try to find image captions
# Hint: soup.find_all('div', class_='thumbcaption')

# 4. Experiment with finding other elements!
# What about links? soup.find_all('a')[:10]  # First 10 links

## Extracting Data from Tables

Wikipedia has lots of tables with useful data! Let's look at a page with demographic information.

We'll scrape the **List of U.S. states by population**.

In [4]:
# Get the page with US state populations
url = "https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population"
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')

if response.status_code == 200:
    print("Page loaded successfully!")
else:
    print(f"Error: Got status code {response.status_code}")

Page loaded successfully!


In [13]:
# Find the first table on the page (usually the main data table)
table = soup.find('table', class_='wikitable')

# Let's look at the table headers first
headers_table = []
for th in table.find_all('th')[:10]:  # Just first 10 headers
    headers_table.append(th.text.strip())

print("Table headers:")
for i, header in enumerate(headers_table):
    print(f"{i}: {header}")

Table headers:
0: State or territory
1: Census population[8][9][a]
2: Change,2010–2020[9][a]
3: House seats[b]
4: Pop. perelec. vote(2020)[c]
5: Pop.perseat(2020)[a]
6: % US(2020)
7: % EC(2020)
8: July 1, 2024 (est.)
9: April 1, 2020


### Extracting State Population Data

Now let's get the actual data - we'll collect state names and their populations.

In [9]:
# Create lists to store our data
states = []
populations = []

# Find all rows in the table
rows = table.find_all('tr')[2:12]  # Skip header rows, get first 10 states

for row in rows:
    # Get all cells in this row
    cells = row.find_all(['td', 'th'])
    
    if len(cells) > 3:  # Make sure we have enough cells
        # State name is usually in the 2nd or 3rd cell
        state_name = cells[0].text.strip()
        # Population is usually in the 3rd or 4th cell
        population = cells[1].text.strip()
        
        states.append(state_name)
        populations.append(population)

# Display what we collected
print("Top 10 US States by Population:")
print("-" * 40)
for state, pop in zip(states, populations):
    print(f"{state}: {pop}")

Top 10 US States by Population:
----------------------------------------
California: 39,431,263
Texas: 31,290,831
Florida: 23,372,215
New York: 19,867,248
Pennsylvania: 13,078,751
Illinois: 12,710,158
Ohio: 11,883,304
Georgia: 11,180,878
North Carolina: 11,046,024
Michigan: 10,140,459


## Building a Dataset - World Leaders

Let's create a more complex example that would be useful for political science research.
We'll scrape information about current world leaders!

In [15]:
# Get the list of current heads of state
url = "https://en.wikipedia.org/wiki/List_of_current_heads_of_state_and_government"
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')

if response.status_code == 200:
    print("Page loaded!")
else:
    print(f"Error: Got status code {response.status_code}")

Page loaded!


In [16]:
# Find tables with leader information
tables = soup.find_all('table', class_='wikitable')

# Let's focus on the first table we find
if tables:
    first_table = tables[0]
    
    # Extract first 5 rows as examples
    leaders_data = []
    rows = first_table.find_all('tr')[1:6]  # Skip header
    
    for row in rows:
        cells = row.find_all(['td', 'th'])
        if len(cells) >= 2:
            # Try to extract country and leader name
            country = cells[0].text.strip()
            leader = cells[1].text.strip() if len(cells) > 1 else "N/A"
            
            # Clean up the text (remove extra whitespace and references like [1])
            country = country.split('[')[0].strip()
            leader = leader.split('[')[0].strip()
            
            leaders_data.append({
                'country': country,
                'leader': leader
            })
    
    print("Sample of World Leaders:")
    print("-" * 50)
    for data in leaders_data:
        print(f"Country: {data['country'][:30]}")
        print(f"Leader: {data['leader'][:50]}")
        print("-" * 50)

Sample of World Leaders:
--------------------------------------------------
Country: Afghanistan
Leader: Supreme Leader – Hibatullah Akhundzada
--------------------------------------------------
Country: Albania
Leader: President – Bajram Begaj
--------------------------------------------------
Country: Algeria
Leader: President – Abdelmadjid Tebboune
--------------------------------------------------
Country: Andorra
Leader: Episcopal Co-Prince – Josep-Lluís Serrano Pentinat
--------------------------------------------------
Country: Angola
Leader: President – João Lourenço
--------------------------------------------------


## A Practical Example - Historical Events Timeline

Let's scrape events from a specific year.
We'll look at major events from 2020.

In [18]:
# Get the 2020 events page
url = "https://en.wikipedia.org/wiki/2020"
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')

In [None]:
# Just find the Events section and grab the first few lists after it
events_h2 = soup.find('h2', id='Events')
first_ul = events_h2.parent.find_next_sibling('ul')

# Get first 5 events from that list
for i in first_ul.find_all('li', recursive=False)[:5]:
    print(i.text)

January 1
Croatia begins its term in the presidency of the European Union.[6]
Flash floods struck Jakarta, Indonesia, killing 66 people in the worst flooding in over a decade.[7]
January 2 – The Royal Australian Air Force and Navy are deployed to New South Wales and Victoria to assist mass evacuation efforts amidst the 2019–20 Australian bushfire season.[8][9]
January 3 –  A United States drone strike at Baghdad International Airport kills ten people, including the intended target, an Iranian general Qasem Soleimani and Iraqi paramilitary leader Abu Mahdi al-Muhandis.[10]
January 5
Second Libyan Civil War: President Recep Tayyip Erdoğan announces the deployment of Turkish troops to Libya on behalf of the United Nations-backed Government of National Accord.[11]
2019–20 Croatian presidential election: The second round of voting is held, and Zoran Milanović of the Social Democratic Party of Croatia defeats incumbent president Kolinda Grabar-Kitarović.[12]
January 8
Iran launches ballistic

## Saving Your Data

Once you've scraped data, you'll want to save it for analysis!
Let's save our state population data to a CSV file.

In [46]:
import csv

# Extract events
data = []
for li in first_ul.find_all('li', recursive=False)[:20]:
    text = li.text.split('[')[0]
    if ' – ' in text:
        date, event = text.split(' – ', 1)
        data.append([date, event])

# preview
data[0]

['January 2',
 'The Royal Australian Air Force and Navy are deployed to New South Wales and Victoria to assist mass evacuation efforts amidst the 2019–20 Australian bushfire season.']

In [None]:
# Write CSV
with open('events.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(['Date', 'Event'])
    writer.writerows(data)

## Important: Web Scraping Ethics and Best Practices

### Always Remember:

1. **Check the website's rules**: Look for a `robots.txt` file (e.g., https://en.wikipedia.org/robots.txt)
2. **Be polite**: Don't make too many requests too quickly
3. **Give credit**: Always cite Wikipedia as your data source
4. **Respect copyrights**: Wikipedia content is under Creative Commons license
5. **Use APIs when available**: Wikipedia has an API that's better for large-scale data collection

### Being a Good Web Scraper:

In [None]:
urls = [
    "https://en.wikipedia.org/wiki/Sociology",
    "https://en.wikipedia.org/wiki/Psychology",
    "https://en.wikipedia.org/wiki/Political_science"
]

for url in urls:
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # Wikipedia uses h1 for title
    title = soup.find('h1').text
    
    print(f"Scraped: {title}")
    time.sleep(1)  # Be polite!

print("\nDone!")

Scraped: Sociology
Scraped: Psychology
Scraped: Political science

Done!


### Exercise 1: Build a Dataset of Universities
Scrape the list of Ivy League universities and their founding years.

In [None]:
# Your code here!
# Start with: url = "https://en.wikipedia.org/wiki/Ivy_League"


### Exercise 2: Create a Timeline
Pick a historical event page and extract a timeline of key dates.

In [None]:
# Your code here!
# Try: "https://en.wikipedia.org/wiki/COVID-19_pandemic"
# or: "https://en.wikipedia.org/wiki/Arab_Spring"


## Conclusion and Next Steps

🎉 **Congratulations!** You've learned the basics of web scraping!

### What You've Learned:
- How to fetch web pages with Python
- How to parse HTML and extract specific information
- How to work with tables and lists
- How to save data for later analysis
- Ethical web scraping practices

### Where to Go From Here:
1. **Practice with different Wikipedia pages** - try scraping data relevant to your research
2. **Learn about APIs** - many sites offer cleaner ways to get data
3. **Explore pandas** - a Python library great for analyzing scraped data
4. **Try more complex scraping** - multiple pages, following links, etc.

### Resources:
- [Beautiful Soup Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Wikipedia API](https://www.mediawiki.org/wiki/API:Main_page)
- [Web Scraping Ethics](https://blog.apify.com/is-web-scraping-legal/)

Remember: Web scraping is a powerful tool for social science research. Use it wisely and ethically.