Web Scraping Workshop

Learn the fundamentals of web scraping with Python

Workshop Overview

This workshop teaches web scraping from the ground up, starting with basic HTTP requests and progressing to advanced HTML parsing and data extraction techniques. Students will build a complete Wikipedia scraper that handles real-world challenges like disambiguation pages, search results, and structured data extraction.

Learning Objectives

By the end of this workshop, you will be able to:

Make HTTP requests to APIs and web pages
Parse HTML content using BeautifulSoup
Extract structured data using regular expressions
Handle edge cases like redirects and error responses
Build a complete web scraping application with user interaction

Prerequisites

Basic Python knowledge (functions, loops, conditionals)
Familiarity with command line usage
Text editor or IDE (VS Code recommended)

Setup Instructions

Clone or download this workshop repository

Create a virtual environment:

python3 -m venv .venv
. .venv/bin/activate  # On Windows: .venv\Scripts\activate

Install dependencies:

python3 -m pip install -r requirements.txt

Workshop Structure

Examples

1. Basic HTTP Requests (`examples/01_basic_requests.py`)

Making GET requests to APIs
Handling URL parameters
Introduction to raw HTML scraping
Basic error handling and status codes

Run: python3 examples/01_basic_requests.py

2. HTML Parsing with BeautifulSoup (`examples/02_beautifulsoup_basics.py`)

Why regex isn't sufficient for HTML parsing
BeautifulSoup basics: finding elements by tags, classes, and IDs
Extracting text and attributes
Comparing regex vs. proper HTML parsing

Run: python3 examples/02_beautifulsoup_basics.py

3. Regular Expressions for Data Formatting (`examples/03_regex_formatting.py`)

Regex fundamentals and syntax
Pattern matching for data extraction
Text cleaning and normalization
Practical regex patterns for common data types

Run: python3 examples/03_regex_formatting.py

Capstone Project: Wikipedia Scraper

Build a complete Wikipedia article scraper that demonstrates all learned concepts.

Features:

Search Wikipedia articles by topic using their official API
Scrape article content from Wikipedia pages
Parse content using BeautifulSoup
Handle disambiguation pages with user selection
Extract key facts using regex patterns (dates, money, measurements, quotes, locations)
Interactive interface with pagination and user commands
Robust error handling for network issues and edge cases

Files:

solutions/capstone/wikipedia_scraper.py - Complete solution for reference
capstone/template.wikipedia_scraper.py - Student template with TODO instructions

Template Structure

The template provides a structured approach to building the scraper:

Basic Functions (Complete First):

get_response() - Make HTTP requests to Wikipedia
get_search_results() - Use Wikipedia API for search
extract_page_paragraphs() - Extract article content using BeautifulSoup

Advanced Functions (Complete After Basic Works): 4. is_disambiguation_page() - Detect disambiguation pages 5. extract_disambiguation_links() - Extract topic links 6. extract_key_facts() - Extract structured data using regex patterns

Getting Started

Work through examples in order:

python3 examples/01_basic_requests.py
python3 examples/02_beautifulsoup_basics.py
python3 examples/03_regex_formatting.py

Start the capstone project:

cp capstone/template.wikipedia_scraper.py capstone/wikipedia_scraper.py
chmod u+x capstone/wikipedia_scraper.py

Test your progress:
```
./capstone/wikipedia_scraper.py
```

Compare with the solution:

./solutions/capstone/wikipedia_scraper.py

🧪 Testing Your Implementation

Basic Functions Test:

After completing get_response(): Try searching any term - should handle HTTP requests properly
After completing get_search_results(): Try searching "Python" - should show search results from Wikipedia API
After completing extract_page_paragraphs(): Try searching "Python (programming language)" - should display article title and first paragraph

Advanced Functions Test:

Try searching: Python (disambiguation page with multiple options)
Any article should show extracted facts (dates, money, measurements, quotes, locations)

Dependencies

requests - HTTP library for making web requests
beautifulsoup4 - HTML parsing and navigation

Tips for Success

Complete examples in order - each builds on the previous
Read the TODO comments carefully - they provide step-by-step guidance
Test frequently - verify each function works before moving on
Use print statements for debugging - see what data you're getting
Check the solution if stuck, but try implementing first
Experiment - try different Wikipedia topics to test edge cases

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
capstone		capstone
examples		examples
solutions/capstone		solutions/capstone
.gitignore		.gitignore
README.md		README.md
attendence_form.png		attendence_form.png
repo_link.png		repo_link.png
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Web Scraping Workshop

Workshop Overview

Learning Objectives

Prerequisites

Setup Instructions

Workshop Structure

Examples

1. Basic HTTP Requests (`examples/01_basic_requests.py`)

2. HTML Parsing with BeautifulSoup (`examples/02_beautifulsoup_basics.py`)

3. Regular Expressions for Data Formatting (`examples/03_regex_formatting.py`)

Capstone Project: Wikipedia Scraper

Features:

Files:

Template Structure

Getting Started

🧪 Testing Your Implementation

Basic Functions Test:

Advanced Functions Test:

Dependencies

Tips for Success

Learning Resources

About

Uh oh!

Releases

Packages

Contributors 2

Languages

devsoc-unsw/webscraping-workshop

Folders and files

Latest commit

History

Repository files navigation

Web Scraping Workshop

Workshop Overview

Learning Objectives

Prerequisites

Setup Instructions

Workshop Structure

Examples

1. Basic HTTP Requests (examples/01_basic_requests.py)

2. HTML Parsing with BeautifulSoup (examples/02_beautifulsoup_basics.py)

3. Regular Expressions for Data Formatting (examples/03_regex_formatting.py)

Capstone Project: Wikipedia Scraper

Features:

Files:

Template Structure

Getting Started

🧪 Testing Your Implementation

Basic Functions Test:

Advanced Functions Test:

Dependencies

Tips for Success

Learning Resources

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

1. Basic HTTP Requests (`examples/01_basic_requests.py`)

2. HTML Parsing with BeautifulSoup (`examples/02_beautifulsoup_basics.py`)

3. Regular Expressions for Data Formatting (`examples/03_regex_formatting.py`)

Packages