Learn the fundamentals of web scraping with Python
This workshop teaches web scraping from the ground up, starting with basic HTTP requests and progressing to advanced HTML parsing and data extraction techniques. Students will build a complete Wikipedia scraper that handles real-world challenges like disambiguation pages, search results, and structured data extraction.
By the end of this workshop, you will be able to:
- Make HTTP requests to APIs and web pages
- Parse HTML content using BeautifulSoup
- Extract structured data using regular expressions
- Handle edge cases like redirects and error responses
- Build a complete web scraping application with user interaction
- Basic Python knowledge (functions, loops, conditionals)
- Familiarity with command line usage
- Text editor or IDE (VS Code recommended)
- Clone or download this workshop repository
- Create a virtual environment:
python3 -m venv .venv . .venv/bin/activate # On Windows: .venv\Scripts\activate
- Install dependencies:
python3 -m pip install -r requirements.txt
- Making GET requests to APIs
- Handling URL parameters
- Introduction to raw HTML scraping
- Basic error handling and status codes
Run: python3 examples/01_basic_requests.py
- Why regex isn't sufficient for HTML parsing
- BeautifulSoup basics: finding elements by tags, classes, and IDs
- Extracting text and attributes
- Comparing regex vs. proper HTML parsing
Run: python3 examples/02_beautifulsoup_basics.py
- Regex fundamentals and syntax
- Pattern matching for data extraction
- Text cleaning and normalization
- Practical regex patterns for common data types
Run: python3 examples/03_regex_formatting.py
Build a complete Wikipedia article scraper that demonstrates all learned concepts.
- Search Wikipedia articles by topic using their official API
- Scrape article content from Wikipedia pages
- Parse content using BeautifulSoup
- Handle disambiguation pages with user selection
- Extract key facts using regex patterns (dates, money, measurements, quotes, locations)
- Interactive interface with pagination and user commands
- Robust error handling for network issues and edge cases
solutions/capstone/wikipedia_scraper.py- Complete solution for referencecapstone/template.wikipedia_scraper.py- Student template with TODO instructions
The template provides a structured approach to building the scraper:
Basic Functions (Complete First):
get_response()- Make HTTP requests to Wikipediaget_search_results()- Use Wikipedia API for searchextract_page_paragraphs()- Extract article content using BeautifulSoup
Advanced Functions (Complete After Basic Works):
4. is_disambiguation_page() - Detect disambiguation pages
5. extract_disambiguation_links() - Extract topic links
6. extract_key_facts() - Extract structured data using regex patterns
-
Work through examples in order:
python3 examples/01_basic_requests.py python3 examples/02_beautifulsoup_basics.py python3 examples/03_regex_formatting.py
-
Start the capstone project:
cp capstone/template.wikipedia_scraper.py capstone/wikipedia_scraper.py chmod u+x capstone/wikipedia_scraper.py
-
Test your progress:
./capstone/wikipedia_scraper.py
-
Compare with the solution:
./solutions/capstone/wikipedia_scraper.py
- After completing
get_response(): Try searching any term - should handle HTTP requests properly - After completing
get_search_results(): Try searching "Python" - should show search results from Wikipedia API - After completing
extract_page_paragraphs(): Try searching "Python (programming language)" - should display article title and first paragraph
- Try searching:
Python(disambiguation page with multiple options) - Any article should show extracted facts (dates, money, measurements, quotes, locations)
- requests - HTTP library for making web requests
- beautifulsoup4 - HTML parsing and navigation
- Complete examples in order - each builds on the previous
- Read the TODO comments carefully - they provide step-by-step guidance
- Test frequently - verify each function works before moving on
- Use print statements for debugging - see what data you're getting
- Check the solution if stuck, but try implementing first
- Experiment - try different Wikipedia topics to test edge cases