Skip to content

abbasbill/web_scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Building a Wikipedia Web Scraper: A Technical Deep Dive

Introduction

Web scraping is an essential technique for data collection, research, and analysis in the modern web. This article explores a practical implementation of a web scraper designed to traverse Wikipedia articles systematically. The scraper demonstrates key concepts in web scraping, including HTTP requests, HTML parsing, depth-limited recursion, and data persistence.

Architecture Overview

The Wikipedia scraper operates as a depth-first traversal engine, starting from a seed article and following hyperlinks to related articles up to a configurable depth limit. This approach allows for comprehensive data collection while preventing infinite loops and excessive resource consumption.

Core Components

  1. HTTP Request Management: Handles network communication with proper error handling
  2. HTML Parsing: Extracts relevant data from web pages
  3. Link Discovery: Identifies and filters valid Wikipedia article links
  4. Recursive Traversal: Explores linked articles in a controlled manner
  5. Data Persistence: Stores results in CSV format for analysis

Technical Implementation

HTTP Headers and User-Agent Handling

HEADERS = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}

A crucial aspect of web scraping is respecting server expectations. Wikipedia, like many modern websites, expects requests to include a proper User-Agent header. This identifies the client making the request and helps servers differentiate between browsers and automated tools. The scraper uses a standard Chrome user agent to ensure compatibility with Wikipedia's server-side logic.

Request Handling with Error Recovery

try:
    response = requests.get(url, headers=HEADERS, timeout=10)
    response.raise_for_status()
except requests.RequestException as e:
    print(f"Request failed: {e}")
    return

The implementation demonstrates defensive programming practices:

  • Timeout Configuration: A 10-second timeout prevents indefinite hanging on slow connections
  • Exception Handling: Catches network errors gracefully without crashing
  • HTTP Status Validation: The raise_for_status() method ensures successful responses

HTML Parsing with BeautifulSoup

soup = bs(response.content, 'html.parser')
title = soup.find(id='firstHeading')
if not title:
    title = soup.find('h1')

The scraper employs a fallback strategy for finding article titles. Wikipedia's primary article heading uses the ID firstHeading, but the parser includes a fallback to generic h1 tags for robustness. This defensive approach handles potential variations in page structure across different Wikipedia versions or edge cases.

Content Area Detection

content = soup.find('div', {'class': 'mw-parser-output'})
if not content:
    content = soup.find(id="bodyContent")

Similar to title extraction, the scraper attempts multiple selectors to locate the main content area. The mw-parser-output class represents the current standard for Wikipedia's article content, with bodyContent as a legacy fallback. This dual approach ensures compatibility across different Wikipedia implementations and historical page structures.

Link Filtering Strategy

allLinks = content.find_all('a')
random.shuffle(allLinks)

for link in allLinks:
    if not link.has_attr('href'):
        continue
    href = link['href']
    if href.startswith("/wiki/") and ":" not in href:
        next_url = "https://en.wikipedia.org" + href
        scrapeWikiArticle(next_url, depth+1, max_depth)
        break

This section implements intelligent link filtering:

  • Attribute Validation: Checks for the presence of href before accessing it
  • Wikipedia-Specific Filtering:
    • startswith("/wiki/") ensures only article links are followed
    • ":" not in href excludes special pages (Wikipedia namespace, talk pages, templates)
  • Randomization: random.shuffle() introduces variety in traversal paths, preventing predictable crawl patterns
  • Early Termination: The break statement follows only the first valid link per article, limiting breadth and preventing exponential growth

Depth-Limited Recursion

def scrapeWikiArticle(url, depth=0, max_depth=5):
    if depth > max_depth:
        print("Reached max depth. Stopping.")
        return

A critical design decision is the depth limit. Without it, the scraper could theoretically traverse millions of interconnected articles. The default max_depth=5 provides a reasonable balance between data collection scope and execution time. Each recursive call increments the depth counter, ensuring termination.

Data Collection and Persistence

results.append({'url': url, 'title': title_text, 'depth': depth})

The scraper maintains an in-memory list of results, storing three key pieces of information:

  • URL: The full article address for reference and verification
  • Title: The human-readable article name for contextual understanding
  • Depth: The recursion level, useful for analyzing article relationships

CSV Export

with open('scraper_results.csv', 'w', newline='', encoding='utf-8') as csvfile:
    fieldnames = ['url', 'title', 'depth']
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    writer.writeheader()
    writer.writerows(results)

Results are persisted using Python's csv.DictWriter, providing structured output for analysis. Key implementation details:

  • newline='': Ensures proper line ending handling across platforms (Windows/Unix)
  • encoding='utf-8': Handles special characters in Wikipedia article titles
  • Header Row: Includes field names for tool compatibility

Performance Considerations

Complexity Analysis

Time Complexity: $O(b^d)$ where $b$ is the branching factor (fixed at 1 due to early termination) and $d$ is the maximum depth. Practically, this becomes $O(d)$ since only one link per article is followed.

Space Complexity: $O(d)$ for the recursion stack plus $O(n)$ for storing results, where $n$ is the total number of articles scraped.

Network Efficiency

The 10-second timeout and sequential processing prevent overwhelming Wikipedia's servers. This respectful approach maintains good standing with the target website's acceptable use policies.

Potential Improvements

  1. Rate Limiting: Add delays between requests to further reduce server load
  2. Caching: Implement memoization to avoid re-scraping identical URLs
  3. Parallel Processing: Use threading or asyncio for concurrent requests
  4. Advanced Filtering: Distinguish between article types or topics
  5. Data Enrichment: Extract article summaries or metadata beyond titles
  6. Configuration: Externalize parameters (max_depth, start_url) to configuration files

Conclusion

This Wikipedia scraper exemplifies practical web scraping principles: respectful server interaction, defensive programming, structured data handling, and controlled recursion. While relatively simple, it demonstrates the fundamental techniques required for more sophisticated data collection systems. Understanding these building blocks is essential for developing robust scrapers that balance efficiency, reliability, and ethical considerations.

Code Repository

The complete implementation is available in the scraper.py file, with results exported to scraper_results.csv for further analysis.

About

a web scraper

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages