Web scraping is an essential technique for data collection, research, and analysis in the modern web. This article explores a practical implementation of a web scraper designed to traverse Wikipedia articles systematically. The scraper demonstrates key concepts in web scraping, including HTTP requests, HTML parsing, depth-limited recursion, and data persistence.
The Wikipedia scraper operates as a depth-first traversal engine, starting from a seed article and following hyperlinks to related articles up to a configurable depth limit. This approach allows for comprehensive data collection while preventing infinite loops and excessive resource consumption.
- HTTP Request Management: Handles network communication with proper error handling
- HTML Parsing: Extracts relevant data from web pages
- Link Discovery: Identifies and filters valid Wikipedia article links
- Recursive Traversal: Explores linked articles in a controlled manner
- Data Persistence: Stores results in CSV format for analysis
HEADERS = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}A crucial aspect of web scraping is respecting server expectations. Wikipedia, like many modern websites, expects requests to include a proper User-Agent header. This identifies the client making the request and helps servers differentiate between browsers and automated tools. The scraper uses a standard Chrome user agent to ensure compatibility with Wikipedia's server-side logic.
try:
response = requests.get(url, headers=HEADERS, timeout=10)
response.raise_for_status()
except requests.RequestException as e:
print(f"Request failed: {e}")
returnThe implementation demonstrates defensive programming practices:
- Timeout Configuration: A 10-second timeout prevents indefinite hanging on slow connections
- Exception Handling: Catches network errors gracefully without crashing
- HTTP Status Validation: The
raise_for_status()method ensures successful responses
soup = bs(response.content, 'html.parser')
title = soup.find(id='firstHeading')
if not title:
title = soup.find('h1')The scraper employs a fallback strategy for finding article titles. Wikipedia's primary article heading uses the ID firstHeading, but the parser includes a fallback to generic h1 tags for robustness. This defensive approach handles potential variations in page structure across different Wikipedia versions or edge cases.
content = soup.find('div', {'class': 'mw-parser-output'})
if not content:
content = soup.find(id="bodyContent")Similar to title extraction, the scraper attempts multiple selectors to locate the main content area. The mw-parser-output class represents the current standard for Wikipedia's article content, with bodyContent as a legacy fallback. This dual approach ensures compatibility across different Wikipedia implementations and historical page structures.
allLinks = content.find_all('a')
random.shuffle(allLinks)
for link in allLinks:
if not link.has_attr('href'):
continue
href = link['href']
if href.startswith("/wiki/") and ":" not in href:
next_url = "https://en.wikipedia.org" + href
scrapeWikiArticle(next_url, depth+1, max_depth)
breakThis section implements intelligent link filtering:
- Attribute Validation: Checks for the presence of
hrefbefore accessing it - Wikipedia-Specific Filtering:
startswith("/wiki/")ensures only article links are followed":" not in hrefexcludes special pages (Wikipedia namespace, talk pages, templates)
- Randomization:
random.shuffle()introduces variety in traversal paths, preventing predictable crawl patterns - Early Termination: The
breakstatement follows only the first valid link per article, limiting breadth and preventing exponential growth
def scrapeWikiArticle(url, depth=0, max_depth=5):
if depth > max_depth:
print("Reached max depth. Stopping.")
returnA critical design decision is the depth limit. Without it, the scraper could theoretically traverse millions of interconnected articles. The default max_depth=5 provides a reasonable balance between data collection scope and execution time. Each recursive call increments the depth counter, ensuring termination.
results.append({'url': url, 'title': title_text, 'depth': depth})The scraper maintains an in-memory list of results, storing three key pieces of information:
- URL: The full article address for reference and verification
- Title: The human-readable article name for contextual understanding
- Depth: The recursion level, useful for analyzing article relationships
with open('scraper_results.csv', 'w', newline='', encoding='utf-8') as csvfile:
fieldnames = ['url', 'title', 'depth']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(results)Results are persisted using Python's csv.DictWriter, providing structured output for analysis. Key implementation details:
newline='': Ensures proper line ending handling across platforms (Windows/Unix)encoding='utf-8': Handles special characters in Wikipedia article titles- Header Row: Includes field names for tool compatibility
Time Complexity:
Space Complexity:
The 10-second timeout and sequential processing prevent overwhelming Wikipedia's servers. This respectful approach maintains good standing with the target website's acceptable use policies.
- Rate Limiting: Add delays between requests to further reduce server load
- Caching: Implement memoization to avoid re-scraping identical URLs
- Parallel Processing: Use threading or asyncio for concurrent requests
- Advanced Filtering: Distinguish between article types or topics
- Data Enrichment: Extract article summaries or metadata beyond titles
- Configuration: Externalize parameters (max_depth, start_url) to configuration files
This Wikipedia scraper exemplifies practical web scraping principles: respectful server interaction, defensive programming, structured data handling, and controlled recursion. While relatively simple, it demonstrates the fundamental techniques required for more sophisticated data collection systems. Understanding these building blocks is essential for developing robust scrapers that balance efficiency, reliability, and ethical considerations.
The complete implementation is available in the scraper.py file, with results exported to scraper_results.csv for further analysis.