A comprehensive web scraping and content extraction toolkit with API support.
The scraper_cleaner project is a Python-based web scraping solution that provides both command-line and API-based interfaces for extracting structured content from websites. It uses advanced libraries like Trafilatura for high-quality article extraction and BeautifulSoup for HTML parsing.
- Trafilatura-based extraction: Uses the powerful Trafilatura library for extracting article content, metadata, and structured data
- Multiple output formats: Supports JSON, Markdown, and plain text outputs
- Comprehensive metadata extraction: Extracts titles, authors, dates, categories, tags, and more
- Robust error handling: Graceful handling of network issues, parsing errors, and edge cases
- FastAPI-based REST API: Provides a modern, high-performance API for programmatic access
- CORS support: Configured for cross-origin requests
- Structured responses: Returns consistent JSON responses with success/failure indicators
- Health monitoring: Built-in health check endpoint
- Interactive scraping: Command-line interface for manual URL scraping
- Batch processing: Support for processing multiple URLs
- Data organization: Automatic file naming and directory structure
scraper_cleaner/
├── api/
│ └── main.py # FastAPI application with REST endpoints
├── artifacts/ # Documentation and status files
├── data/ # Output directory for scraped content
├── main.py # Original scraping script (Scroll.in specific)
├── trafilatura_scraper.py # Core scraping library
├── pyproject.toml # Project dependencies and configuration
├── README.md # This documentation
└── .gitignore # Git ignore rules
The main scraping module that provides:
scrape_article_with_trafilatura(url): Extracts structured article data and clean textslugify(text): Converts text to URL-friendly slugsformat_article_markdown(data, text): Formats article data as Markdownsetup_logging(): Configures logging for the application
FastAPI application providing:
- GET
/: Root endpoint with service information - GET
/health: Health check endpoint - POST
/scrape: Scrape articles from URLs with configurable options
Specialized scraper for Scroll.in website with:
- HTML parsing using BeautifulSoup
- Structured data extraction
- Content cleaning and formatting
- Multiple output formats (JSON, Markdown, HTML)
# Clone the repository
git clone https://github.com/your-repo/scraper_cleaner.git
cd scraper_cleaner
# Install dependencies
pip install -r requirements.txt
# or using uv
uv pip install# Run the main scraper (interactive)
python trafilatura_scraper.py
# Run the Scroll.in specific scraper
python main.py# Start the API server
python api/main.py
# The API will be available at http://localhost:8001POST /scrape
{
"url": "https://example.com/article",
"include_raw_text": true,
"include_metadata": true
}Response:
{
"success": true,
"data": {
"url": "https://example.com/article",
"title": "Article Title",
"author": "Author Name",
"date": "2023-01-01",
"text": "Article content...",
"raw_text": "Raw article text...",
"metadata": {...}
},
"error": null
}Basic Scraping Request:
curl -X POST "http://localhost:8001/scrape" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/article",
"include_raw_text": true,
"include_metadata": true
}'Batch Scraping Request:
curl -X POST "http://localhost:8001/batch-scrape" \
-H "Content-Type: application/json" \
-d '{
"urls": [
"https://example.com/article1",
"https://example.com/article2"
],
"include_raw_text": true,
"include_metadata": true
}'Authentication and Token Usage:
# Get authentication token
curl -X POST "http://localhost:8001/token" \
-H "Content-Type: application/x-www-form-urlencoded" \
-d "username=testuser&password=testpassword"
# Use token for authenticated requests
curl -X POST "http://localhost:8001/scrape" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_ACCESS_TOKEN" \
-d '{
"url": "https://example.com/article",
"include_raw_text": true,
"include_metadata": true
}'Health Check:
curl -X GET "http://localhost:8001/health"Root Endpoint:
curl -X GET "http://localhost:8001/"- Python 3.12+
- Trafilatura 2.0.0+
- FastAPI 0.111.0+
- Uvicorn 0.30.1+
- Requests 2.32.5+
- BeautifulSoup 4.14.3+
- Type Safety: Fixed null reference issues in
api/main.pywherespeccould beNone - Error Handling: Added proper null checks for module loading
- Code Quality: Improved type annotations and error messages
# Run tests (parallel execution by default)
.venv/bin/pytest tests/
# Run with hot reload (development)
uvicorn api.main:app --reload --port 8001The project now includes pytest-xdist for parallel test execution, significantly improving test suite performance.
By default, tests run in parallel using auto-detected CPU cores (10 workers in this environment):
# Run all tests in parallel (default behavior)
.venv/bin/pytest tests/
# Run specific number of workers
.venv/bin/pytest tests/ -n 4
# Run tests sequentially (if needed)
.venv/bin/pytest tests/ -n 0Slow tests are automatically excluded from parallel runs by default:
# Run only fast tests (default)
.venv/bin/pytest tests/ -m "not slow"
# Run slow tests separately
.venv/bin/pytest tests/ -m slow
# Run all tests including slow ones
.venv/bin/pytest tests/ -m ""- Before optimization: ~60 seconds for full test suite
- After parallel execution: ~11 seconds for fast tests (26 tests)
- Slow tests: ~11 seconds for 5 slow tests
- Total improvement: ~80% reduction in test execution time
# Quick development feedback (fast tests only, parallel)
.venv/bin/pytest tests/ -m "not slow"
# Full test suite (parallel)
.venv/bin/pytest tests/ -m ""
# Specific test file
.venv/bin/pytest tests/test_api_integration.py
# With verbose output
.venv/bin/pytest tests/ -v
# Show test durations
.venv/bin/pytest tests/ --durations=10The pytest configuration is defined in pytest.ini:
[pytest]
markers =
slow: marks tests as slow (deselect with '-m "not slow"')
integration: marks integration tests
unit: marks unit tests
# pytest-xdist configuration for parallel execution
addopts = -n auto -m "not slow"
testpaths = tests
python_files = test_*.py
python_classes = Test*
python_functions = test_*- Test Isolation: All tests are designed to be parallel-safe with proper mocking and fixtures
- Slow Test Marking: Use
@pytest.mark.slowdecorator for tests that take >1 second - Resource Management: Tests avoid shared state and use fixtures for setup/teardown
- Mocking: External dependencies are properly mocked to ensure fast, reliable tests
MIT License