scraper_cleaner

A comprehensive web scraping and content extraction toolkit with API support.

Overview

The scraper_cleaner project is a Python-based web scraping solution that provides both command-line and API-based interfaces for extracting structured content from websites. It uses advanced libraries like Trafilatura for high-quality article extraction and BeautifulSoup for HTML parsing.

Features

1. Core Scraping Functionality

Trafilatura-based extraction: Uses the powerful Trafilatura library for extracting article content, metadata, and structured data
Multiple output formats: Supports JSON, Markdown, and plain text outputs
Comprehensive metadata extraction: Extracts titles, authors, dates, categories, tags, and more
Robust error handling: Graceful handling of network issues, parsing errors, and edge cases

2. API Interface

FastAPI-based REST API: Provides a modern, high-performance API for programmatic access
CORS support: Configured for cross-origin requests
Structured responses: Returns consistent JSON responses with success/failure indicators
Health monitoring: Built-in health check endpoint

3. Command Line Tools

Interactive scraping: Command-line interface for manual URL scraping
Batch processing: Support for processing multiple URLs
Data organization: Automatic file naming and directory structure

Project Structure

scraper_cleaner/
├── api/
│   └── main.py          # FastAPI application with REST endpoints
├── artifacts/           # Documentation and status files
├── data/                # Output directory for scraped content
├── main.py              # Original scraping script (Scroll.in specific)
├── trafilatura_scraper.py # Core scraping library
├── pyproject.toml       # Project dependencies and configuration
├── README.md            # This documentation
└── .gitignore           # Git ignore rules

Components

1. `trafilatura_scraper.py` - Core Scraping Engine

The main scraping module that provides:

scrape_article_with_trafilatura(url): Extracts structured article data and clean text
slugify(text): Converts text to URL-friendly slugs
format_article_markdown(data, text): Formats article data as Markdown
setup_logging(): Configures logging for the application

2. `api/main.py` - REST API Interface

FastAPI application providing:

GET /: Root endpoint with service information
GET /health: Health check endpoint
POST /scrape: Scrape articles from URLs with configurable options

3. `main.py` - Original Scraping Script

Specialized scraper for Scroll.in website with:

HTML parsing using BeautifulSoup
Structured data extraction
Content cleaning and formatting
Multiple output formats (JSON, Markdown, HTML)

Installation

# Clone the repository
git clone https://github.com/your-repo/scraper_cleaner.git
cd scraper_cleaner

# Install dependencies
pip install -r requirements.txt
# or using uv
uv pip install

Usage

Command Line Interface

# Run the main scraper (interactive)
python trafilatura_scraper.py

# Run the Scroll.in specific scraper
python main.py

API Server

# Start the API server
python api/main.py

# The API will be available at http://localhost:8001

API Endpoints

POST /scrape

{
  "url": "https://example.com/article",
  "include_raw_text": true,
  "include_metadata": true
}

Response:

{
  "success": true,
  "data": {
    "url": "https://example.com/article",
    "title": "Article Title",
    "author": "Author Name",
    "date": "2023-01-01",
    "text": "Article content...",
    "raw_text": "Raw article text...",
    "metadata": {...}
  },
  "error": null
}

cURL Usage Examples

Basic Scraping Request:

curl -X POST "http://localhost:8001/scrape" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/article",
    "include_raw_text": true,
    "include_metadata": true
  }'

Batch Scraping Request:

curl -X POST "http://localhost:8001/batch-scrape" \
  -H "Content-Type: application/json" \
  -d '{
    "urls": [
      "https://example.com/article1",
      "https://example.com/article2"
    ],
    "include_raw_text": true,
    "include_metadata": true
  }'

Authentication and Token Usage:

# Get authentication token
curl -X POST "http://localhost:8001/token" \
  -H "Content-Type: application/x-www-form-urlencoded" \
  -d "username=testuser&password=testpassword"

# Use token for authenticated requests
curl -X POST "http://localhost:8001/scrape" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_ACCESS_TOKEN" \
  -d '{
    "url": "https://example.com/article",
    "include_raw_text": true,
    "include_metadata": true
  }'

Health Check:

curl -X GET "http://localhost:8001/health"

Root Endpoint:

curl -X GET "http://localhost:8001/"

Dependencies

Python 3.12+
Trafilatura 2.0.0+
FastAPI 0.111.0+
Uvicorn 0.30.1+
Requests 2.32.5+
BeautifulSoup 4.14.3+

Recent Fixes

Type Safety: Fixed null reference issues in api/main.py where spec could be None
Error Handling: Added proper null checks for module loading
Code Quality: Improved type annotations and error messages

Development

# Run tests (parallel execution by default)
.venv/bin/pytest tests/

# Run with hot reload (development)
uvicorn api.main:app --reload --port 8001

Testing with pytest-xdist

The project now includes pytest-xdist for parallel test execution, significantly improving test suite performance.

Parallel Test Execution

By default, tests run in parallel using auto-detected CPU cores (10 workers in this environment):

# Run all tests in parallel (default behavior)
.venv/bin/pytest tests/

# Run specific number of workers
.venv/bin/pytest tests/ -n 4

# Run tests sequentially (if needed)
.venv/bin/pytest tests/ -n 0

Slow Test Management

Slow tests are automatically excluded from parallel runs by default:

# Run only fast tests (default)
.venv/bin/pytest tests/ -m "not slow"

# Run slow tests separately
.venv/bin/pytest tests/ -m slow

# Run all tests including slow ones
.venv/bin/pytest tests/ -m ""

Performance Results

Before optimization: ~60 seconds for full test suite
After parallel execution: ~11 seconds for fast tests (26 tests)
Slow tests: ~11 seconds for 5 slow tests
Total improvement: ~80% reduction in test execution time

Test Execution Examples

# Quick development feedback (fast tests only, parallel)
.venv/bin/pytest tests/ -m "not slow"

# Full test suite (parallel)
.venv/bin/pytest tests/ -m ""

# Specific test file
.venv/bin/pytest tests/test_api_integration.py

# With verbose output
.venv/bin/pytest tests/ -v

# Show test durations
.venv/bin/pytest tests/ --durations=10

Configuration

The pytest configuration is defined in pytest.ini:

[pytest]
markers =
    slow: marks tests as slow (deselect with '-m "not slow"')
    integration: marks integration tests
    unit: marks unit tests

# pytest-xdist configuration for parallel execution
addopts = -n auto -m "not slow"
testpaths = tests
python_files = test_*.py
python_classes = Test*
python_functions = test_*

Best Practices

Test Isolation: All tests are designed to be parallel-safe with proper mocking and fixtures
Slow Test Marking: Use @pytest.mark.slow decorator for tests that take >1 second
Resource Management: Tests avoid shared state and use fixtures for setup/teardown
Mocking: External dependencies are properly mocked to ensure fast, reliable tests

License

MIT License

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

scraper_cleaner

Overview

Features

1. Core Scraping Functionality

2. API Interface

3. Command Line Tools

Project Structure

Components

1. `trafilatura_scraper.py` - Core Scraping Engine

2. `api/main.py` - REST API Interface

3. `main.py` - Original Scraping Script

Installation

Usage

Command Line Interface

API Server

API Endpoints

cURL Usage Examples

Dependencies

Recent Fixes

Development

Testing with pytest-xdist

Parallel Test Execution

Slow Test Management

Performance Results

Test Execution Examples

Configuration

Best Practices

License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
api		api
artifacts		artifacts
tests		tests
.gitignore		.gitignore
.python-version		.python-version
Dockerfile		Dockerfile
README.md		README.md
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
trafilatura_scraper.py		trafilatura_scraper.py
uv.lock		uv.lock

amite/python_web_scraper_cleaner

Folders and files

Latest commit

History

Repository files navigation

scraper_cleaner

Overview

Features

1. Core Scraping Functionality

2. API Interface

3. Command Line Tools

Project Structure

Components

1. trafilatura_scraper.py - Core Scraping Engine

2. api/main.py - REST API Interface

3. main.py - Original Scraping Script

Installation

Usage

Command Line Interface

API Server

API Endpoints

cURL Usage Examples

Dependencies

Recent Fixes

Development

Testing with pytest-xdist

Parallel Test Execution

Slow Test Management

Performance Results

Test Execution Examples

Configuration

Best Practices

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

1. `trafilatura_scraper.py` - Core Scraping Engine

2. `api/main.py` - REST API Interface

3. `main.py` - Original Scraping Script

Packages