Skip to content

enlabedev/xscraper

Repository files navigation

XScrapper

Automated newsletter generator from X (Twitter) hashtags. Fetches tweets, processes with AI, and sends email newsletters.

Features

  • Automated Scraping: Extract tweets from X.com based on configurable hashtag groups
  • AI Processing: Filter and summarize tweets using LLM via OpenRouter
  • Email Delivery: Send newsletters via Resend API
  • Scheduling: Configure automatic execution times

Requirements

  • Python 3.11+
  • Playwright (browser automation)
  • API keys (see Configuration)

Installation

# Create virtual environment
python -m venv .venv
source .venv/bin/activate  # Linux/Mac
.venv\Scripts\activate.ps1     # Windows

# Install dependencies
pip install -e ".[dev]"

# Install Playwright browsers
playwright install chromium

Configuration

Create a .env file with the following variables:

# OpenRouter (AI processing)
OPENROUTER_API_KEY=your_openrouter_api_key

# Resend (Email delivery)
RESEND_API_KEY=your_resend_api_key
EMAIL_FROM=your_email@example.com
EMAIL_TO=recipient@example.com

Cookies Configuration

To scrape X.com, you need to export your browser cookies:

  1. Log in to X.com in your browser (Chrome/Firefox)
  2. Install a cookies extension (e.g., "Get cookies.txt" or "Cookie-Editor")
  3. Export cookies for x.com in Netscape format
  4. Save as cookies.json in the project root

Example cookies.json format:

[
  {
    "domain": ".x.com",
    "name": "auth_token",
    "value": "your_token_here",
    "path": "/",
    "secure": true,
    "sameSite": "Lax"
  }
]

Note: Cookies expire periodically. Re-export if you encounter login walls.

Hashtag Groups

Edit config/hashtags.yaml to configure hashtag groups and scraping parameters:

groups:
  - name: "IA & Data"
    hashtags:
      - "#AI"
      - "#MachineLearning"

scraper:
  min_tweets: 50
  min_interactions: 10
  wait_between_requests_ms: 7000

scheduler:
  hours:
    - 8
    - 13
    - 17
  timezone: "America/Lima"

Usage

Run Immediately

python main.py --run-now

Run Scheduled

python main.py --schedule

Options

Option Description
--run-now Execute pipeline immediately
--schedule Start scheduler with configured times
--config Path to configuration file (default: config/hashtags.yaml)
--headless Run browser in headless mode (default: True)
--no-headless Run browser in visible mode for debugging

Project Structure

XScrapper/
├── main.py              # Entry point
├── config/
│   └── hashtags.yaml    # Hashtag groups configuration
├── src/
│   ├── scraper.py       # X.com scraping module
│   ├── ai_processor.py  # AI processing module
│   ├── email_sender.py  # Email delivery module
│   └── scheduler.py     # Scheduling module
├── tests/               # Test suite
└── output/              # Raw tweet exports

Development

Run Tests

pytest

Run Tests with Coverage

pytest --cov=src --cov-report=term-missing

Type Checking

mypy src/

Linting

ruff check src/ tests/

License

MIT

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages