Skip to content

dariopy/facturero-gmail

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Gmail Attachment Downloader

Language: English | Español

Python 3.10+ utility that authenticates with the Gmail API (OAuth 2.0 Desktop flow) to fetch PDF invoice attachments from whitelisted senders over a configurable date range.

Features

  • OAuth2 flow using gmail.readonly scope with token caching.
  • Filters emails by sender list and date window before downloading.
  • Saves PDF attachments grouped under data/downloads/<sender>/ and skips already processed files via a state cache.
  • Structured logging for observability.

Prerequisites

  1. Google Cloud project with Gmail API enabled.
  2. Create OAuth 2.0 credentials of type Desktop App and download credentials.json.
  3. Python 3.10 or later plus the ability to create a virtual environment.
  4. Gmail account with access to the target inbox.

Outside-the-IDE checklist

  1. Enable the Gmail API in Google Cloud Console for your project.
  2. Configure the OAuth consent screen (Internal or External) and add the Gmail account as a test user if required (each contributor should use their own project/credentials).
  3. Create the Desktop App OAuth client, download credentials.json, and place it inside credentials/ locally (never commit it).
  4. Store client ID/secret in a password manager for safekeeping.
  5. Ensure the workstation can open a browser window for OAuth consent on first run.
  6. (Optional) Decide on a log location, e.g., LOG_FILE=./data/logs/run.jsonl, for long-term summaries.

Step-by-step setup

  1. Clone & enter the repo
    git clone <repo-url>
    cd facturero
  2. Create a virtual environment & install deps
    python -m venv .venv
    .\.venv\Scripts\activate        # PowerShell
    pip install -r requirements.txt
  3. Copy & fill environment variables
    copy .env.example .env
    • Edit .env to set sender list, date range, download/state paths, optional keywords, dry-run flag, and LOG_FILE (if you want JSON logs).
  4. Drop OAuth credentials
    • Place the downloaded credentials.json in credentials/ (path should match CREDENTIALS_PATH).
  5. First run (authorization)
    • Execute python -m src.main. A browser window prompts for Gmail consent. After approval, token.json is saved for future runs.
  6. Verify outputs
    • Attachments land in DOWNLOAD_DIR/<sender>/.
    • Processed IDs persist in STATE_PATH so reruns are idempotent.
    • Optional JSON logs will append to LOG_FILE with rich metadata (including summary counts).

Quickstart

python -m venv .venv
.\.venv\Scripts\activate            # Windows PowerShell
pip install -r requirements.txt
copy .env.example .env
# Fill .env with senders, dates, token paths, etc.
python -m src.main           # add --dry to preview without downloading

First run will open a browser window for OAuth consent and persist token.json in the configured location. Subsequent runs reuse the refresh token.

Configuration

Key settings live in .env (see .env.example):

  • GMAIL_SENDERS comma-separated list of allowed From addresses.
  • GMAIL_START_DATE / GMAIL_END_DATE (ISO date, inclusive).
  • DOWNLOAD_DIR, CREDENTIALS_PATH, TOKEN_PATH, STATE_PATH.
  • DRY (0/1) optional dry-run toggle (overridden by --dry).
  • LOG_FILE path for JSON log output (includes summary counts and extras).

Optional command-line overrides:

python -m src.main --start-date 2024-03-01 --end-date 2024-03-31 --max-results 200 --dry --dotenv config/.env

--dotenv allows pointing at an alternate env file; CLI dates override .env values for ad hoc runs. --dry lists candidate attachments without downloading or updating state (useful for verification).

Running the downloader

  1. Dry run (preview only)
    python -m src.main --dry
    • Logs show [DRY] Attachment candidate entries with sender + filename.
    • STATE_PATH and downloads remain untouched.
  2. Full download
    python -m src.main
    • New PDFs are saved, and processed IDs recorded to prevent duplicates.
    • Summary log entry includes total messages scanned, attachments matched, downloads, skipped, and dry-run flag.
  3. Monitoring logs
    • Console displays human-friendly messages.
    • If LOG_FILE is set, each record is written as a JSON line (timestamp, logger, message, extra data). Example summary entry:
      {"timestamp": "2026-02-05 09:26:56", "level": "INFO", "logger": "gmail_downloader.downloader", "message": "Download summary", "messages": 120, "attachments": 37, "downloaded": 35, "skipped": 2, "previewed": 0, "dry_run": false}

Gmail keywords & filters

  • Sender filters are OR-combined: (from:a@example.com OR from:b@example.com).
  • Keywords are AND-combined; each keyword must appear in the message.
  • Keywords with spaces are automatically quoted for exact matching ("factura super").
  • Date range uses Gmail after: / before: operators (inclusive of start, exclusive of end+1 day).

Repository Hygiene

  • Secrets (credentials.json, .env, tokens) are gitignored—keep them local.
  • Recommended flow: feature branches (e.g., feature/gmail-downloader), PR review, squash merge.

Operational Notes

  • Downloads are stored under data/downloads/<sanitized_sender>/file.pdf.
  • Idempotency state lives at STATE_PATH (JSON map of message+attachment IDs). Delete this file to force re-downloads.
  • Logs are emitted to stdout; adjust LOG_LEVEL to DEBUG for troubleshooting.

Future Enhancements

  • Add unit tests/mocks for Gmail interactions.
  • Introduce dry-run mode to list candidate attachments without downloading.
  • Support alternate storage backends (e.g., S3) or structured invoice parsing.

About

Script para bajar facturas (y otros adjuntos) de tu propio gmail

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages