Python 3.10+ utility that authenticates with the Gmail API (OAuth 2.0 Desktop flow) to fetch PDF invoice attachments from whitelisted senders over a configurable date range.
- OAuth2 flow using
gmail.readonlyscope with token caching. - Filters emails by sender list and date window before downloading.
- Saves PDF attachments grouped under
data/downloads/<sender>/and skips already processed files via a state cache. - Structured logging for observability.
- Google Cloud project with Gmail API enabled.
- Create OAuth 2.0 credentials of type Desktop App and download
credentials.json. - Python 3.10 or later plus the ability to create a virtual environment.
- Gmail account with access to the target inbox.
- Enable the Gmail API in Google Cloud Console for your project.
- Configure the OAuth consent screen (Internal or External) and add the Gmail account as a test user if required (each contributor should use their own project/credentials).
- Create the Desktop App OAuth client, download
credentials.json, and place it insidecredentials/locally (never commit it). - Store client ID/secret in a password manager for safekeeping.
- Ensure the workstation can open a browser window for OAuth consent on first run.
- (Optional) Decide on a log location, e.g.,
LOG_FILE=./data/logs/run.jsonl, for long-term summaries.
- Clone & enter the repo
git clone <repo-url> cd facturero
- Create a virtual environment & install deps
python -m venv .venv .\.venv\Scripts\activate # PowerShell pip install -r requirements.txt
- Copy & fill environment variables
copy .env.example .env
- Edit
.envto set sender list, date range, download/state paths, optional keywords, dry-run flag, andLOG_FILE(if you want JSON logs).
- Edit
- Drop OAuth credentials
- Place the downloaded
credentials.jsonincredentials/(path should matchCREDENTIALS_PATH).
- Place the downloaded
- First run (authorization)
- Execute
python -m src.main. A browser window prompts for Gmail consent. After approval,token.jsonis saved for future runs.
- Execute
- Verify outputs
- Attachments land in
DOWNLOAD_DIR/<sender>/. - Processed IDs persist in
STATE_PATHso reruns are idempotent. - Optional JSON logs will append to
LOG_FILEwith rich metadata (including summary counts).
- Attachments land in
python -m venv .venv
.\.venv\Scripts\activate # Windows PowerShell
pip install -r requirements.txt
copy .env.example .env
# Fill .env with senders, dates, token paths, etc.
python -m src.main # add --dry to preview without downloadingFirst run will open a browser window for OAuth consent and persist token.json in the configured location. Subsequent runs reuse the refresh token.
Key settings live in .env (see .env.example):
GMAIL_SENDERScomma-separated list of allowed From addresses.GMAIL_START_DATE/GMAIL_END_DATE(ISO date, inclusive).DOWNLOAD_DIR,CREDENTIALS_PATH,TOKEN_PATH,STATE_PATH.DRY(0/1) optional dry-run toggle (overridden by--dry).LOG_FILEpath for JSON log output (includes summary counts and extras).
Optional command-line overrides:
python -m src.main --start-date 2024-03-01 --end-date 2024-03-31 --max-results 200 --dry --dotenv config/.env
--dotenv allows pointing at an alternate env file; CLI dates override .env values for ad hoc runs.
--dry lists candidate attachments without downloading or updating state (useful for verification).
- Dry run (preview only)
python -m src.main --dry
- Logs show
[DRY] Attachment candidateentries with sender + filename. STATE_PATHand downloads remain untouched.
- Logs show
- Full download
python -m src.main
- New PDFs are saved, and processed IDs recorded to prevent duplicates.
- Summary log entry includes total messages scanned, attachments matched, downloads, skipped, and dry-run flag.
- Monitoring logs
- Console displays human-friendly messages.
- If
LOG_FILEis set, each record is written as a JSON line (timestamp, logger, message, extra data). Example summary entry:{"timestamp": "2026-02-05 09:26:56", "level": "INFO", "logger": "gmail_downloader.downloader", "message": "Download summary", "messages": 120, "attachments": 37, "downloaded": 35, "skipped": 2, "previewed": 0, "dry_run": false}
- Sender filters are OR-combined:
(from:a@example.com OR from:b@example.com). - Keywords are AND-combined; each keyword must appear in the message.
- Keywords with spaces are automatically quoted for exact matching (
"factura super"). - Date range uses Gmail
after:/before:operators (inclusive of start, exclusive of end+1 day).
- Secrets (
credentials.json,.env, tokens) are gitignored—keep them local. - Recommended flow: feature branches (e.g.,
feature/gmail-downloader), PR review, squash merge.
- Downloads are stored under
data/downloads/<sanitized_sender>/file.pdf. - Idempotency state lives at
STATE_PATH(JSON map of message+attachment IDs). Delete this file to force re-downloads. - Logs are emitted to stdout; adjust
LOG_LEVELtoDEBUGfor troubleshooting.
- Add unit tests/mocks for Gmail interactions.
- Introduce dry-run mode to list candidate attachments without downloading.
- Support alternate storage backends (e.g., S3) or structured invoice parsing.