-
Notifications
You must be signed in to change notification settings - Fork 1
Project
Version: 2.4 — updated 2026-03-14
Spendify is a personal financial ledger that aggregates heterogeneous bank statements (CSV / XLSX from Italian banks) into a single chronological ledger. The system automatically eliminates double-counting caused by:
- Periodic credit card charges on the current account (RF-03 reconciliation)
- Internal transfers between accounts owned by the same holder (RF-04)
Processing is offline-first: the default LLM backend is local Ollama; OpenAI and Claude are supported as opt-in with mandatory PII sanitization.
| Layer | Technology |
|---|---|
| UI | Streamlit ≥ 1.45 |
| Pipeline | Python 3.13, pandas 2.x |
| ORM / DB | SQLAlchemy 2.x + SQLite |
| Schema validation | Pydantic v2 |
| LLM (local) | Ollama + gemma3:12b (default) |
| LLM (remote) | OpenAI SDK, Anthropic SDK |
| Charts | Plotly |
| HTML Export | Jinja2 |
| Tests | pytest, SQLite in-memory |
| Package manager | uv |
┌─────────────────────────────────────────────────┐
│ Presentation │
│ ├─ Streamlit UI (ui/, app.py) :8501 │
│ └─ FastAPI REST (api/) :8000 │
├─────────────────────────────────────────────────┤
│ Service layer (services/) │
│ TransactionService · RuleService │
│ SettingsService · CategoryService │
│ ImportService │
├─────────────────────────────────────────────────┤
│ Business logic (core/) │
│ classifier · normalizer · categorizer │
│ sanitizer · description_cleaner · orchestrator │
├─────────────────────────────────────────────────┤
│ Persistence (db/) │
│ models (SQLAlchemy ORM) · repository (CRUD) │
│ SQLite: ledger.db │
└─────────────────────────────────────────────────┘
The UI and API are fully independent and interchangeable: both rely on the service layer, which has no knowledge of HTTP or Streamlit.
CSV/XLSX File
│
▼
core/classifier.py ──── Flow 1: schema already in DB (SHA-256 column fingerprint)
│ Flow 2: LLM infers schema from sanitized sample
▼
core/normalizer.py
├─ Encoding detection (chardet)
├─ Delimiter / header detection
├─ parse_amount() → Decimal (never float)
├─ SHA-256 tx_id (idempotent dedup)
├─ invert_sign (card sign correction)
├─ RF-03 card–current account reconciliation
└─ RF-04 internal transfer detection
│
core/description_cleaner.py
└─ LLM: noise removal, text standardization
│
core/categorizer.py
├─ Step 0: user rules
├─ Step 1: static regex
├─ Step 2: ML stub (future)
├─ Step 3: LLM structured output (constrained subcategory enum)
└─ Step 4: fallback "Altro"
│
db/repository.py → persist_import_result()
Each imported file is "signed" by a SHA-256 of the normalized column headers. If the signature is already in document_schema, the saved schema is used (Flow 1, zero LLM cost). Otherwise the LLM infers the schema (Flow 2) and the approved template is saved for subsequent imports.
Processing runs in a background thread. Progress is saved in the DB (import_job table) and polled every 2 seconds by any open browser session. This allows opening a second browser and viewing the progress of an import started elsewhere.
Significant progress points:
-
0%— start -
15%— schema identified / approved -
25%— normalization completed -
38%— description cleaning completed -
40%→100%— batch categorization (≈1 min per 20 transactions with local LLM)
| Table | Description |
|---|---|
transaction |
Each row of the ledger. Key: id (SHA-256 24 char) |
import_batch |
Metadata for each import (file, schema, counts) |
document_schema |
Schema template for Flow 1 (fingerprint → configuration) |
reconciliation_link |
Reconciled card–current account pairs (RF-03) |
internal_transfer_link |
Internal transfer pairs (RF-04) |
category_rule |
Deterministic categorization rules |
user_settings |
User preferences (key/value store) |
import_job |
Current state of the import job |
taxonomy_category |
Taxonomy categories (2 levels) |
taxonomy_subcategory |
Taxonomy subcategories |
| Column | Type | Notes |
|---|---|---|
id |
VARCHAR(24) PK | Truncated SHA-256 of (source_file, date, amount, description) |
date |
VARCHAR(10) | ISO 8601: YYYY-MM-DD
|
amount |
Numeric(18,4) | Always Decimal, never float; negative = expense |
tx_type |
VARCHAR |
expense / income / internal_out / internal_in
|
description |
TEXT | Description cleaned by the LLM |
raw_description |
TEXT | Original description from the file |
category |
VARCHAR | Taxonomy category |
subcategory |
VARCHAR | Taxonomy subcategory |
context |
VARCHAR(64) | Life context (nullable, orthogonal to category) |
account_label |
VARCHAR | Stable account identifier (from user_settings) |
to_review |
BOOLEAN | True if LLM failed or ambiguous |
source_identifier |
VARCHAR | SHA-256 of columns (schema fingerprint) |
| Key | Default | Description |
|---|---|---|
date_display_format |
%d/%m/%Y |
Date format in the UI |
amount_decimal_sep |
, |
Decimal separator |
amount_thousands_sep |
. |
Thousands separator |
description_language |
it |
Language used in LLM prompts |
giroconto_mode |
neutral |
neutral or exclude
|
llm_backend |
local_ollama |
Active LLM backend |
ollama_base_url |
http://localhost:11434 |
Ollama server URL |
ollama_model |
gemma3:12b |
Ollama model |
openai_api_key |
— | OpenAI key |
openai_model |
gpt-4o-mini |
OpenAI model |
anthropic_api_key |
— | Anthropic key |
anthropic_model |
claude-3-5-haiku-20241022 |
Claude model |
owner_names |
— | Account holder names (CSV) for PII redaction and internal transfers |
use_owner_names_giroconto |
false |
Use holder names to detect internal transfers |
contexts |
["Quotidianità","Lavoro","Vacanza"] |
Life contexts (JSON array) |
import_test_mode |
false |
Import only the first 20 rows |
Three deterministic steps, all without LLM:
Step 1 — Keyword regex
The description is compared against patterns configured per schema (internal_transfer_patterns). Positive match → tx_type = internal_out/in with high confidence.
Step 2 — Amount matching + time window
Among transactions with different account_label, a pair with the same absolute amount within ±3 days is searched for. Positive match → link in internal_transfer_link.
Step 3 — Owner name permutations
_build_owner_name_regex() in core/sanitizer.py builds a regex that intercepts all permutations of the holder name tokens. This avoids false negatives when the surname/name order varies across files from different banks.
Cross-account re-execution
Available from the Review page via _rerun_transfer_detection() in ui/review_page.py. Loads all non-internal-transfer transactions, aggregates patterns from all schemas (get_all_transfer_keyword_patterns), and re-runs the three steps updating only the rows where tx_type has changed.
3-phase algorithm to eliminate double-counting of periodic card charges:
- Time window ±45 days
- Contiguous sliding window (gap ≤ 5 days, O(n²))
- Subset sum at boundary (k=10 transactions, ≈10⁶ operations)
Reconciled pairs are recorded in reconciliation_link. Reconciled card transactions are excluded from the net balance.
Transaction
│
├─ Step 0: match on category_rule (highest priority)
│ subcategory → parent category via TaxonomyConfig
│
├─ Step 1: static regex in core/categorizer.py
│
├─ Step 2: [ML stub — future]
│
├─ Step 3: LLM with constrained enum
│ prompt: categorizer.json
│ output: subcategory chosen from valid enum
│ TaxonomyConfig.find_category_for_subcategory() resolves the category
│
└─ Step 4: fallback → "Altro" / "Altro entrate"
to_review = True
Orthogonal dimension to the taxonomy. Each transaction can have at most one context (context VARCHAR(64)). Configurable by the user (add/rename/delete) from the Impostazioni page.
Assignment: from the Ledger page, "🌍 Assegna contesto" panel:
- Manual selection from dropdown menu
- "Apply to similar transactions" option:
get_similar_transactions()indb/repository.pyuses Jaccard token similarity (threshold 0.35) to find transactions with a similar description
Filter: the ledger can be filtered by specific context, "all", or "none" (NULL).
core/description_cleaner.py calls the LLM to remove noise (internal bank codes, operation IDs, partial IBANs) and standardize the text. The result is saved in description; the original remains in raw_description.
If the LLM fails, description remains equal to raw_description. The "🔄 Rielabora con LLM" button on the Review page uses this condition as a filter to identify transactions to reprocess.
Before any call to a remote backend:
| Pattern | Replacement |
|---|---|
| IBAN | <ACCOUNT_ID> |
| PAN (card) | <CARD_ID> |
| Tax code | <FISCAL_ID> |
| Holder names | <OWNER> |
assert_sanitized() verifies the absence of detectable patterns and blocks the call if any are found.
Migrations are idempotent and run automatically at startup in db/models.py → create_tables():
| Function | Addition |
|---|---|
_migrate_add_user_settings() |
user_settings table (key/value store) |
_migrate_add_import_job() |
import_job table |
_migrate_add_raw_description() |
raw_description column on transaction
|
_migrate_add_account_label() |
account_label column on transaction
|
_migrate_add_context() |
context column on transaction
|
9 Streamlit pages managed by app.py + ui/sidebar.py:
📥 Import upload_page.py
📋 Ledger registry_page.py
✏️ Modifiche massive bulk_edit_page.py
📊 Analytics analysis_page.py
🔍 Review review_page.py
📏 Regole rules_page.py
🗂️ Tassonomia taxonomy_page.py
⚙️ Impostazioni settings_page.py
✅ Check List checklist_page.py
- Multi-file upload (CSV / XLSX)
- Live progress bar (DB polling every 2s)
- Visible from any browser that has the app open
- Summary at completion: imported transactions, reconciled, internal transfers found, flow used
- Bulk operations on a reference transaction: internal transfer toggle, context assignment (Jaccard similarity ≥ 35%), category/subcategory correction + rule saving
- Bulk deletion by filter: combinable filters (date, account, type, description, category); at least one filter required; preview of first 10 rows; confirmation by typing
ELIMINA; irreversible deletion - Cross-account duplicate detection: pivot table to identify transactions present on multiple accounts
- Filters: date range, transaction type, description (full-text on description + raw_description), category, context, review flag
- Click on a row → instant selection with details in sidebar
- "🌍 Assegna contesto" panel with similarity suggestions
- Internal transfer toggle (single + bulk by description)
- Separate Income/Expense columns, right-aligned
- Metrics: net balance, total income, total expenses
- CSV / XLSX download of the filtered ledger
- Only transactions with
to_review=True - Internal transfer toggle + bulk-apply
- Category/subcategory correction with optional saving as a rule
- "🔄 Rielabora con LLM": re-runs cleaning + categorization on uncleaned transactions
- "🔁 Riesegui rilevamento giroconti": re-runs RF-04 globally
- Date format and amount separators (with live preview)
- Description language (used in LLM prompts)
- Internal transfer mode (neutral / exclude)
- Holder names + toggle for use in internal transfer detection
- Life contexts (editable list: add/rename/delete)
- Import test mode (first 20 rows only)
- Bank account list (add/delete, used as stable
account_labelfor dedup) - LLM backend: Ollama / OpenAI / Claude + model + API keys
- Month × account pivot table with the number of transactions for each combination
- Rows in descending order: current month at the top, then going back in time
- The current month always appears (even if it has no transactions yet)
- Columns: all accounts defined in
account+ anyaccount_labelfrom transactions not yet formalized - Empty cell (0 tx): — symbol in light grey; cell with tx: number with proportional coloring (light blue → dark blue)
- Three KPIs at the top: total transactions, monitored accounts, months with data
- Filters: account selection, last N months, hide months without transactions
- CSV download of the filtered table
| Type | Behavior |
|---|---|
contains |
Pattern anywhere in the description (case-insensitive) |
exact |
Description equal to the pattern (case-insensitive) |
regex |
Full Python regex |
Same (pattern, match_type) pair → in-place update of category/priority (no duplicates).
Rules are applied to all existing transactions on save, not just to future imports. The count of updated transactions is shown to the user.
Run all rules (bulk): the "to_review=True). Useful after creating multiple rules in different sessions or after importing historical data without an active LLM.
Two DB tables: taxonomy_category and taxonomy_subcategory. Initial seeding from taxonomy.yaml.
TaxonomyConfig.find_category_for_subcategory() resolves the parent category from any valid subcategory. The LLM and rules can specify only the subcategory and the category is resolved automatically.
Expenses (15): Casa · Alimentari · Ristorazione · Trasporti · Salute · Istruzione · Abbigliamento · Comunicazioni · Svago e tempo libero · Animali domestici · Finanza e assicurazioni · Cura personale · Tasse e tributi · Regali e donazioni · Altro
Income (7): Lavoro dipendente · Lavoro autonomo · Rendite finanziarie · Rendite immobiliari · Trasferimenti e rimborsi · Prestazioni sociali · Altro entrate
- Local-first: Ollama is the default backend, no data leaves the process
-
Mandatory sanitization:
assert_sanitized()blocks remote calls if PII is found - Holder names: configurable from the UI, removed from all descriptions before remote LLM calls and used to detect internal transfers
- No LangChain: OpenAI, Anthropic SDKs and requests used directly, minimal attack surface
-
Excel and numeric locale: Excel numeric cells lose the original format (e.g.,
2,50becomes2.5). Theraw_amountfield for Excel files will show"2.5"— a limitation of the Excel format, not a bug. - Cross-account internal transfers: detected only if both files have already been imported. Solution: "Riesegui rilevamento giroconti" button on the Review page.
- Async LLM: categorization happens in the background. With local Ollama and gemma3:12b, each batch of 20 transactions takes about 1 minute.
The FastAPI layer (api/) exposes the same ledger operations over HTTP/JSON, without touching the Streamlit UI.
Start: uv run uvicorn api.main:app --host 0.0.0.0 --port 8000
Interactive docs: http://localhost:8000/docs (Swagger UI)
| Method | Path | Description |
|---|---|---|
| GET | /health |
Liveness check |
| GET | /transactions |
List with filters (date, category, account, to_review) |
| PATCH | /transactions/{id}/category |
Update category and subcategory |
| PATCH | /transactions/{id}/context |
Update life context |
| POST | /transactions/{id}/toggle-giroconto |
Toggle internal transfer flag |
| DELETE | /transactions |
Bulk delete by filter (at least 1 filter required) |
| GET/POST/PATCH/DELETE | /rules/category |
Category rule CRUD |
| POST | /rules/category/apply-to-review |
Apply rules to pending review |
| POST | /rules/category/apply-to-all |
Apply rules to all transactions |
| GET/POST/DELETE | /rules/description |
Description rule CRUD |
| GET | /settings |
All settings (API keys redacted) |
| GET/PUT | /settings/{key} |
Read/write a single setting |
| GET/POST/DELETE | /accounts |
Account CRUD |
| GET/POST/PATCH/DELETE | /taxonomy/categories |
Taxonomy category CRUD |
| GET | /import/jobs/latest |
Latest import job status |
- API keys (
openai_api_key,anthropic_api_key) are always redacted (***) in GET responses - The same keys cannot be updated via API (403 Forbidden) — only from the Settings UI
- CORS configured for
localhost:8501(Streamlit) by default
In Docker Compose, the api service shares the spendify_data volume (same ledger.db) with the Streamlit service.
# Full suite
uv run python -m pytest tests/ -v
# With coverage
uv run python -m pytest tests/ --cov=core --cov=db --cov-report=term-missingAll tests use SQLite in-memory — no files, no external services.
| File | Coverage |
|---|---|
test_normalizer.py |
parse_amount, SHA-256, encoding |
test_backends.py |
Factory, validation, mock Ollama |
test_categorizer.py |
4-step cascade, taxonomy resolution |
test_repository_rules.py |
Rule upsert, pattern matching, internal transfer toggle, bulk ops |