Project

Spendify — Project Document

Version: 2.4 — updated 2026-03-14

1. Objective

Spendify is a personal financial ledger that aggregates heterogeneous bank statements (CSV / XLSX from Italian banks) into a single chronological ledger. The system automatically eliminates double-counting caused by:

Periodic credit card charges on the current account (RF-03 reconciliation)
Internal transfers between accounts owned by the same holder (RF-04)

Processing is offline-first: the default LLM backend is local Ollama; OpenAI and Claude are supported as opt-in with mandatory PII sanitization.

2. Technology Stack

Layer	Technology
UI	Streamlit ≥ 1.45
Pipeline	Python 3.13, pandas 2.x
ORM / DB	SQLAlchemy 2.x + SQLite
Schema validation	Pydantic v2
LLM (local)	Ollama + gemma3:12b (default)
LLM (remote)	OpenAI SDK, Anthropic SDK
Charts	Plotly
HTML Export	Jinja2
Tests	pytest, SQLite in-memory
Package manager	uv

3. System Architecture

3.0 Architectural layers

┌─────────────────────────────────────────────────┐
│  Presentation                                   │
│  ├─ Streamlit UI  (ui/, app.py)   :8501         │
│  └─ FastAPI REST  (api/)          :8000         │
├─────────────────────────────────────────────────┤
│  Service layer   (services/)                    │
│  TransactionService · RuleService               │
│  SettingsService · CategoryService              │
│  ImportService                                  │
├─────────────────────────────────────────────────┤
│  Business logic  (core/)                        │
│  classifier · normalizer · categorizer          │
│  sanitizer · description_cleaner · orchestrator │
├─────────────────────────────────────────────────┤
│  Persistence     (db/)                          │
│  models (SQLAlchemy ORM) · repository (CRUD)    │
│  SQLite: ledger.db                              │
└─────────────────────────────────────────────────┘

The UI and API are fully independent and interchangeable: both rely on the service layer, which has no knowledge of HTTP or Streamlit.

3.1 Import Pipeline

CSV/XLSX File
     │
     ▼
core/classifier.py   ──── Flow 1: schema already in DB (SHA-256 column fingerprint)
     │                    Flow 2: LLM infers schema from sanitized sample
     ▼
core/normalizer.py
  ├─ Encoding detection (chardet)
  ├─ Delimiter / header detection
  ├─ parse_amount() → Decimal (never float)
  ├─ SHA-256 tx_id (idempotent dedup)
  ├─ invert_sign (card sign correction)
  ├─ RF-03 card–current account reconciliation
  └─ RF-04 internal transfer detection
     │
core/description_cleaner.py
  └─ LLM: noise removal, text standardization
     │
core/categorizer.py
  ├─ Step 0: user rules
  ├─ Step 1: static regex
  ├─ Step 2: ML stub (future)
  ├─ Step 3: LLM structured output (constrained subcategory enum)
  └─ Step 4: fallback "Altro"
     │
db/repository.py → persist_import_result()

3.2 Schema Identification (Flow 1 vs Flow 2)

Each imported file is "signed" by a SHA-256 of the normalized column headers. If the signature is already in document_schema, the saved schema is used (Flow 1, zero LLM cost). Otherwise the LLM infers the schema (Flow 2) and the approved template is saved for subsequent imports.

3.3 Import Job

Processing runs in a background thread. Progress is saved in the DB (import_job table) and polled every 2 seconds by any open browser session. This allows opening a second browser and viewing the progress of an import started elsewhere.

Significant progress points:

0% — start
15% — schema identified / approved
25% — normalization completed
38% — description cleaning completed
40%→100% — batch categorization (≈1 min per 20 transactions with local LLM)

4. Data Model

4.1 Main Tables

Table	Description
`transaction`	Each row of the ledger. Key: `id` (SHA-256 24 char)
`import_batch`	Metadata for each import (file, schema, counts)
`document_schema`	Schema template for Flow 1 (fingerprint → configuration)
`reconciliation_link`	Reconciled card–current account pairs (RF-03)
`internal_transfer_link`	Internal transfer pairs (RF-04)
`category_rule`	Deterministic categorization rules
`user_settings`	User preferences (key/value store)
`import_job`	Current state of the import job
`taxonomy_category`	Taxonomy categories (2 levels)
`taxonomy_subcategory`	Taxonomy subcategories

4.2 Transaction — Key Columns

Column	Type	Notes
`id`	VARCHAR(24) PK	Truncated SHA-256 of (source_file, date, amount, description)
`date`	VARCHAR(10)	ISO 8601: `YYYY-MM-DD`
`amount`	Numeric(18,4)	Always Decimal, never float; negative = expense
`tx_type`	VARCHAR	`expense` / `income` / `internal_out` / `internal_in`
`description`	TEXT	Description cleaned by the LLM
`raw_description`	TEXT	Original description from the file
`category`	VARCHAR	Taxonomy category
`subcategory`	VARCHAR	Taxonomy subcategory
`context`	VARCHAR(64)	Life context (nullable, orthogonal to category)
`account_label`	VARCHAR	Stable account identifier (from user_settings)
`to_review`	BOOLEAN	True if LLM failed or ambiguous
`source_identifier`	VARCHAR	SHA-256 of columns (schema fingerprint)

4.3 UserSettings — Relevant Keys

Key	Default	Description
`date_display_format`	`%d/%m/%Y`	Date format in the UI
`amount_decimal_sep`	`,`	Decimal separator
`amount_thousands_sep`	`.`	Thousands separator
`description_language`	`it`	Language used in LLM prompts
`giroconto_mode`	`neutral`	`neutral` or `exclude`
`llm_backend`	`local_ollama`	Active LLM backend
`ollama_base_url`	`http://localhost:11434`	Ollama server URL
`ollama_model`	`gemma3:12b`	Ollama model
`openai_api_key`	—	OpenAI key
`openai_model`	`gpt-4o-mini`	OpenAI model
`anthropic_api_key`	—	Anthropic key
`anthropic_model`	`claude-3-5-haiku-20241022`	Claude model
`owner_names`	—	Account holder names (CSV) for PII redaction and internal transfers
`use_owner_names_giroconto`	`false`	Use holder names to detect internal transfers
`contexts`	`["Quotidianità","Lavoro","Vacanza"]`	Life contexts (JSON array)
`import_test_mode`	`false`	Import only the first 20 rows

5. Main Features

5.1 Internal Transfer Detection (RF-04)

Three deterministic steps, all without LLM:

Step 1 — Keyword regex The description is compared against patterns configured per schema (internal_transfer_patterns). Positive match → tx_type = internal_out/in with high confidence.

Step 2 — Amount matching + time window Among transactions with different account_label, a pair with the same absolute amount within ±3 days is searched for. Positive match → link in internal_transfer_link.

Step 3 — Owner name permutations _build_owner_name_regex() in core/sanitizer.py builds a regex that intercepts all permutations of the holder name tokens. This avoids false negatives when the surname/name order varies across files from different banks.

Cross-account re-execution Available from the Review page via _rerun_transfer_detection() in ui/review_page.py. Loads all non-internal-transfer transactions, aggregates patterns from all schemas (get_all_transfer_keyword_patterns), and re-runs the three steps updating only the rows where tx_type has changed.

5.2 Card–Account Reconciliation (RF-03)

3-phase algorithm to eliminate double-counting of periodic card charges:

Time window ±45 days
Contiguous sliding window (gap ≤ 5 days, O(n²))
Subset sum at boundary (k=10 transactions, ≈10⁶ operations)

Reconciled pairs are recorded in reconciliation_link. Reconciled card transactions are excluded from the net balance.

5.3 Cascading Categorization (RF-05)

Transaction
    │
    ├─ Step 0: match on category_rule (highest priority)
    │          subcategory → parent category via TaxonomyConfig
    │
    ├─ Step 1: static regex in core/categorizer.py
    │
    ├─ Step 2: [ML stub — future]
    │
    ├─ Step 3: LLM with constrained enum
    │          prompt: categorizer.json
    │          output: subcategory chosen from valid enum
    │          TaxonomyConfig.find_category_for_subcategory() resolves the category
    │
    └─ Step 4: fallback → "Altro" / "Altro entrate"
                          to_review = True

5.4 Life Contexts

Orthogonal dimension to the taxonomy. Each transaction can have at most one context (context VARCHAR(64)). Configurable by the user (add/rename/delete) from the Impostazioni page.

Assignment: from the Ledger page, "🌍 Assegna contesto" panel:

Manual selection from dropdown menu
"Apply to similar transactions" option: get_similar_transactions() in db/repository.py uses Jaccard token similarity (threshold 0.35) to find transactions with a similar description

Filter: the ledger can be filtered by specific context, "all", or "none" (NULL).

5.5 Description Cleaning

core/description_cleaner.py calls the LLM to remove noise (internal bank codes, operation IDs, partial IBANs) and standardize the text. The result is saved in description; the original remains in raw_description.

If the LLM fails, description remains equal to raw_description. The "🔄 Rielabora con LLM" button on the Review page uses this condition as a filter to identify transactions to reprocess.

5.6 PII Sanitization (RF-10)

Before any call to a remote backend:

Pattern	Replacement
IBAN	`<ACCOUNT_ID>`
PAN (card)	`<CARD_ID>`
Tax code	`<FISCAL_ID>`
Holder names	`<OWNER>`

assert_sanitized() verifies the absence of detectable patterns and blocks the call if any are found.

6. DB Migrations

Migrations are idempotent and run automatically at startup in db/models.py → create_tables():

Function	Addition
`_migrate_add_user_settings()`	`user_settings` table (key/value store)
`_migrate_add_import_job()`	`import_job` table
`_migrate_add_raw_description()`	`raw_description` column on `transaction`
`_migrate_add_account_label()`	`account_label` column on `transaction`
`_migrate_add_context()`	`context` column on `transaction`

7. User Interface

7.1 Navigation

9 Streamlit pages managed by app.py + ui/sidebar.py:

📥 Import              upload_page.py
📋 Ledger              registry_page.py
✏️ Modifiche massive   bulk_edit_page.py
📊 Analytics           analysis_page.py
🔍 Review              review_page.py
📏 Regole              rules_page.py
🗂️ Tassonomia          taxonomy_page.py
⚙️ Impostazioni        settings_page.py
✅ Check List          checklist_page.py

7.2 Import Page

Multi-file upload (CSV / XLSX)
Live progress bar (DB polling every 2s)
Visible from any browser that has the app open
Summary at completion: imported transactions, reconciled, internal transfers found, flow used

7.3 Modifiche massive Page

Bulk operations on a reference transaction: internal transfer toggle, context assignment (Jaccard similarity ≥ 35%), category/subcategory correction + rule saving
Bulk deletion by filter: combinable filters (date, account, type, description, category); at least one filter required; preview of first 10 rows; confirmation by typing ELIMINA; irreversible deletion
Cross-account duplicate detection: pivot table to identify transactions present on multiple accounts

7.4 Ledger Page

Filters: date range, transaction type, description (full-text on description + raw_description), category, context, review flag
Click on a row → instant selection with details in sidebar
"🌍 Assegna contesto" panel with similarity suggestions
Internal transfer toggle (single + bulk by description)
Separate Income/Expense columns, right-aligned
Metrics: net balance, total income, total expenses
CSV / XLSX download of the filtered ledger

7.5 Review Page

Only transactions with to_review=True
Internal transfer toggle + bulk-apply
Category/subcategory correction with optional saving as a rule
"🔄 Rielabora con LLM": re-runs cleaning + categorization on uncleaned transactions
"🔁 Riesegui rilevamento giroconti": re-runs RF-04 globally

7.6 Impostazioni Page

Date format and amount separators (with live preview)
Description language (used in LLM prompts)
Internal transfer mode (neutral / exclude)
Holder names + toggle for use in internal transfer detection
Life contexts (editable list: add/rename/delete)
Import test mode (first 20 rows only)
Bank account list (add/delete, used as stable account_label for dedup)
LLM backend: Ollama / OpenAI / Claude + model + API keys

7.7 Check List Page

Month × account pivot table with the number of transactions for each combination
Rows in descending order: current month at the top, then going back in time
The current month always appears (even if it has no transactions yet)
Columns: all accounts defined in account + any account_label from transactions not yet formalized
Empty cell (0 tx): — symbol in light grey; cell with tx: number with proportional coloring (light blue → dark blue)
Three KPIs at the top: total transactions, monitored accounts, months with data
Filters: account selection, last N months, hide months without transactions
CSV download of the filtered table

8. Categorization Rules

8.1 Match Types

Type	Behavior
`contains`	Pattern anywhere in the description (case-insensitive)
`exact`	Description equal to the pattern (case-insensitive)
`regex`	Full Python regex

8.2 Upsert Semantics

Same (pattern, match_type) pair → in-place update of category/priority (no duplicates).

8.3 Retroactive Application

Rules are applied to all existing transactions on save, not just to future imports. The count of updated transactions is shown to the user.

Run all rules (bulk): the "▶️ Esegui tutte le regole" button on the Regole page applies all active rules to every transaction in the ledger at once (not only to_review=True). Useful after creating multiple rules in different sessions or after importing historical data without an active LLM.

9. Taxonomy

9.1 Storage

Two DB tables: taxonomy_category and taxonomy_subcategory. Initial seeding from taxonomy.yaml.

9.2 Subcategory as Source of Truth

TaxonomyConfig.find_category_for_subcategory() resolves the parent category from any valid subcategory. The LLM and rules can specify only the subcategory and the category is resolved automatically.

9.3 Default Categories

Expenses (15): Casa · Alimentari · Ristorazione · Trasporti · Salute · Istruzione · Abbigliamento · Comunicazioni · Svago e tempo libero · Animali domestici · Finanza e assicurazioni · Cura personale · Tasse e tributi · Regali e donazioni · Altro

Income (7): Lavoro dipendente · Lavoro autonomo · Rendite finanziarie · Rendite immobiliari · Trasferimenti e rimborsi · Prestazioni sociali · Altro entrate

10. Privacy and Security

Local-first: Ollama is the default backend, no data leaves the process
Mandatory sanitization: assert_sanitized() blocks remote calls if PII is found
Holder names: configurable from the UI, removed from all descriptions before remote LLM calls and used to detect internal transfers
No LangChain: OpenAI, Anthropic SDKs and requests used directly, minimal attack surface

11. Known Limitations

Excel and numeric locale: Excel numeric cells lose the original format (e.g., 2,50 becomes 2.5). The raw_amount field for Excel files will show "2.5" — a limitation of the Excel format, not a bug.
Cross-account internal transfers: detected only if both files have already been imported. Solution: "Riesegui rilevamento giroconti" button on the Review page.
Async LLM: categorization happens in the background. With local Ollama and gemma3:12b, each batch of 20 transactions takes about 1 minute.

12. REST API

The FastAPI layer (api/) exposes the same ledger operations over HTTP/JSON, without touching the Streamlit UI.

Start: uv run uvicorn api.main:app --host 0.0.0.0 --port 8000 Interactive docs: http://localhost:8000/docs (Swagger UI)

Main endpoints

Method	Path	Description
GET	`/health`	Liveness check
GET	`/transactions`	List with filters (date, category, account, to_review)
PATCH	`/transactions/{id}/category`	Update category and subcategory
PATCH	`/transactions/{id}/context`	Update life context
POST	`/transactions/{id}/toggle-giroconto`	Toggle internal transfer flag
DELETE	`/transactions`	Bulk delete by filter (at least 1 filter required)
GET/POST/PATCH/DELETE	`/rules/category`	Category rule CRUD
POST	`/rules/category/apply-to-review`	Apply rules to pending review
POST	`/rules/category/apply-to-all`	Apply rules to all transactions
GET/POST/DELETE	`/rules/description`	Description rule CRUD
GET	`/settings`	All settings (API keys redacted)
GET/PUT	`/settings/{key}`	Read/write a single setting
GET/POST/DELETE	`/accounts`	Account CRUD
GET/POST/PATCH/DELETE	`/taxonomy/categories`	Taxonomy category CRUD
GET	`/import/jobs/latest`	Latest import job status

Security

API keys (openai_api_key, anthropic_api_key) are always redacted (***) in GET responses
The same keys cannot be updated via API (403 Forbidden) — only from the Settings UI
CORS configured for localhost:8501 (Streamlit) by default

Docker

In Docker Compose, the api service shares the spendify_data volume (same ledger.db) with the Streamlit service.

13. Tests

# Full suite
uv run python -m pytest tests/ -v

# With coverage
uv run python -m pytest tests/ --cov=core --cov=db --cov-report=term-missing

All tests use SQLite in-memory — no files, no external services.

File	Coverage
`test_normalizer.py`	`parse_amount`, SHA-256, encoding
`test_backends.py`	Factory, validation, mock Ollama
`test_categorizer.py`	4-step cascade, taxonomy resolution
`test_repository_rules.py`	Rule upsert, pattern matching, internal transfer toggle, bulk ops

Project

Spendify — Project Document

1. Objective

2. Technology Stack

3. System Architecture

3.0 Architectural layers

3.1 Import Pipeline

3.2 Schema Identification (Flow 1 vs Flow 2)

3.3 Import Job

4. Data Model

4.1 Main Tables

4.2 Transaction — Key Columns

4.3 UserSettings — Relevant Keys

5. Main Features

5.1 Internal Transfer Detection (RF-04)

5.2 Card–Account Reconciliation (RF-03)

5.3 Cascading Categorization (RF-05)

5.4 Life Contexts

5.5 Description Cleaning

5.6 PII Sanitization (RF-10)

6. DB Migrations

7. User Interface

7.1 Navigation

7.2 Import Page

7.3 Modifiche massive Page

7.4 Ledger Page

7.5 Review Page

7.6 Impostazioni Page

7.7 Check List Page

8. Categorization Rules

8.1 Match Types

8.2 Upsert Semantics

8.3 Retroactive Application

9. Taxonomy

9.1 Storage

9.2 Subcategory as Source of Truth

9.3 Default Categories

10. Privacy and Security

11. Known Limitations

12. REST API

Main endpoints

Security

Docker

13. Tests

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally