-
Notifications
You must be signed in to change notification settings - Fork 1
Deterministic Rules
Complete inventory of all rules, algorithms and transformations that are non-LLM and implemented in the system. For each tool: where it lives, what it does, the hardcoded rules and the point of application in the pipeline.
FILE (CSV / XLSX)
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 1. FORMAT DETECTION │ ◄─ DETERMINISTIC
│ detect_encoding · detect_delimiter │
│ detect_header_row · detect_best_sheet │
└────────────────────────────┬────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 1b. PRE-PROCESSING Phase 0 │ ◄─ DETERMINISTIC
│ detect_and_strip_preheader_rows │
│ drop_low_variability_columns │
└────────────────────────────┬────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 2. DOCUMENT CLASSIFICATION — Phase 0 │ ◄─ DETERMINISTIC
│ column synonyms · sign inspection │
└────────────────────────────┬────────────────────────────────┘
│ LLM for ambiguous fields (Phase 1)
▼
┌─────────────────────────────────────────────────────────────┐
│ 3. NORMALISATION │ ◄─ DETERMINISTIC
│ parse_date_safe · parse_amount · apply_sign_convention │
│ normalize_description · compute_transaction_id (SHA-256)│
│ _infer_tx_type · remove_card_balance_row │
└────────────────────────────┬────────────────────────────────┘
│ ID calculated here from raw values
▼
┌─────────────────────────────────────────────────────────────┐
│ 4. DEDUP CHECK │ ◄─ DETERMINISTIC
│ get_existing_tx_ids (repository.py) │
│ → abort if all already in DB, zero wasted LLM calls │
└────────────────────────────┬────────────────────────────────┘
│ only new txs proceed
▼
┌─────────────────────────────────────────────────────────────┐
│ 5. DESCRIPTION CLEANING │
│ PRIVACY / PII REDACTION ◄─ DETERMINISTIC │
│ redact_pii · restore_owner_placeholders │
│ (applied BEFORE and AFTER every LLM call) │
│ ◄─ LLM (counterparty extraction)│
└────────────────────────────┬────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 6. INTERNAL TRANSFER DETECTION [RF-04] │ ◄─ DETERMINISTIC
│ detect_internal_transfers │
│ Phase 1: amount+date matching │
│ Phase 2: owner name matching │
└────────────────────────────┬────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 7. CARD RECONCILIATION [RF-03] │ ◄─ DETERMINISTIC
│ find_card_settlement_matches │
│ sliding window · subset sum │
└────────────────────────────┬────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 8. CATEGORISATION — Levels 0 and 1 │ ◄─ DETERMINISTIC
│ Lv. 0: user rules (CategoryRule.matches) │
│ Lv. 1: static keyword rules │
└────────────────────────────┬────────────────────────────────┘
│ LLM only if no rule matches
▼
┌─────────────────────────────────────────────────────────────┐
│ 9. DB PERSISTENCE │ ◄─ DETERMINISTIC
│ idempotent upsert · SHA-256 for file and transaction │
└────────────────────────────┬────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 10. REVIEW — auto-apply rules │ ◄─ DETERMINISTIC
│ apply_rules_to_review_transactions (to_review=True) │
│ apply_all_rules_to_all_transactions (all txs) │
│ bulk description rules · DescriptionRule │
└─────────────────────────────────────────────────────────────┘
Module: core/normalizer.py
When: stage 1, before any parsing
| Function | Hardcoded rule |
|---|---|
detect_encoding(raw_bytes) |
chardet → normalises alias (ascii → utf-8) |
detect_delimiter(content) |
counts frequency of , ; \t | → most frequent wins |
detect_header_row(lines) |
first row with ≥ 2 non-numeric fields; numeric pattern: ^[\d\.\,\-\+\s€$£%]+$
|
detect_best_sheet(workbook) |
excludes sheets named summary|totale|riepilogo; score = rows + (numeric columns × 10) |
Module: core/classifier.py
When: stage 2 (Flow 2), only if source has no schema in DB
Resolves column fields without LLM via synonyms:
| Field | Recognised synonyms |
|---|---|
date_col |
data, date, data operazione, booking date, buchungsdatum, … |
amount_col |
importo, amount, betrag, montant, somme, … |
debit_col |
dare, addebiti, uscite, debit, ausgaben, … |
credit_col |
avere, accrediti, entrate, credit, einnahmen, … |
description_col |
descrizione, causale, memo, payee, bezeichnung, libellé, … |
Sign inspection (Phase 0.5):
If amount_col semantics "neutral" → reads actual data; if any value < 0 → invert_sign=False certain, no LLM needed.
Module: core/normalizer.py, core/orchestrator.py
When: stage 3, after schema classification
parse_date_safe(value, format)
- Tries the schema format
- Fallback to common formats (in order):
%d/%m/%Y·%d-%m-%Y·%d/%m/%y·%d-%m-%y·%Y-%m-%d·%Y/%m/%d·%m/%d/%Y·%m/%d/%y - Returns
Noneif everything fails (row discarded)
parse_amount(value)
Strip symbols: € $ £ (spaces)
Separator heuristic:
"1.234,56" → dot = thousands, comma = decimal → 1234.56
"1,234.56" → comma = thousands, dot = decimal → 1234.56
"1234,56" → comma only with ≤ 2 decimal digits → 1234.56
"1234.56" → dot only with ≤ 2 decimal digits → 1234.56
apply_sign_convention(row, convention)
| Convention | Rule |
|---|---|
signed_single |
uses amount_col as-is |
debit_positive |
credit − debit (both positive in CSV) |
credit_negative |
credit as-is positive; debit negated |
After: if invert_sign=True (typical for cards) → multiply by −1.
normalize_description(text)
unicodedata.normalize("NFC", text).casefold().strip()
Ensures stable case-insensitive comparisons; never modifies raw_description.
compute_transaction_id(account_label, date, amount, description)
SHA-256[:24] of the string: {account_label}|{ISO date}|{amount}|{raw_description}
Used on raw values → stable across normalisation versions.
compute_file_hash(raw_bytes)
Full SHA-256 of the file → import-level dedup.
_infer_tx_type(amount, doc_type, description, internal_patterns)
1. description matches internal_patterns (list from DB) → internal_out / internal_in
2. doc_type in {credit_card, debit_card, prepaid_card} → card_tx
3. amount ≥ 0 → income
4. amount < 0 → expense
remove_card_balance_row(txs, epsilon, owner_label)
Detects the row whose |amount| ≈ Σ|other amounts| (within epsilon 0.01 €).
With owner_label → renames the description (internal transfer detection captures it).
Without owner_label → removes the row (avoids double counting).
Module: db/repository.py → get_existing_tx_ids()
When: stage 4, after normalisation and before description cleaning (LLM)
Why: the SHA-256 ID is calculated at step 3 from raw values → duplicates can be discarded without wasting LLM tokens
existing_ids = SELECT id FROM transaction WHERE id IN (all_ids_in_batch)
→ filters already-present txs
→ if all present → abort early (file already imported)
Module: core/sanitizer.py
When: BEFORE every LLM call (description cleaning + categorisation); AFTER for owner name restoration
| Pattern | Regex | Replaced with |
|---|---|---|
| IBAN | [A-Z]{2}\d{2}[A-Z0-9]{4,30} |
<ACCOUNT_ID> |
| PAN / card (13-19 digits) | \d{4}[\s\-]?\d{4}[\s\-]?\d{4}[\s\-]?\d{1,7} |
<CARD_ID> |
| Masked card | [\*X]{4}[\s\-]?\d{4} |
<CARD_ID> |
| Transaction codes | (CAU|NDS|TRN|CRO|RIF|ID TRANSAZIONE)\s*[\d\-]+ |
<TX_CODE> |
| IT tax code | [A-Z]{6}\d{2}[A-Z]\d{2}[A-Z]\d{3}[A-Z] |
<FISCAL_ID> |
| Additional user patterns | configurable | <REDACTED> |
Real names are replaced with plausible but fake names (the LLM can still recognise them as persons and extract them correctly). After the LLM response, restore_owner_placeholders() puts the real names back.
| Language | Fictitious name pool |
|---|---|
| IT | Carlo Brambilla, Marta Pellegrino, Alberto Marini, Giovanna Ferrara, … |
| EN | James Fletcher, Helen Norris, David Lawson, Susan Palmer, … |
| DE | Klaus Hartmann, Monika Braun, Stefan Richter, Ingrid Weber, … |
| FR | Pierre Dumont, Claire Lebrun, Michel Garnier, Sophie Renard, … |
| ES | Carlos Navarro, Elena Vega, Miguel Torres, Isabel Molina, … |
Final guard: assert_sanitized(text) → raises ValueError if IBAN or PAN are still present.
Module: core/normalizer.py → detect_internal_transfers()
When: stage 6, after dedup
For every pair (i, j) with account_label_i ≠ account_label_j:
amount_match = |amount_i + amount_j| ≤ epsilon
date_match = |date_i − date_j| ≤ delta_days
If both verified:
high_symmetry = amount ≤ epsilon_strict AND date ≤ delta_days_strict
Confidence:
HIGH → keyword from internal_patterns list found in description
MEDIUM → high_symmetry without keyword
If require_keyword_confirmation=True AND confidence=MEDIUM:
→ marks transfer_pair_id, does NOT update tx_type (goes to review)
Otherwise:
→ tx_type: internal_out (outgoing) / internal_in (incoming)
For every tx not yet paired:
If description contains an owner name
(regex with all permutations of the name tokens):
→ tx_type = internal_out / internal_in
→ transfer_confidence = HIGH
| Parameter | Default |
|---|---|
epsilon |
0.01 € |
epsilon_strict |
0.005 € |
delta_days |
5 days |
delta_days_strict |
1 day |
Module: core/normalizer.py → find_card_settlement_matches()
When: stage 7, matches card_settlement (current account) with card_tx (card)
card_tx in [debit_date − 45 days, debit_date + 7 days]
For every contiguous subset [i..j]:
verify: gap between consecutive txs ≤ max_gap_days (5 days)
sum = Σ |amount[i..j]|
If |sum − debit_amount| ≤ epsilon → MATCH ✓
Takes k=10 txs before + k=10 after the debit date (max 20 txs)
Exhaustive search: all subsets → 2^20 ≈ 1M combinations (safe)
First combination that sums to the amount → MATCH ✓
Module: core/categorizer.py
When: stage 8, before LLM (levels 0 and 1)
Saved in DB, sorted by descending priority. First match wins.
CategoryRule.matches(description, doc_type):
| Type | Logic |
|---|---|
exact |
description.casefold() == pattern.casefold() |
contains |
pattern.casefold() IN description.casefold() |
regex |
re.search(pattern, description, IGNORECASE) |
If doc_type specified in the rule → must match the transaction's doc_type.
Hardcoded in the code, direction-aware (expenses/income separated):
EXPENSES:
| Pattern (regex, case-insensitive) | Category | Subcategory |
|---|---|---|
conad|coop|esselunga|lidl|carrefour|eurospin|aldi|penny|pam |
Food | Grocery shopping |
farmacia|pharma |
Health | Medicines |
eni|shell|q8|tamoil|ip|api|agip |
Transport | Fuel |
telepass|autostrad |
Transport | Parking / ZTL |
trenitalia|italo|frecciarossa|frecciargento |
Transport | Public transport |
enel|iren|a2a|hera|eni gas |
Home | Electricity |
netflix|spotify|amazon prime|disney+|apple tv |
Leisure | Streaming / digital subscriptions |
commissione|canone conto|spese tenuta |
Finance | Bank fees |
INCOME:
| Pattern | Category | Subcategory |
|---|---|---|
stipendio|salary|busta paga |
Employment | Salary |
pensione|inps rendita |
Social benefits | Pension / annuity |
Module: db/repository.py
When: stage 9, everything idempotent
| Function | Idempotency rule |
|---|---|
upsert_transaction(tx) |
if tx.id exists → skip |
create_import_batch(sha256) |
if sha256 exists → return existing |
upsert_document_schema(schema) |
if source_identifier exists → update |
create_reconciliation_link(sid, did) |
if pair (sid, did) exists → skip |
create_transfer_link(out_id, in_id) |
if pair exists → skip |
update_transaction_category() |
always sets: confidence=high, source=manual, to_review=False
|
Module: db/repository.py, ui/review_page.py
apply_rules_to_review_transactions(session, user_rules)
On every load of the Review page:
For each tx with to_review=True:
For each rule (sorted by priority DESC):
If rule.matches(tx.description, tx.doc_type):
→ update category, source=rule, to_review=False
→ move to next tx
apply_all_rules_to_all_transactions(session, user_rules)
"
Applies all rules to ALL transactions (not only to_review=True):
Rules sorted by priority DESC
For each tx:
For each rule:
If rule.matches(tx.description, tx.doc_type):
→ update category, subcategory, source=rule, confidence=high
→ if tx.to_review=True → set to_review=False (n_cleared++)
→ move to next tx (first match wins)
Returns (n_matched, n_cleared_review)
Requires confirmation via checkbox before execution.
Saved in DB (description_rule). Pattern on raw_description:
| Type | Logic |
|---|---|
exact |
raw_description.lower() == pattern.lower() |
contains |
pattern.lower() IN raw_description.lower() |
regex |
re.search(pattern, raw_description, IGNORECASE) |
Application: updates description → re-categorises with LLM.
Module: ui/analytics_page.py
EXCLUDED = {"internal_out", "internal_in", "card_settlement", "aggregate_debit"}Thresholds applied for each category against the reference household benchmark:
| Signal | Condition | Icon |
|---|---|---|
| Abnormally high spending | spending > 1.5 × benchmark | 🔴 |
| Abnormally low spending | spending < 0.5 × benchmark | 🔵 |
| Normal spending | between 0.5× and 1.5× | 🟢 |
| Absent | no spending in category | ⚪ |
| Stage | Tool | Module | LLM? |
|---|---|---|---|
| 1. File format | detect_encoding / detect_delimiter / detect_header_row / detect_best_sheet | normalizer.py | ✗ |
| 1b. Pre-processing | detect_and_strip_preheader_rows / drop_low_variability_columns | normalizer.py | ✗ |
| 2. Schema — Phase 0 | column synonyms, sign inspection | classifier.py | ✗ |
| 2. Schema — Phase 1 | doc_type classification, date_format, sign_convention | classifier.py | ✓ LLM |
| 3. Normalisation | parse_date_safe / parse_amount / apply_sign_convention / normalize_description / compute_transaction_id / _infer_tx_type / remove_card_balance_row | normalizer.py + orchestrator.py | ✗ |
| 4. Dedup | get_existing_tx_ids | repository.py | ✗ |
| 5. Privacy | redact_pii / restore_owner_placeholders | sanitizer.py | ✗ |
| 5. Description cleaning | clean_descriptions_batch | description_cleaner.py | ✓ LLM |
| 6. Internal transfers | detect_internal_transfers (Phase 1 + Phase 2) | normalizer.py | ✗ |
| 7. Card reconciliation | find_card_settlement_matches (3 phases) | normalizer.py | ✗ |
| 8. Categorisation Lv. 0 | CategoryRule.matches (user rules) | categorizer.py | ✗ |
| 8. Categorisation Lv. 1 | _apply_static_rules (hardcoded keywords) | categorizer.py | ✗ |
| 8. Categorisation Lv. 3 | categorize_batch (LLM) | categorizer.py | ✓ LLM |
| 9. Persistence | upsert_transaction / persist_import_result | repository.py | ✗ |
| 10. Auto-rules | apply_rules_to_review_transactions | repository.py | ✗ |
| 10. Run all rules | apply_all_rules_to_all_transactions | repository.py | ✗ |
| 10. Bulk descriptions | DescriptionRule + _apply_description_rule_bulk | repository.py + review_page.py | ✓ LLM (re-cat.) |
| Analytics | EXCLUDED / ISTAT benchmark 0.5×–1.5× | analytics_page.py | ✗ |
All defaults are in ProcessingConfig (core/orchestrator.py):
| Parameter | Default | Used by |
|---|---|---|
tolerance |
0.01 € | internal transfer detection, card reconciliation |
tolerance_strict |
0.005 € | high-symmetry internal transfers |
settlement_days |
5 days | internal transfer matching window |
settlement_days_strict |
1 day | strict internal transfer window |
window_days |
45 days | card reconciliation time window |
max_gap_days |
5 days | card sliding window |
boundary_pre_post |
10 txs | reconciliation subset sum |
confidence_threshold |
0.80 | LLM threshold → to_review |
require_keyword_confirmation |
True | medium internal transfers → to_review if no keyword |
batch_size (descriptions) |
30 tx/call | clean_descriptions_batch |
batch_size (categories) |
20 tx/call | categorize_batch |