-
Notifications
You must be signed in to change notification settings - Fork 1
Deterministic Rules
Complete inventory of all rules, algorithms and transformations that are non-LLM and implemented in the system. For each tool: where it lives, what it does, the hardcoded rules and the point of application in the pipeline.
FILE (CSV / XLSX)
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 1. FORMAT DETECTION β ββ DETERMINISTIC
β detect_encoding Β· detect_delimiter β
β detect_header_row Β· detect_best_sheet β
ββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 1b. PRE-PROCESSING Phase 0 β ββ DETERMINISTIC
β detect_and_strip_preheader_rows β
β drop_low_variability_columns β
ββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 2. DOCUMENT CLASSIFICATION β Phase 0 β ββ DETERMINISTIC
β column synonyms Β· sign inspection β
ββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββ
β LLM for ambiguous fields (Phase 1)
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 3. NORMALISATION β ββ DETERMINISTIC
β parse_date_safe Β· parse_amount Β· apply_sign_convention β
β normalize_description Β· compute_transaction_id (SHA-256)β
β _infer_tx_type Β· remove_card_balance_row β
ββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββ
β ID calculated here from raw values
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 4. DEDUP CHECK β ββ DETERMINISTIC
β get_existing_tx_ids (repository.py) β
β β abort if all already in DB, zero wasted LLM calls β
ββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββ
β only new txs proceed
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 5. DESCRIPTION CLEANING β
β PRIVACY / PII REDACTION ββ DETERMINISTIC β
β redact_pii Β· restore_owner_placeholders β
β (applied BEFORE and AFTER every LLM call) β
β ββ LLM (counterparty extraction)β
ββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 6. INTERNAL TRANSFER DETECTION [RF-04] β ββ DETERMINISTIC
β detect_internal_transfers β
β Phase 1: amount+date matching β
β Phase 2: owner name matching β
ββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 7. CARD RECONCILIATION [RF-03] β ββ DETERMINISTIC
β find_card_settlement_matches β
β sliding window Β· subset sum β
ββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 8. CATEGORISATION β Levels 0 and 1 β ββ DETERMINISTIC
β Lv. 0: user rules (CategoryRule.matches) β
β Lv. 1: static keyword rules β
ββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββ
β LLM only if no rule matches
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 9. DB PERSISTENCE β ββ DETERMINISTIC
β idempotent upsert Β· SHA-256 for file and transaction β
ββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 10. REVIEW β auto-apply rules β ββ DETERMINISTIC
β apply_rules_to_review_transactions (to_review=True) β
β apply_all_rules_to_all_transactions (all txs) β
β bulk description rules Β· DescriptionRule β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Module: core/normalizer.py
When: stage 1, before any parsing
| Function | Hardcoded rule |
|---|---|
detect_encoding(raw_bytes) |
chardet β normalises alias (ascii β utf-8) |
detect_delimiter(content) |
counts frequency of , ; \t | β most frequent wins |
detect_header_row(lines) |
first row with β₯ 2 non-numeric fields; numeric pattern: ^[\d\.\,\-\+\sβ¬$Β£%]+$
|
detect_best_sheet(workbook) |
excludes sheets named summary|totale|riepilogo; score = rows + (numeric columns Γ 10) |
Module: core/classifier.py
When: stage 2 (Flow 2), only if source has no schema in DB
Resolves column fields without LLM via synonyms:
| Field | Recognised synonyms |
|---|---|
date_col |
data, date, data operazione, booking date, buchungsdatum, β¦ |
amount_col |
importo, amount, betrag, montant, somme, β¦ |
debit_col |
dare, addebiti, uscite, debit, ausgaben, β¦ |
credit_col |
avere, accrediti, entrate, credit, einnahmen, β¦ |
description_col |
descrizione, causale, memo, payee, bezeichnung, libellΓ©, β¦ |
Sign inspection (Phase 0.5):
If amount_col semantics "neutral" β reads actual data; if any value < 0 β invert_sign=False certain, no LLM needed.
Module: core/normalizer.py, core/orchestrator.py
When: stage 3, after schema classification
parse_date_safe(value, format)
- Tries the schema format
- Fallback to common formats (in order):
%d/%m/%YΒ·%d-%m-%YΒ·%d/%m/%yΒ·%d-%m-%yΒ·%Y-%m-%dΒ·%Y/%m/%dΒ·%m/%d/%YΒ·%m/%d/%y - Returns
Noneif everything fails (row discarded)
parse_amount(value)
Strip symbols: β¬ $ Β£ (spaces)
Separator heuristic:
"1.234,56" β dot = thousands, comma = decimal β 1234.56
"1,234.56" β comma = thousands, dot = decimal β 1234.56
"1234,56" β comma only with β€ 2 decimal digits β 1234.56
"1234.56" β dot only with β€ 2 decimal digits β 1234.56
apply_sign_convention(row, convention)
| Convention | Rule |
|---|---|
signed_single |
uses amount_col as-is |
debit_positive |
credit β debit (both positive in CSV) |
credit_negative |
credit as-is positive; debit negated |
After: if invert_sign=True (typical for cards) β multiply by β1.
normalize_description(text)
unicodedata.normalize("NFC", text).casefold().strip()
Ensures stable case-insensitive comparisons; never modifies raw_description.
compute_transaction_id(account_label, date, amount, description)
SHA-256[:24] of the string: {account_label}|{ISO date}|{amount}|{raw_description}
Used on raw values β stable across normalisation versions.
compute_file_hash(raw_bytes)
Full SHA-256 of the file β import-level dedup.
_infer_tx_type(amount, doc_type, description, internal_patterns)
1. description matches internal_patterns (list from DB) β internal_out / internal_in
2. doc_type in {credit_card, debit_card, prepaid_card} β card_tx
3. amount β₯ 0 β income
4. amount < 0 β expense
remove_card_balance_row(txs, epsilon, owner_label)
Detects the row whose |amount| β Ξ£|other amounts| (within epsilon 0.01 β¬).
With owner_label β renames the description (internal transfer detection captures it).
Without owner_label β removes the row (avoids double counting).
Module: db/repository.py β get_existing_tx_ids()
When: stage 4, after normalisation and before description cleaning (LLM)
Why: the SHA-256 ID is calculated at step 3 from raw values β duplicates can be discarded without wasting LLM tokens
existing_ids = SELECT id FROM transaction WHERE id IN (all_ids_in_batch)
β filters already-present txs
β if all present β abort early (file already imported)
Module: core/sanitizer.py
When: BEFORE every LLM call (description cleaning + categorisation); AFTER for owner name restoration
| Pattern | Regex | Replaced with |
|---|---|---|
| IBAN | [A-Z]{2}\d{2}[A-Z0-9]{4,30} |
<ACCOUNT_ID> |
| PAN / card (13-19 digits) | \d{4}[\s\-]?\d{4}[\s\-]?\d{4}[\s\-]?\d{1,7} |
<CARD_ID> |
| Masked card | [\*X]{4}[\s\-]?\d{4} |
<CARD_ID> |
| Transaction codes | (CAU|NDS|TRN|CRO|RIF|ID TRANSAZIONE)\s*[\d\-]+ |
<TX_CODE> |
| IT tax code | [A-Z]{6}\d{2}[A-Z]\d{2}[A-Z]\d{3}[A-Z] |
<FISCAL_ID> |
| Additional user patterns | configurable | <REDACTED> |
Real names are replaced with plausible but fake names (the LLM can still recognise them as persons and extract them correctly). After the LLM response, restore_owner_placeholders() puts the real names back.
| Language | Fictitious name pool |
|---|---|
| IT | Carlo Brambilla, Marta Pellegrino, Alberto Marini, Giovanna Ferrara, β¦ |
| EN | James Fletcher, Helen Norris, David Lawson, Susan Palmer, β¦ |
| DE | Klaus Hartmann, Monika Braun, Stefan Richter, Ingrid Weber, β¦ |
| FR | Pierre Dumont, Claire Lebrun, Michel Garnier, Sophie Renard, β¦ |
| ES | Carlos Navarro, Elena Vega, Miguel Torres, Isabel Molina, β¦ |
Final guard: assert_sanitized(text) β raises ValueError if IBAN or PAN are still present.
Module: core/normalizer.py β detect_internal_transfers()
When: stage 6, after dedup
For every pair (i, j) with account_label_i β account_label_j:
amount_match = |amount_i + amount_j| β€ epsilon
date_match = |date_i β date_j| β€ delta_days
If both verified:
high_symmetry = amount β€ epsilon_strict AND date β€ delta_days_strict
Confidence:
HIGH β keyword from internal_patterns list found in description
MEDIUM β high_symmetry without keyword
If require_keyword_confirmation=True AND confidence=MEDIUM:
β marks transfer_pair_id, does NOT update tx_type (goes to review)
Otherwise:
β tx_type: internal_out (outgoing) / internal_in (incoming)
For every tx not yet paired:
If description contains an owner name
(regex with all permutations of the name tokens):
β tx_type = internal_out / internal_in
β transfer_confidence = HIGH
| Parameter | Default |
|---|---|
epsilon |
0.01 β¬ |
epsilon_strict |
0.005 β¬ |
delta_days |
5 days |
delta_days_strict |
1 day |
Module: core/normalizer.py β find_card_settlement_matches()
When: stage 7, matches card_settlement (current account) with card_tx (card)
card_tx in [debit_date β 45 days, debit_date + 7 days]
For every contiguous subset [i..j]:
verify: gap between consecutive txs β€ max_gap_days (5 days)
sum = Ξ£ |amount[i..j]|
If |sum β debit_amount| β€ epsilon β MATCH β
Takes k=10 txs before + k=10 after the debit date (max 20 txs)
Exhaustive search: all subsets β 2^20 β 1M combinations (safe)
First combination that sums to the amount β MATCH β
Module: core/categorizer.py
When: stage 8, before LLM (levels 0 and 1)
Saved in DB, sorted by descending priority. First match wins.
CategoryRule.matches(description, doc_type):
| Type | Logic |
|---|---|
exact |
description.casefold() == pattern.casefold() |
contains |
pattern.casefold() IN description.casefold() |
regex |
re.search(pattern, description, IGNORECASE) |
If doc_type specified in the rule β must match the transaction's doc_type.
Hardcoded in the code, direction-aware (expenses/income separated):
EXPENSES:
| Pattern (regex, case-insensitive) | Category | Subcategory |
|---|---|---|
conad|coop|esselunga|lidl|carrefour|eurospin|aldi|penny|pam |
Food | Grocery shopping |
farmacia|pharma |
Health | Medicines |
eni|shell|q8|tamoil|ip|api|agip |
Transport | Fuel |
telepass|autostrad |
Transport | Parking / ZTL |
trenitalia|italo|frecciarossa|frecciargento |
Transport | Public transport |
enel|iren|a2a|hera|eni gas |
Home | Electricity |
netflix|spotify|amazon prime|disney+|apple tv |
Leisure | Streaming / digital subscriptions |
commissione|canone conto|spese tenuta |
Finance | Bank fees |
INCOME:
| Pattern | Category | Subcategory |
|---|---|---|
stipendio|salary|busta paga |
Employment | Salary |
pensione|inps rendita |
Social benefits | Pension / annuity |
Module: db/repository.py
When: stage 9, everything idempotent
| Function | Idempotency rule |
|---|---|
upsert_transaction(tx) |
if tx.id exists β skip |
create_import_batch(sha256) |
if sha256 exists β return existing |
upsert_document_schema(schema) |
if source_identifier exists β update |
create_reconciliation_link(sid, did) |
if pair (sid, did) exists β skip |
create_transfer_link(out_id, in_id) |
if pair exists β skip |
update_transaction_category() |
always sets: confidence=high, source=manual, to_review=False
|
Module: db/repository.py, ui/review_page.py
apply_rules_to_review_transactions(session, user_rules)
On every load of the Review page:
For each tx with to_review=True:
For each rule (sorted by priority DESC):
If rule.matches(tx.description, tx.doc_type):
β update category, source=rule, to_review=False
β move to next tx
apply_all_rules_to_all_transactions(session, user_rules)
"
Applies all rules to ALL transactions (not only to_review=True):
Rules sorted by priority DESC
For each tx:
For each rule:
If rule.matches(tx.description, tx.doc_type):
β update category, subcategory, source=rule, confidence=high
β if tx.to_review=True β set to_review=False (n_cleared++)
β move to next tx (first match wins)
Returns (n_matched, n_cleared_review)
Requires confirmation via checkbox before execution.
Saved in DB (description_rule). Pattern on raw_description:
| Type | Logic |
|---|---|
exact |
raw_description.lower() == pattern.lower() |
contains |
pattern.lower() IN raw_description.lower() |
regex |
re.search(pattern, raw_description, IGNORECASE) |
Application: updates description β re-categorises with LLM.
Module: ui/analytics_page.py
EXCLUDED = {"internal_out", "internal_in", "card_settlement", "aggregate_debit"}Thresholds applied for each category against the reference household benchmark:
| Signal | Condition | Icon |
|---|---|---|
| Abnormally high spending | spending > 1.5 Γ benchmark | π΄ |
| Abnormally low spending | spending < 0.5 Γ benchmark | π΅ |
| Normal spending | between 0.5Γ and 1.5Γ | π’ |
| Absent | no spending in category | βͺ |
| Stage | Tool | Module | LLM? |
|---|---|---|---|
| 1. File format | detect_encoding / detect_delimiter / detect_header_row / detect_best_sheet | normalizer.py | β |
| 1b. Pre-processing | detect_and_strip_preheader_rows / drop_low_variability_columns | normalizer.py | β |
| 2. Schema β Phase 0 | column synonyms, sign inspection | classifier.py | β |
| 2. Schema β Phase 1 | doc_type classification, date_format, sign_convention | classifier.py | β LLM |
| 3. Normalisation | parse_date_safe / parse_amount / apply_sign_convention / normalize_description / compute_transaction_id / _infer_tx_type / remove_card_balance_row | normalizer.py + orchestrator.py | β |
| 4. Dedup | get_existing_tx_ids | repository.py | β |
| 5. Privacy | redact_pii / restore_owner_placeholders | sanitizer.py | β |
| 5. Description cleaning | clean_descriptions_batch | description_cleaner.py | β LLM |
| 6. Internal transfers | detect_internal_transfers (Phase 1 + Phase 2) | normalizer.py | β |
| 7. Card reconciliation | find_card_settlement_matches (3 phases) | normalizer.py | β |
| 8. Categorisation Lv. 0 | CategoryRule.matches (user rules) | categorizer.py | β |
| 8. Categorisation Lv. 1 | _apply_static_rules (hardcoded keywords) | categorizer.py | β |
| 8. Categorisation Lv. 3 | categorize_batch (LLM) | categorizer.py | β LLM |
| 9. Persistence | upsert_transaction / persist_import_result | repository.py | β |
| 10. Auto-rules | apply_rules_to_review_transactions | repository.py | β |
| 10. Run all rules | apply_all_rules_to_all_transactions | repository.py | β |
| 10. Bulk descriptions | DescriptionRule + _apply_description_rule_bulk | repository.py + review_page.py | β LLM (re-cat.) |
| Analytics | EXCLUDED / ISTAT benchmark 0.5Γβ1.5Γ | analytics_page.py | β |
All defaults are in ProcessingConfig (core/orchestrator.py):
| Parameter | Default | Used by |
|---|---|---|
tolerance |
0.01 β¬ | internal transfer detection, card reconciliation |
tolerance_strict |
0.005 β¬ | high-symmetry internal transfers |
settlement_days |
5 days | internal transfer matching window |
settlement_days_strict |
1 day | strict internal transfer window |
window_days |
45 days | card reconciliation time window |
max_gap_days |
5 days | card sliding window |
boundary_pre_post |
10 txs | reconciliation subset sum |
confidence_threshold |
0.80 | LLM threshold β to_review |
require_keyword_confirmation |
True | medium internal transfers β to_review if no keyword |
batch_size (descriptions) |
30 tx/call | clean_descriptions_batch |
batch_size (categories) |
20 tx/call | categorize_batch |