Skip to content

Deterministic Rules

github-actions[bot] edited this page Mar 17, 2026 · 1 revision

Spendify β€” Deterministic Tools

Complete inventory of all rules, algorithms and transformations that are non-LLM and implemented in the system. For each tool: where it lives, what it does, the hardcoded rules and the point of application in the pipeline.


Pipeline map

FILE (CSV / XLSX)
        β”‚
        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  1. FORMAT DETECTION                                         β”‚ ◄─ DETERMINISTIC
β”‚     detect_encoding Β· detect_delimiter                       β”‚
β”‚     detect_header_row Β· detect_best_sheet                    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                             β”‚
                             β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  1b. PRE-PROCESSING Phase 0                                  β”‚ ◄─ DETERMINISTIC
β”‚     detect_and_strip_preheader_rows                          β”‚
β”‚     drop_low_variability_columns                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                             β”‚
                             β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  2. DOCUMENT CLASSIFICATION β€” Phase 0                        β”‚ ◄─ DETERMINISTIC
β”‚     column synonyms Β· sign inspection                        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                             β”‚ LLM for ambiguous fields (Phase 1)
                             β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  3. NORMALISATION                                            β”‚ ◄─ DETERMINISTIC
β”‚     parse_date_safe Β· parse_amount Β· apply_sign_convention   β”‚
β”‚     normalize_description Β· compute_transaction_id (SHA-256)β”‚
β”‚     _infer_tx_type Β· remove_card_balance_row                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                             β”‚  ID calculated here from raw values
                             β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  4. DEDUP CHECK                                              β”‚ ◄─ DETERMINISTIC
β”‚     get_existing_tx_ids (repository.py)                      β”‚
β”‚     β†’ abort if all already in DB, zero wasted LLM calls      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                             β”‚  only new txs proceed
                             β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  5. DESCRIPTION CLEANING                                     β”‚
β”‚     PRIVACY / PII REDACTION  ◄─ DETERMINISTIC               β”‚
β”‚     redact_pii Β· restore_owner_placeholders                  β”‚
β”‚     (applied BEFORE and AFTER every LLM call)                β”‚
β”‚                              ◄─ LLM (counterparty extraction)β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                             β”‚
                             β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  6. INTERNAL TRANSFER DETECTION [RF-04]                      β”‚ ◄─ DETERMINISTIC
β”‚     detect_internal_transfers                                β”‚
β”‚     Phase 1: amount+date matching                            β”‚
β”‚     Phase 2: owner name matching                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                             β”‚
                             β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  7. CARD RECONCILIATION [RF-03]                              β”‚ ◄─ DETERMINISTIC
β”‚     find_card_settlement_matches                             β”‚
β”‚     sliding window Β· subset sum                              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                             β”‚
                             β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  8. CATEGORISATION β€” Levels 0 and 1                          β”‚ ◄─ DETERMINISTIC
β”‚     Lv. 0: user rules (CategoryRule.matches)                 β”‚
β”‚     Lv. 1: static keyword rules                              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                             β”‚ LLM only if no rule matches
                             β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  9. DB PERSISTENCE                                           β”‚ ◄─ DETERMINISTIC
β”‚     idempotent upsert Β· SHA-256 for file and transaction     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                             β”‚
                             β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  10. REVIEW β€” auto-apply rules                               β”‚ ◄─ DETERMINISTIC
β”‚      apply_rules_to_review_transactions  (to_review=True)    β”‚
β”‚      apply_all_rules_to_all_transactions (all txs)           β”‚
β”‚      bulk description rules Β· DescriptionRule                β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

1 β€” File format detection

Module: core/normalizer.py When: stage 1, before any parsing

Function Hardcoded rule
detect_encoding(raw_bytes) chardet β†’ normalises alias (ascii β†’ utf-8)
detect_delimiter(content) counts frequency of , ; \t | β†’ most frequent wins
detect_header_row(lines) first row with β‰₯ 2 non-numeric fields; numeric pattern: ^[\d\.\,\-\+\s€$Β£%]+$
detect_best_sheet(workbook) excludes sheets named summary|totale|riepilogo; score = rows + (numeric columns Γ— 10)

2 β€” Document classification β€” Phase 0

Module: core/classifier.py When: stage 2 (Flow 2), only if source has no schema in DB

Resolves column fields without LLM via synonyms:

Field Recognised synonyms
date_col data, date, data operazione, booking date, buchungsdatum, …
amount_col importo, amount, betrag, montant, somme, …
debit_col dare, addebiti, uscite, debit, ausgaben, …
credit_col avere, accrediti, entrate, credit, einnahmen, …
description_col descrizione, causale, memo, payee, bezeichnung, libellΓ©, …

Sign inspection (Phase 0.5): If amount_col semantics "neutral" β†’ reads actual data; if any value < 0 β†’ invert_sign=False certain, no LLM needed.


3 β€” Normalisation

Module: core/normalizer.py, core/orchestrator.py When: stage 3, after schema classification

3a β€” Date parsing

parse_date_safe(value, format)

  1. Tries the schema format
  2. Fallback to common formats (in order): %d/%m/%Y Β· %d-%m-%Y Β· %d/%m/%y Β· %d-%m-%y Β· %Y-%m-%d Β· %Y/%m/%d Β· %m/%d/%Y Β· %m/%d/%y
  3. Returns None if everything fails (row discarded)

3b β€” Amount parsing

parse_amount(value)

Strip symbols: €  $  Β£  (spaces)

Separator heuristic:
  "1.234,56"  β†’ dot = thousands, comma = decimal β†’ 1234.56
  "1,234.56"  β†’ comma = thousands, dot = decimal β†’ 1234.56
  "1234,56"   β†’ comma only with ≀ 2 decimal digits β†’ 1234.56
  "1234.56"   β†’ dot only with ≀ 2 decimal digits   β†’ 1234.56

3c β€” Sign convention

apply_sign_convention(row, convention)

Convention Rule
signed_single uses amount_col as-is
debit_positive credit βˆ’ debit (both positive in CSV)
credit_negative credit as-is positive; debit negated

After: if invert_sign=True (typical for cards) β†’ multiply by βˆ’1.

3d β€” Description normalisation

normalize_description(text) unicodedata.normalize("NFC", text).casefold().strip() Ensures stable case-insensitive comparisons; never modifies raw_description.

3e β€” Transaction identifier (idempotency key)

compute_transaction_id(account_label, date, amount, description) SHA-256[:24] of the string: {account_label}|{ISO date}|{amount}|{raw_description} Used on raw values β†’ stable across normalisation versions.

compute_file_hash(raw_bytes) Full SHA-256 of the file β†’ import-level dedup.

3f β€” Transaction type inference

_infer_tx_type(amount, doc_type, description, internal_patterns)

1. description matches internal_patterns (list from DB) β†’ internal_out / internal_in
2. doc_type in {credit_card, debit_card, prepaid_card}  β†’ card_tx
3. amount β‰₯ 0                                           β†’ income
4. amount < 0                                           β†’ expense

3g β€” Card balance row removal

remove_card_balance_row(txs, epsilon, owner_label) Detects the row whose |amount| β‰ˆ Ξ£|other amounts| (within epsilon 0.01 €). With owner_label β†’ renames the description (internal transfer detection captures it). Without owner_label β†’ removes the row (avoids double counting).


4 β€” Dedup check

Module: db/repository.py β†’ get_existing_tx_ids() When: stage 4, after normalisation and before description cleaning (LLM) Why: the SHA-256 ID is calculated at step 3 from raw values β†’ duplicates can be discarded without wasting LLM tokens

existing_ids = SELECT id FROM transaction WHERE id IN (all_ids_in_batch)
β†’ filters already-present txs
β†’ if all present β†’ abort early (file already imported)

5 β€” Privacy / PII Redaction

Module: core/sanitizer.py When: BEFORE every LLM call (description cleaning + categorisation); AFTER for owner name restoration

Redaction rules

Pattern Regex Replaced with
IBAN [A-Z]{2}\d{2}[A-Z0-9]{4,30} <ACCOUNT_ID>
PAN / card (13-19 digits) \d{4}[\s\-]?\d{4}[\s\-]?\d{4}[\s\-]?\d{1,7} <CARD_ID>
Masked card [\*X]{4}[\s\-]?\d{4} <CARD_ID>
Transaction codes (CAU|NDS|TRN|CRO|RIF|ID TRANSAZIONE)\s*[\d\-]+ <TX_CODE>
IT tax code [A-Z]{6}\d{2}[A-Z]\d{2}[A-Z]\d{3}[A-Z] <FISCAL_ID>
Additional user patterns configurable <REDACTED>

Owner names β†’ fictitious names (for LLM)

Real names are replaced with plausible but fake names (the LLM can still recognise them as persons and extract them correctly). After the LLM response, restore_owner_placeholders() puts the real names back.

Language Fictitious name pool
IT Carlo Brambilla, Marta Pellegrino, Alberto Marini, Giovanna Ferrara, …
EN James Fletcher, Helen Norris, David Lawson, Susan Palmer, …
DE Klaus Hartmann, Monika Braun, Stefan Richter, Ingrid Weber, …
FR Pierre Dumont, Claire Lebrun, Michel Garnier, Sophie Renard, …
ES Carlos Navarro, Elena Vega, Miguel Torres, Isabel Molina, …

Final guard: assert_sanitized(text) β†’ raises ValueError if IBAN or PAN are still present.


6 β€” Internal transfer detection [RF-04]

Module: core/normalizer.py β†’ detect_internal_transfers() When: stage 6, after dedup

Phase 1 β€” Amount + date matching

For every pair (i, j) with account_label_i β‰  account_label_j:

  amount_match = |amount_i + amount_j| ≀ epsilon
  date_match   = |date_i βˆ’ date_j| ≀ delta_days

  If both verified:
    high_symmetry = amount ≀ epsilon_strict AND date ≀ delta_days_strict

    Confidence:
      HIGH   β†’ keyword from internal_patterns list found in description
      MEDIUM β†’ high_symmetry without keyword

    If require_keyword_confirmation=True AND confidence=MEDIUM:
      β†’ marks transfer_pair_id, does NOT update tx_type (goes to review)
    Otherwise:
      β†’ tx_type: internal_out (outgoing) / internal_in (incoming)

Phase 2 β€” Owner name matching

For every tx not yet paired:
  If description contains an owner name
  (regex with all permutations of the name tokens):
    β†’ tx_type = internal_out / internal_in
    β†’ transfer_confidence = HIGH

Key parameters

Parameter Default
epsilon 0.01 €
epsilon_strict 0.005 €
delta_days 5 days
delta_days_strict 1 day

7 β€” Card reconciliation [RF-03]

Module: core/normalizer.py β†’ find_card_settlement_matches() When: stage 7, matches card_settlement (current account) with card_tx (card)

Phase 1 β€” Time window

card_tx in [debit_date βˆ’ 45 days, debit_date + 7 days]

Phase 2 β€” Sliding window (contiguous subsets)

For every contiguous subset [i..j]:
  verify: gap between consecutive txs ≀ max_gap_days (5 days)
  sum = Ξ£ |amount[i..j]|
  If |sum βˆ’ debit_amount| ≀ epsilon β†’ MATCH βœ“

Phase 3 β€” Boundary subset sum (fallback)

Takes k=10 txs before + k=10 after the debit date (max 20 txs)
Exhaustive search: all subsets β†’ 2^20 β‰ˆ 1M combinations (safe)
First combination that sums to the amount β†’ MATCH βœ“

8 β€” Categorisation β€” deterministic levels

Module: core/categorizer.py When: stage 8, before LLM (levels 0 and 1)

Level 0 β€” User rules (CategoryRule)

Saved in DB, sorted by descending priority. First match wins.

CategoryRule.matches(description, doc_type):

Type Logic
exact description.casefold() == pattern.casefold()
contains pattern.casefold() IN description.casefold()
regex re.search(pattern, description, IGNORECASE)

If doc_type specified in the rule β†’ must match the transaction's doc_type.

Level 1 β€” Static keyword rules

Hardcoded in the code, direction-aware (expenses/income separated):

EXPENSES:

Pattern (regex, case-insensitive) Category Subcategory
conad|coop|esselunga|lidl|carrefour|eurospin|aldi|penny|pam Food Grocery shopping
farmacia|pharma Health Medicines
eni|shell|q8|tamoil|ip|api|agip Transport Fuel
telepass|autostrad Transport Parking / ZTL
trenitalia|italo|frecciarossa|frecciargento Transport Public transport
enel|iren|a2a|hera|eni gas Home Electricity
netflix|spotify|amazon prime|disney+|apple tv Leisure Streaming / digital subscriptions
commissione|canone conto|spese tenuta Finance Bank fees

INCOME:

Pattern Category Subcategory
stipendio|salary|busta paga Employment Salary
pensione|inps rendita Social benefits Pension / annuity

9 β€” DB Persistence

Module: db/repository.py When: stage 9, everything idempotent

Function Idempotency rule
upsert_transaction(tx) if tx.id exists β†’ skip
create_import_batch(sha256) if sha256 exists β†’ return existing
upsert_document_schema(schema) if source_identifier exists β†’ update
create_reconciliation_link(sid, did) if pair (sid, did) exists β†’ skip
create_transfer_link(out_id, in_id) if pair exists β†’ skip
update_transaction_category() always sets: confidence=high, source=manual, to_review=False

10 β€” Manual review β€” deterministic tools

Module: db/repository.py, ui/review_page.py

Auto-apply rules (Review page)

apply_rules_to_review_transactions(session, user_rules) On every load of the Review page:

For each tx with to_review=True:
  For each rule (sorted by priority DESC):
    If rule.matches(tx.description, tx.doc_type):
      β†’ update category, source=rule, to_review=False
      β†’ move to next tx

Run all rules (Rules page)

apply_all_rules_to_all_transactions(session, user_rules) "▢️ Run all rules" button on the Rules page:

Applies all rules to ALL transactions (not only to_review=True):
  Rules sorted by priority DESC
  For each tx:
    For each rule:
      If rule.matches(tx.description, tx.doc_type):
        β†’ update category, subcategory, source=rule, confidence=high
        β†’ if tx.to_review=True β†’ set to_review=False (n_cleared++)
        β†’ move to next tx (first match wins)
  Returns (n_matched, n_cleared_review)

Requires confirmation via checkbox before execution.

DescriptionRule β€” bulk description correction rules

Saved in DB (description_rule). Pattern on raw_description:

Type Logic
exact raw_description.lower() == pattern.lower()
contains pattern.lower() IN raw_description.lower()
regex re.search(pattern, raw_description, IGNORECASE)

Application: updates description β†’ re-categorises with LLM.


11 β€” Analytics β€” thresholds and filters

Module: ui/analytics_page.py

Types excluded from charts

EXCLUDED = {"internal_out", "internal_in", "card_settlement", "aggregate_debit"}

Spending benchmarks (ISTAT comparison)

Thresholds applied for each category against the reference household benchmark:

Signal Condition Icon
Abnormally high spending spending > 1.5 Γ— benchmark πŸ”΄
Abnormally low spending spending < 0.5 Γ— benchmark πŸ”΅
Normal spending between 0.5Γ— and 1.5Γ— 🟒
Absent no spending in category βšͺ

Summary β€” All tools by pipeline stage

Stage Tool Module LLM?
1. File format detect_encoding / detect_delimiter / detect_header_row / detect_best_sheet normalizer.py βœ—
1b. Pre-processing detect_and_strip_preheader_rows / drop_low_variability_columns normalizer.py βœ—
2. Schema β€” Phase 0 column synonyms, sign inspection classifier.py βœ—
2. Schema β€” Phase 1 doc_type classification, date_format, sign_convention classifier.py βœ“ LLM
3. Normalisation parse_date_safe / parse_amount / apply_sign_convention / normalize_description / compute_transaction_id / _infer_tx_type / remove_card_balance_row normalizer.py + orchestrator.py βœ—
4. Dedup get_existing_tx_ids repository.py βœ—
5. Privacy redact_pii / restore_owner_placeholders sanitizer.py βœ—
5. Description cleaning clean_descriptions_batch description_cleaner.py βœ“ LLM
6. Internal transfers detect_internal_transfers (Phase 1 + Phase 2) normalizer.py βœ—
7. Card reconciliation find_card_settlement_matches (3 phases) normalizer.py βœ—
8. Categorisation Lv. 0 CategoryRule.matches (user rules) categorizer.py βœ—
8. Categorisation Lv. 1 _apply_static_rules (hardcoded keywords) categorizer.py βœ—
8. Categorisation Lv. 3 categorize_batch (LLM) categorizer.py βœ“ LLM
9. Persistence upsert_transaction / persist_import_result repository.py βœ—
10. Auto-rules apply_rules_to_review_transactions repository.py βœ—
10. Run all rules apply_all_rules_to_all_transactions repository.py βœ—
10. Bulk descriptions DescriptionRule + _apply_description_rule_bulk repository.py + review_page.py βœ“ LLM (re-cat.)
Analytics EXCLUDED / ISTAT benchmark 0.5×–1.5Γ— analytics_page.py βœ—

Global configuration parameters

All defaults are in ProcessingConfig (core/orchestrator.py):

Parameter Default Used by
tolerance 0.01 € internal transfer detection, card reconciliation
tolerance_strict 0.005 € high-symmetry internal transfers
settlement_days 5 days internal transfer matching window
settlement_days_strict 1 day strict internal transfer window
window_days 45 days card reconciliation time window
max_gap_days 5 days card sliding window
boundary_pre_post 10 txs reconciliation subset sum
confidence_threshold 0.80 LLM threshold β†’ to_review
require_keyword_confirmation True medium internal transfers β†’ to_review if no keyword
batch_size (descriptions) 30 tx/call clean_descriptions_batch
batch_size (categories) 20 tx/call categorize_batch

Clone this wiki locally