-
Notifications
You must be signed in to change notification settings - Fork 1
Deterministic Tools
Complete inventory of all algorithms, rules and transformations that do not use LLM. Each entry indicates: where it lives in the code, which pipeline stage it is applied at, whether it is automatic or requires user action.
FILE (CSV / XLSX)
│
▼ ── STAGE 1: LOADING ──────────────────────────────────────────────
│ detect_encoding (normalizer.py)
│ detect_delimiter (normalizer.py) [CSV only]
│ detect_header_row (normalizer.py) [CSV only]
│ detect_best_sheet (normalizer.py) [Excel only]
│
▼ ── STAGE 1b: PRE-PROCESSING (Phase 0) ────────────────────────────
│ detect_and_strip_preheader_rows (normalizer.py) [CSV + Excel]
│ drop_low_variability_columns (normalizer.py) [CSV + Excel]
│
▼ ── STAGE 2: SCHEMA / CLASSIFICATION ─────────────────────────────
│ compute_columns_key (normalizer.py) [schema cache]
│ compute_file_hash (normalizer.py) [import idempotency]
│ [PHASE 0 classifier.py: deterministic column synonyms]
│
▼ ── STAGE 3: NORMALISATION ──────────────────────────────────────
│ parse_date_safe (normalizer.py)
│ parse_amount (normalizer.py)
│ apply_sign_convention (normalizer.py)
│ normalize_description (normalizer.py)
│ compute_transaction_id (normalizer.py) [SHA-256 dedup]
│ [intra-file dedup: aggregation of identical rows]
│ remove_card_balance_row (normalizer.py)
│
▼ ── STAGE 4: DEDUP CHECK ────────────────────────────────────────────
│ get_existing_tx_ids (repository.py)
│ [IDs already calculated at step 3 → no LLM call on already-known txs]
│
▼ ── STAGE 5: DESCRIPTION CLEANING ──────────────────────────────────
│ redact_pii (sanitizer.py) [before LLM]
│ restore_owner_aliases (sanitizer.py) [after LLM]
│ [LLM output filter: discard "null","none","nan",…]
│
▼ ── STAGE 6: INTERNAL TRANSFER DETECTION ────────────────────────────
│ detect_internal_transfers (normalizer.py)
│ ├─ Phase 1: amount + date matching
│ └─ Phase 2: owner name match
│
▼ ── STAGE 7: CARD RECONCILIATION ───────────────────────────────────
│ find_card_settlement_matches (normalizer.py)
│ ├─ Phase 1: time window
│ ├─ Phase 2: sliding window (contiguous)
│ └─ Phase 3: boundary subset sum
│
▼ ── STAGE 8: CATEGORISATION (cascade) ──────────────────────────────
│ _try_deterministic (categorizer.py)
│ ├─ Level 0: user rules (CategoryRule.matches)
│ └─ Level 1: static rules (_STATIC_RULES)
│ [if no match → LLM]
│ [after LLM: deterministic taxonomy validation]
│
▼ ── STAGE 9: DB PERSISTENCE ──────────────────────────────────────
│ [idempotent upsert for every tx, link, schema]
│
▼ ── STAGE 10: MANUAL REVIEW ──────────────────────────────────────
apply_rules_to_review_transactions (repository.py) [auto on Review page load]
apply_all_rules_to_all_transactions (repository.py) [button "Run all rules"]
_apply_description_rule_bulk (review_page.py) [on user request]
_rerun_transfer_detection (review_page.py) [on user request]
Function: detects the file encoding via chardet
Input: raw bytes of the file
Output: encoding string (e.g. "utf-8", "latin-1")
Notes: ascii is normalised to utf-8; default utf-8 if chardet fails
Function: counts the frequency of candidate characters and returns the most frequent
Input: CSV text content
Output: one of , ; \t |
Candidates (hardcoded): [",", ";", "\t", "|"]
Function: finds the first row with ≥ 2 non-numeric and non-empty fields
Input: list of text rows
Output: row index (0 if not found)
Numeric pattern (hardcoded): ^[\d\.\,\-\+\s€$£%]+$
Function: selects the Excel sheet with the most useful data Input: Excel workbook object Output: sheet name Logic:
- Excludes sheets with names matching
summary|totale|riepilogo(case-insensitive) - Score =
n_rows + n_numeric_columns × 10 - Threshold: > 50% of rows must have numeric values for a column to be considered numeric
Function: removes sparse metadata rows present before the actual transaction table header
Input: raw DataFrame + filename (for logging)
Output: (cleaned DataFrame, n_rows_removed)
Algorithm:
- Reconstructs the header row consumed by pandas as row 0 (column names
Unnamed: Nare treated as empty cells) - Computes non-null density per row (number of non-null cells / number of columns)
- Computes the median of densities
- Counts contiguous rows from the start with density <
median × 0.5 - Safety limits: max 20 rows absolute AND max 10% of total → otherwise
ValueError - Reassigns column names from the first non-sparse row
Constants (hardcoded):
| Constant | Value | Description |
|---|---|---|
_PREHEADER_MAX_ROWS |
20 | Max absolute sparse rows before error |
_PREHEADER_MAX_RATIO |
0.10 | Max % sparse rows relative to total |
_PREHEADER_DENSITY_THRESHOLD |
0.5 | Median multiplier for sparse threshold |
Edge cases:
- DataFrame with < 4 rows → returns unchanged
- 0 sparse rows → returns unchanged (fast path)
Rationale: statistical and language-agnostic approach — does not use banking term dictionaries
Function: removes columns with nearly constant values (e.g. "Account holder name", "Card number" in AmEx files)
Input: DataFrame + filename (for logging)
Output: (cleaned DataFrame, removed_columns_list)
Algorithm: for each column computes nunique(col) / n_rows; if < threshold → candidate for removal
Protection: never drops below 2 columns (always preserves a minimum workable set)
Constants (hardcoded):
| Constant | Value | Description |
|---|---|---|
_LOW_VARIABILITY_RATIO |
0.015 | Threshold: < 1.5% unique/nrows → metadata column |
Edge cases:
- DataFrame with < 2 rows → returns unchanged
- DataFrame with ≤ 2 columns → returns unchanged
Function: generates a cache key for the DocumentSchema based on column names
Input: pandas DataFrame
Output: "cols:" + SHA-256[:16]
Use: same bank layout recognised across different files (e.g. CARD_2025.xlsx → CARD_2026.xlsx)
Function: SHA-256 of the raw file Input: file bytes Output: hex string Use: import-level idempotency (same file not re-imported)
Function: deterministic mapping of column names to canonical fields Recognised synonyms (examples):
| Canonical field | Synonyms (excerpt) |
|---|---|
date_col |
data, date, data operazione, buchungsdatum, fecha, date valeur |
amount_col |
importo, amount, betrag, montant, importe |
debit_col |
dare, addebiti, uscite, debit, ausgaben, débits |
credit_col |
avere, accrediti, entrate, credit, eingaben, crédits |
description_col |
descrizione, causale, memo, payee, verwendungszweck, libellé |
Phase 0 always overrides Phase 1 (LLM) — certain results from deterministic matching are not overwritten
Function: date parsing with fallback to common formats
Input: raw string, primary format from schema
Output: date object or None (row discarded if None)
Fallback formats (in order):
%d/%m/%Y %d-%m-%Y %d/%m/%y %d-%m-%y
%Y-%m-%d %Y/%m/%d %m/%d/%Y %m/%d/%y
Function: converts any amount representation to Decimal
Input: str | float | int | Decimal
Output: Decimal or None
Removed symbols (hardcoded): € $ £ and spaces
Separator detection:
| Format | Example | Rule |
|---|---|---|
| European | 1.234,56 |
. = thousands, , = decimal |
| American | 1,234.56 |
, = thousands, . = decimal |
| Comma only | 1234,56 |
, = decimal if ≤ 2 fractional digits |
| Dot only | 1234.56 |
. = decimal |
Function: produces the correctly-signed amount according to the file's convention
Input: row, columns, SignConvention enum
Conventions:
| Convention | Logic |
|---|---|
signed_single |
amount column already signed |
debit_positive |
credit − debit (both positive in CSV) |
credit_negative |
credit as-is, `− |
Function: text normalisation for stable case-insensitive comparison
Input: string
Output: NFC_unicode(text).casefold().strip()
Function: stable deduplication key for each transaction
Input: account_label | source_file, raw date, raw amount, raw description
Output: SHA-256[:24] of the string "key|raw_date|raw_amount|raw_desc"
Why raw values: stable across normalisation algorithm updates
Function: aggregates identical rows (same account_label + date + amount + description) Output: sums amounts, recomputes hash Use: avoids double counting if the same tx appears multiple times in the bank export
Function: detects and removes the "total balance" row present in some card files
Detection condition: ||amount_i| − Σ|other amounts|| ≤ epsilon
Requires: ≥ 3 transactions in the file
Behaviour:
- With
owner_name_label→ renames the description with the owner's name (internal transfer detection then captures it) - Without
owner_name_label→ removes the row (legacy: avoids double counting)
Function: DB query to find which IDs in the batch already exist Input: list of transaction IDs (calculated at step 3 from raw values) Output: set of already-present IDs Use: filters already-imported txs; if all present → immediate abort, zero LLM calls
This step was moved before description cleaning to avoid wasting LLM tokens on transactions that would be discarded anyway because they are already in the DB.
Function: replaces sensitive data with placeholders or fictitious names before every LLM call
Input: text, SanitizationConfig
Output: redacted text
Hardcoded patterns (precompiled regex):
| Data | Pattern | Placeholder |
|---|---|---|
| IBAN | \b[A-Z]{2}\d{2}[A-Z0-9]{4,30}\b |
<ACCOUNT_ID> |
| PAN / card (13-19 digits) | \b\d{4}[\s\-]?\d{4}[\s\-]?\d{4}[\s\-]?\d{1,7}\b |
<CARD_ID> |
| Masked card | [\*X]{4}[\s\-]?\d{4} |
<CARD_ID> |
| Bank codes (CAU, NDS, CRO, RIF…) | \b(CAU|NDS|CRO|RIF|TRN|ID\s*TRANSAZIONE)\s*[\d\-]+ |
<TX_CODE> |
| IT tax code | \b[A-Z]{6}\d{2}[A-Z]\d{2}[A-Z]\d{3}[A-Z]\b |
<FISCAL_ID> |
| Additional user patterns | from SanitizationConfig.extra_patterns
|
<REDACTED> |
Owner names → fictitious names (pool by language):
| Language | Pool (6 names) |
|---|---|
| IT | Carlo Brambilla, Marta Pellegrino, Alberto Marini, Giovanna Ferrara, Luca Montanari, Silvia Cattaneo |
| EN | James Fletcher, Helen Norris, Robert Ashworth, Patricia Holt, Edward Tilman, Susan Delaney |
| FR | Pierre Dumont, Claire Lebrun, François Martel, Isabelle Renaud, Gilles Fontaine, Nathalie Girard |
| DE | Klaus Hartmann, Monika Braun, Werner Schulze, Ingrid Bauer, Dieter Hoffmann, Renate Fischer |
| ES | Carlos Navarro, Elena Vega, Javier Romero, Isabel Fuentes, Andrés Molina, Lucía Castillo |
All permutations of tokens are captured ("Luigi Corsaro" and "Corsaro Luigi")
Function: reverse operation — replaces fictitious names with the real owner names When: after every LLM response Why: internal transfer detection works on real names
Function: discards unusable LLM responses
Discarded values (hardcoded): "null", "none", "n/a", "na", "nan", "-", "—"
Additional check: discards if result == original description (no change)
Function: identifies pairs of transactions that cancel each other out (internal transfer between accounts) Run: automatically on import + on user request from the Review page
Phase 1 — Matching by amount + date
For every pair (i, j) with i.account_label ≠ j.account_label:
amount_match = |amount_i + amount_j| ≤ epsilon (default 0.01 €)
date_match = |date_i − date_j| ≤ delta_days (default 5 days)
high_symmetry = |amount_i + amount_j| ≤ epsilon_strict (default 0.005 €)
AND |date_i − date_j| ≤ delta_days_strict (default 1 day)
Confidence:
HIGH → keyword present in description (list from user DB)
MEDIUM → high_symmetry without keyword
With require_keyword_confirmation=True AND confidence=MEDIUM:
→ sets transfer_pair_id but does NOT update tx_type (stays in review queue)
Otherwise:
→ tx_type = internal_out (outgoing) / internal_in (incoming)
Phase 2 — Owner name match (for txs not yet paired)
If the description contains an owner name
(regex with all permutations of tokens):
→ tx_type = internal_out / internal_in
→ transfer_confidence = HIGH
Configurable parameters:
| Parameter | Default | Configurable |
|---|---|---|
epsilon |
0.01 € | Yes |
epsilon_strict |
0.005 € | Yes |
delta_days |
5 days | Yes |
delta_days_strict |
1 day | Yes |
keyword_patterns |
from DB | Yes (Settings page) |
owner_names |
from DB | Yes (Settings page) |
require_keyword_confirmation |
True |
Yes |
Function: matches card_settlement rows (from the current account) to individual card_tx entries (from the card)
Run: automatically on import when both files are present
Phase 1 — Time window filter
Considers only card_tx in [debit_date − 45 days, debit_date + 7 days]
Phase 2 — Sliding window (contiguous subsets)
For every contiguous subset [i..j]:
verify gap between consecutive txs ≤ max_gap_days (default 5 days)
sum = Σ |amount[i..j]|
If |sum − debit_amount| ≤ epsilon → MATCH ✓ (method="sliding_window")
Phase 3 — Boundary subset sum (fallback)
Takes k=10 txs before + k=10 txs after the debit date (total ≤ 20)
Exhaustive search over all subsets: O(2^20) ≈ 1M → safe
If a subset sums to the amount → MATCH ✓ (method="subset_sum")
Configurable parameters:
| Parameter | Default | Configurable |
|---|---|---|
epsilon |
0.01 € | Yes |
window_days |
45 days | Yes |
max_gap_days |
5 days | Yes |
boundary_k |
10 | Yes |
Function: applies the deterministic cascade (Levels 0 and 1) before calling LLM
Output: CategorizationResult or None (→ proceeds to LLM)
Object: CategoryRule in core/categorizer.py
Key fields: pattern, match_type, category, subcategory, doc_type, priority
Matching logic (case-insensitive, casefold):
match_type |
Condition |
|---|---|
exact |
description == pattern |
contains |
pattern in description |
regex |
re.search(pattern, description) |
Optional filter on doc_type (e.g. only credit_card)
Order: descending priority — the first matching rule wins
Result: confidence=HIGH, source=rule, to_review=False
10 hardcoded rules, ordered by transaction type and direction. All case-insensitive, applied only to the correct direction (expense/income).
| # | Pattern | Category | Subcategory | Direction |
|---|---|---|---|---|
| 1 | conad|coop|esselunga|lidl|carrefour|eurospin|aldi|penny|pam\b |
Food | Grocery shopping | Expense |
| 2 | farmacia|pharma |
Health | Medicines | Expense |
| 3 | eni\b|shell|q8|tamoil|ip\b|api\b|agip |
Transport | Fuel | Expense |
| 4 | telepass|autostrad |
Transport | Parking and ZTL | Expense |
| 5 | trenitalia|italo|frecciarossa|frecciargento |
Transport | Public transport | Expense |
| 6 | enel\b|iren\b|a2a\b|hera\b|eni gas |
Home | Electricity | Expense |
| 7 | netflix|spotify|amazon prime|disney\+|apple tv |
Leisure and free time | Streaming / digital subscriptions | Expense |
| 8 | commissione|canone conto|spese tenuta |
Finance and insurance | Bank fees | Expense |
| 9 | stipendio|salary|busta paga |
Employment | Salary | Income |
| 10 | pensione|inps rendita |
Social benefits | Pension / annuity | Income |
Result if match: confidence=HIGH, source=rule, to_review=False
After every LLM response, before accepting the categorisation:
1. Is the (category, subcategory) pair valid in the taxonomy?
NO → look for the subcategory's parent category (find_category_for_subcategory)
IF not found → use the first valid subcategory for that category
SET confidence=LOW, to_review=True
2. Is the category in the correct direction (expense/income)?
NO → fallback (Other / Unclassified expenses), to_review=True
3. LLM confidence ≥ threshold (0.80)?
NO → to_review=True
Function: applies user rules to all txs with to_review=True
When: automatically on every load of the Review page
Logic: descending priority, first match wins
Effect: to_review=False, category_source=rule
Function: applies all user rules to all transactions (not only to_review=True)
When: on user request via "category, subcategory, category_source=rule, category_confidence=high; if to_review=True → sets False
Effect: returns (n_matched, n_cleared_review) — transactions updated and transactions removed from the review queue
Function: updates the description of all txs whose raw_description matches the pattern
When: on user request ("Apply in bulk" button)
Deterministic steps:
-
get_transactions_by_raw_pattern(session, pattern, match_type)→ list of txs - Updates
tx.description = new_descriptionfor each tx - Starts LLM re-categorisation (non-deterministic)
Rule saving (DescriptionRule): idempotent on (raw_pattern, match_type)
Function: re-runs detect_internal_transfers on all non-internal-transfer txs in the DB
When: on user request ("Re-run transfer detection" button)
Typical use: after importing files from multiple different accounts
| # | Tool | Module | Stage | Auto? | Configurable? |
|---|---|---|---|---|---|
| 1 | detect_encoding |
normalizer.py | 1 – Loading | ✓ | No |
| 2 | detect_delimiter |
normalizer.py | 1 – Loading | ✓ | No |
| 3 | detect_header_row |
normalizer.py | 1 – Loading | ✓ | No |
| 4 | detect_best_sheet |
normalizer.py | 1 – Loading | ✓ | No |
| 4b | detect_and_strip_preheader_rows |
normalizer.py | 1b – Pre-processing | ✓ | No |
| 4c | drop_low_variability_columns |
normalizer.py | 1b – Pre-processing | ✓ | No |
| 5 | compute_file_hash |
normalizer.py | 2 – Schema | ✓ | No |
| 6 | compute_columns_key |
normalizer.py | 2 – Schema | ✓ | No |
| 7 | Classifier Phase 0 (synonyms) | classifier.py | 2 – Classification | ✓ | No |
| 8 | parse_date_safe |
normalizer.py | 3 – Normalisation | ✓ | No |
| 9 | parse_amount |
normalizer.py | 3 – Normalisation | ✓ | No |
| 10 | apply_sign_convention |
normalizer.py | 3 – Normalisation | ✓ | Partial |
| 11 | normalize_description |
normalizer.py | 3 – Normalisation | ✓ | No |
| 12 | compute_transaction_id |
normalizer.py | 3 – Normalisation | ✓ | No |
| 13 | Intra-file dedup | normalizer.py | 3 – Normalisation | ✓ | No |
| 14 | remove_card_balance_row |
normalizer.py | 3 – Normalisation | ✓ | epsilon |
| 15 | get_existing_tx_ids |
repository.py | 4 – Dedup check | ✓ | No |
| 16 | redact_pii |
sanitizer.py | 5 – Desc. cleaning | ✓ | Owner names, extra patterns |
| 17 | restore_owner_aliases |
sanitizer.py | 5 – Desc. cleaning | ✓ | No |
| 18 | LLM output filter | description_cleaner.py | 5 – Desc. cleaning | ✓ | No |
| 19 | detect_internal_transfers |
normalizer.py | 6 – Internal transfers | ✓ + manual | Yes (all) |
| 20 | find_card_settlement_matches |
normalizer.py | 7 – Reconciliation | ✓ | Yes |
| 21 | User rules (CategoryRule) |
categorizer.py | 8 – Categorisation | ✓ | Yes (user) |
| 22 | Static rules (_STATIC_RULES) |
categorizer.py | 8 – Categorisation | ✓ | No |
| 23 | Post-LLM taxonomy validation | categorizer.py | 8 – Categorisation | ✓ | No |
| 24 | apply_rules_to_review_transactions |
repository.py | 10 – Review | ✓ (on load) | No |
| 25 | apply_all_rules_to_all_transactions |
repository.py | 10 – Review | Manual (button) | No |
| 26 | _apply_description_rule_bulk |
review_page.py | 10 – Review | Manual | Yes (pattern) |
| 27 | _rerun_transfer_detection |
review_page.py | 10 – Review | Manual | No |
| 28 | get_transactions_by_rule_pattern |
repository.py | 10 – Review | On demand | No |
| 29 | get_transactions_by_raw_pattern |
repository.py | 10 – Review | On demand | No |