A personal finance dashboard that consolidates movements from multiple bank accounts into a single interactive view — with automatic categorisation, persistent user overrides, and ML-powered next-month predictions. The dashboard eliminates the manual work of reconciling exports from three different banks with incompatible formats: a unified loader detects columns automatically, assigns spending categories via keyword rules, and lets you override any classification with a single click that survives app restarts. A lightweight linear-regression model then extrapolates next-month income and expenses per category, giving a forward-looking view alongside the historical charts.
Main technologies: Python · Streamlit · Pandas · Plotly Express · NumPy · openpyxl / xlrd
Monthly cost: $0. The dashboard runs entirely on your local machine with no cloud services, no APIs, and no paid subscriptions. All bank data stays local.
- Screenshots
- Features
- Quick start
- Project structure
- Data processing pipeline
- Architecture & data flow
- Libraries
- Machine Learning model
- Adding your own bank data
- Data privacy
- Auditing
- Multi-bank support — loads data from three banks with different file formats:
- Bank1: one
.xlsxfile per year - Bank2: single multi-column
.xlsxexport (25 columns, split income/expense) - Bank3: single
.csvexport
- Bank1: one
- Automatic format detection — column names are detected by keyword so new exports work without code changes
- Per-bank tabs with:
- Movements table with an editable Category dropdown per row
- Balance over time, expense pie chart, monthly cash flow
- Next-month predictions — predicted income and expenses by category using a linear-trend ML model trained on the last 24 months of data
- Overview tab — combined balance timeline, monthly net flow, expense & income breakdowns by year
- Persistent category overrides — manual category assignments are saved to disk and survive restarts
- Category manager — add or remove spending categories from the UI
# 1. Install dependencies
pip install streamlit pandas plotly openpyxl xlrd numpy
# 2. Generate dummy data (or drop your real files into Bank1/, Bank2/, Bank3/)
python generate_dummy_data.py
# 3. Run the dashboard
streamlit run app.pyOpen http://localhost:8501 in your browser.
├── app.py # Streamlit dashboard
├── data_loader.py # Data loading & normalisation for all three banks
├── predictor.py # ML next-month predictions per category (linear trend)
├── generate_dummy_data.py # Generates realistic sample data for demo
├── categories.json # Editable list of spending categories
├── category_overrides.json # User-assigned category overrides (auto-created, gitignored)
├── Bank1/ # Bank1 data: one xlsx per year
├── Bank2/ # Bank2 data: one xlsx covering multiple years
└── Bank3/ # Bank3 data: all.csv
Each bank exports data in a different format. A dedicated loader handles each one:
| Loader | Source | Format details |
|---|---|---|
load_bank1() |
Bank1/*.xlsx |
One file per year. Columns detected by keyword (fecha, concepto, importe, saldo). Date format varies: MM/DD/YYYY for 2020, DD/MM/YYYY for all later files. |
load_bank2() |
Bank2/*.xlsx |
Single file spanning multiple years. Two layouts are handled: a 25-column modern format with separate income/expense columns, and an older 6-column format. The header row is located dynamically by scanning for keyword markers. |
load_bank3() |
Bank3/all.csv |
CSV export. Only State == "COMPLETED" rows are kept. The Type column (Topup, ATM, Card Refund, Exchange) is used to assign categories directly, bypassing keyword matching. |
After loading, all three DataFrames are concatenated into a single unified schema:
| Column | Type | Description |
|---|---|---|
date |
datetime64 |
Transaction date, timezone-naive |
concept |
str |
Merchant name or transaction description |
amount |
float |
EUR amount — positive for income, negative for expense |
balance |
float |
Account balance after the transaction |
bank |
str |
Source bank identifier (Bank1, Bank2, Bank3) |
category |
str |
Spending category |
Every transaction concept is matched against an ordered list of keyword rules. The first matching rule wins and assigns the category; unmatched transactions fall back to "other". The 30 built-in categories cover everything from supermarket and restaurant to investment, loan, and rental income.
A transaction ID (tx_id) is computed as an MD5 hash of bank | date | concept | amount. This stable identifier lets user-assigned category overrides survive across re-imports: overrides are persisted in category_overrides.json and re-applied on top of the auto-categorisation every time the app loads.
apply_filters() restricts the DataFrame to the sidebar date range and, optionally, hides own inter-bank transfers. The filtered data is then aggregated in memory for each chart:
- Balance over time — last balance per calendar day (step-function line chart)
- Expenses by category — absolute sum of negative amounts grouped by category (donut)
- Monthly cash flow — net sum of all amounts grouped by calendar month (bar chart)
- Overview breakdowns — same aggregations across all banks simultaneously
predict_next_month(df) is called once per bank tab with the bank's full unfiltered history and returns predicted income and expense totals for each category for the coming month. See the Machine Learning model section for full details.
flowchart TD
subgraph FILES["📁 Input files"]
F1["Bank1/*.xlsx\none file per year"]
F2["Bank2/2024-2026.xlsx\n25-column format"]
F3["Bank3/all.csv\nCSV export"]
F4["categories.json"]
F5["category_overrides.json"]
end
subgraph LOADER["🔄 data_loader.py"]
L1["load_bank1()\ncolumn auto-detect\ndate-format handling"]
L2["load_bank2()\nheader-row scan\nincome − expense cols"]
L3["load_bank3()\nState==COMPLETED\nType → category map"]
CAT["_categorize(concept)\nkeyword scan → 30 cats"]
CONCAT["load_all()\npd.concat + type coercion\nunified schema"]
end
subgraph APP["🖥️ app.py — Streamlit"]
OVER["apply_overrides()\ntx_id MD5 lookup"]
FILT["apply_filters()\ndate range · hide transfers"]
subgraph VIZ["📊 Plotly charts"]
V1["Balance timeline"]
V2["Expenses donut"]
V3["Monthly cash flow"]
V4["Overview — all banks"]
end
EDIT["data_editor\nuser edits → overrides.json"]
end
subgraph PRED["🤖 predictor.py"]
P1["Monthly totals\nper category × sign"]
P2["_extrapolate\nnp.polyfit deg=1\nor mean if < 4 pts"]
P3["Predicted income\n& expense charts"]
end
F1 --> L1
F2 --> L2
F3 --> L3
L1 --> CAT
L2 --> CAT
L3 --> CAT
CAT --> CONCAT
F4 --> APP
CONCAT --> OVER
F5 --> OVER
OVER --> FILT
FILT --> V1
FILT --> V2
FILT --> V3
FILT --> V4
OVER --> P1
P1 --> P2
P2 --> P3
EDIT --> F5
| Library | Role in this project |
|---|---|
| Streamlit | Turns a plain Python script into an interactive web app. Provides tabs, sidebar filters, the editable data table, and the reactive re-run model — no HTML or JavaScript needed. |
| Pandas | Core data manipulation layer. Used for reading Excel/CSV files, concatenating multi-bank DataFrames, groupby aggregations for charts, period-based resampling, and applying category overrides via row-level lookups. |
| Plotly Express | Interactive chart library. All charts (line, bar, donut/pie, horizontal bar) are rendered with Plotly and embedded in Streamlit. Hover tooltips and zoom come for free. |
| NumPy | Numerical backbone of the prediction module. np.polyfit fits the least-squares linear model; np.arange generates the time index. No heavy ML framework is needed. |
| openpyxl | Reads modern .xlsx files (Bank1 yearly exports, Bank2 multi-year export). Used transparently by pd.read_excel. |
| xlrd | Reads legacy .xls files. Some bank exports still use the older binary format; xlrd handles those transparently alongside openpyxl. |
| hashlib (stdlib) | Generates the stable MD5 tx_id per transaction so that user category overrides survive data re-imports without a database. |
| pathlib (stdlib) | Cross-platform file path handling for locating bank data folders and JSON config files relative to the script location. |
| json (stdlib) | Reads and writes categories.json and category_overrides.json — the two persistence files that survive app restarts. |
For each (bank, category, sign) triplet — e.g. Bank1 / supermarket / expense — the predictor:
- Collects all completed calendar months in the last
LOOKBACK_MONTHS = 24months where that category had at least one transaction. - Computes the monthly total for that category (income and expenses are treated separately).
- Fits a degree-1 polynomial (straight line) to those totals using
np.polyfit, which minimises the sum of squared residuals — equivalent to Ordinary Least Squares regression with a single time-index feature. - Extrapolates one step ahead: the predicted value is
slope × (n + 1) + intercept, clamped to zero to prevent negative predictions. - If fewer than
TREND_MIN = 4months of data exist for the category, the model falls back to the arithmetic mean — a sensible prior when there is too little evidence for a trend.
| Property | Benefit for this use case |
|---|---|
| No external ML dependency | Only NumPy is required — already a transitive dependency of Pandas. |
| Interpretable | The slope directly expresses whether spending is growing or shrinking in euros per month. |
| Low data requirement | Works with as few as 4 monthly observations — important for categories like flight or hotel that appear rarely. |
| Robust to irregular cadence | Because the input is a pre-aggregated monthly total rather than raw daily amounts, daily noise is averaged out before fitting. |
| No overfitting risk | A single-parameter linear model cannot overfit a short time series, which is the dominant risk with personal finance data (small N, high variance). |
More complex models (ARIMA, Prophet, LSTM) would require substantially more historical data per category, hyperparameter tuning, and additional dependencies, with limited accuracy gain on a dataset of this size.
| Parameter | Value | Effect |
|---|---|---|
LOOKBACK_MONTHS |
24 |
Training window. Keeps predictions recent; drops data older than 2 years. |
MIN_MONTHS |
2 |
Minimum monthly data points to produce any prediction for a category. |
TREND_MIN |
4 |
Minimum points to fit a trend line; below this the mean is used instead. |
deg |
1 |
Polynomial degree passed to np.polyfit — enforces a straight-line fit. |
Replace the files in Bank1/, Bank2/, Bank3/ with your real exports.
The loaders detect column names automatically, so minor format variations (different column names, date formats) are handled without code changes.
To add a fourth bank, add a new loader function in data_loader.py following the same pattern and include it in load_all().
The Bank1/, Bank2/, Bank3/ folders in this repository contain synthetic dummy data generated by generate_dummy_data.py. No real financial data is included.
This section provides a structured checklist for review by an IT expert and a personal-finance subject-matter expert.
- Cost & resource minimization — The project runs entirely locally with no cloud services, APIs, or paid libraries. Monthly cost is $0.
- IT architecture — Clean three-file separation (
data_loader.py,predictor.py,app.py). Streamlit is appropriate for a personal, single-user dashboard. No over-engineering for the stated use case. - Code efficiency — MD5 is used for
tx_idgeneration; adequate for personal finance but not collision-resistant at scale. The full dataset reloads on every Streamlit interaction (no persistent server state), which is acceptable for a few thousand transactions. - Cybersecurity — Real bank files are handled locally and never uploaded. Dummy data is committed to the repo. The Streamlit dashboard has no authentication — anyone with localhost access can view all financial data.
category_overrides.jsonis gitignored. - Readability & maintainability — The 30-category keyword list in
_categorize()may become hard to maintain as spending habits change. The loader's keyword-based column detection is a practical approach but fragile if bank export formats change significantly. - ML / predictive model adequacy — OLS linear regression (
np.polyfit, degree 1) is simple, interpretable, and well-suited to a personal dataset of this size. The graceful fallback to arithmetic mean for categories with fewer than 4 data points prevents overfitting on sparse data. - Data privacy — Financial data remains strictly local. No data is transmitted to external services. The dummy data generator ensures the repo contains no real transaction information.
- Other — No input validation on bank export file formats; future format changes by banks may silently produce wrong results. The 24-month lookback window (
LOOKBACK_MONTHS) may exclude relevant historical context for infrequent spending categories.
| Audit Item | Claude's Assessment | Human Expert Assessment |
|---|---|---|
| Cost & resource minimization | $0/month. No cloud dependencies. Optimal for a personal-use tool. | |
| IT architecture | Clean separation of concerns. Streamlit is appropriate. No unnecessary complexity. | |
| Code efficiency | MD5 tx_id is adequate. Full reload per interaction is acceptable at this data volume. | |
| Cybersecurity | No authentication on dashboard. Real data stays local. No cloud transmission. | |
| Readability & maintainability | 30-category keyword list may drift from reality. Bank format detection is pragmatic but fragile. | |
| ML / predictive model adequacy | Linear regression with mean fallback is the right level of complexity for personal finance. | |
| Data privacy | Financial data never leaves the local machine. Repo contains only synthetic data. | |
| Other | No format-change detection for bank exports. Infrequent categories may have insufficient history for trend detection. |



