nanoLearn

GPU fine-tuning experiments

A tiny benchmark for continual-learning architectures, plus a production-validation suite on real-world intent classification across four datasets.

Two components:

Synthetic benchmark — symbolic, closed-world vocabulary where pretrained shortcuts are impossible. 100 invented-word facts across 10 lessons, three question types (recall, composition, negation), full forgetting curves.
BANKING77 extended series (50+ experiments) — 77-class intent classification with frozen pretrained encoders + Mahalanobis prototype. Validated on CLINC150 (150 classes), HWU64 (64 classes), AG News (4-class topic).

Headline Results (Honest)

BANKING77 — 77-class class-incremental intent classification

Configuration	Official Split	5-fold CV (honest)	Forgetting
Naive MLP (catastrophic baseline)	32.5%	—	severe
Replay buffer (cap=200)	68.9%	—	moderate
Frozen MiniLM + Mahalanobis	90.5%	—	zero
Frozen mpnet + Mahalanobis	92.8%	—	zero
3-encoder concat + Mahalanobis	94.22%	93.33% +/- 0.58pp	zero
Class merge (top-30 pairs → 49 groups)	96.10%	—	zero (but redefines task)
Multi-label top-2 (T=0.05)	97.73% strict	—	zero

Important honesty note: the official BANKING77 test set is systematically 4.36pp easier than the training set (verified by train/test role swap). The honest cross-validation number is 93.33%, not 94.22%. Report both.

Cross-dataset validation — same ranking holds

Dataset	Classes	Catastrophic	Replay	Prototype (best)
BANKING77	77	32.5%	68.9%	94.22%
CLINC150	150	60.9%	80.1%	95.56%
HWU64	64	62.8%	83.3%	90.61%
AG News (topic)	4	66.7%	82.0%	90.22%

Ranking prototype >> replay >> catastrophic holds on 4 datasets, 3 task types. Gap shrinks with fewer classes (AG News has only 8pp gap to replay).

Production characteristics (latency benchmarks, Apple M-series CPU)

Config	Cold start	E2E@1	QPS@128	Memory
MiniLM (384d)	54ms	24ms	455	87MB + 0.7MB
mpnet (768d)	903ms	153ms	32	418MB + 2.5MB
3-enc concat (2176d)	12.3s	627ms	9	700MB + 18MB

Add a new class: 530ms for 100 examples. Mahalanobis prediction itself is ~1ms; encoder dominates all latency.

OOD rejection — margin score beats raw distance

Scorer	AUROC	ID accept @ 90% OOD reject
Global distance (baseline)	0.78	52%
Margin (d2 - d1)	0.85	52% (but at lower threshold)

What Worked vs What Failed

Worked (positive results)

Better encoders: MiniLM → mpnet (+2.24pp), +e5-large (+0.94pp)
Shrinkage tuning: 1e-4 optimal (+0.13pp)
ICDA embedding augmentation: +0.36pp (vanishes with better encoders)
Margin-based OOD scoring: AUROC 0.78 → 0.85
Multi-label via distance thresholds: 94% top-1 → 97.7% top-2
Class merging on ambiguous pairs: 94.22% → 96.10% (at 49 groups)

Failed (negative results, all verified empirically)

LoRA fine-tuning: -24.5pp even at best of 9 configs
Multi-prototype Mahalanobis (k=3): -0.7pp
Hard-margin replay: -3.6pp vs random reservoir
Label noise cleanup: -0.2pp (volume > purity)
PCA/LDA projection: -3.1pp
Hierarchical classification: -1.8pp
Weighted encoder fusion: -2.2pp
kNN-Mahalanobis (100 neighbors): -2.6pp
Cross-encoder reranking: -21pp (STS objective anti-correlates with intent)
LLM cascade (Qwen 2.5 3B): -3.8pp
Fine-tuned mpnet contrastive (oracle): +0.03pp on concat
Bigger encoder bge-large alone: -0.6pp

Corrected core insight (after honest audit)

A late-cycle audit revealed we never tested vanilla LinearSVC on the 2176-d concat features. Result: LinearSVC matches or slightly beats Mahalanobis under joint training (94.29% official / 94.36% CV vs 94.22% / 93.33%).

The honest narrative is narrower than "Mahalanobis is magic":

Training mode	Classifier	Accuracy
Joint (all 77 classes at once)	LinearSVC	94.36% CV
Joint (all 77 classes at once)	Mahalanobis	93.33% CV
Class-incremental	Incremental OvR LogReg	90.52%
Class-incremental	Mahalanobis	94.22%

Mahalanobis doesn't beat everything. It holds up under class-incremental constraints while naive linear classifiers lose 3.84pp. The mathematical invariances (order, granularity, streaming sufficient statistics) are the actual contribution. Under joint training, Mahalanobis and LinearSVC both hit the frozen-feature ceiling; it's not about discrimination, it's about CIL deployability.

SOTA position (honest)

Published BANKING77 SOTA (~95-96%) requires full gradient fine-tuning of the encoder. Our approach at 93.33% CV uses zero gradient training, has zero forgetting, is order-invariant, and deploys in 5MB. For production CIL where new intents must be added continuously without forgetting, our approach dominates; for pure accuracy without CIL constraints, fine-tuning still wins.

Quick Start

pip install -r requirements.txt

# Synthetic benchmark
python run.py train --arch catastrophic_mlp --seed 42
python run.py leaderboard

# BANKING77 experiments
PYTHONPATH=. python -m experiments.banking77                    # core 5-architecture baseline
PYTHONPATH=. python -m experiments.banking77_3enc               # 3-encoder concat (94.22%)
PYTHONPATH=. python -m experiments.banking77_variance_audit     # honest error bars
PYTHONPATH=. python -m experiments.banking77_latency            # production latency
PYTHONPATH=. python -m experiments.clinc150                     # CLINC150 replication
PYTHONPATH=. python -m experiments.ag_news                      # non-intent task

Project Layout

tasks.py              Fact / Lesson / Question / Answer / Dataset dataclasses
data.py               synthetic dataset generator
evaluation.py         class-incremental harness
recorder.py           append-only event log
train.py              drive one architecture end-to-end
architectures/        20+ LearningSystem implementations
experiments/          BANKING77 + CLINC150 + HWU64 + AG News
  banking77.py          core + class-incremental eval
  banking77_3enc.py     3-encoder concat (94.22% baseline)
  banking77_variance_audit.py  honest error bars
  banking77_latency.py         production latency/memory
  banking77_merge_deep.py      class merging to 96.10%
  banking77_multilabel.py      top-k distance thresholds
  banking77_ood_fix.py         margin-based OOD (AUROC 0.85)
  banking77_lora_sweep.py      fair LoRA falsification
  clinc150.py                  CLINC150 cross-dataset
  hwu64.py                     HWU64 cross-dataset
  ag_news.py                   non-intent cross-task
  ... 40+ experiment scripts total
paper/paper.md        full research paper draft
docs/                 GitHub Pages site
runs/                 immutable synthetic-benchmark run directories

License

MIT License. See LICENSE.

Links

GitHub Pages: https://danlex.github.io/nanolearn/
Custom domain (DNS pending): https://nanolearn.alexandrudan.com

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
architectures		architectures
docs		docs
experiments		experiments
notebooks		notebooks
paper		paper
runs		runs
tests		tests
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
data.py		data.py
evaluation.py		evaluation.py
leaderboard.py		leaderboard.py
manifest.py		manifest.py
recorder.py		recorder.py
registry.py		registry.py
requirements.txt		requirements.txt
retention.py		retention.py
run.py		run.py
runs.py		runs.py
tasks.py		tasks.py
train.py		train.py
viz.py		viz.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

nanoLearn

Headline Results (Honest)

BANKING77 — 77-class class-incremental intent classification

Cross-dataset validation — same ranking holds

Production characteristics (latency benchmarks, Apple M-series CPU)

OOD rejection — margin score beats raw distance

What Worked vs What Failed

Worked (positive results)

Failed (negative results, all verified empirically)

Corrected core insight (after honest audit)

SOTA position (honest)

Quick Start

Project Layout

License

Links

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

nanoLearn

Headline Results (Honest)

BANKING77 — 77-class class-incremental intent classification

Cross-dataset validation — same ranking holds

Production characteristics (latency benchmarks, Apple M-series CPU)

OOD rejection — margin score beats raw distance

What Worked vs What Failed

Worked (positive results)

Failed (negative results, all verified empirically)

Corrected core insight (after honest audit)

SOTA position (honest)

Quick Start

Project Layout

License

Links

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages