microdata.no copilot — v2

A small fine-tuned language model (Qwen3.5-4B + LoRA) with a FAISS+BM25 retrieval layer, deployed locally via Ollama. Helps users write microdata.no scripts and look up SSB variable metadata.

Try it — 5-step demo (CPU or GPU, ~10 min)

The RAG index is pre-built and shipped in this repo. The model weights live on Hugging Face. No training, no scraping, no Docker required.

git clone https://github.com/forlop/microdata-no-copilot
cd microdata-no-copilot
pip install -r requirements.txt streamlit

# Install Ollama (one-time, OS-specific):
#   Linux/WSL:  curl -fsSL https://ollama.com/install.sh | sh
#   macOS:      brew install ollama   (or download from ollama.com)
#   Windows:    download OllamaSetup.exe from ollama.com

ollama pull hf.co/forlop/microdata-copilot-v2:Q4_K_M
ollama create microdata-copilot -f deploy/Modelfile
streamlit run rag/app.py

The pull grabs the GGUF (~2.7 GB) from Hugging Face; ollama create then applies the SYSTEM prompt, refusal few-shots, and stop-token parameters from deploy/Modelfile — without this step the model bleeds <|endoftext|> tokens and loops.

Streamlit prints a http://localhost:8501 URL — open it in your browser and ask a microdata.no question. On CPU expect ~10-15 s per response; on a recent GPU, ~1-2 s.

Model weights: huggingface.co/forlop/microdata-copilot-v2 (q4_k_m GGUF, 2.7 GB).

Start here

Document	What it is	For whom
`TECHNICAL_NOTE.md`	Comprehensive technical record — architecture, design choices, evaluation, lessons, deployment story. Has a reader's preamble with glossary for non-ML-expert readers.	SSB partners, future maintainers, anyone evaluating the project
`IMPLEMENTATION_PLAN.md`	The original forward-looking plan (mostly historical now)	Reference
`*/README.md`	Per-phase technical detail (`scrape/`, `cards/`, `train/`, `eval/`, `rag/`, `deploy/`)	Reproducers, contributors

If you read only one file: TECHNICAL_NOTE.md — its §F has a reading-guide that points to the right sections by audience and goal.

Quick map of the codebase

Folder	Purpose
`scrape/`	Variable / example / manual scrapers (re-runnable, version-aware)
`cards/`	Training-card generation from scraped sources
`train/`	Unsloth QLoRA training + merge + GGUF export
`eval/`	Eval sets (v1 iteration + v2 held-out + adversarial) + three scorers (substring, LLM-judge, syntax-validator)
`rag/`	FAISS + BM25 index build + retrieval + Ollama wrapper
`deploy/`	Ollama Modelfile
`configs/`	yaml configs for scrape and train

Where data and models live (not in this repo — too big for sync)

Path	Holds
`D:\Work\microdata_LoRA\repo\v2\`	This codebase (~3,500 lines Python + Markdown)
`D:\Work\microdata_LoRA\data_raw\`	Scraped JSON, manual text, PDF (~12 MB)
`D:\Work\microdata_LoRA\data_processed\`	Cards JSONL, FAISS index, embeddings (~80 MB)
`D:\Work\microdata_LoRA\models\`	LoRA adapters, merged safetensors, GGUFs (~12 GB total)
`D:\Work\microdata_LoRA\logs\`	Training + eval logs
Dropbox `microdata_LoRA\archive_v1\`	Frozen v1 reference (read-only)

Current status (deployed)

v2.0 LoRA adapter trained on Qwen3.5-4B base, quantized to q4_k_m GGUF (2.7 GB)
Served via Ollama as microdata-copilot on localhost:11434
FAISS + BM25 indexes built over 729 variables + ~100 manual sections + 40 examples
Internal iteration eval: 82.6% pass rate (lenient substring scoring)
Strict held-out + LLM-judge eval: 53.8% pass rate — this is the honest measurement
100% jailbreak resistance, 80% RAG-class on held-out, 0% on stale-fact probes (calibration weakness)

See TECHNICAL_NOTE.md §17 and §21 for the full results breakdown.

Reproducing this from scratch

End-to-end on a 16 GB GPU machine: ~3 hours of compute + ~30 min human attention. See per-phase READMEs:

scrape/README.md — Phase 1 (~25 min)
cards/ (no README; see generate_cards_v22.py source comments) — Phase 2 (~1 min)
train/README.md — Phase 3 (~1.5h train + ~5 min export)
eval/README.md — Phase 4
rag/README.md — Phase 5 (~2 min after Phase 1)
deploy/README.md — Phase 6 (~1 min)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

microdata.no copilot — v2

Try it — 5-step demo (CPU or GPU, ~10 min)

Start here

Quick map of the codebase

Where data and models live (not in this repo — too big for sync)

Current status (deployed)

Reproducing this from scratch

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
cards		cards
configs		configs
data_processed		data_processed
deploy		deploy
eval		eval
private		private
rag		rag
scrape		scrape
scripts		scripts
train		train
.gitignore		.gitignore
IMPLEMENTATION_PLAN.md		IMPLEMENTATION_PLAN.md
LICENSE		LICENSE
README.md		README.md
TECHNICAL_NOTE.md		TECHNICAL_NOTE.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

microdata.no copilot — v2

Try it — 5-step demo (CPU or GPU, ~10 min)

Start here

Quick map of the codebase

Where data and models live (not in this repo — too big for sync)

Current status (deployed)

Reproducing this from scratch

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages