Skip to content

forlop/microdata-no-copilot

Repository files navigation

microdata.no copilot — v2

A small fine-tuned language model (Qwen3.5-4B + LoRA) with a FAISS+BM25 retrieval layer, deployed locally via Ollama. Helps users write microdata.no scripts and look up SSB variable metadata.

Try it — 5-step demo (CPU or GPU, ~10 min)

The RAG index is pre-built and shipped in this repo. The model weights live on Hugging Face. No training, no scraping, no Docker required.

git clone https://github.com/forlop/microdata-no-copilot
cd microdata-no-copilot
pip install -r requirements.txt streamlit

# Install Ollama (one-time, OS-specific):
#   Linux/WSL:  curl -fsSL https://ollama.com/install.sh | sh
#   macOS:      brew install ollama   (or download from ollama.com)
#   Windows:    download OllamaSetup.exe from ollama.com

ollama pull hf.co/forlop/microdata-copilot-v2:Q4_K_M
ollama create microdata-copilot -f deploy/Modelfile
streamlit run rag/app.py

The pull grabs the GGUF (~2.7 GB) from Hugging Face; ollama create then applies the SYSTEM prompt, refusal few-shots, and stop-token parameters from deploy/Modelfile — without this step the model bleeds <|endoftext|> tokens and loops.

Streamlit prints a http://localhost:8501 URL — open it in your browser and ask a microdata.no question. On CPU expect ~10-15 s per response; on a recent GPU, ~1-2 s.

Model weights: huggingface.co/forlop/microdata-copilot-v2 (q4_k_m GGUF, 2.7 GB).

Start here

Document What it is For whom
TECHNICAL_NOTE.md Comprehensive technical record — architecture, design choices, evaluation, lessons, deployment story. Has a reader's preamble with glossary for non-ML-expert readers. SSB partners, future maintainers, anyone evaluating the project
IMPLEMENTATION_PLAN.md The original forward-looking plan (mostly historical now) Reference
*/README.md Per-phase technical detail (scrape/, cards/, train/, eval/, rag/, deploy/) Reproducers, contributors

If you read only one file: TECHNICAL_NOTE.md — its §F has a reading-guide that points to the right sections by audience and goal.

Quick map of the codebase

Folder Purpose
scrape/ Variable / example / manual scrapers (re-runnable, version-aware)
cards/ Training-card generation from scraped sources
train/ Unsloth QLoRA training + merge + GGUF export
eval/ Eval sets (v1 iteration + v2 held-out + adversarial) + three scorers (substring, LLM-judge, syntax-validator)
rag/ FAISS + BM25 index build + retrieval + Ollama wrapper
deploy/ Ollama Modelfile
configs/ yaml configs for scrape and train

Where data and models live (not in this repo — too big for sync)

Path Holds
D:\Work\microdata_LoRA\repo\v2\ This codebase (~3,500 lines Python + Markdown)
D:\Work\microdata_LoRA\data_raw\ Scraped JSON, manual text, PDF (~12 MB)
D:\Work\microdata_LoRA\data_processed\ Cards JSONL, FAISS index, embeddings (~80 MB)
D:\Work\microdata_LoRA\models\ LoRA adapters, merged safetensors, GGUFs (~12 GB total)
D:\Work\microdata_LoRA\logs\ Training + eval logs
Dropbox microdata_LoRA\archive_v1\ Frozen v1 reference (read-only)

Current status (deployed)

  • v2.0 LoRA adapter trained on Qwen3.5-4B base, quantized to q4_k_m GGUF (2.7 GB)
  • Served via Ollama as microdata-copilot on localhost:11434
  • FAISS + BM25 indexes built over 729 variables + ~100 manual sections + 40 examples
  • Internal iteration eval: 82.6% pass rate (lenient substring scoring)
  • Strict held-out + LLM-judge eval: 53.8% pass rate — this is the honest measurement
  • 100% jailbreak resistance, 80% RAG-class on held-out, 0% on stale-fact probes (calibration weakness)

See TECHNICAL_NOTE.md §17 and §21 for the full results breakdown.

Reproducing this from scratch

End-to-end on a 16 GB GPU machine: ~3 hours of compute + ~30 min human attention. See per-phase READMEs:

About

Locally-deployed AI copilot for microdata.no (SSB). Fine-tuned Qwen3.5-4B + LoRA + FAISS/BM25 RAG, served via Ollama. Research preview — see TECHNICAL_NOTE.md.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors