ConvTag is a conversation labeling and active-learning platform for AI teams. It captures every exchange between users and an AI assistant, automatically predicts what type of interaction it was, lets humans correct those predictions, and retrains the classifier on that feedback — closing the loop continuously.
convtag/
├── tagger/ FastAPI backend — storage, LLM, pipeline, training, export
├── chat/ Next.js frontend — chat, review, analytics, settings
├── data/ SQLite database (auto-created) + settings.json + model files
├── tests/ End-to-end backend tests
├── seed_data.csv Optional bootstrap training data (text, tag)
└── .env.local Local environment config (copy from .env.example)
- A user message creates conversation context.
- The assistant (OpenAI or Anthropic) produces a reply.
- That reply is automatically tagged by the pipeline.
- A reviewer confirms or corrects the label in the Queue.
- Reviewed labels become training examples.
- One click retrains the embedding classifier — the new model is picked up immediately.
Five stages run in order. The first to accept wins:
| Stage | What it does |
|---|---|
| Rule check | Keyword + regex patterns from rules.yaml — instant, no model |
| Embedding classifier | LogisticRegression on text embeddings (TF-IDF or OpenAI) |
| Heuristic | Label-keyword matching from labels.py |
| LLM classifier | Zero-shot classification via your configured LLM |
| Fallback | Returns unknown with low confidence |
Task execution: code, debugging, math, data_analysis, task_completion, instruction, planning, translation
Knowledge: factual_qa, explanation, comparison, summarization, reasoning
Creative & social: creative, opinion, roleplay, conversation
Meta: clarification, refusal, safety
Custom labels can be added in Settings without touching code.
| Page | What it shows |
|---|---|
| Queue | Low-confidence outputs sorted for human review; keyboard shortcuts (j/k/1–9) |
| Playground | Live conversation with the assistant; every reply is auto-tagged; load sample data |
| Import | Bulk-ingest turns from JSONL, CSV, or JSON file/paste |
| Sessions | All past sessions with label summaries |
| Session detail | Turn-by-turn review with pipeline trace |
| Training | Retrain the embedding classifier; readiness card; live metrics; version history |
| Analytics | Coverage metrics, label distribution, confidence histogram, activity timeseries |
| Export | Download JSONL or RLHF-format data with label and confidence filters |
| Settings | LLM provider, embeddings, pipeline toggles, thresholds, labels, label activity |
Copy-Item .env.example .env.localEdit .env.local — at minimum set one of:
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
cd tagger
python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install -r requirements.txt
python -m uvicorn app:app --reload --port 8000cd chat
npm install
npm run devOpen http://localhost:3002.
| Variable | Default | Description |
|---|---|---|
OPENAI_API_KEY |
— | Required for OpenAI LLM or embeddings |
ANTHROPIC_API_KEY |
— | Required for Anthropic LLM |
LLM_PROVIDER |
openai |
openai or anthropic |
LLM_MODEL |
gpt-4.1-mini |
Model name for chat replies |
EMBEDDING_PROVIDER |
tfidf |
openai or tfidf (local fallback) |
EMBEDDING_MODEL |
text-embedding-3-small |
OpenAI embedding model |
LLM_CLASSIFIER_ENABLED |
true |
Enable/disable the LLM classifier stage |
TAGGER_API_KEY |
— | Optional API key for the tagger service |
TAGGER_DATABASE_PATH |
data/convtag.db |
SQLite database path |
TAGGER_MODEL_PATH |
tagger/model.joblib |
Active classifier model path |
TAGGER_MODEL_DIR |
tagger/models/ |
Versioned model storage directory |
TAGGER_URL |
http://127.0.0.1:8000 |
Tagger URL used by the Next.js proxy |
All variables can be changed in .env.local. Provider and API key changes require a tagger restart. Pipeline toggles, thresholds, and labels can be changed live from the Settings page without restarting.
Training runs in a background thread on the tagger and completes regardless of whether the browser stays open. The Training page polls for new versions and updates automatically when training finishes.
A held-out test split requires 20+ reviewed examples. With fewer, the model still trains but per-label metrics are not available.
To bootstrap from a CSV before any chat data exists:
.\tagger\.venv\Scripts\Activate.ps1
python tagger\trainer.py seed_data.csvThe CSV must have text and tag columns.
- JSONL — one JSON object per reviewed label:
{ "text", "label", "source", "session_id", "turn_index" } - RLHF — paired turns:
{ "prompt", "chosen", "rejected", "label" }for reward model training
GET /health
POST /start_session
POST /message
GET /api/sessions
GET /api/session/{session_id}
GET /api/summary
GET /api/labels/uncertain
GET /api/labels/{label_id}
PATCH /api/labels/batch
PATCH /api/labels/{label_id}
POST /api/ingest
POST /api/ingest/sample
GET /api/training/examples
POST /api/model/train
GET /api/model/versions
GET /api/analytics/coverage
GET /api/analytics/label_stats
GET /api/analytics/timeseries
GET /api/analytics/confidence
GET /api/export/jsonl
GET /api/export/rlhf
GET /api/settings
POST /api/settings
python -m py_compile tagger\app.py tagger\classifier.py tagger\config.py tagger\llm_agent.py tagger\pii.py tagger\storage.py tagger\tagger.py tagger\trainer.py
python -m unittest tests.test_system
cd chat && npm run build