A WhatsApp message-writing bot for Indian local-commerce merchants — dentists, salons, restaurants, gyms, pharmacies. Given who the merchant is and what just happened (a research paper, a supply alert, a perf dip, a recall, etc.), it composes the message the merchant should send their customers.
🚀 Live on Hugging Face Spaces: huggingface.co/spaces/Arjunnnnn123/vera-bot 🌐 Live API base:
https://Arjunnnnn123-vera-bot.hf.spaceQuick checks:/v1/healthz·/v1/metadataSubmission: magicpin AI Challenge, April 2026.
You feed the bot 4 things:
- Category — e.g. "dentists in India" (compliance rules, taboo words, what to never claim)
- Merchant — e.g. "Dr Meera, solo practitioner in South Delhi, Hindi+English"
- Trigger — e.g. "new ICMR research published on tooth-pulp regeneration"
- Customer (optional) — e.g. "Priya, last visit 8 months ago, due for a cleaning"
It returns a WhatsApp message the merchant can send out — short, on-brand, grounded in the facts above, with a real CTA. No invented prices, no fake stats, no spammy "20% off!" copy.
It also handles incoming replies — auto-reply detection, opt-out, hostile messages, intent-commit transitions.
The repo contains both the starter material magicpin handed to candidates and my actual submission. Here's the split — useful for anyone reviewing this as a portfolio piece.
These files came from the challenge team and were not modified. I only moved them into docs/ for readability.
| Path | What it is |
|---|---|
docs/challenge-brief.md |
The full spec — what to build, the 4-context framework, the rubric |
docs/challenge-testing-brief.md |
The technical contract — HTTP API shapes, judge harness behavior |
docs/engagement-design.md |
magicpin's message-design principles (compulsion levers, etc.) |
docs/engagement-research.md |
Research notes that fed the design doc |
docs/examples/api-call-examples.md |
Exact HTTP calls the judge sends + expected responses |
docs/examples/case-studies.md |
10 winning example messages used as anchors |
dataset/categories/*.json |
5 category contexts (dentists, salons, gyms, restaurants, pharmacies) |
dataset/merchants_seed.json |
Seed merchants (2 per category) |
dataset/customers_seed.json |
Seed customer profiles |
dataset/triggers_seed.json |
Seed trigger payloads (one per kind) |
dataset/generate_dataset.py |
Deterministic seed → full-dataset expander (same output for every candidate) |
expanded/ |
Output of running the expander — generated data, not authored by me |
judge_simulator.py |
The testing harness magicpin uses to score bots |
This is the work being judged.
| Path | What it does |
|---|---|
bot/main.py |
FastAPI app + the 5 required endpoints + lifespan handler |
bot/compose.py |
Main composition pipeline — compose_for_trigger() |
bot/reply.py |
/v1/reply handler — heuristic fast-paths + LLM |
bot/llm.py |
Async multi-provider LLM client (Anthropic / OpenAI / Gemini / Groq / DeepSeek) with retry, fallback chain, JSON-mode |
bot/prompts.py |
Composer system prompt + 22 per-trigger-kind playbooks + reply system prompt |
bot/distill.py |
Compresses raw context dicts into compact prompt-friendly text |
bot/validators.py |
Hard validators — no URLs, no fake prices, anti-repetition, send_as match, etc. |
bot/store.py |
Versioned, idempotent in-memory context store + per-merchant state + conversations |
bot/schemas.py |
Pydantic request/response models for the 5 endpoints |
hf_upload/ |
The exact bundle deployed to Hugging Face Spaces (mirror of bot/ + deploy config) |
Dockerfile |
Container definition for HF Spaces / any Docker host |
requirements.txt |
Python dependencies |
smoke_test_kinds.py |
My own smoke test that composes every trigger kind and prints a quality table |
docs/space_README.md |
HF Spaces frontmatter file (title, sdk: docker, app_port: 7860) |
.env.example |
Template for local env vars |
.gitignore |
Excludes .claude/, .env, __pycache__/, etc. |
README.md |
This file |
In short: the bot/ directory and the deployment around it are entirely mine. The challenge team provided the spec, the seed data, and the judge harness — I built the system that scores well against them.
[ HTTP request from judge ]
│
▼
┌────────────────┐
│ FastAPI app │ bot/main.py — 5 endpoints
└───────┬────────┘
│
┌───────▼────────┐
│ In-memory │ bot/store.py — versioned, idempotent
│ context store │
└───────┬────────┘
│
┌───────▼────────┐
│ Composer │ bot/compose.py
│ ─────────── │ 1. suppression / DND / not-interested check
│ │ 2. distill 4 contexts → ~1.5KB prompt
│ │ 3. add per-trigger-kind playbook
│ │ 4. LLM call (Gemini 2.5 Flash, temp=0)
│ │ 5. validators — no URLs, no fake prices, …
│ │ 6. one re-prompt on fail, then surgical repair
└───────┬────────┘
│
┌───────▼────────┐
│ Action │ body, cta, send_as, suppression_key
│ out │
└────────────────┘
Reply path runs through bot/reply.py — heuristic fast-paths for auto-reply (3 strikes → end), opt-out, hostile (apology + send), intent-commit (action mode), then LLM for everything else.
.
├── README.md ← you are here
├── requirements.txt ← Python deps
├── Dockerfile ← container for HF Spaces / any Docker host
├── .env.example ← template for local env vars (no real keys)
│
├── bot/ ← the actual application
│ ├── main.py FastAPI app + 5 endpoints
│ ├── compose.py compose_for_trigger() — main pipeline
│ ├── reply.py /v1/reply handler
│ ├── llm.py async multi-provider client (Anthropic / OpenAI / Gemini / Groq / DeepSeek)
│ ├── prompts.py system prompt + per-kind playbooks + reply prompt
│ ├── distill.py compresses raw contexts into prompt-friendly text
│ ├── validators.py hard validators with feedback strings
│ ├── store.py versioned context store + per-merchant state
│ └── schemas.py Pydantic request/response models
│
├── dataset/ ← seed data the local simulator loads
│ ├── categories/ 5 category JSONs (dentists, salons, gyms, …)
│ ├── merchants_seed.json 10+ merchants
│ ├── customers_seed.json customers per merchant
│ └── triggers_seed.json trigger payloads (one per kind)
│
├── expanded/ ← bigger generated dataset for stress testing
│
├── docs/ ← challenge briefs and reference material
│ ├── challenge-brief.md what to build
│ ├── challenge-testing-brief.md how it's tested
│ ├── engagement-design.md message-design principles
│ ├── engagement-research.md research notes
│ ├── space_README.md HF Spaces frontmatter copy
│ └── examples/
│ ├── api-call-examples.md exact HTTP calls the judge sends
│ └── case-studies.md 10 winning examples
│
├── hf_upload/ ← exact bundle deployed to HF Spaces (don't edit)
│
├── judge_simulator.py ← local end-to-end test harness (from magicpin)
└── smoke_test_kinds.py ← composes every trigger kind, prints quality table
| Method | Path | What it does |
|---|---|---|
| GET/HEAD | /v1/healthz |
Liveness + context counts (UptimeRobot OK) |
| GET | /v1/metadata |
Team, model, approach summary |
| POST | /v1/context |
Idempotent, versioned context push |
| POST | /v1/tick |
Run all available triggers → up to 20 actions |
| POST | /v1/reply |
Reply handler (auto-reply, opt-out, hostile, intent) |
| POST | /v1/teardown |
Optional: wipe state at end of test |
Full request/response shapes are in docs/examples/api-call-examples.md.
pip install -r requirements.txt
# pick any one provider:
export LLM_PROVIDER=gemini
export LLM_API_KEY=AIza... # your key
export LLM_MODEL=gemini-2.5-flash # or claude-sonnet-4-5 / gpt-4o / llama-3.3-70b-versatile
uvicorn bot.main:app --host 0.0.0.0 --port 8080Then point the simulator at it:
export LLM_API_KEY=... # judge LLM key (same provider env var)
python judge_simulator.py # edit BOT_URL/LLM_PROVIDER inside the file firstOr hit it directly:
curl http://localhost:8080/v1/healthz
curl http://localhost:8080/v1/metadatadocker build -t vera-bot .
docker run -p 8080:8080 \
-e LLM_PROVIDER=gemini \
-e LLM_API_KEY=$LLM_API_KEY \
-e LLM_MODEL=gemini-2.5-flash \
vera-botThe HF Spaces deployment uses the same Dockerfile, on port 7860.
The challenge brief calls out 4 weaknesses of magicpin's production "Vera" — this submission specifically fixes each:
- Auto-reply pollution → strict detection + 3-strike exit (works under both same-conv and rotating-conv judge harnesses).
- Intent-handoff failures → explicit ACTION-MODE routing + regex post-check that strips qualifying language ("would you", "do you", "what if") if the model regresses.
- Generic discount copy → validator rejects "X% off" when a real catalog title exists; playbooks force concrete service names.
- Low engagement frequency → coverage of curiosity / digest / trend / planning families, not just functional reminders.
Hard validators that block weak outputs before send:
- ❌ no URLs in the body
- ❌ no invented prices, no invented metrics
- ❌ no generic "X% off" copy when a real service-at-price is available
- ❌ no category taboo words (e.g. "guaranteed cure" for dentists)
- ❌ no repetition vs the last message sent to this merchant
- ✅ must contain a numeric anchor (price, %, count, or window)
- ✅
send_asmust match the trigger's audience scope
If a validator fails, the composer re-prompts the LLM once with the failure list, then falls back to surgical repair (strip URL, force send_as, normalise CTA).
- In-memory store instead of Redis. Test window is 60 simulated minutes — spec allows it, restart-between-calls is forbidden anyway, saves a deploy dependency.
- One re-prompt on validation failure (not two). Keeps p95 < 8s; second-failure cases get surgical repair. ~90% pass on the first attempt.
- Heuristic intent classifier (not an LLM hop) for auto-reply / opt-out / intent-commit. These need to be fast and deterministic; an LLM classifier is overkill and burns the budget the composer needs.
- Per-kind playbooks instead of per-kind few-shots. Few-shots would balloon token cost across many calls; playbooks give the model the framing without the bloat.
- Confidence band on
peer_stats.scope— "south-Delhi solo practices" is excellent; many composes would benefit from city-level matching. - A signed list of allowed social-proof phrases derived from the merchant's own customer aggregate, so the composer can use them without it reading as fabrication.
- A
language_registerper merchant distinct fromlanguages(e.g.peer_clinical_envspeer_clinical_hi_en_mix).
Hugging Face's free Spaces tier puts a Space to sleep after ~48 hours of inactivity, and a cold start can take 30–60 seconds. The judge harness calls happen at unpredictable moments and the spec asks for p95 < 8 seconds, so a cold start would tank the score.
To prevent that, I set up a free UptimeRobot monitor that pings the bot every 5 minutes:
- Monitor type: HTTP(s) — Keyword
- URL:
https://Arjunnnnn123-vera-bot.hf.space/v1/healthz - HTTP method:
GET(UptimeRobot's free tier sometimes defaults to HEAD; I made/v1/healthzaccept both with@app.api_route("/v1/healthz", methods=["GET", "HEAD"])in bot/main.py) - Expected keyword:
"status":"ok"(returned by the healthz endpoint) - Interval: every 5 minutes
- Effect: keeps the container warm; cold-start risk during judging window drops to near-zero
This also means I get an email if the Space ever goes down — a free safety net for the judging window.
- Python 3.11, FastAPI, Pydantic 2, httpx (async)
- LLM: Gemini 2.5 Flash (primary, free tier 250K TPD), with automatic fallback to Groq Llama-3.3-70B
- Deploy: Hugging Face Spaces (Docker SDK, port 7860)
- Uptime: UptimeRobot 5-minute ping on
/v1/healthz(see section above)
- Hugging Face Space: https://huggingface.co/spaces/Arjunnnnn123/vera-bot
- Live API base: https://Arjunnnnn123-vera-bot.hf.space
- Source code: https://github.com/arrjunn/vera-bot
- Author: Arjun Varshney