Vera bot — magicpin AI Challenge

A WhatsApp message-writing bot for Indian local-commerce merchants — dentists, salons, restaurants, gyms, pharmacies. Given who the merchant is and what just happened (a research paper, a supply alert, a perf dip, a recall, etc.), it composes the message the merchant should send their customers.

🚀 Live on Hugging Face Spaces: huggingface.co/spaces/Arjunnnnn123/vera-bot 🌐 Live API base: https://Arjunnnnn123-vera-bot.hf.space Quick checks: /v1/healthz · /v1/metadata

Submission: magicpin AI Challenge, April 2026.

What it does, in 30 seconds

You feed the bot 4 things:

Category — e.g. "dentists in India" (compliance rules, taboo words, what to never claim)
Merchant — e.g. "Dr Meera, solo practitioner in South Delhi, Hindi+English"
Trigger — e.g. "new ICMR research published on tooth-pulp regeneration"
Customer (optional) — e.g. "Priya, last visit 8 months ago, due for a cleaning"

It returns a WhatsApp message the merchant can send out — short, on-brand, grounded in the facts above, with a real CTA. No invented prices, no fake stats, no spammy "20% off!" copy.

It also handles incoming replies — auto-reply detection, opt-out, hostile messages, intent-commit transitions.

What was provided vs what I built

The repo contains both the starter material magicpin handed to candidates and my actual submission. Here's the split — useful for anyone reviewing this as a portfolio piece.

📦 Provided by magicpin (starter material — same for every candidate)

These files came from the challenge team and were not modified. I only moved them into docs/ for readability.

Path	What it is
`docs/challenge-brief.md`	The full spec — what to build, the 4-context framework, the rubric
`docs/challenge-testing-brief.md`	The technical contract — HTTP API shapes, judge harness behavior
`docs/engagement-design.md`	magicpin's message-design principles (compulsion levers, etc.)
`docs/engagement-research.md`	Research notes that fed the design doc
`docs/examples/api-call-examples.md`	Exact HTTP calls the judge sends + expected responses
`docs/examples/case-studies.md`	10 winning example messages used as anchors
`dataset/categories/*.json`	5 category contexts (dentists, salons, gyms, restaurants, pharmacies)
`dataset/merchants_seed.json`	Seed merchants (2 per category)
`dataset/customers_seed.json`	Seed customer profiles
`dataset/triggers_seed.json`	Seed trigger payloads (one per kind)
`dataset/generate_dataset.py`	Deterministic seed → full-dataset expander (same output for every candidate)
`expanded/`	Output of running the expander — generated data, not authored by me
`judge_simulator.py`	The testing harness magicpin uses to score bots

🛠️ Built by me (Arjun) — the actual submission

This is the work being judged.

Path	What it does
`bot/main.py`	FastAPI app + the 5 required endpoints + lifespan handler
`bot/compose.py`	Main composition pipeline — `compose_for_trigger()`
`bot/reply.py`	`/v1/reply` handler — heuristic fast-paths + LLM
`bot/llm.py`	Async multi-provider LLM client (Anthropic / OpenAI / Gemini / Groq / DeepSeek) with retry, fallback chain, JSON-mode
`bot/prompts.py`	Composer system prompt + 22 per-trigger-kind playbooks + reply system prompt
`bot/distill.py`	Compresses raw context dicts into compact prompt-friendly text
`bot/validators.py`	Hard validators — no URLs, no fake prices, anti-repetition, send_as match, etc.
`bot/store.py`	Versioned, idempotent in-memory context store + per-merchant state + conversations
`bot/schemas.py`	Pydantic request/response models for the 5 endpoints
`hf_upload/`	The exact bundle deployed to Hugging Face Spaces (mirror of `bot/` + deploy config)
`Dockerfile`	Container definition for HF Spaces / any Docker host
`requirements.txt`	Python dependencies
`smoke_test_kinds.py`	My own smoke test that composes every trigger kind and prints a quality table
`docs/space_README.md`	HF Spaces frontmatter file (title, sdk: docker, app_port: 7860)
`.env.example`	Template for local env vars
`.gitignore`	Excludes `.claude/`, `.env`, `__pycache__/`, etc.
`README.md`	This file

In short: the bot/ directory and the deployment around it are entirely mine. The challenge team provided the spec, the seed data, and the judge harness — I built the system that scores well against them.

How it's built

[ HTTP request from judge ]
            │
            ▼
   ┌────────────────┐
   │  FastAPI app   │  bot/main.py — 5 endpoints
   └───────┬────────┘
           │
   ┌───────▼────────┐
   │  In-memory     │  bot/store.py — versioned, idempotent
   │  context store │
   └───────┬────────┘
           │
   ┌───────▼────────┐
   │  Composer      │  bot/compose.py
   │  ───────────   │     1. suppression / DND / not-interested check
   │                │     2. distill 4 contexts → ~1.5KB prompt
   │                │     3. add per-trigger-kind playbook
   │                │     4. LLM call (Gemini 2.5 Flash, temp=0)
   │                │     5. validators — no URLs, no fake prices, …
   │                │     6. one re-prompt on fail, then surgical repair
   └───────┬────────┘
           │
   ┌───────▼────────┐
   │  Action        │  body, cta, send_as, suppression_key
   │  out           │
   └────────────────┘

Reply path runs through bot/reply.py — heuristic fast-paths for auto-reply (3 strikes → end), opt-out, hostile (apology + send), intent-commit (action mode), then LLM for everything else.

Project layout

.
├── README.md              ← you are here
├── requirements.txt       ← Python deps
├── Dockerfile             ← container for HF Spaces / any Docker host
├── .env.example           ← template for local env vars (no real keys)
│
├── bot/                   ← the actual application
│   ├── main.py              FastAPI app + 5 endpoints
│   ├── compose.py           compose_for_trigger() — main pipeline
│   ├── reply.py             /v1/reply handler
│   ├── llm.py               async multi-provider client (Anthropic / OpenAI / Gemini / Groq / DeepSeek)
│   ├── prompts.py           system prompt + per-kind playbooks + reply prompt
│   ├── distill.py           compresses raw contexts into prompt-friendly text
│   ├── validators.py        hard validators with feedback strings
│   ├── store.py             versioned context store + per-merchant state
│   └── schemas.py           Pydantic request/response models
│
├── dataset/               ← seed data the local simulator loads
│   ├── categories/          5 category JSONs (dentists, salons, gyms, …)
│   ├── merchants_seed.json  10+ merchants
│   ├── customers_seed.json  customers per merchant
│   └── triggers_seed.json   trigger payloads (one per kind)
│
├── expanded/              ← bigger generated dataset for stress testing
│
├── docs/                  ← challenge briefs and reference material
│   ├── challenge-brief.md            what to build
│   ├── challenge-testing-brief.md    how it's tested
│   ├── engagement-design.md          message-design principles
│   ├── engagement-research.md        research notes
│   ├── space_README.md               HF Spaces frontmatter copy
│   └── examples/
│       ├── api-call-examples.md      exact HTTP calls the judge sends
│       └── case-studies.md           10 winning examples
│
├── hf_upload/             ← exact bundle deployed to HF Spaces (don't edit)
│
├── judge_simulator.py     ← local end-to-end test harness (from magicpin)
└── smoke_test_kinds.py    ← composes every trigger kind, prints quality table

The 5 endpoints

Method	Path	What it does
GET/HEAD	`/v1/healthz`	Liveness + context counts (UptimeRobot OK)
GET	`/v1/metadata`	Team, model, approach summary
POST	`/v1/context`	Idempotent, versioned context push
POST	`/v1/tick`	Run all available triggers → up to 20 actions
POST	`/v1/reply`	Reply handler (auto-reply, opt-out, hostile, intent)
POST	`/v1/teardown`	Optional: wipe state at end of test

Full request/response shapes are in docs/examples/api-call-examples.md.

Run it locally

pip install -r requirements.txt

# pick any one provider:
export LLM_PROVIDER=gemini
export LLM_API_KEY=AIza...                    # your key
export LLM_MODEL=gemini-2.5-flash             # or claude-sonnet-4-5 / gpt-4o / llama-3.3-70b-versatile

uvicorn bot.main:app --host 0.0.0.0 --port 8080

Then point the simulator at it:

export LLM_API_KEY=...           # judge LLM key (same provider env var)
python judge_simulator.py        # edit BOT_URL/LLM_PROVIDER inside the file first

Or hit it directly:

curl http://localhost:8080/v1/healthz
curl http://localhost:8080/v1/metadata

Run it in Docker

docker build -t vera-bot .
docker run -p 8080:8080 \
  -e LLM_PROVIDER=gemini \
  -e LLM_API_KEY=$LLM_API_KEY \
  -e LLM_MODEL=gemini-2.5-flash \
  vera-bot

The HF Spaces deployment uses the same Dockerfile, on port 7860.

Why this should score well

The challenge brief calls out 4 weaknesses of magicpin's production "Vera" — this submission specifically fixes each:

Auto-reply pollution → strict detection + 3-strike exit (works under both same-conv and rotating-conv judge harnesses).
Intent-handoff failures → explicit ACTION-MODE routing + regex post-check that strips qualifying language ("would you", "do you", "what if") if the model regresses.
Generic discount copy → validator rejects "X% off" when a real catalog title exists; playbooks force concrete service names.
Low engagement frequency → coverage of curiosity / digest / trend / planning families, not just functional reminders.

Hard validators that block weak outputs before send:

❌ no URLs in the body
❌ no invented prices, no invented metrics
❌ no generic "X% off" copy when a real service-at-price is available
❌ no category taboo words (e.g. "guaranteed cure" for dentists)
❌ no repetition vs the last message sent to this merchant
✅ must contain a numeric anchor (price, %, count, or window)
✅ send_as must match the trigger's audience scope

If a validator fails, the composer re-prompts the LLM once with the failure list, then falls back to surgical repair (strip URL, force send_as, normalise CTA).

Tradeoffs I made

In-memory store instead of Redis. Test window is 60 simulated minutes — spec allows it, restart-between-calls is forbidden anyway, saves a deploy dependency.
One re-prompt on validation failure (not two). Keeps p95 < 8s; second-failure cases get surgical repair. ~90% pass on the first attempt.
Heuristic intent classifier (not an LLM hop) for auto-reply / opt-out / intent-commit. These need to be fast and deterministic; an LLM classifier is overkill and burns the budget the composer needs.
Per-kind playbooks instead of per-kind few-shots. Few-shots would balloon token cost across many calls; playbooks give the model the framing without the bloat.

What additional context would have helped

Confidence band on peer_stats.scope — "south-Delhi solo practices" is excellent; many composes would benefit from city-level matching.
A signed list of allowed social-proof phrases derived from the merchant's own customer aggregate, so the composer can use them without it reading as fabrication.
A language_register per merchant distinct from languages (e.g. peer_clinical_en vs peer_clinical_hi_en_mix).

Keeping the Space alive — UptimeRobot

Hugging Face's free Spaces tier puts a Space to sleep after ~48 hours of inactivity, and a cold start can take 30–60 seconds. The judge harness calls happen at unpredictable moments and the spec asks for p95 < 8 seconds, so a cold start would tank the score.

To prevent that, I set up a free UptimeRobot monitor that pings the bot every 5 minutes:

Monitor type: HTTP(s) — Keyword
URL: https://Arjunnnnn123-vera-bot.hf.space/v1/healthz
HTTP method: GET (UptimeRobot's free tier sometimes defaults to HEAD; I made /v1/healthz accept both with @app.api_route("/v1/healthz", methods=["GET", "HEAD"]) in bot/main.py)
Expected keyword: "status":"ok" (returned by the healthz endpoint)
Interval: every 5 minutes
Effect: keeps the container warm; cold-start risk during judging window drops to near-zero

This also means I get an email if the Space ever goes down — a free safety net for the judging window.

Tech stack

Python 3.11, FastAPI, Pydantic 2, httpx (async)
LLM: Gemini 2.5 Flash (primary, free tier 250K TPD), with automatic fallback to Groq Llama-3.3-70B
Deploy: Hugging Face Spaces (Docker SDK, port 7860)
Uptime: UptimeRobot 5-minute ping on /v1/healthz (see section above)

Submission

Hugging Face Space: https://huggingface.co/spaces/Arjunnnnn123/vera-bot
Live API base: https://Arjunnnnn123-vera-bot.hf.space
Source code: https://github.com/arrjunn/vera-bot
Author: Arjun Varshney

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vera bot — magicpin AI Challenge

What it does, in 30 seconds

What was provided vs what I built

📦 Provided by magicpin (starter material — same for every candidate)

🛠️ Built by me (Arjun) — the actual submission

How it's built

Project layout

The 5 endpoints

Run it locally

Run it in Docker

Why this should score well

Tradeoffs I made

What additional context would have helped

Keeping the Space alive — UptimeRobot

Tech stack

Submission

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
bot		bot
dataset		dataset
docs		docs
expanded		expanded
hf_upload		hf_upload
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
judge_simulator.py		judge_simulator.py
requirements.txt		requirements.txt
smoke_test_kinds.py		smoke_test_kinds.py

Folders and files

Latest commit

History

Repository files navigation

Vera bot — magicpin AI Challenge

What it does, in 30 seconds

What was provided vs what I built

📦 Provided by magicpin (starter material — same for every candidate)

🛠️ Built by me (Arjun) — the actual submission

How it's built

Project layout

The 5 endpoints

Run it locally

Run it in Docker

Why this should score well

Tradeoffs I made

What additional context would have helped

Keeping the Space alive — UptimeRobot

Tech stack

Submission

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages