Skip to content

arrjunn/vera-bot

Repository files navigation

Vera bot — magicpin AI Challenge

View on Hugging Face Live API GitHub

A WhatsApp message-writing bot for Indian local-commerce merchants — dentists, salons, restaurants, gyms, pharmacies. Given who the merchant is and what just happened (a research paper, a supply alert, a perf dip, a recall, etc.), it composes the message the merchant should send their customers.

🚀 Live on Hugging Face Spaces: huggingface.co/spaces/Arjunnnnn123/vera-bot 🌐 Live API base: https://Arjunnnnn123-vera-bot.hf.space Quick checks: /v1/healthz · /v1/metadata

Submission: magicpin AI Challenge, April 2026.


What it does, in 30 seconds

You feed the bot 4 things:

  1. Category — e.g. "dentists in India" (compliance rules, taboo words, what to never claim)
  2. Merchant — e.g. "Dr Meera, solo practitioner in South Delhi, Hindi+English"
  3. Trigger — e.g. "new ICMR research published on tooth-pulp regeneration"
  4. Customer (optional) — e.g. "Priya, last visit 8 months ago, due for a cleaning"

It returns a WhatsApp message the merchant can send out — short, on-brand, grounded in the facts above, with a real CTA. No invented prices, no fake stats, no spammy "20% off!" copy.

It also handles incoming replies — auto-reply detection, opt-out, hostile messages, intent-commit transitions.


What was provided vs what I built

The repo contains both the starter material magicpin handed to candidates and my actual submission. Here's the split — useful for anyone reviewing this as a portfolio piece.

📦 Provided by magicpin (starter material — same for every candidate)

These files came from the challenge team and were not modified. I only moved them into docs/ for readability.

Path What it is
docs/challenge-brief.md The full spec — what to build, the 4-context framework, the rubric
docs/challenge-testing-brief.md The technical contract — HTTP API shapes, judge harness behavior
docs/engagement-design.md magicpin's message-design principles (compulsion levers, etc.)
docs/engagement-research.md Research notes that fed the design doc
docs/examples/api-call-examples.md Exact HTTP calls the judge sends + expected responses
docs/examples/case-studies.md 10 winning example messages used as anchors
dataset/categories/*.json 5 category contexts (dentists, salons, gyms, restaurants, pharmacies)
dataset/merchants_seed.json Seed merchants (2 per category)
dataset/customers_seed.json Seed customer profiles
dataset/triggers_seed.json Seed trigger payloads (one per kind)
dataset/generate_dataset.py Deterministic seed → full-dataset expander (same output for every candidate)
expanded/ Output of running the expander — generated data, not authored by me
judge_simulator.py The testing harness magicpin uses to score bots

🛠️ Built by me (Arjun) — the actual submission

This is the work being judged.

Path What it does
bot/main.py FastAPI app + the 5 required endpoints + lifespan handler
bot/compose.py Main composition pipeline — compose_for_trigger()
bot/reply.py /v1/reply handler — heuristic fast-paths + LLM
bot/llm.py Async multi-provider LLM client (Anthropic / OpenAI / Gemini / Groq / DeepSeek) with retry, fallback chain, JSON-mode
bot/prompts.py Composer system prompt + 22 per-trigger-kind playbooks + reply system prompt
bot/distill.py Compresses raw context dicts into compact prompt-friendly text
bot/validators.py Hard validators — no URLs, no fake prices, anti-repetition, send_as match, etc.
bot/store.py Versioned, idempotent in-memory context store + per-merchant state + conversations
bot/schemas.py Pydantic request/response models for the 5 endpoints
hf_upload/ The exact bundle deployed to Hugging Face Spaces (mirror of bot/ + deploy config)
Dockerfile Container definition for HF Spaces / any Docker host
requirements.txt Python dependencies
smoke_test_kinds.py My own smoke test that composes every trigger kind and prints a quality table
docs/space_README.md HF Spaces frontmatter file (title, sdk: docker, app_port: 7860)
.env.example Template for local env vars
.gitignore Excludes .claude/, .env, __pycache__/, etc.
README.md This file

In short: the bot/ directory and the deployment around it are entirely mine. The challenge team provided the spec, the seed data, and the judge harness — I built the system that scores well against them.


How it's built

[ HTTP request from judge ]
            │
            ▼
   ┌────────────────┐
   │  FastAPI app   │  bot/main.py — 5 endpoints
   └───────┬────────┘
           │
   ┌───────▼────────┐
   │  In-memory     │  bot/store.py — versioned, idempotent
   │  context store │
   └───────┬────────┘
           │
   ┌───────▼────────┐
   │  Composer      │  bot/compose.py
   │  ───────────   │     1. suppression / DND / not-interested check
   │                │     2. distill 4 contexts → ~1.5KB prompt
   │                │     3. add per-trigger-kind playbook
   │                │     4. LLM call (Gemini 2.5 Flash, temp=0)
   │                │     5. validators — no URLs, no fake prices, …
   │                │     6. one re-prompt on fail, then surgical repair
   └───────┬────────┘
           │
   ┌───────▼────────┐
   │  Action        │  body, cta, send_as, suppression_key
   │  out           │
   └────────────────┘

Reply path runs through bot/reply.py — heuristic fast-paths for auto-reply (3 strikes → end), opt-out, hostile (apology + send), intent-commit (action mode), then LLM for everything else.


Project layout

.
├── README.md              ← you are here
├── requirements.txt       ← Python deps
├── Dockerfile             ← container for HF Spaces / any Docker host
├── .env.example           ← template for local env vars (no real keys)
│
├── bot/                   ← the actual application
│   ├── main.py              FastAPI app + 5 endpoints
│   ├── compose.py           compose_for_trigger() — main pipeline
│   ├── reply.py             /v1/reply handler
│   ├── llm.py               async multi-provider client (Anthropic / OpenAI / Gemini / Groq / DeepSeek)
│   ├── prompts.py           system prompt + per-kind playbooks + reply prompt
│   ├── distill.py           compresses raw contexts into prompt-friendly text
│   ├── validators.py        hard validators with feedback strings
│   ├── store.py             versioned context store + per-merchant state
│   └── schemas.py           Pydantic request/response models
│
├── dataset/               ← seed data the local simulator loads
│   ├── categories/          5 category JSONs (dentists, salons, gyms, …)
│   ├── merchants_seed.json  10+ merchants
│   ├── customers_seed.json  customers per merchant
│   └── triggers_seed.json   trigger payloads (one per kind)
│
├── expanded/              ← bigger generated dataset for stress testing
│
├── docs/                  ← challenge briefs and reference material
│   ├── challenge-brief.md            what to build
│   ├── challenge-testing-brief.md    how it's tested
│   ├── engagement-design.md          message-design principles
│   ├── engagement-research.md        research notes
│   ├── space_README.md               HF Spaces frontmatter copy
│   └── examples/
│       ├── api-call-examples.md      exact HTTP calls the judge sends
│       └── case-studies.md           10 winning examples
│
├── hf_upload/             ← exact bundle deployed to HF Spaces (don't edit)
│
├── judge_simulator.py     ← local end-to-end test harness (from magicpin)
└── smoke_test_kinds.py    ← composes every trigger kind, prints quality table

The 5 endpoints

Method Path What it does
GET/HEAD /v1/healthz Liveness + context counts (UptimeRobot OK)
GET /v1/metadata Team, model, approach summary
POST /v1/context Idempotent, versioned context push
POST /v1/tick Run all available triggers → up to 20 actions
POST /v1/reply Reply handler (auto-reply, opt-out, hostile, intent)
POST /v1/teardown Optional: wipe state at end of test

Full request/response shapes are in docs/examples/api-call-examples.md.


Run it locally

pip install -r requirements.txt

# pick any one provider:
export LLM_PROVIDER=gemini
export LLM_API_KEY=AIza...                    # your key
export LLM_MODEL=gemini-2.5-flash             # or claude-sonnet-4-5 / gpt-4o / llama-3.3-70b-versatile

uvicorn bot.main:app --host 0.0.0.0 --port 8080

Then point the simulator at it:

export LLM_API_KEY=...           # judge LLM key (same provider env var)
python judge_simulator.py        # edit BOT_URL/LLM_PROVIDER inside the file first

Or hit it directly:

curl http://localhost:8080/v1/healthz
curl http://localhost:8080/v1/metadata

Run it in Docker

docker build -t vera-bot .
docker run -p 8080:8080 \
  -e LLM_PROVIDER=gemini \
  -e LLM_API_KEY=$LLM_API_KEY \
  -e LLM_MODEL=gemini-2.5-flash \
  vera-bot

The HF Spaces deployment uses the same Dockerfile, on port 7860.


Why this should score well

The challenge brief calls out 4 weaknesses of magicpin's production "Vera" — this submission specifically fixes each:

  1. Auto-reply pollution → strict detection + 3-strike exit (works under both same-conv and rotating-conv judge harnesses).
  2. Intent-handoff failures → explicit ACTION-MODE routing + regex post-check that strips qualifying language ("would you", "do you", "what if") if the model regresses.
  3. Generic discount copy → validator rejects "X% off" when a real catalog title exists; playbooks force concrete service names.
  4. Low engagement frequency → coverage of curiosity / digest / trend / planning families, not just functional reminders.

Hard validators that block weak outputs before send:

  • ❌ no URLs in the body
  • ❌ no invented prices, no invented metrics
  • ❌ no generic "X% off" copy when a real service-at-price is available
  • ❌ no category taboo words (e.g. "guaranteed cure" for dentists)
  • ❌ no repetition vs the last message sent to this merchant
  • ✅ must contain a numeric anchor (price, %, count, or window)
  • send_as must match the trigger's audience scope

If a validator fails, the composer re-prompts the LLM once with the failure list, then falls back to surgical repair (strip URL, force send_as, normalise CTA).


Tradeoffs I made

  • In-memory store instead of Redis. Test window is 60 simulated minutes — spec allows it, restart-between-calls is forbidden anyway, saves a deploy dependency.
  • One re-prompt on validation failure (not two). Keeps p95 < 8s; second-failure cases get surgical repair. ~90% pass on the first attempt.
  • Heuristic intent classifier (not an LLM hop) for auto-reply / opt-out / intent-commit. These need to be fast and deterministic; an LLM classifier is overkill and burns the budget the composer needs.
  • Per-kind playbooks instead of per-kind few-shots. Few-shots would balloon token cost across many calls; playbooks give the model the framing without the bloat.

What additional context would have helped

  • Confidence band on peer_stats.scope — "south-Delhi solo practices" is excellent; many composes would benefit from city-level matching.
  • A signed list of allowed social-proof phrases derived from the merchant's own customer aggregate, so the composer can use them without it reading as fabrication.
  • A language_register per merchant distinct from languages (e.g. peer_clinical_en vs peer_clinical_hi_en_mix).

Keeping the Space alive — UptimeRobot

Hugging Face's free Spaces tier puts a Space to sleep after ~48 hours of inactivity, and a cold start can take 30–60 seconds. The judge harness calls happen at unpredictable moments and the spec asks for p95 < 8 seconds, so a cold start would tank the score.

To prevent that, I set up a free UptimeRobot monitor that pings the bot every 5 minutes:

  • Monitor type: HTTP(s) — Keyword
  • URL: https://Arjunnnnn123-vera-bot.hf.space/v1/healthz
  • HTTP method: GET (UptimeRobot's free tier sometimes defaults to HEAD; I made /v1/healthz accept both with @app.api_route("/v1/healthz", methods=["GET", "HEAD"]) in bot/main.py)
  • Expected keyword: "status":"ok" (returned by the healthz endpoint)
  • Interval: every 5 minutes
  • Effect: keeps the container warm; cold-start risk during judging window drops to near-zero

This also means I get an email if the Space ever goes down — a free safety net for the judging window.


Tech stack

  • Python 3.11, FastAPI, Pydantic 2, httpx (async)
  • LLM: Gemini 2.5 Flash (primary, free tier 250K TPD), with automatic fallback to Groq Llama-3.3-70B
  • Deploy: Hugging Face Spaces (Docker SDK, port 7860)
  • Uptime: UptimeRobot 5-minute ping on /v1/healthz (see section above)

Submission

About

Vera bot — magicpin AI Challenge submission. Stateful FastAPI bot that composes WhatsApp messages for Indian local-commerce merchants from 4 contexts.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors