A Python AI workflow that handles a simulated customer-support conversation end-to-end for Bloom Aesthetics Clinic, a fictional small aesthetics business. Built for the Closira AI Engineering Intern assignment.
The workflow runs a conversation through four stages:
- FAQ Answering — answers customer questions using a provided SOP only.
- Lead Qualification — asks structured questions to qualify the lead.
- Escalation Detection — detects when a human is needed and logs why.
- Conversation Summary — produces a structured end-of-session summary.
It works with the Anthropic Claude API, the OpenAI API, or a local Ollama
model — it auto-detects the provider. Provide a hosted API key and it uses
that; provide nothing and it falls back to a free local model
(qwen2.5:1.5b) via Ollama. Reviewers only need to drop in an API key.
- Python 3.10 or newer
- One of: an Anthropic Claude API key, an OpenAI API key, or a local Ollama install (no key, free)
cd closira-ai-workflow
python -m pip install -r requirements.txtThe workflow auto-detects the provider in this priority order: Anthropic key → OpenAI key → local Ollama fallback.
cp .env.example .env # Windows: copy .env.example .envReviewers / recruiters — use a hosted model (recommended).
Open .env and fill in one key:
ANTHROPIC_API_KEY=sk-ant-... # or
OPENAI_API_KEY=sk-...That's it — the key is picked up automatically. To force a provider, set
LLM_PROVIDER=anthropic, openai, or ollama.
Cost note: both hosted providers are paid APIs, but new accounts include free trial credit — far more than enough for every demo here (each full session costs a fraction of a cent).
Local development — use the free Ollama fallback (no key).
Leave the keys in .env blank. With no key set, the workflow runs entirely
locally on Ollama. One-time setup:
# 1. Install Ollama from https://ollama.com
# 2. Pull the model used by this project
ollama pull qwen2.5:1.5b
# 3. Make sure the Ollama server is running (it usually starts automatically)
ollama serveThen run the workflow normally — it will detect Ollama and use it. The model
and endpoint are configurable via OLLAMA_MODEL and OLLAMA_BASE_URL in
.env.
Why a small model for local dev:
qwen2.5:1.5bis tiny and fast, so it is great for iterating on the workflow logic offline at zero cost. It is less reliable at strict JSON output and nuanced escalation than a hosted model — so for grading/evaluation, a hosted key is recommended. The code requests native JSON mode from Ollama and defensively repairs malformed JSON to keep the small model usable.
Interactive mode — you type as the customer:
python -m src.mainScripted mode — replays a saved customer script (used to reproduce the test transcripts):
python -m src.main --script test_transcripts/scripts/01_in_sop.txtType exit or quit at any time to end a session. When the session ends,
the structured conversation summary prints to the screen and the full
transcript is saved to logs/transcripts/.
closira-ai-workflow/
├── README.md This file
├── prompt_design.md Full system prompt + design reasoning
├── requirements.txt Dependencies
├── .env.example Configuration template
├── VIDEO_SCRIPT.md Script for the 2-5 min walkthrough video
│
├── data/
│ └── sop.json The SOP — the AI's ONLY source of truth
│
├── src/
│ ├── main.py CLI entry point
│ ├── config.py Settings (provider, thresholds) from env
│ ├── llm_client.py Provider-agnostic LLM client (Claude/OpenAI)
│ ├── sop.py Loads + renders the SOP
│ ├── prompts.py Every prompt used by the workflow
│ ├── conversation.py Orchestrator — wires the four stages together
│ ├── logger.py Append-only escalation logging
│ └── stages/
│ ├── faq.py Stage 1 — FAQ answering
│ ├── qualification.py Stage 2 — lead qualification
│ ├── escalation.py Stage 3 — escalation detection
│ └── summary.py Stage 4 — conversation summary
│
├── test_transcripts/
│ ├── 01_in_sop_question.md One transcript per expected behaviour
│ ├── 02_out_of_scope_question.md
│ ├── 03_escalation_trigger.md
│ ├── 04_lead_qualification.md
│ ├── 05_conversation_summary.md
│ └── scripts/ Customer scripts to regenerate them
│
└── logs/
├── escalations.log One JSON line per escalation (created at run)
└── transcripts/ Saved transcript + summary per session
Each customer message flows through the workflow like this:
customer message
│
▼
[ Stage 3: Escalation check ] ── runs on EVERY message, before answering
│ complaint / anger / medical / pricing negotiation / human request?
│ │ yes ──────────────► hand off to a human, end session
│ │ no
▼
[ Stage 1: FAQ answering ] ── answers from data/sop.json ONLY
│ returns: reply, answered_from_sop, confidence, escalate
│ │ low confidence / out of scope ──► flag + log, hand off
│ │ answered
▼
"Anything else?" ── customer has more questions? loop back to Stage 1
│ no
▼
[ Stage 2: Lead qualification ] ── 3 structured questions
│
▼
[ Stage 4: Conversation summary ] ── structured summary, session ends
The orchestrator (src/conversation.py) owns all state and stage
transitions, so the flow is deterministic and easy to trace. The four stages
are cleanly separated — each lives in its own module under src/stages/.
Full reasoning behind the prompts, grounding strategy, and escalation logic
is in prompt_design.md.
The AI's entire knowledge base is data/sop.json — an
extended version of the assignment's sample SOP for Bloom Aesthetics
Clinic. It contains the business details, hours, six services with pricing,
booking and cancellation policy, general clinic policies, six FAQs, and seven
escalation rules. The AI may answer only from this file; anything outside
it triggers an honest "I don't have that information" and an escalation.
To use a different business, edit data/sop.json — no code changes needed.
- Grounding: the SOP is the only source of truth; the system prompt forbids stating, guessing, estimating, or inferring anything not in it.
- Structured output: every stage returns JSON, so the Python layer makes escalation decisions on explicit fields — not on interpreting prose.
- Confidence threshold: answers below
CONFIDENCE_THRESHOLD(default 0.6) are escalated automatically, even if the model didn't ask for it. - Dedicated escalation classifier: safety detection is a separate model call that runs before answering — it is not left to the answering model.
- Fail-safe defaults: if the escalation classifier errors, the workflow escalates anyway; if the FAQ call errors, it hands off instead of guessing.
- Audit log: every escalation is written to
logs/escalations.logwith a timestamp, reason, rationale, and which path raised it.
| Package | Purpose |
|---|---|
anthropic |
Anthropic Claude API client (used if Claude is selected) |
openai |
OpenAI API client — also used for the Ollama fallback, since Ollama exposes an OpenAI-compatible API |
python-dotenv |
Loads the .env file (optional — env vars also work) |
You only need the SDK for the provider you actually use. The Ollama fallback
reuses the openai package (no extra dependency) and additionally requires
Ollama installed locally — see Local development above.
- Multiple model calls per turn. Each customer message triggers a separate escalation-classifier call plus the FAQ call (and occasionally a small intent call). This is a deliberate trade of latency/token cost for reliability — keeping safety detection separate from answer generation makes it far more dependable. A production system could consolidate calls once behaviour is validated.
- Fixed qualification questions. The three qualification questions are the same every session. This keeps lead data consistent and comparable, but it is not adaptive — the AI may ask something the conversation already revealed.
- Heuristic "anything else?" detection. The FAQ→qualification transition uses a keyword heuristic with a model fallback; an unusually phrased reply could occasionally be misclassified.
- Confidence is a model self-assessment. The 0.6 threshold is a sensible default but should be tuned against real transcripts for a given business.
- Single SOP, English, no real booking. The workflow operates on one SOP in UK English and cannot perform real bookings — it only explains how to book. Multi-tenant, multi-language, and live booking integration are out of scope for this assignment.
- Local Ollama fallback is for development, not grading.
qwen2.5:1.5bis small; it is fast and free for offline iteration but weaker at strict JSON output and nuanced escalation. Evaluation is best done against a hosted model (Anthropic or OpenAI). The provider abstraction means the same code runs identically on all three. - Test transcripts are representative samples. The files in
test_transcripts/show realistic output; exact wording will vary slightly per model run. Regenerate any of them with--script(see above).
Each expected behaviour has a customer script in test_transcripts/scripts/:
python -m src.main --script test_transcripts/scripts/01_in_sop.txt
python -m src.main --script test_transcripts/scripts/02_out_of_scope.txt
python -m src.main --script test_transcripts/scripts/03_escalation_trigger.txt
python -m src.main --script test_transcripts/scripts/04_lead_qualification.txt
python -m src.main --script test_transcripts/scripts/05_conversation_summary.txtThe committed .md transcripts in test_transcripts/ are annotated sample
runs explaining what each one demonstrates.