Skip to content
/ ava Public

AI voice assistant that answers your phone calls with a human like real persona, holds natural multilingual conversations (GPT-4o + ElevenLabs), and keeps you in the loop via Signal with realtime midcall instructions.

License

Notifications You must be signed in to change notification settings

dzaczek/ava

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

30 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

AVA – AI Voice Assistant

AVA answers your calls when you can't, holds a natural conversation with a human-like persona, and keeps you in the loop via Signal. You can send live instructions mid-call from your phone.


Architecture Overview

graph TB
    subgraph External["EXTERNAL SERVICES"]
        Twilio["Twilio<br/>Voice / PSTN<br/>STT (Gather)<br/>Webhooks"]
        OpenAI["OpenAI<br/>GPT-4o (conversation)<br/>TTS (fallback)"]
        ElevenLabs["ElevenLabs<br/>TTS (primary voice)<br/>eleven_multilingual_v2"]
    end

    subgraph Docker["DOCKER HOST (your server)"]
        subgraph Ingress["INGRESS (choose one)"]
            Caddy["Caddy :443/:80<br/>Let's Encrypt<br/>auto HTTPS"]
            Cloudflared["Cloudflare Tunnel<br/>outbound, no open ports"]
        end

        subgraph AVA["AVA (FastAPI :8000)"]
            Main["main.py<br/>Call routing<br/>Twilio hooks<br/>Rate limiter<br/>Audio serve<br/>Diagnostics"]
            Conv["conversation.py<br/>GPT-4o loop<br/>Streaming<br/>Meta parsing<br/>Summarizer"]
            TTS["tts.py<br/>ElevenLabs β†’ OpenAI<br/>β†’ Polly (fallback)<br/>Cache (MD5)<br/>Circuit breaker"]
            Owner["owner_channel.py<br/>Signal notify<br/>Signal poll (3s)<br/>Slash commands<br/>Instructions"]
            Contact["contact_lookup.py<br/>contacts.json<br/>Twilio CNAM<br/>E.164 normalize<br/>Lang from prefix"]
            I18n["i18n.py<br/>8+ languages<br/>Signal templates<br/>Polly voices<br/>Twilio codes"]
        end

        SignalCLI["signal-cli :8080<br/>REST API<br/>Native mode<br/>Self-hosted"]

        subgraph Volumes["Persistent Volumes"]
            TTSCache["tts_cache (MP3s)"]
            CallData["/data/calls/ (JSON)"]
            Contacts["/data/contacts.json"]
            SignalData["signal_data"]
        end
    end

    OwnerPhone["Owner's Phone<br/>(Signal app)"]

    Twilio -->|"HTTPS webhooks"| Caddy
    Twilio -->|"HTTPS webhooks"| Cloudflared
    Caddy -->|"ava-net"| Main
    Cloudflared -->|"ava-net"| Main

    Main <--> Conv
    Main <--> TTS
    Main <--> Owner
    Main <--> Contact
    Conv <--> I18n
    Main <--> I18n

    Conv -->|"HTTPS"| OpenAI
    TTS -->|"HTTPS"| ElevenLabs
    TTS -->|"HTTPS"| OpenAI

    Owner -->|"HTTP (ava-net)"| SignalCLI
    SignalCLI <-->|"Signal protocol"| OwnerPhone

    TTS --> TTSCache
    Main --> CallData
    Contact --> Contacts
    SignalCLI --> SignalData

    style External fill:#f9f0ff,stroke:#7c3aed
    style Docker fill:#f0f9ff,stroke:#2563eb
    style AVA fill:#ecfdf5,stroke:#059669
    style Ingress fill:#fef3c7,stroke:#d97706
    style Volumes fill:#fef2f2,stroke:#dc2626
Loading

Call Flow (detailed sequence)

sequenceDiagram
    participant Caller as Caller's Phone
    participant Twilio as Twilio (PSTN + STT)
    participant AVA as AVA Server
    participant GPT as OpenAI GPT-4o
    participant TTS as ElevenLabs / OpenAI TTS
    participant Signal as Owner (Signal)

    Caller->>Twilio: Dials owner (call forwarded)
    Twilio->>AVA: POST /twilio/incoming<br/>(CallSid, From, To)

    Note over AVA: Contact lookup (local/CNAM)<br/>Detect lang from phone prefix<br/>(+41β†’de-CH, +48β†’pl-PL)

    AVA-->>Signal: πŸ“ž Incoming call notification
    AVA->>TTS: Generate greeting TTS
    TTS-->>AVA: MP3 audio URL

    AVA->>Twilio: TwiML: Gather + Play<br/>speech_timeout=2s<br/>language=de-CH, enhanced=true
    Twilio->>Caller: Plays greeting audio

    loop Max 10 exchanges
        Caller->>Twilio: Speaks
        Twilio->>AVA: POST /process_speech<br/>(SpeechResult, Confidence)

        Note over AVA: langdetect on text<br/>Pop Signal instructions

        opt Owner sent instruction
            Signal-->>AVA: "tell him I'll call back"
            Note over AVA: Inject [RELAY_TO_CALLER: ...]<br/>into GPT user message
        end

        AVA->>GPT: Stream GPT-4o (user text + instructions)
        GPT-->>AVA: Sentence chunks (streaming)

        Note over AVA: TTS pipeline: start TTS on<br/>1st sentence while GPT<br/>still generates the rest

        AVA->>TTS: TTS sentence 1 (parallel)
        TTS-->>AVA: MP3 URL
        AVA->>TTS: TTS remaining sentences

        Note over AVA: Parse meta JSON<br/>end_call, urgency, topic,<br/>caller_name, lang

        opt GPT switched language
            Note over AVA: Update STT language<br/>for next Gather<br/>e.g. de-CH β†’ pl-PL
        end

        AVA->>Twilio: TwiML: Gather + Play<br/>(updated STT language)
        Twilio->>Caller: Plays response audio

        opt Every 4 transcript entries
            AVA-->>Signal: πŸ“ž Live update<br/>(topic, last 6 lines)
        end
    end

    Note over AVA: end_call=true OR<br/>END_CALL_NOW from owner

    AVA->>Twilio: TwiML: Play + Hangup
    Twilio->>Caller: Goodbye + disconnect

    Twilio->>AVA: POST /twilio/status<br/>CallStatus=completed

    AVA->>GPT: Summarize full transcript
    GPT-->>AVA: Summary text
    AVA-->>Signal: πŸ“‹ Call summary + priority
    AVA-->>Signal: πŸ“ Full transcript

    Note over AVA: Save JSON to /data/calls/<br/>Cleanup after 90s delay
Loading

Timeouts & Limits

Parameter Value Location Description
speech_timeout 2 s main.py (all 4 Gather calls) Silence after speech ends before Twilio fires callback
enhanced true main.py (Gather) Use enhanced STT model for better accuracy
GPT max_tokens 350 conversation.py Max response length per turn
GPT temperature 0.75 conversation.py Creativity level for responses
Summary max_tokens 400 conversation.py Max summary length
Summary temperature 0.2 conversation.py Low creativity for factual summaries
Context window last 20 messages conversation.py Sliding window of conversation history
Hard turn limit 10 exchanges conversation.py AVA wraps up after 10 user turns
Wrap-up warning 8+ exchanges conversation.py System prompt warns AVA to end soon
ElevenLabs timeout 15 s tts.py (httpx) HTTP timeout for TTS API
ElevenLabs circuit breaker 10 min tts.py Disable after 401/403/429, auto-reset
Signal poll interval 3 s main.py / owner_channel.py How often AVA checks for new Signal messages
Signal HTTP timeout 10 s owner_channel.py (httpx) Timeout for Signal API calls
CNAM lookup timeout 5 s contact_lookup.py (httpx) Twilio CNAM API timeout
Rate limiter 30 req/min per IP main.py Sliding window, auto-cleanup every 5 min
Rate limiter cleanup 5 min main.py Stale entry eviction interval
Call state cleanup 90 s after end main.py Delayed cleanup of in-memory call state
TTS cache no expiry tts.py MD5(lang:text) keyed, persists in Docker volume
Seen Signal timestamps 500 entries owner_channel.py Deque for deduplication

Language Detection & Switching

flowchart TD
    Start([CALL START]) --> Prefix["Phone prefix detection<br/>+41 β†’ de-CH<br/>+48 β†’ pl-PL<br/>+44 β†’ en-GB<br/>(52 prefixes)"]

    Prefix --> ContactCheck{Contact has<br/>lang override?}
    ContactCheck -->|Yes| ContactLang["Use contact language<br/>contacts.json<br/>e.g. {lang: pl}"]
    ContactCheck -->|No| PrefixLang["Use prefix language"]

    ContactLang --> Gather
    PrefixLang --> Gather

    Gather["Twilio STT Gather<br/>language = detected locale<br/>speech_timeout = 2s<br/>enhanced = true"]

    Gather --> Speech["SpeechResult (text)"]
    Speech --> Detect["langdetect on text<br/>(if 3+ words)<br/>e.g. DzieΕ„ dobry β†’ pl"]

    Detect --> GPT["GPT-4o processes text<br/>Responds in caller's language<br/>Returns meta with lang: pl"]

    GPT --> Switch{GPT lang β‰ <br/>current STT?}
    Switch -->|Yes| Update["Switch STT language<br/>for NEXT Gather<br/>e.g. de-CH β†’ pl-PL"]
    Switch -->|No| Keep["Keep current STT language"]

    Update --> Gather
    Keep --> Gather

    style Start fill:#059669,color:#fff
    style Gather fill:#2563eb,color:#fff
    style GPT fill:#7c3aed,color:#fff
    style Switch fill:#d97706,color:#fff
Loading

Important limitation: Twilio STT only supports one language per Gather. If the caller speaks Polish but STT is set to German, the transcript will be garbled. The language switch only takes effect on the next turn.


TTS Provider Chain

flowchart TD
    Input["Text to speak"] --> Cache{Disk cache hit?<br/>key = MD5 lang:text}

    Cache -->|Yes| Serve["Return cached URL<br/>PUBLIC_URL/audio/hash.mp3"]
    Cache -->|No| ELCheck{ElevenLabs<br/>available?<br/>API key set?<br/>Circuit breaker OK?}

    ELCheck -->|Yes| EL["ElevenLabs API<br/>voice_id (env)<br/>model_id (env)<br/>timeout: 15s"]
    ELCheck -->|No| OpenAI

    EL -->|Success| Save["Save to cache<br/>Return URL"]
    EL -->|Fail| OpenAI["OpenAI TTS<br/>model: tts-1<br/>voice: OPENAI_TTS_VOICE<br/>(default: nova)"]

    OpenAI -->|Success| Save
    OpenAI -->|Fail| Polly["Twilio Say (Polly)<br/>Last resort<br/>Built-in voice"]

    EL -->|"401/403/429"| CB["Circuit Breaker<br/>Disable ElevenLabs<br/>for 10 minutes"]
    CB --> OpenAI

    Save --> Done([Audio URL returned])
    Polly --> Done2([TwiML Say fallback])

    style Input fill:#2563eb,color:#fff
    style EL fill:#7c3aed,color:#fff
    style OpenAI fill:#059669,color:#fff
    style Polly fill:#dc2626,color:#fff
    style CB fill:#d97706,color:#fff
    style Done fill:#059669,color:#fff
Loading

Signal Communication Flow

sequenceDiagram
    participant Owner as Owner's Signal
    participant CLI as signal-cli REST API
    participant AVA as AVA Server

    loop Every 3 seconds
        AVA->>CLI: GET /v1/receive
        CLI-->>AVA: [] (no messages)
    end

    Note over AVA: INCOMING CALL

    AVA->>CLI: POST /v2/send
    CLI->>Owner: πŸ“ž Incoming call<br/>From: Jan (+48...)<br/>🌐 pl-PL

    Owner->>CLI: "tell him I'll call back"
    AVA->>CLI: GET /v1/receive
    CLI-->>AVA: [message data]
    Note over AVA: Queue instruction<br/>for active call

    AVA->>CLI: POST /v2/send
    CLI->>Owner: βœ… AVA will tell the caller

    Note over AVA: Next speech turn:<br/>inject instruction<br/>into GPT context

    Note over AVA: After 4 transcript entries

    AVA->>CLI: POST /v2/send
    CLI->>Owner: πŸ“ž Call in progress<br/>🟑 Topic: invoice dispute<br/>Last 6 lines of transcript

    Note over AVA: CALL ENDS

    AVA->>CLI: POST /v2/send
    CLI->>Owner: πŸ“‹ Call summary<br/>Priority + AI summary

    AVA->>CLI: POST /v2/send
    CLI->>Owner: πŸ“ Full transcript
Loading

Slash commands (no active call needed)

Command Description
/ping Alive check + timestamp
/status Uptime, active calls, public URL
/stats Call count, memory, TTS cache size
/calls Last 5 call records with topics
/restart Restart AVA (requires /restart confirm)
/help Command list

Owner Instruction Injection

flowchart LR
    subgraph Signal["Owner sends via Signal"]
        A["tell him I'll call at 3"]
        B["ask for order number"]
        C["be more formal"]
        D["end"]
    end

    subgraph GPT["AVA injects into GPT context"]
        A2["[RELAY_TO_CALLER: I'll call at 3]"]
        B2["[ASK_CALLER: order number]"]
        C2["[OWNER_INSTRUCTION: be more formal]"]
        D2["END_CALL_NOW + force_end flag"]
    end

    A --> A2
    B --> B2
    C --> C2
    D --> D2

    GPT --> Response["GPT acts on markers<br/>naturally within response"]

    style Signal fill:#f0f9ff,stroke:#2563eb
    style GPT fill:#ecfdf5,stroke:#059669
Loading

GPT Response Meta Block

Every GPT response ends with an invisible metadata block:

Hello, I'm Maya, Jacek's assistant. How can I help you today?

<meta>{"end_call": false, "urgency": "low", "topic": "general inquiry",
 "caller_name": "Jan", "lang": "en"}</meta>
Field Purpose
end_call true β†’ AVA hangs up after this response
urgency low / medium / high β†’ emoji in Signal summary
topic Short English description for Signal notifications
caller_name First name if mentioned by caller
lang Two-letter code (pl, en, de) β†’ used to switch STT language

Docker Compose Services

graph LR
    subgraph compose["docker-compose.yml"]
        ava["ava<br/>FastAPI :8000<br/>Python 3.11"]
        signal["signal-cli<br/>REST API :8080<br/>Native mode"]
        caddy["caddy<br/>:80 / :443<br/>Let's Encrypt"]
        tunnel["cloudflared<br/>Cloudflare Tunnel<br/>outbound only"]
    end

    ava -->|depends_on| signal
    caddy -->|depends_on| ava
    tunnel -->|depends_on| ava

    caddy -.-|"profile: caddy"| note1["Open ports 80/443"]
    tunnel -.-|"profile: tunnel"| note2["No open ports"]

    style ava fill:#059669,color:#fff
    style signal fill:#2563eb,color:#fff
    style caddy fill:#d97706,color:#fff
    style tunnel fill:#7c3aed,color:#fff
Loading

Environment Variables (complete reference)

Variable Default Description
Twilio
TWILIO_ACCOUNT_SID (required) Twilio account identifier
TWILIO_AUTH_TOKEN (required) Auth token, also validates webhook signatures
TWILIO_PHONE_NUMBER (required) Your Twilio virtual number
Signal
SIGNAL_CLI_URL http://signal-cli:8080 Internal signal-cli API address
SIGNAL_SENDER_NUMBER (required) Bot's Signal number
SIGNAL_RECIPIENT (required) Your personal Signal number
SIGNAL_LANG en Signal notification language (en / pl)
OpenAI
OPENAI_API_KEY (required) OpenAI API key
OPENAI_MODEL gpt-4o GPT model for conversation
ElevenLabs
ELEVENLABS_API_KEY (empty) Leave blank to skip ElevenLabs
ELEVENLABS_VOICE_ID WAhoMTNdLdMoq1j3wf3I Single multilingual voice ID
ELEVENLABS_MODEL eleven_multilingual_v2 TTS model
OpenAI TTS
OPENAI_TTS_VOICE nova Fallback voice (alloy/echo/fable/onyx/nova/shimmer)
Language
DEFAULT_STT_LANG en-US Fallback STT language
Persona
OWNER_CONTEXT (built-in default) Assistant persona + owner info (single line, private)
Infrastructure
PUBLIC_URL (required) HTTPS URL for Twilio to fetch TTS audio
COMPOSE_PROFILES caddy caddy or tunnel
DOMAIN localhost Domain for Caddy SSL (caddy profile only)
CLOUDFLARE_TUNNEL_TOKEN (empty) Tunnel token (tunnel profile only)

Persona Configuration (OWNER_CONTEXT)

The OWNER_CONTEXT variable in .env defines the assistant's entire personality and behaviour. It is injected into the GPT-4o system prompt as the "WHO YOU ARE" section.

This stays in .env only β€” never committed to git. The .env file is in .gitignore.

The value must be a single line (no newlines). Use natural text flow:

OWNER_CONTEXT=PERSONA: Name: Maya, 24yo, UK-raised, Polish-German parents. Warm, witty, tech-loving. NEVER admit being AI β€” you are a real human assistant. OWNER: Jacek, born 1984, Switzerland. Senior Linux SysEng. Work hours 9-17 CET. CALL TYPES: 1) Recruiters: ask salary/stack/remote, send to job@sysop.cat. 2) Doctors: get details, mark URGENT. 3) IT emergencies: HIGH PRIORITY. 4) Sales/scam: hang up.

For deep changes to the conversation rules (turn limits, meta format, etc.), edit SYSTEM_PROMPT in app/conversation.py.


File Structure

AVA/
β”œβ”€β”€ app/
β”‚   β”œβ”€β”€ main.py              # FastAPI app, call routing, webhooks, diagnostics
β”‚   β”œβ”€β”€ conversation.py      # GPT-4o conversation loop, streaming, meta parsing
β”‚   β”œβ”€β”€ tts.py               # TTS provider chain (ElevenLabs β†’ OpenAI β†’ Polly)
β”‚   β”œβ”€β”€ owner_channel.py     # Signal notifications, polling, slash commands
β”‚   β”œβ”€β”€ contact_lookup.py    # Contact book + Twilio CNAM + language from prefix
β”‚   └── i18n.py              # Multilingual strings, voice maps, Signal templates
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ contacts.json        # (user-created) Phone contact book
β”‚   └── calls/               # (auto-generated) JSON call records
β”œβ”€β”€ docs/
β”‚   β”œβ”€β”€ INSTALL_EN.md        # English installation guide
β”‚   └── INSTALL_PL.md        # Polish installation guide
β”œβ”€β”€ .env                     # (not in git) API keys, persona, configuration
β”œβ”€β”€ .env.example             # Template with all variables documented
β”œβ”€β”€ docker-compose.yml       # AVA + signal-cli + Caddy/Cloudflared
β”œβ”€β”€ Dockerfile               # Python 3.11-slim, uvicorn
β”œβ”€β”€ Caddyfile                # Caddy reverse proxy config
β”œβ”€β”€ requirements.txt         # Python dependencies
└── README.md                # This file

Security

Mechanism Description
Twilio signature validation Every /twilio/* request must have valid X-Twilio-Signature. Invalid β†’ 403.
Rate limiting 30 requests/min per IP. Exceeding β†’ 429.
Hidden app port Port 8000 internal only. Traffic via Caddy HTTPS (:443) or Cloudflare Tunnel.
Signal sender filter Only messages from SIGNAL_RECIPIENT are processed. Others are logged and ignored.
Audio file validation Filenames must match [a-f0-9]{32}\.mp3. Path traversal blocked.
Security headers Caddy adds HSTS, X-Frame-Options DENY, X-Content-Type-Options nosniff.
Disabled API docs /docs, /redoc, /openapi.json endpoints are off.

Cost Estimate

Service Rate Typical 2-min call
Twilio Voice $0.013/min ~$0.03
Twilio STT (enhanced) $0.02/15s ~$0.16
OpenAI GPT-4o ~$0.01/1k tokens ~$0.005
ElevenLabs from $5/month (30k chars free tier)
Twilio CNAM Lookup $0.01/query $0.01 (unknown numbers only)

Typical call: ~$0.20–0.25


Signal Commands

During a call

Message What happens
tell him I'll call back tomorrow at 10 AVA naturally relays this to the caller
ask for the order number AVA asks the caller
end / stop / koniec AVA wraps up the call gracefully
status or ? Confirms whether a call is active
Any other text Forwarded as a generic instruction

Setup

See the detailed installation guides:

Quick start

cp .env.example .env
# Edit .env β€” fill in API keys, OWNER_CONTEXT, PUBLIC_URL
mkdir -p data/calls
docker compose up -d
curl https://your-domain.com/health

Troubleshooting

# Twilio can't reach the webhook?
curl -I https://your-domain.com/health

# TTS audio not playing?
docker compose logs ava | grep -i tts

# Signal not sending?
docker compose logs ava-signal-cli
curl http://localhost:8080/v1/accounts

# Check active calls
# Send "status" or "/status" to the Signal bot

# Clear TTS cache (after voice change)
docker exec ava sh -c 'rm -f /tmp/tts_cache/*.mp3'

# View recent call logs
ls -lt data/calls/ | head

About

AI voice assistant that answers your phone calls with a human like real persona, holds natural multilingual conversations (GPT-4o + ElevenLabs), and keeps you in the loop via Signal with realtime midcall instructions.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors