A real-time, multilingual, speech-to-speech AI assistant that listens, understands, and responds with natural voice output — all running locally.
This project combines:
- Automatic Speech Recognition (ASR) using Whisper & specialized models
- Local LLM reasoning using Ollama
- High-quality Text-to-Speech (TTS) using Microsoft Edge TTS (free, no API key needed)
Supports 20+ languages, including English, Japanese, Chinese, Spanish, French, Arabic, Korean, and more.
- Uses OpenAI Whisper and specialized models
- Japanese uses kotoba-tech/kotoba-whisper-v1.0 for superior accuracy
- Supports automatic normalization, resampling, and long audio segments
- Integrates seamlessly with Ollama’s local LLMs
- Supports any model (Llama, Qwen, Mistral, Phi, etc.)
- Keeps conversation history for natural dialogue
- Uses Microsoft Edge Neural Voices
- Completely free, high-quality, and supports 90+ voices
- No API keys or cloud access required
Preconfigured for high-quality voices + models:
English, Japanese, Chinese, Spanish, French, German,
Italian, Portuguese, Korean, Russian, Arabic, Hindi,
Polish, Turkish, Dutch, Czech, Hungarian, Swedish,
Norwegian, Finnish
pip install torch transformers sounddevice scipy numpy requests edge-tts(Optional for MP3 → WAV conversion and playback)
pip install pydub pygameDownload from: https://ollama.com
Start server:
ollama servePull the model you want to use (example):
ollama pull llama3.2:3bInside the script:
LANGUAGE = "japanese"
OLLAMA_MODEL = "gpt-oss:120b-cloud"
CUSTOM_VOICE = NoneYou can switch:
- LANGUAGE → any from
SUPPORTED_LANGUAGES - OLLAMA_MODEL → any model installed in Ollama
- CUSTOM_VOICE → any Edge TTS voice name (optional)
List voices for a language:
SpeechPipelineEdgeTTS.print_available_voices(language_filter="ja")python app.pyThe program will:
- Record 5 seconds of your speech
- Transcribe it using Whisper
- Send text to the Ollama LLM
- Convert the reply to natural speech
- Play the audio output
You can continue chatting in a loop.
├── SpeechPipelineEdgeTTS
│ ├── ASR (Whisper / Kotoba)
│ ├── Ollama LLM Chat Interface
│ ├── Edge TTS Voice Synthesis
│ ├── Microphone + Audio Playback
│ ├── Conversation Memory Handling
└── README.md
Uses sounddevice for high-quality capture.
Runs a Whisper-based model optimized for the selected language.
Sends text to Ollama with configurable temperature, memory, and model selection.
Converts the LLM output into speech using Edge TTS (Saves MP3 → Converts to WAV → Plays audio)
Each language maps to:
- Whisper language mode
- Specialized ASR model
- Best Edge TTS neural voice
You can customize these through the dictionary:
SUPPORTED_LANGUAGES = {
'japanese': {
'whisper': 'japanese',
'voice': 'ja-JP-NanamiNeural',
'asr_model': 'kotoba-tech/kotoba-whisper-v1.0'
}
}Each interaction is stored:
self.conversation_history.append({"role": "user", "content": text})
self.conversation_history.append({"role": "assistant", "content": ai_response})Auto-clears old messages to avoid memory bloating.
Reset manually:
pipeline.reset_conversation()Use any Edge TTS voice:
CUSTOM_VOICE = "ja-JP-KeitaNeural"Find all voices:
SpeechPipelineEdgeTTS.print_available_voices()Make sure it's running:
ollama serveInstall the fallback:
pip install pygameSwitch to a smaller ASR model:
'asr_model': 'openai/whisper-small'- Streaming ASR + Streaming TTS
- Realtime echo cancellation
- Web UI (Gradio / FastAPI)
- Hotword activation (“Hey Assistant…”)
MIT License
Duke Kojo Kongo (CodeJoe)
Data Scientist • AI Engineer • Builder of Intelligent Systems
