Real-time speech-to-text for classrooms, powered by Deepgram Nova-3 and Flask.
- Overview
- Features
- Demo
- Tech Stack
- Architecture
- Project Structure
- Working Principle
- Getting Started
- Environment Variables
- API & Socket Events
- Known Limitations & Roadmap
- Contributing
- Support
- License
NexLearn is a web application that converts live speech into text in real time; built specifically for the classroom. A teacher speaks, students see the words appear on screen instantly. No file uploads, no post-processing delays, no third-party apps to install.
Under the hood, audio captured by the browser's MediaRecorder API is streamed in WebM chunks to a Flask/Socket.IO backend, which forwards it over a persistent WebSocket to Deepgram's Nova-3 model. Transcription results stream back within milliseconds and are broadcast to the client live.
Built as part of an EdTech project exploring how AI can reduce accessibility barriers in education.
| Feature | Description |
|---|---|
| 🎙 Live Transcription | Audio streams from browser to Deepgram via WebSocket — words appear within seconds |
| ⚡ Interim + Final Results | Interim results show words as detected; finals lock in with punctuation and smart formatting |
| 🌍 Auto Language Detection | Deepgram detects spoken language automatically and displays it as a live badge |
| ⏸ Pause / Resume | Pause and resume recording mid-session without losing any transcribed text |
| ⏱ Two-Phase Timer | A connecting clock tracks handshake time; a separate recording timer starts from zero once live |
| 📋 Summary Generation | Generate a concise summary of the transcribed text on demand |
| 💾 One-Click Download | Export full transcription or summary as .txt directly from the browser |
| 👥 Multi-Client Sessions | Each browser tab is an isolated session — multiple users can record simultaneously |
| 🌌 Animated UI | Starfield background, audio visualizer bars, glowing cards, and smooth CSS transitions |
| Layer | Technology | Purpose |
|---|---|---|
| Backend | Python 3.10+, Flask | HTTP server and routing |
| Real-time | Flask-SocketIO, Gevent | Bidirectional WebSocket events |
| Transcription | Deepgram Nova-3 | Live speech-to-text AI model |
| Audio Capture | Browser MediaRecorder API | WebM/Opus audio stream from mic |
| Summarization | Gemini 2.5 Flash | Fast, efficient, large-scale content summarization |
| Frontend | Vanilla JS, Socket.IO client | UI logic and socket communication |
| Styling | CSS3 with custom properties | Animations, theming, responsive layout |
Browser (Client)
│
│ MediaRecorder → WebM chunks (every 2s)
│ Socket.IO (websocket transport)
│
▼
Flask Server (app.py)
│
│ Per-session store (in-memory dict)
│ Background thread per recording session
│
▼
DeepgramSession (features.py)
│
│ asyncio event loop in dedicated thread
│ Persistent WebSocket (wss://api.deepgram.com)
│ Nova-3 model, WebM/Opus auto-detect
│
▼
Deepgram API
│
│ Streams back interim + final transcripts
│
▼
Flask Server → Socket.IO emit → Browser UI
Key design decision — EBML header prepending:
The browser's MediaRecorder only includes the WebM container header in the first chunk. Deepgram requires a valid WebM stream for every chunk it receives. The server saves the first chunk's header and prepends it to all subsequent chunks before forwarding — this is what makes streaming work reliably without ffmpeg.
NexLearn/
•app.py
•features.py
•template/
static/
css/
style.css
js/
script.js
•templates/
index.html
•test.py
•requirements.txt
•.env
•.gitignore
•README.md
- Python 3.10 or higher
- A Deepgram account — the free tier includes enough credits to get started
- A modern browser (Chrome or Edge recommended for best
MediaRecordersupport)
git clone https://github.com/fachiny17/NexLearn.git
cd nexlearn# macOS / Linux
python3 -m venv venv
source venv/bin/activate
# Windows
python3 -m venv venv
venv\Scripts\activatepip install -r requirements.txtcp .env.example .envEdit .env and add your Deepgram API key:
DEEPGRAM_API_KEY=your_deepgram_api_key_hereGet your key at console.deepgram.com → API Keys → Create a New Key.
python3 app.pyOpen http://localhost:5000 in your browser. Allow microphone access when prompted and click Start Recording.
python3 test.pySpeaks into your system microphone directly. Press Ctrl+C to stop. Useful for verifying your API key and network connection independently of the web UI.
| Variable | Required | Default | Description |
|---|---|---|---|
DEEPGRAM_API_KEY |
✅ Yes | — | API key from console.deepgram.com |
PORT |
❌ No | 5000 |
Port the server listens on |
NexLearn communicates entirely over Socket.IO. Here is the full event reference:
| Event | Payload | Description |
|---|---|---|
start_recording |
— | Initiates a Deepgram session for this client |
stop_recording |
— | Closes the Deepgram session and returns final text |
pause_recording |
— | Pauses audio forwarding |
resume_recording |
— | Resumes audio forwarding |
audio_chunk |
bytes |
Raw WebM audio chunk from MediaRecorder |
generate_summary |
— | Triggers summary generation from current transcript |
download_transcription |
— | Requests transcription text for client-side download |
download_summary |
— | Requests summary text for client-side download |
| Event | Payload | Description |
|---|---|---|
connected |
{ session_id } |
Confirms socket connection |
recording_started |
{ status: 'success' | 'error' } |
Deepgram handshake result |
recording_stopped |
{ full_text, language } |
Final transcript on stop |
transcription_update |
{ text, full_text, language, is_final } |
Live transcript update |
recording_paused |
{ status } |
Pause confirmed |
recording_resumed |
{ status } |
Resume confirmed |
summary_result |
{ success, summary?, error? } |
Summary result |
download_data |
{ success, content, filename } |
File content for download |
- In-memory sessions — All session data lives in a Python dict. Restarting the server wipes all transcriptions. A database (Redis, PostgreSQL) would be needed for persistence.
- Single language per session — The language is set at session start. Switching mid-recording is not currently supported.
- Free tier cold starts — Free Render instances spin down after inactivity. The first request after a dormant period can take up to 60 seconds.
- Persistent storage — save transcriptions to a database
- User accounts and session history
- Export to PDF and DOCX
- Speaker diarization — identify who is speaking
- Real-time collaborative view for students
- Keyword highlighting and topic extraction
- Translation to other languages
Contributions are welcome and appreciated. Here is how to get involved:
Open an issue on GitHub with:
- A clear description of the bug
- Steps to reproduce it
- What you expected vs what actually happened
- Your OS, browser, and Python version
Open an issue with the enhancement label. Describe the feature and the problem it solves.
- Fork the repository
- Create a feature branch:
git checkout -b feat/your-feature-name
- Make your changes and commit using Conventional Commits:
git commit -m "feat: add PDF export for transcriptions" - Push to your fork:
git push origin feat/your-feature-name
- Open a Pull Request against
main
Please keep PRs focused — one feature or fix per PR makes review much faster.
If you run into issues or have questions:
- 🐛 GitHub Issues — Open an issue for bugs and feature requests
- 📝 Medium Article — Read the full build walkthrough for a deep dive into how NexLearn was built
- 📺 YouTube — Watch the demo to see the app in action
If NexLearn helped you or your project, consider giving the repo a ⭐ — it helps others find it.
This project is licensed under the MIT License — free to use, modify, and distribute with attribution.

