A real-time sign language recognition and text-to-speech system that converts sign language gestures into spoken audio with emotion-based voice synthesis. Built for presentations and accessibility.
- Real-time Sign Language Recognition: Recognizes both static and dynamic sign language gestures using MediaPipe hand tracking
- Emotion-Based Text-to-Speech: Converts recognized text to speech with emotion-aware voice synthesis using Google Cloud Text-to-Speech (Gemini-TTS)
- Facial Emotion Detection: Detects facial emotions in real-time using MediaPipe Face Landmarker to enhance TTS with appropriate emotional tone
- Stream Cleaning: Intelligent text cleaning pipeline that removes duplicates and trailing noise from recognized gestures
- Presentation Mode: Upload and display PDF presentations while signing
- Speaker Profiles: Customize voice settings, rate, pitch, and volume
- Live Captions: Real-time word-by-word caption display as you sign
- FastAPI WebSocket server for real-time communication
- Gesture Recognition: SVM classifier for static signs, DTW (Dynamic Time Warping) for dynamic signs
- Stream Cleaner: Removes consecutive duplicates, near-duplicates, and trailing noise
- TTS Service: Google Cloud Text-to-Speech with Gemini-TTS models for emotion-based synthesis
- React + Vite application
- MediaPipe for hand tracking and facial emotion detection
- WebSocket client for real-time backend communication
- PDF.js for presentation viewing
- Python 3.8+
- Node.js 18+ and npm
- Google Cloud Account with Text-to-Speech API enabled
- Webcam for sign language recognition and emotion detection
- Chrome browser (recommended for best webcam support)
git clone <repository-url>
cd Cheesehacks2026cd backend
pip install -r requirements.txt-
Create a Google Cloud Project (if you don't have one):
- Go to Google Cloud Console
- Create a new project or select an existing one
-
Enable Text-to-Speech API:
- Navigate to "APIs & Services" > "Library"
- Search for "Cloud Text-to-Speech API"
- Click "Enable"
-
Create Service Account:
- Go to "APIs & Services" > "Credentials"
- Click "Create Credentials" > "Service Account"
- Fill in the service account details
- Grant the "Cloud Text-to-Speech API User" role
- Click "Done"
-
Download Service Account Key:
- Click on the created service account
- Go to "Keys" tab
- Click "Add Key" > "Create new key"
- Choose "JSON" format
- Download the JSON file
-
Place Credentials File:
- Rename the downloaded JSON file to
signspeak-488902-b29067f64881.json - Place it in the project root directory (same level as
backend/andfrontend/) - Alternatively, set the
GOOGLE_APPLICATION_CREDENTIALSenvironment variable:export GOOGLE_APPLICATION_CREDENTIALS="/path/to/your/credentials.json"
- Rename the downloaded JSON file to
cd backend
python main.pyThe backend server will start on http://localhost:8000
cd frontend
npm installnpm run devThe frontend will start on http://localhost:5173
- Open your browser and navigate to http://localhost:5173
- Allow camera access when prompted
- The application should connect to the backend automatically (check the connection status in the header)
Cheesehacks2026/
├── backend/
│ ├── main.py # FastAPI WebSocket server
│ ├── recognizer.py # Gesture recognition (SVM + DTW)
│ ├── cleaner.py # Text stream cleaning pipeline
│ ├── features.py # Feature extraction for gestures
│ ├── tts_service.py # Google Cloud TTS service
│ ├── templates.json # Saved sign templates
│ └── requirements.txt # Python dependencies
├── frontend/
│ ├── src/
│ │ ├── App.jsx # Main application component
│ │ ├── components/ # React components
│ │ │ ├── Calibration.jsx
│ │ │ ├── DashboardHome.jsx
│ │ │ ├── FacialEmotionDetection.jsx
│ │ │ ├── PresentationsPage.jsx
│ │ │ └── SpeakerProfilesPage.jsx
│ │ └── context/ # React contexts
│ │ └── FacialEmotionContext.jsx
│ ├── package.json
│ └── vite.config.js
├── emotion-app/ # Standalone emotion detection app
├── signspeak-488902-b29067f64881.json # Google Cloud credentials (not in git)
└── README.md # This file
- FastAPI: Modern Python web framework
- MediaPipe: Hand tracking and gesture recognition
- scikit-learn: Machine learning (SVM classifier)
- Google Cloud Text-to-Speech: Emotion-based voice synthesis
- NumPy: Numerical computations
- WebSockets: Real-time bidirectional communication
- React 19: UI framework
- Vite: Build tool and dev server
- MediaPipe Tasks Vision: Client-side hand and face tracking
- PDF.js: PDF rendering for presentations
-
Static Signs (single gesture):
- Click "Saved Signs" button
- Enter a name for your sign
- Click "Record Static"
- Perform the gesture and hold it for a moment
- The system will record multiple samples for better recognition
-
Dynamic Signs (motion-based):
- Enter a name for your sign
- Click "Record Dynamic"
- Perform the motion gesture
- Click "Stop Recording" when done
- Click the "Present" tab
- Load a PDF presentation (optional) using "Load PDF"
- Start signing - your gestures will be recognized in real-time
- Press E key to flush the current sentence and trigger TTS
- The system will:
- Clean the recognized text (remove duplicates, noise)
- Detect your facial emotion
- Synthesize speech with appropriate emotional tone
- Play the audio through your speakers
- Go to the "Calibration" tab
- Follow the on-screen instructions
- The system will calibrate facial emotion detection for your face
In backend/main.py:
VOTE_WINDOW: Number of frames for gesture voting (default: 10)VOTE_THRESHOLD: Agreement threshold for recognition (default: 0.60)
The system uses LINEAR16 (uncompressed WAV) format for highest audio quality. This can be changed in backend/tts_service.py:
LINEAR16: Best quality, larger filesOGG_OPUS: Good quality, smaller filesMP3: Standard quality, smaller files
Emotion thresholds can be customized in frontend/src/components/FacialEmotionDetection.jsx:
STABILITY_FRAMES: Frames required before emotion change (default: 10)- Individual emotion thresholds for: confident, warm, excited, emphatic, serious, passionate, reflective
TTS Service Not Initializing:
- Verify
GOOGLE_APPLICATION_CREDENTIALSenvironment variable is set - Check that the credentials file exists and is valid
- Ensure Text-to-Speech API is enabled in Google Cloud Console
WebSocket Connection Failed:
- Verify backend is running on port 8000
- Check firewall settings
- Ensure CORS is properly configured
Camera Not Working:
- Use Chrome browser (best support)
- Ensure camera permissions are granted
- Check that no other application is using the camera
WebSocket Disconnected:
- Verify backend server is running
- Check browser console for connection errors
- Ensure backend URL in
App.jsxmatches your backend server
No Audio Playing:
- Check browser audio permissions
- Verify TTS service is initialized in backend
- Check browser console for audio errors
- Ensure audio format is supported by your browser
Backend:
cd backend
python main.pyFrontend:
cd frontend
npm run devFrontend:
cd frontend
npm run buildThe production build will be in frontend/dist/
[Add your license here]
[Add contribution guidelines here]
- MediaPipe for hand and face tracking
- Google Cloud Text-to-Speech for voice synthesis
- FastAPI for the backend framework
- React and Vite for the frontend