American Sign Language (ASL) recognition system using MediaPipe landmarks and deep learning.
asl_model/
βββ configs/ # Configuration files
β βββ config.yaml # Main configuration
β βββ label_map.json # Label normalization rules
βββ data/ # Raw datasets
β βββ kaggle_asl1/ # Kaggle ASL Alphabet dataset
β βββ kaggle_asl2/ # Kaggle ASL Dataset
β βββ kaggle_asl_combined/ # Combined Kaggle images (A-Z)
β βββ microsoft_asl/ # MS-ASL videos and metadata
β βββ personal/ # Personal recordings (optional)
βββ artifacts/ # Generated artifacts
β βββ landmarks/ # Raw MediaPipe landmarks [T, 543, 4]
β βββ features/ # Preprocessed features [T, 75, 4]
β βββ manifests/ # Dataset manifests (CSV)
β βββ models/ # Trained model checkpoints
β β βββ lstm_attention_20251020_151032/ # Best model (86.3%)
β βββ logs/ # TensorBoard training logs
βββ src/ # Core library code
β βββ data/ # Data processing modules
β β βββ __init__.py
β β βββ dataloader.py # PyTorch dataloader with windowing
β βββ models/ # Model architectures
β β βββ __init__.py
β β βββ lstm_model.py # BiLSTM + Attention
β βββ utils/ # Utility functions
β βββ __init__.py
βββ scripts/ # Executable scripts
β βββ 1_data_preparation/ # Download and organize data
β β βββ combine_kaggle_asl.py
β β βββ build_manifest.py
β β βββ assign_splits.py
β β βββ msasl_*.py # MS-ASL pipeline scripts
β βββ 2_preprocessing/ # Extract and preprocess features
β β βββ extract_landmarks.py
β β βββ preprocess_features.py
β β βββ filter_valid_features.py
β βββ 3_training/ # Train models
β β βββ train_baseline.py
β βββ 4_evaluation/ # Evaluate and visualize
β βββ analyze_errors.py
β βββ quick_stats.py
β βββ quick_viz.py
βββ tests/ # Test scripts
β βββ test_dataloader_with_splits.py
β βββ test_model.py
β βββ README.md
βββ bash_scripts/ # Bash shell scripts
β βββ download_kaggle_datasets.sh
β βββ check_feature_validity.sh
β βββ check_feature_validity_unfiltered.sh
β βββ train_ensemble.sh
β βββ README.md
βββ plans/ # Project planning documents
βββ requirements.txt # Python dependencies
βββ README.md # This file
βββ TRAINING_GUIDE.md # Training documentation
βββ KAGGLE_DATASETS_INFO.md # Kaggle dataset details
βββ MSASL_PIPELINE.md # MS-ASL pipeline guide
βββ PROJECT_STRUCTURE.md # Detailed structure docs
git clone https://github.com/davidgit3000/asl_recognition.git
cd asl_recognitionMac/Linux:
# Create virtual environment
python3.11 -m venv .venv311
source .venv311/bin/activate
# Install dependencies
pip install -r requirements.txtWindows:
# Create virtual environment
python -m venv .venv311
.venv311\Scripts\activate
# Install dependencies
pip install -r requirements.txtKaggle Datasets (required):
Option A - Using script (Mac/Linux):
# Setup Kaggle API credentials first (~/.kaggle/kaggle.json)
bash download_kaggle_datasets.shOption B - Manual download:
- Download ASL Alphabet β extract to
data/kaggle_asl1/ - Download ASL Dataset β extract to
data/kaggle_asl2/
MS-ASL Videos (optional):
- Download from MS-ASL Dataset
- Then, use automated scripts (see
scripts/1_data_preparation/README.md)
# Combine Kaggle datasets
python scripts/1_data_preparation/combine_kaggle_asl.py
# Build manifest and assign splits
python scripts/1_data_preparation/build_manifest.py
python scripts/1_data_preparation/assign_splits.py# Extract MediaPipe landmarks
python scripts/2_preprocessing/extract_landmarks.py
# Preprocess features for training
python scripts/2_preprocessing/preprocess_features.pypython scripts/4_evaluation/test_dataloader_with_splits.py- Kaggle ASL Alphabet: ~78,000 images (A-Z letters)
- Kaggle ASL Dataset: ~9,000 images (A-Z letters)
- MS-ASL: ~190 videos (20 common words)
- Total Raw: 87,200 samples
- Valid Features: 68,671 samples (86.1% extraction success rate)
- Classes: 26 (A-Z letters only)
- Splits: 70% train (48,074), 15% val (10,278), 15% test (10,319)
- Removed: 11,144 zero features + 188 MS-ASL samples (class imbalance)
- Feature extraction success: 86.1% (dual-detection pipeline)
- Class balance: 1,847-3,008 samples per class (well-balanced)
- Most challenging classes: M, N (39.8% detection failure due to closed fist)
- Data Collection β Download Kaggle + MS-ASL datasets β
- Manifest Building β Create unified CSV with metadata β
- Landmark Extraction β Dual-detection (MediaPipe Holistic + Hands fallback) β
- Feature Preprocessing β Normalize, smooth, reduce to 75 landmarks β
- Quality Filtering β Remove zero features and low-count classes β
- Train/Val/Test Split β Stratified 70/15/15 split β
- Dataloader β PyTorch DataLoader with windowing β
- Model Training β BiLSTM + Attention (86.3% test accuracy) β
- Evaluation β Error analysis and confusion matrix π
- Inference Pipeline β Real-time webcam demo π
- β Dual-detection pipeline: MediaPipe Holistic + Hands fallback (86.1% success)
- β Temporal smoothing: Savitzky-Golay filter (window=5, polynomial=2)
- β Normalization: Centered on torso, scaled by shoulder width
- β Quality filtering: Remove zero features and low-count classes
- β Windowed sequences: 32 frames with configurable stride
- β Data augmentation: Rotation, scale, translation (training only)
- β Class balancing: Weighted loss for imbalanced data
- β Stratified splits: 70/15/15 train/val/test
- β BiLSTM + Attention: 3 layers, 512 hidden units, 17M parameters
- β Regularization: Dropout (0.25), weight decay (1e-5), label smoothing (0.1)
- β Optimization: Adam optimizer with ReduceLROnPlateau scheduler
- β Early stopping: Patience=10 epochs
- β Test Accuracy: 86.29% (22.7Γ better than random)
- β Validation Accuracy: 86.79%
- β Generalization: Val β Test (no overfitting)
- β Training Time: ~6 hours (100 epochs on Apple M4)
KAGGLE_DATASETS_INFO.md- Kaggle dataset detailsMSASL_PIPELINE.md- MS-ASL download pipelinescripts/README.md- Scripts documentationscripts/*/README.md- Detailed docs for each stage
- β Data collection and preprocessing (68,671 samples, 26 classes)
- β Dual-detection landmark extraction (86.1% success rate)
- β Feature engineering and quality filtering
- β BiLSTM + Attention model (86.29% test accuracy)
- β Training pipeline with early stopping and LR scheduling
- π Error analysis and confusion matrix visualization
- π Per-class performance evaluation
-
CNN-based models (for spatial feature extraction)
- 2D CNN on landmark heatmaps
- 3D CNN for spatiotemporal features
- ResNet/EfficientNet backbones
-
Hybrid models (combining spatial + temporal)
- CNN + LSTM (extract spatial features, then temporal modeling)
- CNN + Transformer (attention over CNN features)
- Two-stream networks (appearance + motion)
-
Transformer-based models
- Vision Transformer (ViT) for landmark sequences
- Temporal Transformer with positional encoding
- BERT-style pre-training on ASL data
-
Ensemble methods
- Train 5 diverse models (different architectures/hyperparameters)
- Probability averaging or voting
- Expected: +2-4% accuracy boost
-
Real-time inference pipeline
- Webcam integration with MediaPipe
- Sliding window prediction (30 FPS)
- Temporal smoothing for stable predictions
-
Demo application
- GUI with live video feed
- Top-3 predictions with confidence scores
- Recording capability for new samples
-
Model optimization
- ONNX export for cross-platform deployment
- Quantization for faster inference
- Mobile deployment (TensorFlow Lite)
-
Word-level recognition
- Collect more MS-ASL data (500+ samples per word)
- Sequence-to-sequence models
- Sentence-level ASL translation
-
Transfer learning
- Fine-tune on personal signing style
- Few-shot learning for new signs
-
Multi-modal learning
- Combine landmarks + raw video
- Audio integration (for signed songs)
| Model | Architecture | Params | Expected Acc | Training Time | Notes |
|---|---|---|---|---|---|
| BiLSTM + Attention | 3-layer BiLSTM | 17M | 86.3% β | 6h | Current best |
| CNN + LSTM | ResNet18 + 2-layer LSTM | ~15M | 87-89% | 8h | Spatial + temporal |
| 3D CNN | 3D ResNet | ~30M | 85-88% | 10h | End-to-end spatiotemporal |
| Transformer | 6-layer encoder | ~20M | 88-91% | 12h | Pure attention |
| Ensemble (5 models) | Mixed | ~80M | 89-92% | 30h | Best performance |
- β Baseline (Random): 3.8% (1/26 classes)
- β Current (BiLSTM+Attention): 86.3%
- π― Target (Ensemble): 90%+
- π SOTA (Published research): 92-95%
Educational project for CS 4620.