This project presents a comprehensive, modular framework for recognizing spoken digits (0-9) from audio recordings. It implements and rigorously compares two powerful, yet fundamentally different, approaches to sequence modeling: classical Hidden Markov Models (HMMs) with Gaussian Mixture Models (GMMs) and modern Recurrent Neural Networks (RNNs), specifically Long Short-Term Memory (LSTM) networks.
The entire system is implemented in Python, following best practices for code modularity, reproducibility, and clear separation of concerns. It includes complete pipelines for data preprocessing, feature extraction (MFCCs), hyperparameter tuning, and final evaluation, providing a thorough, end-to-end solution for a classic pattern recognition problem.
Recognizing spoken words, even isolated digits, is a foundational task in speech processing. The challenge is to build a system that can:
- Process Raw Audio into a meaningful, numerical representation.
- Model Temporal Dynamics of speech, as the sound of a digit evolves over time.
- Accurately Classify digits despite variations in pronunciation, speed, and speaker characteristics.
- Compare and Contrast a classical statistical approach (HMMs) with a deep learning approach (LSTMs) on the same standardized dataset.
This project utilizes the Free Spoken Digit Dataset (FSDD), a public collection of audio recordings for spoken digits.
- Content: 3,000 recordings of digits (0-9) from 6 different speakers.
- Format: 8kHz
.wavfiles. - Attribution: The dataset is open-source and available on GitHub.
The project includes a script that automatically downloads and prepares this dataset, ensuring a smooth and reproducible setup process. A smaller, secondary dataset (digits/) is also included for the initial preparatory analysis.
The project is organized into a modular structure designed for clarity, experimentation, and maintainability.
run_01_preparatory_analysis.py: A script for conducting the initial exploratory analysis on a small digit dataset, including feature visualization (histograms, heatmaps, PCA) and a baseline evaluation of classical ML classifiers (e.g., SVM, Random Forest).run_02_pytorch_intro.py: A self-contained script demonstrating the fundamentals of RNNs in PyTorch by training models to map sine waves to cosine waves.run_03_hmm_classification.py: The primary script for executing the end-to-end HMM pipeline, including hyperparameter tuning and final evaluation on the FSDD.run_04_lstm_classification.py: The primary script for executing the end-to-end LSTM pipeline, including model training, evaluation, and a comparison of performance with and without sequence packing.data/: A directory containing the small, initialdigitsdataset.outputs/: A directory where all generated artifacts (plots, confusion matrices) are saved, organized by experiment.src/: The core logic of the project, structured as a Python library (patrec).preparatory/: Modules for the initial data analysis.pytorch_intro/: Modules for the self-contained PyTorch RNN demonstration.main_assignment/: The core modules for the main classification task.common/: Shared utilities for data preparation (FSDD) and plotting.hmm/: The complete pipeline for GMM-HMM model creation, training, and evaluation.lstm/: The complete pipeline for the PyTorch LSTM model, including theDatasetclass, model architecture, and training logic.
This project directly compares a classical statistical model with a modern deep learning model.
The first model is a Gaussian Mixture Model - Hidden Markov Model (GMM-HMM), a cornerstone of traditional automatic speech recognition.
- Architecture: A separate HMM is trained for each digit (0-9). Each HMM is a left-to-right model, meaning sequences can only progress forward through states, capturing the temporal nature of speech.
- Emission Probabilities: The probability of observing a particular audio feature (an MFCC vector) in a given state is modeled by a Gaussian Mixture Model (GMM). This allows each state to represent complex, multi-modal feature distributions.
- Training: The models are trained using the Baum-Welch algorithm (a form of Expectation-Maximization) to learn the GMM parameters and transition probabilities.
- Inference: To classify a new recording, the system calculates the likelihood of the audio sequence being generated by each of the 10 HMMs. The digit corresponding to the HMM with the highest likelihood is chosen as the prediction.
- Tuning: A systematic grid search is performed to find the optimal number of HMM states and GMM components.
The second model is a Long Short-Term Memory (LSTM) network, a type of Recurrent Neural Network (RNN) specifically designed to handle long-range dependencies in sequence data.
- Architecture: A single, powerful LSTM network is trained to classify all 10 digits. The architecture consists of multiple stacked, bidirectional LSTM layers followed by a final classification layer. Bidirectionality allows the model to process the audio sequence in both forward and backward directions, capturing richer contextual information.
- Efficiency: The implementation leverages
pack_padded_sequence, a PyTorch utility that allows the RNN to skip computations over padded elements in a batch of variable-length sequences. This dramatically improves training speed without sacrificing accuracy. - Training: The model is trained using the AdamW optimizer and Cross-Entropy Loss. Early Stopping is employed to monitor validation loss and prevent overfitting, ensuring the best model is saved.
- Regularization: Dropout is applied between LSTM layers and before the final layer to improve generalization.
- Python 3.10+
- Git (for automatic dataset cloning)
-
Clone the repository:
git clone https://github.com/antonisbaro/digitron cd digitron -
Set up a virtual environment (recommended):
python -m venv .venv # On Windows: .venv\Scripts\activate # On macOS/Linux: source .venv/bin/activate
-
Install dependencies:
pip install -r requirements.txt
The project is run via Python scripts from the command line. It is recommended to run them in their numbered order.
-
Run the Preparatory Analysis:
python run_01_preparatory_analysis.py
-
Run the PyTorch Introduction:
python run_02_pytorch_intro.py
-
Run the HMM Experiment: (This will automatically download the FSDD dataset)
python run_03_hmm_classification.py
-
Run the LSTM Experiment: (This will automatically download the FSDD dataset)
python run_04_lstm_classification.py
All results, plots, and model checkpoints will be saved to the outputs/ directory.
- Language: Python
- Core Libraries: Pandas, NumPy, Scikit-learn
- Deep Learning: PyTorch
- Statistical Modeling: Pomegranate
- Audio Processing: Librosa
- Plotting: Matplotlib, Seaborn
- Utilities: TQDM