Speech recognition in Python using Tensorflow 2 and Keras high level API.
Convolutional Neural Networks (CNN) were invented to classify time invariant data (like images). Recent researches found that, if sound is converted into its spectrogram (or better: Log-Mel spectrogram, then convolutional neural networks can be applied on sound's features (aka spectrogram) for training speech recognition model.
- Clone repository:
cd /path/to/these/files/ git clone https://github.com/tyukesz/command-words-recognition-keras
- Check constants.py file for parameters like:
- DATASET_URL // URL where ZIPed sounds files are stored
- SAMPLE_RATE // Recomended 16KHz
- EPOCHS // number of training epochs
- BATCH_SIZE //use power of 2: 64, 128, 256, 512, ...
- TESTING and VALIDATION percentage // Recomended 15% - 15%
- WANTED_WORDS // list of command words: ['yes', 'no', 'up', 'down', 'on', 'off', ...]
- VERBOSITY //0=disable, 1=progressbar
- Install requirements:
pip install requirements.txt
- Run training on CPU:
python train.py
- If your are running training for first time it's recommended using '--force_extract=True' argument:
python train.py --force_extract=True
- If your SOUNDS_DIR is empty it will download and extract the sounds files from provided DATASET_URL.
- Forcing MFCC feature extraction (force_extract=True) causes saving sounds features in MFCCS_DIR as tensors.
- If features are already extracted, then the training begins faster since no need to recompute MFCCs of sounds, just load them as tensor.
- If your are running training for first time it's recommended using '--force_extract=True' argument:
- (Optional) You can load the pretrained model for transfer learning:
python train.py --load_model=name_of_your_model
- (Optional) Test your model prediction with WAV file:
python predict_wav.py --load_model=name_of_your_model --wav_path=/path/to/yes.wav --num_of_predictions=2
- The above command will output something like:
- yes (97%)
- left (0.84%)
- The above command will output something like:
Using sequential Keras model with following layers:
Layer (type) | Output Shape | Params |
---|---|---|
conv2d (Conv2D) | (512, 64, 32, 16) | 416 |
max_pooling2d (MaxPooling2D) | (512, 32, 16, 16) | 0 |
conv2d_1 (Conv2D) | (512, 32, 16, 32) | 4640 |
max_pooling2d_1 (MaxPooling2) | (512, 16, 8, 32) | 0 |
dropout (Dropout) | (512, 16, 8, 32) | 0 |
batch_normalization_v2 (BatchNormalization) | (512, 16, 8, 32) | 128 |
conv2d_2 (Conv2D) | (512, 8, 4, 64) | 18496 |
conv2d_3 (Conv2D) | (512, 8, 4, 128) | 73856 |
max_pooling2d_2 (MaxPooling2) | (512, 4, 2, 128) | 0 |
dropout_1 (Dropout) | (512, 4, 2, 128) | 0 |
flatten (Flatten) | (512, 1024) | 0 |
dropout_2 (Dropout) | (512, 1024) | 0 |
dense (Dense) | (512, 256) | 262400 |
dense_1 (Dense) | (512, 6) | 1542 |
____________________________ | ||
Total params: 361,478 | ||
Trainable params: 361,414 | ||
Non-trainable params: 64 |
This project is licensed under the MIT License - see the LICENSE.md file for details.