![alt text](kaldi.png "Title")
# Kaldi for Dysarthic Speech Recognition
- Part 1: Installation
- Part 2: Data Preparation
- **Part 3: Training & Evaluation**

### The general work-flow is as follows:
1. Prepare data
2. Build n-gram language model 
3. Extract acoustic features (MFCC, Fbank, etc)
4. Train a monophone acoustic model (context independent)
5. Train a triphone acoustic model (context depedent) 
6. Decode and evaluate AM via word-error rate

### All the above steps will be written in a 'run.sh' recipe which runs kaldi-provided scripts
- The provided run script is divided into 4 stages (language model building, feature extraction, acoustic model training, and decoding)

#### Example code from run.sh
- Building language model

- Extracting MFCC's

- Training acoustic model (monophone and triphone)

- Decoding and scoring AM

#### Stage 1: Languge model preparation

In [1]:
!cd ./kaldi/egs/torgo/ && ./run.sh 


===== PREPARING Language DATA =====


=== Building a language model ...

=== Preparing the dictionary ...

--- Downloading CMU dictionary ...
A    data/local/dict/cmudict/scripts
A    data/local/dict/cmudict/sphinxdict
A    data/local/dict/cmudict/00README_FIRST.txt
A    data/local/dict/cmudict/cmudict-0.7b
A    data/local/dict/cmudict/scripts/test_cmudict.pl
A    data/local/dict/cmudict/scripts/README.txt
A    data/local/dict/cmudict/scripts/make_baseform.pl
A    data/local/dict/cmudict/scripts/CompileDictionary.sh
A    data/local/dict/cmudict/scripts/sort_cmudict.pl
A    data/local/dict/cmudict/scripts/test_dict.pl
A    data/local/dict/cmudict/cmudict.0.6d
A    data/local/dict/cmudict/cmudict.0.7a
A    data/local/dict/cmudict/cmudict-0.7b.phones
A    data/local/dict/cmudict/cmudict-0.7b.symbols
A    data/local/dict/cmudict/sphinxdict/cmudict.0.7a_SPHINX_40
A    data/local/dict/cmudict/sphinxdict/cmudict_SPHINX_40
A    data/local/dict/cmudict/sphinxdict/README.txt
A    data/local/dic

#### Stage 2: MFCC feature extraction

In [2]:
!cd ./kaldi/egs/torgo/ && ./run.sh 


===== PREPARING ACOUSTIC DATA =====


===== FEATURES EXTRACTION =====

utils/validate_data_dir.sh: no such file data/train/feats.scp (if this is by design, specify --no-feats)
fix_data_dir.sh: kept all 15495 utterances.
fix_data_dir.sh: old files are kept in data/train/.backup
   Search for the word 'bold' in http://kaldi-asr.org/doc/data_prep.html
   for more information.
utils/validate_data_dir.sh: no such file data/test/feats.scp (if this is by design, specify --no-feats)
fix_data_dir.sh: kept all 1087 utterances.
fix_data_dir.sh: old files are kept in data/test/.backup
steps/make_mfcc.sh --nj 14 --cmd run.pl data/train exp/make_mfcc/train mfcc
utils/validate_data_dir.sh: Successfully validated data-directory data/train
steps/make_mfcc.sh: [info]: no segments file exists: assuming wav.scp indexed by utterance.
steps/make_mfcc.sh: Succeeded creating MFCC features for train
steps/make_mfcc.sh --nj 1 --cmd run.pl data/test exp/make_mfcc/test mfcc
   Search for the word 'bold' in http:

#### Stage 3: Acoustic model training

In [3]:
!cd ./kaldi/egs/torgo/ && ./run.sh 

steps/train_mono.sh --nj 14 --cmd run.pl --cmvn-opts  /home/abner/kaldi/egs/torgo/data/train /home/abner/kaldi/egs/torgo/data/lang /home/abner/kaldi/egs/torgo/exp/train/mono
steps/train_mono.sh: Initializing monophone system.
steps/train_mono.sh: Compiling training graphs
steps/train_mono.sh: Aligning data equally (pass 0)
steps/train_mono.sh: Pass 1
steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 2
steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 3
steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 4
steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 5
steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 6
steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 7
steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 8
steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 9
steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 10
steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 11
steps/train_mon

#### Stage 4: Decoding and scoring

In [4]:
!cd ./kaldi/egs/torgo/ && ./run.sh 

tree-info /home/abner/kaldi/egs/torgo/exp/train/tri4/tree 
tree-info /home/abner/kaldi/egs/torgo/exp/train/tri4/tree 
fsttablecompose /home/abner/kaldi/egs/torgo/data/lang_test/L_disambig.fst /home/abner/kaldi/egs/torgo/data/lang_test/G.fst 
fstminimizeencoded 
fstdeterminizestar --use-log=true 
fstpushspecial 
fstisstochastic /home/abner/kaldi/egs/torgo/data/lang_test/tmp/LG.fst 
-0.0671434 -0.0676291
[info]: LG not stochastic.
fstcomposecontext --context-size=3 --central-position=1 --read-disambig-syms=/home/abner/kaldi/egs/torgo/data/lang_test/phones/disambig.int --write-disambig-syms=/home/abner/kaldi/egs/torgo/data/lang_test/tmp/disambig_ilabels_3_1.int /home/abner/kaldi/egs/torgo/data/lang_test/tmp/ilabels_3_1.63729 /home/abner/kaldi/egs/torgo/data/lang_test/tmp/LG.fst 
fstisstochastic /home/abner/kaldi/egs/torgo/data/lang_test/tmp/CLG_3_1.fst 
0 -0.0676291
[info]: CLG not stochastic.
make-h-transducer --disambig-syms-out=/home/abner/kaldi/egs/torgo/exp/train/tri4/graph/disambig_

In [5]:
!cat /home/abner/kaldi/egs/torgo/exp/train/tri4/decode_test/scoring_kaldi/best_wer

%WER 39.12 [ 1089 / 2784, 279 ins, 112 del, 698 sub ] /home/abner/kaldi/egs/torgo/exp/train/tri4/decode_test/wer_17_0.0


#### Our model has a WER of 39.12%

### The next step is to improve the recognition accuracy and lower the WER as much as possible
Things to consider: <br/>
- Data augmentation
    - See [speed perturbation](https://github.com/kaldi-asr/kaldi/blob/master/egs/wsj/s5/utils/data/perturb_data_dir_speed_3way.sh)
    - Recommended articles [[1]](https://188.166.204.102/archive/Interspeech_2018/pdfs/1751.pdf), [[2]](https://ieeexplore.ieee.org/abstract/document/8683091), [[3]](https://ieeexplore.ieee.org/abstract/document/8462290), [[4]](https://isca-speech.org/archive/Interspeech_2020/pdfs/1161.pdf)
- Increasing data size?
    - See [UASpeech database](http://www.isle.illinois.edu/sst/data/UASpeech/) (speech database with dysarthric speaker)
    - [Librispeech](https://www.openslr.org/12) (large 1000 hour corpus)
- Adjusting Kaldi's internal default parameters
    - See [Improving Acoustic Models in TORGO Dysarthric Speech Database](https://ieeexplore.ieee.org/abstract/document/8283503)
    - eg. mfcc dimensions, decode beam size, etc.