<a href="https://colab.research.google.com/github/gaurikapoplai21/CCBD-Project/blob/master/e2e_ASR_homework_v1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# EN. 601.467/667 Introduction to Human Language Technology
# End-to-End Speech Recognition Toolkit
**Deadline 16, October, 2019**

- We use [ESPnet](https://github.com/espnet/espnet)
- This homework is designed to have an overview experience of end-to-end speech recognition. Various contents are not covered in the lecture due to the limitation of time.
- Adapted from the [Interspeech 2019 tutorial materials](https://github.com/espnet/interspeech2019-tutorial).
  - Special thanks to Shigeki Karita and Tomoki Hayashi.


## Google colaboratory

Before getting started, get familiar with google colaboratory:
https://colab.research.google.com/notebooks/welcome.ipynb

- Online Jupyter notebook environment
    - Can run python codes
    - Can also run linux command with ! mark
    - Can use a signal GPU (K80)
- What you need to use
    - Internet connection
    - Google account
    - Chrome browser (recommended)

This is a neat python environment that works in the cloud and does not require you to
set up anything on your personal machine
(it also has some built-in IDE features that make writing code easier).
Moreover, it allows you to copy any existing colaboratory file, alter it and share
with other people. In this homework, we will ask you to copy current colaboraty,
complete all the tasks and share your colaboratory notebook with us so
that we can grade it.

## Submission

Before you start working on this homework do the following steps:

1. Press __File > Save a copy in Drive...__ tab. This will allow you to have your own copy and change it.
2. Follow all the steps in this colaboratory file and write/change/uncomment code as necessary.
3. Do not forget to occasionally press __File > Save__ tab to save your progress.
4. After all the changes are done and progress is saved press __Share__ button (top right corner of the page), press __get shareable link__ and make sure you have the option __Anyone with the link can view__ selected.
5. Paste the link into your submission pdf file so that we can view it and grade.






# 1. Overview

ESPnet provides **bash recipes and python library** for speech processing.


## 1.1 Python library overview

- Python 3.6+
- Uses the following neural network libraries
  - PyTorch
  - Chainer


## 1.2 Bash recipe overview

ESPnet supports total 34+ ASR and other speech processing tasks including

- Multilingual ASR: en, zh, ja, etc
- Noise robust and far-field ASR
- Multichannel ASR: joint training with speech enhancement
- Speech Translation: transfer learning from ASR and MT

For more detail:
https://github.com/espnet/espnet/tree/master/egs


## 1.3 ASR Performance

On free corpora, ESPnet achieved:

- Aishell (zh): CER test: 6.7%
- Common Voice (en): WER test: 2.3%
- LibriSpeech (en): WER test-clean: 2.6%, test-other 5.7%
- TED-LIUM2 (en): WER test: 8.1%

**Pretrained models are available**

https://github.com/espnet/espnet#asr-results


# 2. Installation

ESPnet depends on Kaldi ASR toolkit and Warp-CTC.


## 2.1 Installation (on Google colab)

In Google colab, we can use pre-compiled binaries for faster startup (3 min):

In [None]:
# OS setup
!cat /etc/os-release
!apt-get install -qq bc tree sox

# espnet setup
!git clone --depth 5 https://github.com/espnet/espnet
!pip install -q torch==1.1
!cd espnet; pip install -q -e .

# download pre-compiled warp-ctc and kaldi tools
!espnet/utils/download_from_google_drive.sh \
    "https://drive.google.com/open?id=13Y4tSygc8WtqzvAVGK_vRV9GlV7TRC0w" espnet/tools tar.gz > /dev/null
!cd espnet/tools/warp-ctc/pytorch_binding && \
    pip install -U dist/warpctc_pytorch-0.1.1-cp36-cp36m-linux_x86_64.whl

# make dummy activate
!mkdir -p espnet/tools/venv/bin && touch espnet/tools/venv/bin/activate
!echo "setup done."

# 3. AN4 ASR experiments based on ESPnet

- `espnet/egs/*/asr1/run.sh` is an out-of-the-box recipe
- It reproduces our reported results

![image.png](https://github.com/espnet/interspeech2019-tutorial/blob/master/notebooks/interspeech2019_asr/figs/stages.png?raw=1)

- **stage -1**: Download data if the data is available online.
- **stage 0**: Prepare data to make kaldi-style data directory.
- **stage 1**: Extract feature vector, calculate statistics, and normalize.
- **stage 2**: Prepare a dictionary and make json files for training.
- **stage 3**: Train the language model network.
- **stage 4**: Train the E2E-ASR network.
- **stage 5**: Decode speech data using the trained networks and perform scoring (provide the character error rate (CER), word error rate (WER), etc.).


## 3.1 Kaldi-style directory structure

Always we organize each recipe as `egs/xxx/asr1/run.sh`

The most important directories:
- `run.sh`: Main script of the recipe.
- `cmd.sh`: Command configuration script to control how-to-run each job.
- `path.sh`: Path configuration script. Basically, we do not need to touch.
- `conf/`: Directory containing configuration files e.g.g.
- `local/`: Directory containing the recipe-specific scripts e.g. data preparation.
- `steps/` and `utils/`: Directory containing kaldi tools.


In [None]:
# move on the recipe directory
import os
os.chdir("/content/espnet/egs/an4/asr1")

# check files
!tree -L 1


## 3.2 Data preparation (Stage 0 - 2)

For example, if you add `--stop-stage 2`, you can stop the script before neural network training.

These stages perform FBANK speech feature extraction, normalization, and text formatting.

![image.png](https://github.com/espnet/interspeech2019-tutorial/blob/master/notebooks/interspeech2019_asr/figs/stages_prep.png?raw=1)

In [None]:
# run stage -1 and then stop
!./run.sh --stop_stage 2

Now ready to start training!

## 3.3 NN Training (Stage 3 - 4)

You can configure NN training with `conf/train_xxx.yaml`

![image.png](https://github.com/espnet/interspeech2019-tutorial/blob/master/notebooks/interspeech2019_asr/figs/stages_nn.png?raw=1)

## 3.3.1 Run LM

First, we train RNNLM for the AN4 data

In [None]:
# it takes 2 minutes
!./run.sh  --ngpu 1 --stage 3 --stop-stage 3

After we finish the LM training, we can check the perplexity from the log file (`train.log`)

In [None]:
# check the LM perplexity
!cat exp/train_rnnlm_pytorch_lm_word100/train.log | grep perplexity

Please check the `test perplexity` with the above command. The perplexity value should be around 14.

## 3.3.2 ASR NN training

We train ASR NN for the AN4 data

## ASR training config: RNN with Attention

- you can check the default config in the following command
- (optional) complete list of common options https://espnet.github.io/espnet/apis/espnet_bin.html#asr-train-py
- (optional) complete list of model-specific options https://espnet.github.io/espnet/_modules/espnet/nets/pytorch_backend/e2e_asr.html#E2E.add_arguments



In [None]:
!cat /content/espnet/egs/an4/asr1/conf/train_mtlalpha0.5.yaml

## Perform ASR NN training


In [None]:
# WARNING: This code takes 5-6 minutes!
!./run.sh  --ngpu 1 --stage 4 --stop-stage 4 --train-config conf/train_mtlalpha0.5.yaml

The training log file is stored in `exp/train_nodev_pytorch_train_mtlalpha0.5/train.log`.

Let's first look at the validation result of the initial training with the following command:

In [None]:
!grep -e groundtruth -e prediction exp/train_nodev_pytorch_train_mtlalpha0.5/train.log \
  | sed -e 's/<eos>//g' -e 's/<space>/ /g' | head -n 20

By comparing the groundtruth and prediction results, you could see that it did not produce meaningful results.

Then, let's check the validation result of the final training with the following command:

In [None]:
!grep -e groundtruth -e prediction exp/train_nodev_pytorch_train_mtlalpha0.5/train.log \
  | sed -e 's/<eos>//g' -e 's/<space>/ /g' | tail -n 20

Compared with the initial result, the prediction is close to the groundtruth. Training seems to be going well.

You can also check the training and validation accuracies from `exp/train_nodev_pytorch_train_mtlalpha0.5/results/acc.png`


In [None]:
import glob
from IPython.display import Image, display_png
expdir = "exp/train_nodev_pytorch_train_mtlalpha0.5/results/"
for name in ["acc.png"]:
    print(name)
    display_png(Image(expdir + name, width=500))

Please confirm that both training and validation accuracies are improved with more epochs, but finally converged.

## 3.4 Decoding and evaluation (Stage 5)

The last stage of ASR recipe

![image.png](https://github.com/espnet/interspeech2019-tutorial/blob/master/notebooks/interspeech2019_asr/figs/stages_eval.png?raw=1)

### decoding config
- you can check the default config in the following command
- (option) complete list of common options https://espnet.github.io/espnet/apis/espnet_bin.html#asr-recog-py

In [None]:
!cat conf/decode_ctcweight0.5.yaml

### Language model weight
- `lm-weight:` language model (LM) weight $\lambda$

- without LM

  $\hat{W} = \arg \max _{W} \log p(W|O) = \arg \max _{W}  \log p_{\text{e2e}}(W|O)$


- with LM

  $\hat{W} = \arg \max _{W} \log p(W|O) = \arg \max _{W}  (\log p_{\text{e2e}}(W|O) + \lambda \log p_{\text{lm}}(W))$



### Run ASR decoding

In [None]:
# WARNING: This code takes 6 minutes!
# Only recognize the test set
!sed -i.bak -e's/recog_set="train_dev test"/recog_set="test"/' run.sh
# run the actual recognition script
!./run.sh --stage 5 --decode-config conf/decode_ctcweight0.5.yaml --train-config conf/train_mtlalpha0.5.yaml

You can get the Character Error Rate (CER) by checking the `Err` column in the last line

## 4. Compare the result w/ and w/o language model (main homework)

1. modify the decoding config file by the `sed` command and set `lm-weight` to `{0.0, 0.1, 0.2, 0.3}` and rename the config appropriately
2. perform recognition

The following is an example of how to change the language model weight



In [None]:
!sed -e 's/lm-weight: 1.0/lm-weight: 0.1/' conf/decode_ctcweight0.5.yaml | tee conf/decode_ctcweight0.5_lmweight0.1.yaml

In [None]:
!./run.sh --stage 5 --decode-config conf/decode_ctcweight0.5_lmweight0.1.yaml --train-config conf/train_mtlalpha0.5.yaml

**Important Note**: 
- Unfortunately, the result would be changing with different Google colab images. Please finish the above experiments in a half day (otherwise it may use a different image and results would not be consistent)