# ESPnet2-ASR realtime demonstration

This notebook provides a demonstration of the realtime E2E-ASR using ESPnet2-ASR.

- ESPnet2-ASR: https://github.com/espnet/espnet/tree/master/egs2/TEMPLATE/asr1

Author: Jiatong Shi ([@ftshijt](https://github.com/ftshijt))

In [None]:
# NOTE: pip shows imcompatible errors due to preinstalled libraries but you do not need to care
!pip install -q espnet==0.9.9
!pip install -q espnet_model_zoo
!pip install -q pyopenjtalk

[K     |████████████████████████████████| 768kB 6.1MB/s 
[K     |████████████████████████████████| 1.0MB 15.8MB/s 
[K     |████████████████████████████████| 13.1MB 239kB/s 
[K     |████████████████████████████████| 1.5MB 54.7MB/s 
[K     |████████████████████████████████| 225kB 57.6MB/s 
[K     |████████████████████████████████| 645kB 49.7MB/s 
[K     |████████████████████████████████| 122kB 61.9MB/s 
[K     |████████████████████████████████| 51kB 7.7MB/s 
[K     |████████████████████████████████| 92kB 11.4MB/s 
[K     |████████████████████████████████| 184kB 61.6MB/s 
[K     |████████████████████████████████| 2.1MB 52.3MB/s 
[K     |████████████████████████████████| 61kB 9.7MB/s 
[K     |████████████████████████████████| 163kB 59.9MB/s 
[K     |████████████████████████████████| 133kB 38.0MB/s 
[K     |████████████████████████████████| 102kB 14.1MB/s 
[K     |████████████████████████████████| 245kB 56.5MB/s 
[K     |████████████████████████████████| 1.3MB 53.3MB/s 
[K

## ASR model demo

### Model Selection

Please select model shown in [espnet_model_zoo](https://github.com/espnet/espnet_model_zoo/blob/master/espnet_model_zoo/table.csv)

In this demonstration, we will show English, Japanese, Spanish, and Mandrain ASR model, respectively

In [None]:
#@title Choose English ASR model { run: "auto" }

lang = 'en'
fs = 16000 #@param {type:"integer"}
tag = 'Shinji Watanabe/spgispeech_asr_train_asr_conformer6_n_fft512_hop_length256_raw_en_unnorm_bpe5000_valid.acc.ave' #@param ["Shinji Watanabe/spgispeech_asr_train_asr_conformer6_n_fft512_hop_length256_raw_en_unnorm_bpe5000_valid.acc.ave", "kamo-naoyuki/librispeech_asr_train_asr_conformer6_n_fft512_hop_length256_raw_en_bpe5000_scheduler_confwarmup_steps40000_optim_conflr0.0025_sp_valid.acc.ave"] {type:"string"}

In [None]:
#@title Choose Japanese ASR model { run: "auto" }

lang = 'ja'
fs = 16000 #@param {type:"integer"}
tag = 'Shinji Watanabe/laborotv_asr_train_asr_conformer2_latest33_raw_char_sp_valid.acc.ave' #@param ["Shinji Watanabe/laborotv_asr_train_asr_conformer2_latest33_raw_char_sp_valid.acc.ave"] {type:"string"}

In [None]:
#@title Choose Spanish ASR model { run: "auto" }

lang = 'es'
fs = 16000 #@param {type:"integer"}
tag = 'ftshijt/mls_asr_transformer_valid.acc.best' #@param ["ftshijt/mls_asr_transformer_valid.acc.best"] {type:"string"}

In [None]:
#@title Choose Mandrain ASR model { run: "auto" }

lang = 'zh'
fs = 16000 #@param {type:"integer"}
tag = 'Emiru Tsunoo/aishell_asr_train_asr_streaming_transformer_raw_zh_char_sp_valid.acc.ave' #@param ["	Emiru Tsunoo/aishell_asr_train_asr_streaming_transformer_raw_zh_char_sp_valid.acc.ave"] {type:"string"}

### Model Setup

In [None]:
import time
import torch
from espnet_model_zoo.downloader import ModelDownloader
from espnet2.bin.asr_inference import Speech2Text


d = ModelDownloader()
# It may takes a while to download and build models
speech2text = Speech2Text(
    **d.download_and_unpack(tag),
    device="cuda",
    minlenratio=0.0,
    maxlenratio=0.0,
    ctc_weight=0.3,
    beam_size=10,
    batch_size=0,
    nbest=1
)



### Recognize our example recordings

In [None]:
!git clone https://github.com/ftshijt/ESPNet_asr_egs.git

import pandas as pd
import soundfile
from IPython.display import display, Audio

egs = pd.read_csv("ESPNet_asr_egs/egs.csv")
print(lang)
for index, row in egs.iterrows():
  if row["lang"] == lang:
    speech, rate = soundfile.read("ESPNet_asr_egs/" + row["path"])
    assert fs == int(row["sr"])
    nbests = speech2text(speech)

    text, *_ = nbests[0]
    print(f"Input Speech: ESPNet_asr_egs/{row['path']}")
    # let us listen to samples
    display(Audio(speech, rate=rate))
    print(f"Reference text: {row['text']}")
    print(f"ASR hypothesis: {text}")
    print("*" * 50)


fatal: destination path 'ESPNet_asr_egs' already exists and is not an empty directory.
en
Input Speech: ESPNet_asr_egs/en/1.wav


Reference text: HE SAT UP ABRUPTLY.
ASR hypothesis: He's set up a breast men.
**************************************************
Input Speech: ESPNet_asr_egs/en/2.wav


Reference text: HIS SOUL MUST BE TOO PRIMITIVE TO UNDERSTAND THOSE THINGS, HE THOUGHT.
ASR hypothesis: His sole must be too prominent to understand those things. He thought
**************************************************
Input Speech: ESPNet_asr_egs/en/3.wav


Reference text: THE ONE TIME PAD IS AN ULTIMATE ENCRYPTION IF APPLIED CORRECTLY.
ASR hypothesis: The one-time pad is in the ultimate encryption, if applied correctly.
**************************************************
Input Speech: ESPNet_asr_egs/en/4.wav


Reference text: SHE'LL BE ALL RIGHT.
ASR hypothesis: So it should be a right
**************************************************
Input Speech: ESPNet_asr_egs/en/5.wav


Reference text: THE SHEEP HAD TAUGHT HIM THAT.
ASR hypothesis: The ship at the top in that
**************************************************


### Recognize your own recordings

In [None]:
from google.colab import files
from IPython.display import display, Audio
import soundfile

uploaded = files.upload()

for file_name in uploaded.keys():
  speech, rate = soundfile.read(file_name)
  assert rate == fs, "mismatch in sampling rate"
  nbests = speech2text(speech)
  text, *_ = nbests[0]

  print(f"Input Speech: {file_name}")
  display(Audio(speech, rate=rate))
  print(f"ASR hypothesis: {text}")
  print("*" * 50)

Saving 1.wav to 1.wav


  normalized, onesided, return_complex)


Input Speech: 1.wav


ASR hypothesis: He's set up a breast men.
********************
