## Dependencies

approximately 5-6 minutes

In [None]:
%%time
%%bash

git clone https://github.com/NVIDIA/NeMo
cd NeMo
git checkout v1.20.0
for f in $(ls requirements/requirements*.txt); do pip3 install --disable-pip-version-check --no-cache-dir -r $f; done
pip install -e .
pip install huggingface_hub==0.23.2
pip install gdown
pip install jiwer

# restart session



---

## Hybrid CTC & RNN-t model


* Hybrid RNNT-CTC models is a group of models with both the RNNT and CTC decoders. Training a unified model would speedup the convergence for the CTC models and would enable the user to use a single model which works as both a CTC and RNNT model. This category can be used with any of the ASR models.[1]

* So we can get speed from a CTC decoder and quality from a RNN-t decoder. This is extremely useful for production systems where you need to make partial predictions to show on screen while people are talking, and then make a final prediction. The first requests are usually handled by a fast CTC decoder, and the final prediction is done by RNN-t decoder.

\

<img alt="hybrid" src="https://drive.google.com/uc?id=1e8oe4CfBf8UmvWdm--DK_q126EK9A9tg" width=400>

\

* More about hybrid models:
  * [[1] NeMo docs](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/models.html#hybrid-transducer-ctc)
  * [[2] RNNT + LAS](https://arxiv.org/pdf/1908.10992)
  * [[3] CTC + LAS](https://arxiv.org/pdf/1609.06773)
  * [[4] Hybrid Rescoring 1](https://arxiv.org/pdf/2008.13093)
  * [[5] Hybrid Rescoring 2](https://arxiv.org/pdf/2101.11577)




---



In [1]:
import re
import typing as tp

import torch
import torch.nn as nn
import torchaudio
import soundfile as sf
from jiwer import wer
from tqdm import tqdm
import IPython.display as dsp
from sentencepiece import SentencePieceProcessor

from nemo.collections.asr.models import EncDecHybridRNNTCTCBPEModel

BLANK_IND: int = 1024


def clear(text: str):
  return re.sub(r'[^A-Za-z +]', '', text.lower())

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"

model = EncDecHybridRNNTCTCBPEModel.from_pretrained(
    model_name="stt_en_fastconformer_hybrid_large_pc"
).to(device).eval()

TOKENIZER: SentencePieceProcessor = model.tokenizer.tokenizer

dsp.clear_output()

#### Audio

In [None]:
! gdown https://drive.google.com/uc?id=1eNQt0R7Dm71utkLuRhc9wjTjdWoYrwsk -O best-song-ever.wav

In [None]:
dsp.display(dsp.Audio('best-song-ever.wav'))

In [None]:
transcription = clear((
    "Never gonna give you up, never gonna let you down "
    "Never gonna run around and desert you "
    "Never gonna make you cry, never gonna say goodbye "
    "Never gonna tell a lie and hurt you"
))
transcription

### RNN-t inference

**RNN-t** modules:

<img alt="rnnt" src="https://www.mdpi.com/symmetry/symmetry-11-01018/article_deploy/html/images/symmetry-11-01018-g004.png" width=400>


The encoder can be arbitrary, like RNN, DeepSpeech 2 encoder or Сonformer encoder, it can be streamable or non streamable, then whole model will be streamable or non streamable respectively.

Inference stage looks like:

<img alt="rnnt" src="https://drive.google.com/uc?id=1EoSRLSSIg2fSge0yVKakKnbcgWUlCVJJ" width=700>

The prediction network consists of two required parts: embedder and RNN.

<img alt="rnnt" src="https://drive.google.com/uc?id=1SaMiv5F3bDRngNS6ot-TBi12Xo6IFOsX" width=700>

And the joint network can have arbitrary complexity and architecture, but in a simple case, it is a simple DNN.

<img alt="rnnt" src="https://drive.google.com/uc?id=11qccpDLBuAEXvsdkOIB9UZbVXqwD4zJC" width=700>





In [None]:
# Read wav
wav, sr = torchaudio.load('best-song-ever.wav')
wav = wav.to(device)
assert sr == 16_000, sr

# Get mel spectrogram
input_signal_length = torch.tensor([wav.size(-1)], dtype=torch.int32, device=device)
spectrogram, spec_length = model.preprocessor.forward(
    input_signal=wav,
    length=input_signal_length,
)

# Get encoded acoustic embeddings
acoustic_embs, acoustic_embs_length = model.encoder.forward(
    audio_signal=spectrogram, length=spec_length
)

In [None]:
acoustic_embs.size()

#### CTC Inference

Let's use the `ctc_decode` function from the previous seminar and make a prediction by argmax.

In [None]:
def ctc_decode(inds: list):
    # your code here
    raise NotImplementedError()

In [None]:
logits = model.ctc_decoder.forward(encoder_output=acoustic_embs)

inds = logits.argmax(-1).tolist()[0]
inds = ctc_decode(inds)

ctc_hypothesis = model.tokenizer.tokenizer.decode_ids(inds)
ctc_hypothesis = clear(ctc_hypothesis)
ctc_hypothesis

In [None]:
wer(transcription, ctc_hypothesis)

#### RNN-t inference

Use `PredictionNetwork` and `JointNetwork` modules for RNN-t decoding. Sometimes it is useful to limit the number of tokens that will be emitted per frame, try to use this in your code with the `MAX_SYMBOLS_PER_FRAME: int` variable.

In [None]:
class PredictionNetwork(nn.Module):
  def __init__(
      self,
      input_size: int,
      hidden_size: int,
      num_layers: int,
      dropout: float,
      num_embeddings: int,
      embedding_dim: int,
      padding_idx=None,
    ):
    super().__init__()

    # https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html
    self.embed = nn.modules.sparse.Embedding(
        num_embeddings=num_embeddings,
        embedding_dim=embedding_dim,
        padding_idx=padding_idx,
    )
    # https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html
    self.lstm = nn.LSTM(
        input_size=input_size,
        hidden_size=hidden_size,
        num_layers=num_layers,
        dropout=dropout,
    )
    self.dropout = nn.Dropout(dropout) if dropout else None

  def forward(
      self,
      y: torch.tensor,
      state: tp.Optional[tp.Tuple[torch.tensor, ...]] = None,
    ) -> tp.Tuple[torch.tensor, tp.Tuple[torch.tensor, ...]]:
    """
    input:
      y_labels (bs, seq_len): ids from tokenizer of labels
      state: lstm state, can be None in the first moment, (see torch docs)
    output:
      g (bs, seq_len, hid_dim): language context
      state: lstm state
    """
    # Get embeddings for labels
    # your code goes here

    # Proccess it with LSTM
    # your code goes here
    raise NotImplementedError()


In [None]:
prediction_network = PredictionNetwork(
    input_size=640,
    hidden_size=640,
    num_layers=1,
    dropout=0.2,
    num_embeddings=1025,
    embedding_dim=640,
    padding_idx=BLANK_IND,
).to(device).eval()

prediction_network.embed.load_state_dict(
    model.decoder.prediction.embed.state_dict()
)
prediction_network.lstm.load_state_dict(
    model.decoder.prediction.dec_rnn.lstm.state_dict()
)

In [None]:
class JointNetwork(nn.Module):
  def __init__(
      self,
      pred_emb_size: int,
      enc_emb_size: int,
      hidden_size: int,
      dropout: float,
      vocab_size: int,
    ):
    super().__init__()

    self.pred_proj = nn.Linear(
        pred_emb_size, hidden_size
    )
    self.enc_proj = nn.Linear(
        enc_emb_size, hidden_size
    )
    self.joint_net = nn.Sequential(
        nn.ReLU(),
        nn.Dropout(dropout),
        nn.Linear(hidden_size, vocab_size + 1)
    )

  def forward(
      self,
      encoder_outputs: torch.tensor,
      decoder_outputs: torch.tensor,
    ) -> torch.Tensor:
    """
    input:
      encoder outputs (B, H1, T): acoustic context
      decoder outputs (B, H2, U): language context
    output:
      joint activation (B, T, U, V+1)
    """
    # Project the output of the encoder/decoder into the latent space and concatenate them
    # your code goes here

    # Project the following state into the vocab distribution space
    # your code goes here
    raise NotImplementedError()


In [None]:
joint_network = JointNetwork(
    pred_emb_size=640,
    enc_emb_size=512,
    hidden_size=640,
    dropout=0.2,
    vocab_size=1024,
).to(device).eval()

joint_network.pred_proj.load_state_dict(
    model.joint.pred.state_dict()
)
joint_network.enc_proj.load_state_dict(
    model.joint.enc.state_dict()
)
joint_network.joint_net.load_state_dict(
    model.joint.joint_net.state_dict()
)

write `rnnt_decoder_inference` function:

<img alt="rnnt" src="https://drive.google.com/uc?id=1EoSRLSSIg2fSge0yVKakKnbcgWUlCVJJ" width=700>


In [None]:
MAX_SYMBOLS_PER_FRAME: int = 100

In [None]:
@torch.inference_mode()
def rnnt_decoder_inference(
    prediction_network: nn.Module,
    joint_network: nn.Module,
    f: torch.Tensor,  # acoustic context
) -> tp.List[int]:
    """
    f - torch.tensor (B, H1, T): acoustic context
    """
    bs, _, T = f.size()
    assert bs == 1, bs

    y_cur = torch.tensor([[BLANK_IND]], dtype=torch.long, device=device)
    prediction_network_state = None

    for time_step in tqdm(range(T)):
        # your code here
        pass

    raise NotImplementedError()


In [None]:
decoded_outut = rnnt_decoder_inference(
    prediction_network=prediction_network,
    joint_network=joint_network,
    f=acoustic_embs,
)

In [None]:
rnnt_hypothesis = clear(TOKENIZER.decode_ids(decoded_outut))
rnnt_hypothesis

In [None]:
transcription

In [None]:
wer(transcription, rnnt_hypothesis)

### RNN-t training step

In [None]:
transcription

In [None]:
transcription_ids = TOKENIZER.encode(transcription)
transcription_ids = torch.tensor(transcription_ids, dtype=torch.long, device=device).unsqueeze(0)
transcription_ids

In [None]:
# Read wav
wav, sr = torchaudio.load('best-song-ever.wav')
wav = wav.to(device)
assert sr == 16_000, sr

# Get mel spectrogram
input_signal_length = torch.tensor([wav.size(-1)], dtype=torch.int32, device=device)
spectrogram, spec_length = model.preprocessor.forward(
    input_signal=wav,
    length=input_signal_length,
)

# Get encoded acoustic embeddings
acoustic_embs, acoustic_embs_length = model.encoder.forward(
    audio_signal=spectrogram, length=spec_length
)

In [None]:
cur_token_emb, hidden_state = prediction_network(
    y=transcription_ids,
    state=None,
)
inp, vocab_distributon = joint_network(
    encoder_outputs=cur_token_emb,
    decoder_outputs=acoustic_embs.mT,
    return_inp=True,
)

In [None]:
inp.size()

In [None]:
vocab_distributon.size()



---



for further reading:
  * [Sequence-to-sequence learning with Transducers](https://lorenlugosch.github.io/posts/2020/11/transducer/)
  * RNN-t optimizations:
    * [Multi-Blank Transducers for Speech Recognition, Hainan Xu et al., NVIDIA, 2024](https://arxiv.org/pdf/2211.03541v2)
    * [Efficient Sequence Transduction by Jointly Predicting Tokens and Durations, Hainan Xu et al., NVIDIA, 2023](https://arxiv.org/abs/2304.06795)
    * [FastEmit: Low-latency Streaming ASR with Sequence-level Emission Regularization, Jiahui Yu et al., Google, 2021](https://arxiv.org/abs/2010.11148)
    * [Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition, Dima Rekesh et al., NVIDIA, 2023](https://arxiv.org/abs/2305.05084)
    * [Rnn-Transducer with Stateless Prediction Network, Mohammadreza Ghodsi et al., 2020](https://ieeexplore.ieee.org/document/9054419)
