# Introduction

## .nemo file (unpacked)

Assuming you have run and read `test-nemo-mwe.py`, the model that is loaded via the `stt_en_fastconformer_ctc_large.nemo` consists of a bundle of files. Let's unpack what is contained in `stt_en_fastconformer_ctc_large.nemo` into `tmp/nemo_unpacked`.

In [12]:
%%bash
gdown 10aal_afHODYOA6qh5IQX8IGnWbfU3Nhd -O tmp/
tar -xf tmp/nemo_unpacked.tgz -C tmp/

Downloading...
From (uriginal): https://drive.google.com/uc?id=10aal_afHODYOA6qh5IQX8IGnWbfU3Nhd
From (redirected): https://drive.google.com/uc?id=10aal_afHODYOA6qh5IQX8IGnWbfU3Nhd&confirm=t&uuid=c8f0cfa0-dab8-42e5-b6b3-fed3ababfd9c
To: /home/naysan/git-repos/fastconformer_standalone/tmp/nemo_unpacked.tgz
100%|██████████| 463M/463M [00:04<00:00, 99.6MB/s] 


We can see from the file list below that there is a model checkpoint (`model_weights.ckpt`) and various files relating to the model configuration (`model_config.yaml`, `*_vocab.txt`)

In [16]:
!ls -lh tmp/nemo_unpacked/

total 442M
-rwxrwxr-x 1 naysan naysan 255K Sep  5 15:26 0124d42f914e45e98c214bd5afd17f55_tokenizer.model
-rwxrwxr-x 1 naysan naysan 5.9K Sep  5 15:26 129be3e4b71e449e86261ee42b6849fa_vocab.txt
-rwxrwxr-x 1 naysan naysan  17K Sep  5 15:26 7561592dab144ebaaade5d1244a9ffb0_tokenizer.vocab
-rw-r--r-- 1 naysan naysan  15K Sep  5 15:26 model_config.yaml
-rw-r--r-- 1 naysan naysan 442M Sep  5 15:26 model_weights.ckpt


## Load pre-trained model

Let's load the pre-trained model from the `.nemo` file 

In [3]:
import nemo.collections.asr as nemo_asr

asr_model = nemo_asr.models.EncDecCTCModelBPE.restore_from("tmp/stt_en_fastconformer_ctc_large.nemo")

[NeMo I 2023-09-13 18:11:46 mixins:170] Tokenizer SentencePieceTokenizer initialized with 1024 tokens


[NeMo W 2023-09-13 18:11:47 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    sample_rate: 16000
    batch_size: 1
    shuffle: true
    num_workers: 8
    pin_memory: true
    use_start_end_token: false
    trim_silence: false
    max_duration: 20
    min_duration: 0.1
    is_tarred: false
    tarred_audio_filepaths: null
    shuffle_n: 2048
    bucketing_strategy: fully_randomized
    bucketing_batch_size: null
    
[NeMo W 2023-09-13 18:11:47 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    sample_rate: 16000
    batch_size: 32
    shuffle: false
    num_workers: 8
    pin_m

[NeMo I 2023-09-13 18:11:47 features:289] PADDING: 0
[NeMo I 2023-09-13 18:11:47 save_restore_connector:249] Model EncDecCTCModelBPE was successfully restored from /home/naysan/git-repos/fastconformer_standalone/tmp/stt_en_fastconformer_ctc_large.nemo.


## Model structure

We can see from the PyTorch model print-out below that it is composed of various sub-components:
- preprocessor
- encoder
    - pre_encode
    - pos_enc
    - layers
    - decoder

In [4]:
asr_model

EncDecCTCModelBPE(
  (preprocessor): AudioToMelSpectrogramPreprocessor(
    (featurizer): FilterbankFeatures()
  )
  (encoder): ConformerEncoder(
    (pre_encode): ConvSubsampling(
      (out): Linear(in_features=2560, out_features=512, bias=True)
      (conv): Sequential(
        (0): Conv2d(1, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
        (1): ReLU(inplace=True)
        (2): Conv2d(256, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), groups=256)
        (3): Conv2d(256, 256, kernel_size=(1, 1), stride=(1, 1))
        (4): ReLU(inplace=True)
        (5): Conv2d(256, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), groups=256)
        (6): Conv2d(256, 256, kernel_size=(1, 1), stride=(1, 1))
        (7): ReLU(inplace=True)
      )
    )
    (pos_enc): RelPositionalEncoding(
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (layers): ModuleList(
      (0-17): 18 x ConformerLayer(
        (norm_feed_forward1): LayerNorm((512,), eps=1e-05, elementwis

## Call each component one at a time



In [10]:
import torch
import torchaudio

samples, sample_rate = torchaudio.load("tmp/cat.wav")

audio_samples = samples
audio_lens    = torch.tensor(samples.shape[1]).unsqueeze(0)

### Feature extractor

In [11]:
type(asr_model.preprocessor.featurizer)

nemo.collections.asr.parts.preprocessing.features.FilterbankFeatures

In [12]:
fbank_feats, fbank_lens = asr_model.preprocessor.featurizer(audio_samples.to('cuda'), audio_lens.to('cuda'))

### Pre-encoder

In [13]:
preenc_feats, preenc_lens = asr_model.encoder.pre_encode(
    fbank_feats.transpose(1,2),
    fbank_lens
)

In [14]:
type(asr_model.encoder.pos_enc)

nemo.collections.asr.parts.submodules.multi_head_attention.RelPositionalEncoding

In [15]:
preenc_feats, pos_emb = asr_model.encoder.pos_enc(preenc_feats)

### Encoder

In [19]:
type(asr_model.encoder)

nemo.collections.asr.modules.conformer_encoder.ConformerEncoder

#### Pre-processing

In [16]:
pad_mask, att_mask = asr_model.encoder._create_masks(
    att_context_size=asr_model.encoder.att_context_size,
    padding_length=preenc_lens,
    max_audio_length=preenc_feats.size(1),
    offset=None,
    device=preenc_feats.device,
)

#### Pass through encoder layers

In [20]:
type(asr_model.encoder.layers[0])

nemo.collections.asr.parts.submodules.conformer_modules.ConformerLayer

In [96]:
audio_signal = preenc_feats

for lth, (drop_prob, layer) in enumerate(zip(asr_model.encoder.layer_drop_probs, asr_model.encoder.layers)):
    original_signal = audio_signal

    audio_signal = layer(
        x=audio_signal,
        att_mask=att_mask,
        pos_emb=pos_emb,
        pad_mask=pad_mask,
        cache_last_channel=None,
        cache_last_time=None,
    )

### Decoder

In [21]:
type(asr_model.decoder)

nemo.collections.asr.modules.conv_asr.ConvASRDecoder

In [117]:
decoder_output = asr_model.decoder.decoder_layers(
    audio_signal.transpose(1,2)
).transpose(1, 2)

log_probs = torch.nn.functional.log_softmax(decoder_output, dim=-1)

#### Get prediction

In [120]:
greedy_predictions = log_probs.argmax(dim=-1, keepdim=False)

In [124]:
current_hypotheses, all_hyp = asr_model.decoding.ctc_decoder_predictions_tensor(
    log_probs,
    decoder_lengths=preenc_lens,
    return_hypotheses=False
)

In [125]:
current_hypotheses

['cat']

## Try to manually configure components (see if we can get same result)

### Get config

In [32]:
import yaml
with open('tmp/nemo_unpacked/model_config.yaml', 'r') as file:
    model_config = yaml.safe_load(file)

def isolate_target_key(config_dict):
    return {
        'target' : config_dict['_target_'],
        'config': dict([ (k,v) for (k,v) in config_dict.items() if k != '_target_' ])
    }

### Get model weights

In [90]:
from collections import OrderedDict

weights = torch.load("tmp/nemo_unpacked/model_weights.ckpt")

list(weights.keys())[:10]

encoder_weights = OrderedDict([ (k.replace('encoder.', ''),v) for (k,v) in weights.items() if k.startswith('encoder') ])
decoder_weights = OrderedDict([ (k.replace('decoder.', ''),v) for (k,v) in weights.items() if k.startswith('decoder') ])

### Feature extractor

In [49]:
fext = isolate_target_key(model_config['preprocessor'])

fext

{'target': 'nemo.collections.asr.modules.AudioToMelSpectrogramPreprocessor',
 'config': {'sample_rate': 16000,
  'normalize': 'per_feature',
  'window_size': 0.025,
  'window_stride': 0.01,
  'window': 'hann',
  'features': 80,
  'n_fft': 512,
  'log': True,
  'frame_splicing': 1,
  'dither': 1e-05,
  'pad_to': 0,
  'pad_value': 0.0}}

In [50]:
from nemo.collections.asr.modules import AudioToMelSpectrogramPreprocessor

fext = AudioToMelSpectrogramPreprocessor(**fext['config']).to('cuda')

[NeMo I 2023-09-13 18:38:41 features:289] PADDING: 0


In [53]:
manual_fbank_feats, manual_fbank_lens = fext.featurizer(audio_samples.to('cuda'), audio_lens.to('cuda'))

(manual_fbank_feats, manual_fbank_lens)

(tensor([[[-0.8675, -0.2694, -0.1895,  ..., -0.1655, -0.8155, -1.7735],
          [-1.6972, -1.2183, -1.0287,  ..., -0.7180, -0.8771, -1.5676],
          [-1.7146, -1.3714, -1.3406,  ..., -1.0109, -1.5163, -0.8815],
          ...,
          [-1.6872, -1.1314, -0.7925,  ..., -1.0016, -0.9704, -1.1705],
          [-1.7708, -1.0728, -0.9095,  ..., -1.0243, -0.9215, -1.0464],
          [-1.3496, -1.1106, -0.3422,  ..., -0.1715, -0.7981, -0.8456]]],
        device='cuda:0'),
 tensor([53], device='cuda:0'))

### Encoder

In [105]:
enc = isolate_target_key(model_config['encoder'])

enc

{'target': 'nemo.collections.asr.modules.ConformerEncoder',
 'config': {'feat_in': 80,
  'feat_out': -1,
  'n_layers': 18,
  'd_model': 512,
  'subsampling': 'dw_striding',
  'subsampling_factor': 8,
  'subsampling_conv_channels': 256,
  'causal_downsampling': False,
  'ff_expansion_factor': 4,
  'self_attention_model': 'rel_pos',
  'n_heads': 8,
  'att_context_size': [-1, -1],
  'att_context_style': 'regular',
  'xscaling': True,
  'untie_biases': True,
  'pos_emb_max_len': 5000,
  'conv_kernel_size': 9,
  'conv_norm_type': 'batch_norm',
  'conv_context_size': None,
  'dropout': 0.1,
  'dropout_pre_encoder': 0.1,
  'dropout_emb': 0.0,
  'dropout_att': 0.1}}

In [106]:
from nemo.collections.asr.modules import ConformerEncoder

enc = ConformerEncoder(**enc['config']).to('cuda')
enc.load_state_dict(encoder_weights)

<All keys matched successfully>

In [107]:
manual_preenc_feats, manual_preenc_lens = asr_model.encoder.pre_encode(
    manual_fbank_feats.transpose(1,2),
    manual_fbank_lens
)

In [108]:
manual_preenc_feats, manual_pos_emb = enc.pos_enc(manual_preenc_feats)

In [109]:
manual_pad_mask, manual_att_mask = enc._create_masks(
    att_context_size=enc.att_context_size,
    padding_length=manual_preenc_lens,
    max_audio_length=manual_preenc_feats.size(1),
    offset=None,
    device=manual_preenc_feats.device,
)

In [110]:
audio_signal = manual_preenc_feats

for lth, (drop_prob, layer) in enumerate(zip(enc.layer_drop_probs, enc.layers)):
    original_signal = audio_signal

    audio_signal = layer(
        x=audio_signal,
        att_mask=manual_att_mask,
        pos_emb=manual_pos_emb,
        pad_mask=manual_pad_mask,
        cache_last_channel=None,
        cache_last_time=None,
    )

### Decoder

In [111]:
dec = isolate_target_key(model_config['decoder'])

dec['target']

'nemo.collections.asr.modules.ConvASRDecoder'

In [112]:
from nemo.collections.asr.modules import ConvASRDecoder

dec = ConvASRDecoder(**dec['config']).to('cuda')
dec.load_state_dict(decoder_weights)

<All keys matched successfully>

In [113]:
manual_decoder_output = dec.decoder_layers(
    audio_signal.transpose(1,2)
).transpose(1, 2)

manual_log_probs = torch.nn.functional.log_softmax(manual_decoder_output, dim=-1)

In [114]:
manual_greedy_predictions = manual_log_probs.argmax(dim=-1, keepdim=False)

In [115]:
# Just use same decoder as NeMo-configured model above for convenience
manual_current_hypotheses, manual_all_hyp = asr_model.decoding.ctc_decoder_predictions_tensor(
    manual_log_probs,
    decoder_lengths=manual_preenc_lens,
    return_hypotheses=False
)

manual_current_hypotheses

['ca']

Close enough! (Would expect something very different/junk if model was not working...)