# Getting Started: Sample Conversational AI application
This notebook shows how to use NVIDIA NeMo (https://github.com/NVIDIA/NeMo) to construct a toy demo which translates English audio file into a Spanish one.

The demo demonstrates how to: 

* Instantiate pre-trained NeMo models from NVIDIA NGC as AIStudio assets.
* Transcribe audio with English speech recognition model.
* Translate text to Spanish with machine translation model.
* Generate audio with text-to-speech models fine-tuned to Spanish speach.
* Deploy these models locally using MLFlow and AI Studio deployments

## Import all necessary packages

In [1]:
# Import NeMo and it's ASR, NLP and TTS collections
import nemo
# Import Speech Recognition collection
import nemo.collections.asr as nemo_asr
# Import Natural Language Processing collection
import nemo.collections.nlp as nemo_nlp
# Import Speech Synthesis collection
import nemo.collections.tts as nemo_tts

# We'll use this to listen to audio
import IPython

import soundfile
import uuid
import io
import base64
import json
import logging

import mlflow
import os
import shutil
from mlflow.types.schema import Schema, ColSpec
from mlflow.types import ParamSchema, ParamSpec
from mlflow.models import ModelSignature

## Loading from local saved models

Here, instead of downloading the models directly from NGC via code, we are showing that we can access the models that were downloaded previously, using Ai Studio assets manager

In [2]:
# Speech Recognition model - Citrinet initially trained on Multilingual LibriSpeech English corpus, and fine-tuned on the open source Aishell-2
asr_model = nemo_asr.models.EncDecCTCModel.restore_from("/home/jovyan/datafabric/STT_En_Citrinet_1024_Gamma_0_25/stt_en_citrinet_1024_gamma_0_25.nemo")

# Neural Machine Translation model
nmt_model = nemo_nlp.models.MTEncDecModel.restore_from("/home/jovyan/datafabric/NMT_En_Es_Transformer12x2/nmt_en_es_transformer12x2.nemo")

# Spectrogram generator which takes text as an input and produces spectrogram
spectrogram_generator = nemo_tts.models.FastPitchModel.restore_from("/home/jovyan/datafabric/TTS_Es_Multispeaker_FastPitch_HiFiGAN/tts_es_fastpitch_multispeaker.nemo")

# Vocoder model which takes spectrogram and produces actual audio
vocoder = nemo_tts.models.HifiGanModel.restore_from("/home/jovyan/datafabric/TTS_Es_Multispeaker_FastPitch_HiFiGAN/tts_es_hifigan_ft_fastpitch_multispeaker.nemo")

[NeMo I 2024-12-04 14:44:58 mixins:170] Tokenizer SentencePieceTokenizer initialized with 1024 tokens


[NeMo W 2024-12-04 14:44:59 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    sample_rate: 16000
    batch_size: 32
    trim_silence: false
    max_duration: 20.0
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    use_start_end_token: false
    
[NeMo W 2024-12-04 14:44:59 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    sample_rate: 16000
    batch_size: 32
    shuffle: false
    use_start_end_token: false
    
[NeMo W 2024-12-04 14:44:59 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple_test_data() method and provide a v

[NeMo I 2024-12-04 14:44:59 features:289] PADDING: 16
[NeMo I 2024-12-04 14:45:08 save_restore_connector:249] Model EncDecCTCModelBPE was successfully restored from /home/jovyan/datafabric/STT_En_Citrinet_1024_Gamma_0_25/stt_en_citrinet_1024_gamma_0_25.nemo.
[NeMo I 2024-12-04 14:46:51 tokenizer_utils:179] Getting YouTokenToMeTokenizer with model: /tmp/tmp_mqnhy5j/tokenizer.32000.BPE.model with r2l: False.
[NeMo I 2024-12-04 14:46:51 tokenizer_utils:179] Getting YouTokenToMeTokenizer with model: /tmp/tmp_mqnhy5j/tokenizer.32000.BPE.model with r2l: False.


[NeMo W 2024-12-04 14:46:51 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    src_file_name: /raid/batches.tokens.16000.pkl
    tgt_file_name: /raid/batches.tokens.16000.pkl
    tokens_in_batch: 16000
    clean: true
    max_seq_length: 512
    cache_ids: false
    cache_data_per_node: false
    use_cache: false
    shuffle: true
    num_samples: -1
    drop_last: false
    pin_memory: false
    num_workers: 8
    load_from_cached_dataset: true
    reverse_lang_direction: false
    load_from_tarred_dataset: false
    metadata_path: null
    tar_shuffle_n: 100
    
[NeMo W 2024-12-04 14:46:51 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config

[NeMo I 2024-12-04 14:47:07 nlp_overrides:752] Model MTEncDecModel was successfully restored from /home/jovyan/datafabric/NMT_En_Es_Transformer12x2/nmt_en_es_transformer12x2.nemo.


[NeMo W 2024-12-04 14:47:17 deprecated:63] Function ``g2p_backward_compatible_support`` is deprecated. But it will not be removed until a further notice. G2P object root directory `nemo_text_processing.g2p` has been replaced with `nemo.collections.tts.g2p`. Please use the latter instead as of NeMo 1.18.0.
[NeMo W 2024-12-04 14:47:17 experimental:26] `<class 'nemo.collections.tts.g2p.models.i18n_ipa.IpaG2p'>` is experimental and not ready for production yet. Use at your own risk.
[NeMo W 2024-12-04 14:47:18 i18n_ipa:124] apply_to_oov_word=None, This means that some of words will remain unchanged if they are not handled by any of the rules in self.parse_one_word(). This may be intended if phonemes and chars are both valid inputs, otherwise, you may see unexpected deletions in your input.
[NeMo W 2024-12-04 14:47:18 experimental:26] `<class 'nemo.collections.common.tokenizers.text_to_speech.tts_tokenizers.IPATokenizer'>` is experimental and not ready for production yet. Use at your own ri

[NeMo I 2024-12-04 14:47:18 features:289] PADDING: 1
[NeMo I 2024-12-04 14:47:20 save_restore_connector:249] Model FastPitchModel was successfully restored from /home/jovyan/datafabric/TTS_Es_Multispeaker_FastPitch_HiFiGAN/tts_es_fastpitch_multispeaker.nemo.


[NeMo W 2024-12-04 14:47:39 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    dataset:
      _target_: nemo.collections.tts.torch.data.VocoderDataset
      manifest_filepath: /home/rlangman/Data/openslr/spanish/ipa/train_hifi_gta_manifest.json
      sample_rate: 44100
      n_segments: 16384
      max_duration: null
      min_duration: 0.75
      load_precomputed_mel: true
      hop_length: 512
    dataloader_params:
      drop_last: false
      shuffle: true
      batch_size: 16
      num_workers: 4
      pin_memory: true
    
[NeMo W 2024-12-04 14:47:39 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    dataset:
      _target_: nemo

[NeMo I 2024-12-04 14:47:39 features:289] PADDING: 0
[NeMo I 2024-12-04 14:47:39 features:297] STFT using exact pad
[NeMo I 2024-12-04 14:47:39 features:289] PADDING: 0
[NeMo I 2024-12-04 14:47:39 features:297] STFT using exact pad


    


[NeMo I 2024-12-04 14:47:41 save_restore_connector:249] Model HifiGanModel was successfully restored from /home/jovyan/datafabric/TTS_Es_Multispeaker_FastPitch_HiFiGAN/tts_es_hifigan_ft_fastpitch_multispeaker.nemo.


In [3]:
audio_sample = 'ForrestGump.mp3'
# To listen it, click on the play button below
IPython.display.Audio(audio_sample)

In [4]:
cuda_asr_model = asr_model.cuda()
transcribed_text = cuda_asr_model.transcribe([audio_sample])
print(transcribed_text)

Transcribing:   0%|          | 0/1 [00:00<?, ?it/s]

["my mom always said life like a box of chocolates never know what you're going to get"]


In [5]:
cuda_nmt_model = nmt_model.cuda()
translated_text = cuda_nmt_model.translate(transcribed_text)

print(translated_text)

['mi mamá siempre dijo la vida como una caja de chocolates nunca saben lo que vas a conseguir']


In [6]:
cuda_spectrogram_generator = spectrogram_generator.cuda()
cuda_vocoder = vocoder.cuda()


parsed = cuda_spectrogram_generator.parse(translated_text[0])
spectrogram = cuda_spectrogram_generator.generate_spectrogram(tokens=parsed, speaker=2)
audio = cuda_vocoder.convert_spectrogram_to_audio(spec=spectrogram)
IPython.display.Audio(audio.to('cpu').detach().numpy(), rate=44100)


[NeMo W 2024-12-04 14:48:20 fastpitch:291] parse() is meant to be called in eval mode.
[NeMo W 2024-12-04 14:48:20 fastpitch:368] generate_spectrogram() is meant to be called in eval mode.


In [2]:
class NemoTranslationModel(mlflow.pyfunc.PythonModel):
    def transcribe_audio(self, inputs):
        """
        Deserializes base64-serialized audio to a NumPy array.
        Assume the audio is in WAV format for simplicity
        """
        serialized_audio = inputs['source_serialized_audio'][0]
        audio_buffer = io.BytesIO(base64.b64decode(serialized_audio))
        audio, self.framerate = soundfile.read(audio_buffer)
        if len(audio.shape) > 1 and audio.shape[1] > 1:
            audio = audio[:, 0] #Get single channel  
        wave_file = "/phoenix/mlflow/tmp/{}.wav".format(self.file_id)
        soundfile.write(wave_file, audio, self.framerate)
        text = self.asr_model.cuda().transcribe([wave_file])
        return text

    def text_to_audio(self, text):
        """
        Generates audio from text using TTS templates.
        """
        parsed = self.spectrogram_generator.cuda().parse(text)
        spectrogram = self.spectrogram_generator.cuda().generate_spectrogram(tokens=parsed, speaker=2)
        audio = self.vocoder.cuda().convert_spectrogram_to_audio(spec=spectrogram)
        return audio.to('cpu').detach().numpy()

    def serialize_audio(self, audio_np):
        """
        Serializes an audio NumPy array to a base64 string representing a WAV file.
        """
        audio_base64 = ""
        wave_file = "/phoenix/mlflow/tmp/out_{}.wav".format(self.file_id)
        soundfile.write(wave_file, audio_np, samplerate=41000, format='WAV')
        
        with io.BytesIO() as audio_buffer:
            soundfile.write(audio_buffer, audio_np, samplerate=41000, format='WAV')
            audio_buffer.seek(0)
            audio_output = audio_buffer.read()
        return audio_output

    def load_context(self, context):
        import nemo.collections.asr as nemo_asr
        import nemo.collections.nlp as nemo_nlp
        import nemo.collections.tts as nemo_tts
        model_name=context.artifacts["model"]
        self.asr_model = nemo_asr.models.EncDecCTCModel.restore_from("{}/enc_dec_CTC.nemo".format(model_name))
        self.nmt_model = nemo_nlp.models.MTEncDecModel.restore_from("{}/MT_enc_dec.nemo".format(model_name))
        self.spectrogram_generator = nemo_tts.models.FastPitchModel.restore_from("{}/fast_pitch.nemo".format(model_name))
        self.vocoder = nemo_tts.models.HifiGanModel.restore_from("{}/hifi_gan.nemo".format(model_name))
        self.framerate = 41000
        if not os.path.isdir("/phoenix/mlflow/tmp"):
            os.mkdir("/phoenix/mlflow/tmp")
         
    def predict(self, context, model_input, params):
        source_text = ""
        self.file_id = uuid.uuid1()
        if params["use_audio"]:
            source_text = self.transcribe_audio(model_input)[0]
        else:
            source_text = model_input['source_text'][0]
        translated_text = self.nmt_model.cuda().translate([source_text])[0]
        translated_audio = ""
        if params["use_audio"]:
            audio = self.text_to_audio(translated_text)
            translated_audio = self.serialize_audio(audio[0])
        return {"original_text": source_text, "translated_text": translated_text, "translated_serialized_audio": translated_audio}

    @classmethod
    def log_model(cls, model_name, nemo_models, demo_folder): #eg (model, '', 'my_model')
        import os, shutil
        input_schema = Schema(
            [
                ColSpec("string", "source_text"),
                ColSpec("string", "source_serialized_audio"),
            ]
        )
        output_schema = Schema(
            [
                ColSpec("string", "original_text"),
                ColSpec("string", "translated_text"),
                ColSpec("string", "translated_serialized_audio"),
            ]
        )
        
        params_schema = ParamSchema(
            [
                ParamSpec("use_audio", "boolean", False)
            ]
        )
      
        signature = ModelSignature(inputs=input_schema, outputs=output_schema, params=params_schema)

        if not os.path.isdir(model_name):
            os.mkdir(model_name)
        if "enc_dec_CTC" in nemo_models:
            shutil.copyfile(nemo_models["enc_dec_CTC"], "{}/enc_dec_CTC.nemo".format(model_name))
        if "MT_enc_dec" in nemo_models:            
            shutil.copyfile(nemo_models["MT_enc_dec"], "{}/MT_enc_dec.nemo".format(model_name))
        if "fast_pitch" in nemo_models:            
            shutil.copyfile(nemo_models["fast_pitch"], "{}/fast_pitch.nemo".format(model_name))
        if "hifi_gan" in nemo_models:            
            shutil.copyfile(nemo_models["hifi_gan"], "{}/hifi_gan.nemo".format(model_name))
        mlflow.pyfunc.log_model(
            model_name,
            python_model=cls(),
            artifacts={"model": model_name, "demo": demo_folder},
            signature=signature
        )            

        shutil.rmtree(model_name)


In [3]:
mlflow.set_experiment(experiment_name='NeMo_translation')

with mlflow.start_run(run_name='NeMo_en_es_translation') as run:
    model_set = {
        "enc_dec_CTC": "/home/jovyan/datafabric/STT_En_Citrinet_1024_Gamma_0_25/stt_en_citrinet_1024_gamma_0_25.nemo",
        "MT_enc_dec": "/home/jovyan/datafabric/NMT_En_Es_Transformer12x2/nmt_en_es_transformer12x2.nemo",
        "fast_pitch": "/home/jovyan/datafabric/TTS_Es_Multispeaker_FastPitch_HiFiGAN/tts_es_fastpitch_multispeaker.nemo",
        "hifi_gan": "/home/jovyan/datafabric/TTS_Es_Multispeaker_FastPitch_HiFiGAN/tts_es_hifigan_ft_fastpitch_multispeaker.nemo",
    }


    NemoTranslationModel.log_model(model_name='nemo_en_es', nemo_models=model_set, demo_folder = "demo")
    mlflow.register_model(model_uri = f"runs:/{run.info.run_id}/nemo_en_es", name="nemo_en_es")

    


2024/12/04 14:54:55 INFO mlflow.tracking.fluent: Experiment with name 'NeMo_translation' does not exist. Creating a new experiment.


Downloading artifacts:   0%|          | 0/4 [00:00<?, ?it/s]

Downloading artifacts:   0%|          | 0/81 [00:00<?, ?it/s]

Successfully registered model 'nemo_en_es'.
Created version '1' of model 'nemo_en_es'.
