<img src="https://drive.google.com/uc?export=view&id=1wYSMgJtARFdvTt5g7E20mE4NmwUFUuog" width="200">

[![Build Fast with AI](https://img.shields.io/badge/BuildFastWithAI-GenAI%20Bootcamp-blue?style=for-the-badge&logo=artificial-intelligence)](https://www.buildfastwithai.com/genai-course)
[![EduChain GitHub](https://img.shields.io/github/stars/satvik314/educhain?style=for-the-badge&logo=github&color=gold)](https://github.com/satvik314/educhain)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/17lYgYeudEdfhMKEHo-jK3rLRz_OHEp9c?usp=sharing)
## Master Generative AI in 6 Weeks
**What You'll Learn:**
- Build with Latest LLMs
- Create Custom AI Apps
- Learn from Industry Experts
- Join Innovation Community
Transform your AI ideas into reality through hands-on projects and expert mentorship.
[Start Your Journey](https://www.buildfastwithai.com/genai-course)
*Empowering the Next Generation of AI Innovators

# NeMo :- SpeechAI Workbench 🎙️ 🚀


NeMo is an open-source framework developed by NVIDIA to simplify the creation, training,
and deployment of large-scale AI models, including Large Language Models (LLMs),
Speech AI, and Multimodal applications. 🎙️🤖🖼️


🛠️ Modular Design: Supports LLMs, Speech AI, and Vision models through pre-built components.,

⚡ Scalability: Designed for multi-GPU and distributed training on large datasets.,

📚 Pretrained Models: Access state-of-the-art models for ASR, TTS, NLP, and multimodal AI.,

🔧 Customization: Fine-tune models with domain-specific data for better performance.,

☁️ Cloud & On-Prem Support: Deploy on NVIDIA GPUs, cloud platforms, or edge devices.


###**Setup and Installation**

In [None]:
!pip install wget
!apt-get install sox libsndfile1 ffmpeg
!pip install text-unidecode

In [None]:
BRANCH = 'main'
!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]


In [None]:
!pip install megatron-core

### **Check NeMo Version**








In [None]:
import nemo
nemo.__version__

'2.2.0rc2'

### **Import NeMo Modules**








In [None]:
import nemo
import nemo.collections.asr as nemo_asr
import nemo.collections.nlp as nemo_nlp
import nemo.collections.tts as nemo_tts
import IPython

### **List Available ASR Models in NeMo**








In [None]:
nemo_asr.models.EncDecCTCModel.list_available_models()


### **Load Pretrained ASR Model in NeMo**








In [None]:
asr_model = nemo_asr.models.EncDecCTCModel.from_pretrained(model_name="stt_zh_citrinet_1024_gamma_0_25").cuda()


### **Load Pretrained FastPitch TTS Model**








In [None]:
spectrogram_generator = nemo_tts.models.FastPitchModel.from_pretrained(model_name="tts_en_fastpitch").cuda()


### **Load Pretrained HiFi-GAN Vocoder**








In [None]:
vocoder = nemo_tts.models.HifiGanModel.from_pretrained(model_name="tts_en_hifigan").cuda()


### **Download Audio Sample**








In [None]:
audio_sample = 'common_voice_zh-CN_21347786.mp3'
!wget 'https://nemo-public.s3.us-east-2.amazonaws.com/zh-samples/common_voice_zh-CN_21347786.mp3'

### **Play Audio Sample**








In [None]:
IPython.display.Audio(audio_sample)


### **Transcribe Audio Using ASR Model**








In [None]:
transcribed_text = asr_model.transcribe([audio_sample])
print(transcribed_text)

Transcribing:   0%|          | 0/1 [00:00<?, ?it/s][NeMo W 2025-02-05 19:05:45 nemo_logging:405] Function ``_transcribe_output_processing`` is deprecated. The return type of args will be updated in the upcoming release to ensure a consistent output             format across all decoder types, such that a Hypothesis object is always returned.
Transcribing: 100%|██████████| 1/1 [00:16<00:00, 16.78s/it]

['我们尽了最大努力']





### **Text-to-Speech Conversion with NeMo**








In [None]:
import nemo.collections.tts as nemo_tts
import IPython

spectrogram_generator = nemo_tts.models.FastPitchModel.from_pretrained(model_name="tts_en_fastpitch").cuda()

vocoder = nemo_tts.models.HifiGanModel.from_pretrained(model_name="tts_en_hifigan").cuda()

def text_to_audio(text):
    parsed = spectrogram_generator.parse(text)
    spectrogram = spectrogram_generator.generate_spectrogram(tokens=parsed)
    audio = vocoder.convert_spectrogram_to_audio(spec=spectrogram)
    return audio.to('cpu').detach().numpy()

english_text = "Hello, this is a test of the NeMo text-to-speech system."
audio = text_to_audio(english_text)
IPython.display.Audio(audio, rate=22050)

[NeMo I 2025-02-05 19:06:23 nemo_logging:393] Found existing object /root/.cache/torch/NeMo/NeMo_2.2.0rc2/tts_en_fastpitch_align/b7d086a07b5126c12d5077d9a641a38c/tts_en_fastpitch_align.nemo.
[NeMo I 2025-02-05 19:06:23 nemo_logging:393] Re-using file from: /root/.cache/torch/NeMo/NeMo_2.2.0rc2/tts_en_fastpitch_align/b7d086a07b5126c12d5077d9a641a38c/tts_en_fastpitch_align.nemo
[NeMo I 2025-02-05 19:06:23 nemo_logging:393] Instantiating model from pre-trained checkpoint


 NeMo-text-processing :: INFO     :: Creating ClassifyFst grammars.
INFO:NeMo-text-processing:Creating ClassifyFst grammars.
[NeMo W 2025-02-05 19:06:56 nemo_logging:405] apply_to_oov_word=None, This means that some of words will remain unchanged if they are not handled by any of the rules in self.parse_one_word(). This may be intended if phonemes and chars are both valid inputs, otherwise, you may see unexpected deletions in your input.
[NeMo W 2025-02-05 19:06:56 nemo_logging:405] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    dataset:
      _target_: nemo.collections.tts.torch.data.TTSDataset
      manifest_filepath: /ws/LJSpeech/nvidia_ljspeech_train_clean_ngc.json
      sample_rate: 22050
      sup_data_path: /raid/LJSpeech/supplementary
      sup_data_types:
      - align_prior_matrix
      - pitch
      n_fft: 1024
      win_length: 10

[NeMo I 2025-02-05 19:06:56 nemo_logging:393] PADDING: 1
[NeMo I 2025-02-05 19:06:57 nemo_logging:393] Model FastPitchModel was successfully restored from /root/.cache/torch/NeMo/NeMo_2.2.0rc2/tts_en_fastpitch_align/b7d086a07b5126c12d5077d9a641a38c/tts_en_fastpitch_align.nemo.
[NeMo I 2025-02-05 19:06:57 nemo_logging:393] Found existing object /root/.cache/torch/NeMo/NeMo_2.2.0rc2/tts_hifigan/e6da322f0f7e7dcf3f1900a9229a7e69/tts_hifigan.nemo.
[NeMo I 2025-02-05 19:06:57 nemo_logging:393] Re-using file from: /root/.cache/torch/NeMo/NeMo_2.2.0rc2/tts_hifigan/e6da322f0f7e7dcf3f1900a9229a7e69/tts_hifigan.nemo
[NeMo I 2025-02-05 19:06:57 nemo_logging:393] Instantiating model from pre-trained checkpoint


[NeMo W 2025-02-05 19:07:02 nemo_logging:405] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    dataset:
      _target_: nemo.collections.tts.data.datalayers.MelAudioDataset
      manifest_filepath: /home/fkreuk/data/train_finetune.txt
      min_duration: 0.75
      n_segments: 8192
    dataloader_params:
      drop_last: false
      shuffle: true
      batch_size: 64
      num_workers: 4
    
[NeMo W 2025-02-05 19:07:02 nemo_logging:405] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    dataset:
      _target_: nemo.collections.tts.data.datalayers.MelAudioDataset
      manifest_filepath: /home/fkreuk/data/val_finetune.txt
      min_duration: 3
      n_segmen

[NeMo I 2025-02-05 19:07:02 nemo_logging:393] PADDING: 0


[NeMo W 2025-02-05 19:07:02 nemo_logging:405] Using torch_stft is deprecated and has been removed. The values have been forcibly set to False for FilterbankFeatures and AudioToMelSpectrogramPreprocessor. Please set exact_pad to True as needed.


[NeMo I 2025-02-05 19:07:02 nemo_logging:393] PADDING: 0
[NeMo I 2025-02-05 19:07:03 nemo_logging:393] Model HifiGanModel was successfully restored from /root/.cache/torch/NeMo/NeMo_2.2.0rc2/tts_hifigan/e6da322f0f7e7dcf3f1900a9229a7e69/tts_hifigan.nemo.


[NeMo W 2025-02-05 19:07:03 nemo_logging:405] parse() is meant to be called in eval mode.
[NeMo W 2025-02-05 19:07:03 nemo_logging:405] generate_spectrogram() is meant to be called in eval mode.


### **List Available ASR Models in NeMo**








In [None]:
asr_models = [model for model in dir(nemo_asr.models) if model.endswith("Model")]
asr_models

['ASRModel',
 'EncDecCTCModel',
 'EncDecClassificationModel',
 'EncDecDenoiseMaskedTokenPredModel',
 'EncDecDiarLabelModel',
 'EncDecFrameClassificationModel',
 'EncDecHybridRNNTCTCBPEModel',
 'EncDecHybridRNNTCTCModel',
 'EncDecK2RnntSeqModel',
 'EncDecK2SeqModel',
 'EncDecMaskedTokenPredModel',
 'EncDecMultiTaskModel',
 'EncDecRNNTBPEModel',
 'EncDecRNNTModel',
 'EncDecSpeakerLabelModel',
 'SLUIntentSlotBPEModel',
 'SortformerEncLabelModel',
 'SpeechEncDecSelfSupervisedModel']

### **List Available NLP Models in NeMo**








In [None]:
nlp_models = [model for model in dir(nemo_nlp.models) if model.endswith("Model")]
nlp_models

['BERTLMModel',
 'BertDPRModel',
 'BertJointIRModel',
 'DuplexDecoderModel',
 'DuplexTaggerModel',
 'DuplexTextNormalizationModel',
 'EntityLinkingModel',
 'GLUEModel',
 'IntentSlotClassificationModel',
 'MTEncDecModel',
 'MegatronGPTPromptLearningModel',
 'MultiLabelIntentSlotClassificationModel',
 'PunctuationCapitalizationLexicalAudioModel',
 'PunctuationCapitalizationModel',
 'QAModel',
 'SpellcheckingAsrCustomizationModel',
 'Text2SparqlModel',
 'TextClassificationModel',
 'ThutmoseTaggerModel',
 'TokenClassificationModel',
 'TransformerLMModel',
 'ZeroShotIntentModel']

### **List Available TTS Models in NeMo**








In [None]:
tts_models = [model for model in dir(nemo_tts.models) if model.endswith("Model")]
tts_models

['AlignerModel',
 'AudioCodecModel',
 'FastPitchModel',
 'GriffinLimModel',
 'HifiGanModel',
 'MelPsuedoInverseModel',
 'MixerTTSModel',
 'RadTTSModel',
 'SpectrogramEnhancerModel',
 'Tacotron2Model',
 'TwoStagesModel',
 'UnivNetModel',
 'VitsModel',
 'WaveGlowModel']

### **Load Pretrained Citrinet ASR Model**








In [None]:
citrinet = nemo_asr.models.EncDecCTCModelBPE.from_pretrained('stt_en_citrinet_512')


[NeMo I 2025-02-05 19:15:15 nemo_logging:393] Downloading from: https://api.ngc.nvidia.com/v2/models/nvidia/nemo/stt_en_citrinet_512/versions/1.0.0rc1/files/stt_en_citrinet_512.nemo to /root/.cache/torch/NeMo/NeMo_2.2.0rc2/stt_en_citrinet_512/3262321355385bb7cf5a583146117d77/stt_en_citrinet_512.nemo
[NeMo I 2025-02-05 19:15:21 nemo_logging:393] Instantiating model from pre-trained checkpoint
[NeMo I 2025-02-05 19:15:25 nemo_logging:393] Tokenizer SentencePieceTokenizer initialized with 1024 tokens


[NeMo W 2025-02-05 19:15:25 nemo_logging:405] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    sample_rate: 16000
    batch_size: 32
    trim_silence: true
    max_duration: 16.7
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    
[NeMo W 2025-02-05 19:15:25 nemo_logging:405] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    sample_rate: 16000
    batch_size: 32
    shuffle: false
    
[NeMo W 2025-02-05 19:15:25 nemo_logging:405] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple_test_data() method and provide a valid configuration file to setup the test data l

[NeMo I 2025-02-05 19:15:25 nemo_logging:393] PADDING: 16
[NeMo I 2025-02-05 19:15:27 nemo_logging:393] Model EncDecCTCModelBPE was successfully restored from /root/.cache/torch/NeMo/NeMo_2.2.0rc2/stt_en_citrinet_512/3262321355385bb7cf5a583146117d77/stt_en_citrinet_512.nemo.


### **Summarize Citrinet ASR Model**








In [None]:
citrinet.summarize()


  | Name              | Type                              | Params | Mode 
--------------------------------------------------------------------------------
0 | preprocessor      | AudioToMelSpectrogramPreprocessor | 0      | train
1 | encoder           | ConvASREncoder                    | 36.3 M | train
2 | decoder           | ConvASRDecoder                    | 657 K  | train
3 | loss              | CTCLoss                           | 0      | train
4 | spec_augmentation | SpectrogramAugmentation           | 0      | train
5 | wer               | WER                               | 0      | train
--------------------------------------------------------------------------------
37.0 M    Trainable params
0         Non-trainable params
37.0 M    Total params
147.977   Total estimated model params size (MB)
943       Modules in train mode
0         Modules in eval mode