Zero-shot evaluation pipeline for mcore RETRO (NVIDIA#8941)

* update branch Signed-off-by: eharper <eharper@nvidia.com> * Add dist ckpt support for regular optimizers (NVIDIA#7749) * Add dist ckpt support for regular optimizers Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * [tutorial] fixed missing RIR scripts file. (NVIDIA#8257) Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> * fix imports Signed-off-by: dimapihtar <dpihtar@gmail.com> * imports fix Signed-off-by: dimapihtar <dpihtar@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * ci imports fix Signed-off-by: dimapihtar <dpihtar@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * revert asr notebook Signed-off-by: dimapihtar <dpihtar@gmail.com> * revert asr notebook Signed-off-by: dimapihtar <dpihtar@gmail.com> --------- Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> Signed-off-by: dimapihtar <dpihtar@gmail.com> Co-authored-by: Eric Harper <complex451@gmail.com> Co-authored-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> Co-authored-by: Dmytro Pykhtar <37850217+dimapihtar@users.noreply.github.com> Co-authored-by: dimapihtar <dpihtar@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Pin lhotse=1.19.2 in r1.23.0 (NVIDIA#8303) Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Cache Aware Streaming tutorial notebook (NVIDIA#8296) * add notebook Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com> * rename old notebook to Buffered_Streaming Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com> * call setup_streaming_params in set_default_att_context_size method Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com> * update links in docs Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com> * update links to tutorials in docs Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com> * remove hard-coding Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com> * rename var Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com> --------- Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com> * fix path location and branch (NVIDIA#8304) * fix path location and branch Signed-off-by: Nithin Rao Koluguri <nithinraok> * change to a floating point number Signed-off-by: Nithin Rao Koluguri <nithinraok> --------- Signed-off-by: Nithin Rao Koluguri <nithinraok> Co-authored-by: Nithin Rao Koluguri <nithinraok> Co-authored-by: Somshubra Majumdar <titu1994@gmail.com> * add deallocate pipeline output optimization (NVIDIA#8279) * add deallocate pipeline output optimization Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com> Co-authored-by: Jimmy Zhang <jiemingz@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Fix memory leak caused by context parallelism hanging references by omegaconf (NVIDIA#8299) * save cp_size to self Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com> * use parallel_state instead of self Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com> --------- Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com> Co-authored-by: Jimmy Zhang <jiemingz@nvidia.com> Co-authored-by: Eric Harper <complex451@gmail.com> * remove assertion (NVIDIA#8302) Signed-off-by: dimapihtar <dpihtar@gmail.com> * Update PEFT Doc (NVIDIA#8262) * update peft doc Signed-off-by: Chen Cui <chcui@nvidia.com> * remove old prompt learning doc and notebook Signed-off-by: Chen Cui <chcui@nvidia.com> * fix table Signed-off-by: Chen Cui <chcui@nvidia.com> * fix table Signed-off-by: Chen Cui <chcui@nvidia.com> * fix table Signed-off-by: Chen Cui <chcui@nvidia.com> * Merge branch 'r1.23.0' into chcui/update_peft_doc Signed-off-by: Chen Cui <chcui@nvidia.com> * revert accidental changes Signed-off-by: Chen Cui <chcui@nvidia.com> * revert accidental changes Signed-off-by: Chen Cui <chcui@nvidia.com> --------- Signed-off-by: Chen Cui <chcui@nvidia.com> * Attention encoder-decoder models for multiple speech-to-text tasks (NVIDIA#8242) (NVIDIA#8324) * Rebasing canary changes at current main Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Move the changes from asr transformer to nlp transformer as originally intended Signed-off-by: Piotr Żelasko <petezor@gmail.com> * update eval to strip spaces before punctuations Signed-off-by: stevehuang52 <heh@nvidia.com> * update pc strip Signed-off-by: stevehuang52 <heh@nvidia.com> * [canary] Refactor: `PromptedAudioToTextLhotseDataset` and `EncDecMultiTaskModel` (NVIDIA#8247) * Create a separate CanaryDataset and use it inside `transformer_bpe_models.py`. Ditches `token_sequence_format`. Signed-off-by: Piotr Żelasko <petezor@gmail.com> * [canary] Refactor: move changes in transformer_bpe_models.py to Canar… (NVIDIA#8252) * [canary] Refactor: move changes in transformer_bpe_models.py to CanaryModel Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Rename `CanaryModel` to `EncDecMultiTaskModel` and remove inheritance from `EncDecTransfModelBPE`; add a separate config for this model Signed-off-by: Piotr Żelasko <petezor@gmail.com> --------- Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Rename `CanaryDataset` to `PromptedAudioToTextLhotseDataset`; add `prompt_format_fn` argument; clean-up the `_canary_prompt_format` function a bit Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Move tokenization into `prompt_format_fn`, fix usage, add docs Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Backward-compatible utterance validation Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Improve type annotations Signed-off-by: Piotr Żelasko <petezor@gmail.com> * config and prompt_fn registration changes from review Signed-off-by: Piotr Żelasko <petezor@gmail.com> --------- Signed-off-by: Piotr Żelasko <petezor@gmail.com> * fix transcribe config Signed-off-by: stevehuang52 <heh@nvidia.com> * Refactor Canary to follow schema of remaining ASR models (NVIDIA#8260) * Initial draft of multi task beam decoding strategy Signed-off-by: smajumdar <titu1994@gmail.com> * Stabilize inference Signed-off-by: smajumdar <titu1994@gmail.com> * Update AED Multi Task model to mostly conform to Archetype-Type format. Update config Signed-off-by: smajumdar <titu1994@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add change decoding strategy Signed-off-by: smajumdar <titu1994@gmail.com> * Remove redundant imports Signed-off-by: smajumdar <titu1994@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Cleanup Signed-off-by: smajumdar <titu1994@gmail.com> * Cleanup Signed-off-by: smajumdar <titu1994@gmail.com> * remove asr transformer dependency on nlp Signed-off-by: stevehuang52 <heh@nvidia.com> * clean up Signed-off-by: stevehuang52 <heh@nvidia.com> * copy token_classifier from nlp to asr Signed-off-by: stevehuang52 <heh@nvidia.com> * Address comments Signed-off-by: smajumdar <titu1994@gmail.com> * Add typing to beam decoding Signed-off-by: smajumdar <titu1994@gmail.com> * Make prompt format configurable Signed-off-by: smajumdar <titu1994@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * drop asr dependency on nlp Signed-off-by: stevehuang52 <heh@nvidia.com> --------- Signed-off-by: smajumdar <titu1994@gmail.com> Signed-off-by: stevehuang52 <heh@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: stevehuang52 <heh@nvidia.com> * fix transcribe, update asr evaluator Signed-off-by: stevehuang52 <heh@nvidia.com> * Extend the docs for the canary prompt_fn Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Incorporate changes from Nithin's code review Signed-off-by: Piotr Żelasko <petezor@gmail.com> * training bug fix and adding launch script for speech_multitask (NVIDIA#8270) * bug fix and adding launch script for speech_multitask Signed-off-by: Krishna Puvvada <kpuvvada@nvidia.com> * update launch script example in speech_to_text_aed.py Signed-off-by: Krishna Puvvada <kpuvvada@nvidia.com> --------- Signed-off-by: Krishna Puvvada <kpuvvada@nvidia.com> Co-authored-by: Krishna Puvvada <kpuvvada@nvidia.com> * Fix: drop_last must be true in validation/test otherwise the training will hang Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com> * revert to current transcribe API Signed-off-by: stevehuang52 <heh@nvidia.com> * revert changes to NLP, update docs Signed-off-by: stevehuang52 <heh@nvidia.com> * update eval utils Signed-off-by: stevehuang52 <heh@nvidia.com> * update docs Signed-off-by: stevehuang52 <heh@nvidia.com> * Remove DALI; rename compute_audio_loss to compute_loss Signed-off-by: Piotr Żelasko <petezor@gmail.com> * set default use_model_transcribe=False Signed-off-by: stevehuang52 <heh@nvidia.com> * change os.path.dirname to pathlib Signed-off-by: stevehuang52 <heh@nvidia.com> * [canary] Test for CanaryTokenizer + refactoring (NVIDIA#8285) * Test for CanaryTokenizer Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Attempt at refactor... Signed-off-by: Piotr Żelasko <petezor@gmail.com> --------- Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Update config for AED models (NVIDIA#8294) Signed-off-by: smajumdar <titu1994@gmail.com> * set default calculate_wer=False in transcribe_speech.py Signed-off-by: stevehuang52 <heh@nvidia.com> * Attention encoder-decoder models for multiple speech-to-text tasks Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Apply suggestions from code review, part 1 Co-authored-by: Nithin Rao <nithinrao.koluguri@gmail.com> Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Apply suggestions from code review, part 2 Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Document compute_loss Signed-off-by: Piotr Żelasko <petezor@gmail.com> * update transcribe_speech.py Signed-off-by: stevehuang52 <heh@nvidia.com> * add docstring Signed-off-by: stevehuang52 <heh@nvidia.com> * Attention encoder-decoder models for multiple speech-to-text tasks Signed-off-by: Piotr Żelasko <petezor@gmail.com> --------- Signed-off-by: Piotr Żelasko <petezor@gmail.com> Signed-off-by: stevehuang52 <heh@nvidia.com> Signed-off-by: smajumdar <titu1994@gmail.com> Signed-off-by: Krishna Puvvada <kpuvvada@nvidia.com> Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com> Co-authored-by: stevehuang52 <heh@nvidia.com> Co-authored-by: Somshubra Majumdar <titu1994@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Krishna Puvvada <93558329+krishnacpuvvada@users.noreply.github.com> Co-authored-by: Krishna Puvvada <kpuvvada@nvidia.com> Co-authored-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com> Co-authored-by: Nithin Rao <nithinrao.koluguri@gmail.com> (cherry picked from commit d10726d) Co-authored-by: Piotr Żelasko <petezor@gmail.com> * add code for calling mcore_retro in NeMo * add code for calling mcore_retro in NeMo * runnable, training curve match retro mcore and nemo * working on retro inference * working on megatron_retro_eval.py and megatron_retro_inference.yaml * refactoring text_generation_utils code and retro inference relevant files * clean PR * resolving quick hacks (reading number of train/valid samples from workdir, discrepancy in total samples and samples with neighbors retrieved, tokenizers) * clean repository * revert changes to inference/eval code to original in main * clean code * runable training code, with already implemented eval code * [tutorial] fixed missing RIR scripts file. (NVIDIA#8257) Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> * add values to en tts dict (NVIDIA#7879) Signed-off-by: Mariana Graterol Fuenmayor <marianag@nvidia.com> * Add Bert HF checkpoint converter (NVIDIA#8088) * Add Bert HF checkpoint converter Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Reformat Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Add BERT ONNX export * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add NeMo BERT to HF BERT script * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Clean code Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update argument names Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Update build_transformer_config in Bert Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> --------- Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Bobby Chen <bobchen@nvidia.com> * revert to original eval code files * revert to original eval code files 2 * revert to original eval code files 3 * revert to original eval code files 4 * clean code * clean code * update my code to support changes from lastest main * commit before rebase r1.23.0 * Multimodal r1.23.0 bug fix (NVIDIA#8315) * Rename quick-gelu Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * ddpm config guard Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Fix ddpm edit api Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Fix insert_image_token cfg issue Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * neva updates Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * reformat Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Add back jenkins Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix jenkins Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix bugs Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Update default neva template Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> --------- Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Co-authored-by: Eric Harper <complex451@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * copy paste files from r1.23.0 * clean PR * Fixes for MoE parameter passing & use of AutoTokenizer/Model for mistral. (NVIDIA#8272) Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * Keep max_seqlen and cu_seqlens_argmin for later micro-batches when PP>1 (NVIDIA#8334) Signed-off-by: Sangkug Lym <slym@nvidia.com> Co-authored-by: Eric Harper <complex451@gmail.com> * Remove asr webapp (NVIDIA#8347) Signed-off-by: smajumdar <titu1994@gmail.com> * remove _target_ at model level in aed config (NVIDIA#8351) Signed-off-by: Krishna Puvvada <kpuvvada@nvidia.com> Co-authored-by: Krishna Puvvada <kpuvvada@nvidia.com> * revert changes for tts and asr * Add change_vocabulary and save_tokenizers() support to Multitask ASR models (NVIDIA#8357) * Add change_vocabulary and save_tokenizers() support Signed-off-by: smajumdar <titu1994@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update nemo/collections/asr/models/aed_multitask_models.py Co-authored-by: Piotr Żelasko <petezor@gmail.com> Signed-off-by: Somshubra Majumdar <titu1994@gmail.com> --------- Signed-off-by: smajumdar <titu1994@gmail.com> Signed-off-by: Somshubra Majumdar <titu1994@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Piotr Żelasko <petezor@gmail.com> * Change default (NVIDIA#8371) Signed-off-by: smajumdar <titu1994@gmail.com> * implement retro's own fwd_bwd_step() and validation_step() to not have argument first_val_step, which the MLM commit doesn't support * adding megatron compile_helpers(), in future can be fixed with correct MLM commit * bug fix in fast-conformer-aed.yaml and adding jenkins test for speech_to_text_aed model (NVIDIA#8368) Signed-off-by: Krishna Puvvada <kpuvvada@nvidia.com> Co-authored-by: Krishna Puvvada <kpuvvada@nvidia.com> Co-authored-by: Somshubra Majumdar <titu1994@gmail.com> * Enable megatron core loggers for GPT pretraining (NVIDIA#8354) * Logging changes tested for gpt_pretraining Signed-off-by: Aishwarya Bhandare <abhandare@nvidia.com> * Additional args Signed-off-by: Aishwarya Bhandare <abhandare@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Aishwarya Bhandare <abhandare@nvidia.com> Co-authored-by: Aishwarya Bhandare <abhandare@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com> * mcore ds fix (NVIDIA#8283) * [tutorial] fixed missing RIR scripts file. (NVIDIA#8257) Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> * add values to en tts dict (NVIDIA#7879) Signed-off-by: Mariana Graterol Fuenmayor <marianag@nvidia.com> * mcore ds fix Signed-off-by: Dmytro Pykhtar <dpykhtar@login-eos01.eos.clusters.nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update mcore Signed-off-by: dimapihtar <dpihtar@gmail.com> * revert asr files Signed-off-by: dimapihtar <dpihtar@gmail.com> * add comments Signed-off-by: dimapihtar <dpihtar@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add support for mcore mock dataset Signed-off-by: dimapihtar <dpihtar@gmail.com> * update mcore version Signed-off-by: dimapihtar <dpihtar@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update gpt cfg Signed-off-by: dimapihtar <dpihtar@gmail.com> * update mcore commit Signed-off-by: dimapihtar <dpihtar@gmail.com> * fix Bert unit tests Signed-off-by: dimapihtar <dpihtar@gmail.com> * update bert tests Signed-off-by: dimapihtar <dpihtar@gmail.com> * fix bert mcore test Signed-off-by: dimapihtar <dpihtar@gmail.com> * fix gpt jenkins tests Signed-off-by: dimapihtar <dpihtar@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update apex & TE commits Signed-off-by: dimapihtar <dpihtar@gmail.com> * revert apex installation Signed-off-by: dimapihtar <dpihtar@gmail.com> * turn off the fusion for jenkins Signed-off-by: dimapihtar <dpihtar@gmail.com> --------- Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> Signed-off-by: Mariana Graterol Fuenmayor <marianag@nvidia.com> Signed-off-by: Dmytro Pykhtar <dpykhtar@login-eos01.eos.clusters.nvidia.com> Signed-off-by: dimapihtar <dpihtar@gmail.com> Co-authored-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> Co-authored-by: Mariana <47233618+mgrafu@users.noreply.github.com> Co-authored-by: Dmytro Pykhtar <dpykhtar@login-eos01.eos.clusters.nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Pablo Garay <palenq@gmail.com> * addressing Eric's reviews * adding existing implementation RETRO files * adding existing implementation RETRO files * Add Finetuning tutorial with HF Datasets (NVIDIA#8356) * Add Finetuning tutorial with HF Datasets Signed-off-by: Nithin Rao Koluguri <nithinraok> * update on Som comments Signed-off-by: Nithin Rao Koluguri <nithinraok> --------- Signed-off-by: Nithin Rao Koluguri <nithinraok> Co-authored-by: Nithin Rao Koluguri <nithinraok> * release updates (NVIDIA#8378) * [tutorial] fixed missing RIR scripts file. (NVIDIA#8257) Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> * add values to en tts dict (NVIDIA#7879) Signed-off-by: Mariana Graterol Fuenmayor <marianag@nvidia.com> * mcore ds fix Signed-off-by: Dmytro Pykhtar <dpykhtar@login-eos01.eos.clusters.nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update mcore Signed-off-by: dimapihtar <dpihtar@gmail.com> * revert asr files Signed-off-by: dimapihtar <dpihtar@gmail.com> * add comments Signed-off-by: dimapihtar <dpihtar@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add support for mcore mock dataset Signed-off-by: dimapihtar <dpihtar@gmail.com> * update mcore version Signed-off-by: dimapihtar <dpihtar@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update gpt cfg Signed-off-by: dimapihtar <dpihtar@gmail.com> * update mcore commit Signed-off-by: dimapihtar <dpihtar@gmail.com> * fix Bert unit tests Signed-off-by: dimapihtar <dpihtar@gmail.com> * update bert tests Signed-off-by: dimapihtar <dpihtar@gmail.com> * fix bert mcore test Signed-off-by: dimapihtar <dpihtar@gmail.com> * fix gpt jenkins tests Signed-off-by: dimapihtar <dpihtar@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add support for dict data input type Signed-off-by: dimapihtar <dpihtar@gmail.com> * add mock ds test Signed-off-by: dimapihtar <dpihtar@gmail.com> * add test for dict data input type Signed-off-by: dimapihtar <dpihtar@gmail.com> * mcore ds fix Signed-off-by: dimapihtar <dpihtar@gmail.com> * data input fix Signed-off-by: dimapihtar <dpihtar@gmail.com> --------- Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> Signed-off-by: Mariana Graterol Fuenmayor <marianag@nvidia.com> Signed-off-by: Dmytro Pykhtar <dpykhtar@login-eos01.eos.clusters.nvidia.com> Signed-off-by: dimapihtar <dpihtar@gmail.com> Signed-off-by: Dmytro Pykhtar <37850217+dimapihtar@users.noreply.github.com> Co-authored-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> Co-authored-by: Mariana <47233618+mgrafu@users.noreply.github.com> Co-authored-by: Dmytro Pykhtar <dpykhtar@login-eos01.eos.clusters.nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Pablo Garay <palenq@gmail.com> * MCore dataset compatibility for tokenizers (NVIDIA#8390) * Add unique_identifiers for all tokenizers and eod for SentencePieceTokenizer Signed-off-by: Valerie Sarge <vsarge@nvidia.com> * Add generalized token aliases to TokenizerSpec to conform with MegatronTokenizer's interface. Remove now-redundant individual fixes from AutoTokenizer and SentencePieceTokenizer. Signed-off-by: Valerie Sarge <vsarge@nvidia.com> --------- Signed-off-by: Valerie Sarge <vsarge@nvidia.com> Co-authored-by: Pablo Garay <palenq@gmail.com> * Mcore customization doc (NVIDIA#8298) * [tutorial] fixed missing RIR scripts file. (NVIDIA#8257) Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> * add values to en tts dict (NVIDIA#7879) Signed-off-by: Mariana Graterol Fuenmayor <marianag@nvidia.com> * Add Bert HF checkpoint converter (NVIDIA#8088) * Add Bert HF checkpoint converter Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Reformat Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Add BERT ONNX export * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add NeMo BERT to HF BERT script * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Clean code Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update argument names Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Update build_transformer_config in Bert Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> --------- Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Bobby Chen <bobchen@nvidia.com> * initial placeholder Signed-off-by: Huiying Li <huiyingl@nvidia.com> * add to intro/index.rst Signed-off-by: Huiying Li <huiyingl@nvidia.com> * initial content update Signed-off-by: Huiying Li <willwin.lee@gmail.com> * add diff images Signed-off-by: Huiying Li <willwin.lee@gmail.com> size Signed-off-by: Huiying Li <willwin.lee@gmail.com> * minor fixes * minor language change Signed-off-by: Chen Cui <chcui@nvidia.com> * clean changes --------- Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> Signed-off-by: Mariana Graterol Fuenmayor <marianag@nvidia.com> Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Signed-off-by: Huiying Li <huiyingl@nvidia.com> Signed-off-by: Huiying Li <willwin.lee@gmail.com> Signed-off-by: Chen Cui <chcui@nvidia.com> Co-authored-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> Co-authored-by: Mariana <47233618+mgrafu@users.noreply.github.com> Co-authored-by: yaoyu-33 <54727607+yaoyu-33@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Bobby Chen <bobchen@nvidia.com> Co-authored-by: Huiying Li <huiyingl@nvidia.com> Co-authored-by: Chen Cui <chcui@nvidia.com> * wer fix (NVIDIA#8404) Signed-off-by: Travis Bartley <tbartley@nvidia.com> * updated link to pubmed (NVIDIA#8402) Signed-off-by: Nithin Rao Koluguri <nithinraok> Co-authored-by: Nithin Rao Koluguri <nithinraok> * Update NFA video download link (NVIDIA#8406) * update nfa nasa video link Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com> * update link in markdown Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com> --------- Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com> * revert changes (NVIDIA#8410) Signed-off-by: Chen Cui <chcui@nvidia.com> * Fix dreambooth data sampler issue (NVIDIA#8400) * Turn on drop last Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Some neva fixes Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Fixed errors in the CTM gen functions (NVIDIA#8416) Signed-off-by: Taejin Park <tango4j@gmail.com> * add ensemble decoding fix (NVIDIA#8427) Signed-off-by: Nithin Rao Koluguri <nithinraok> Co-authored-by: Nithin Rao Koluguri <nithinraok> * SDE bugfix log (NVIDIA#8430) Signed-off-by: George <gzelenfroind@nvidia.com> * mcore customization doc minor fix (NVIDIA#8421) Signed-off-by: Huiying Li <willwin.lee@gmail.com> * NeMo-Mistral to HF converter bugfix. (NVIDIA#8353) Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * Fixing mcore bert for TP, PP and SP (NVIDIA#8336) * Fixing mcore bert for TP, PP and SP * Fixing mcore bert for TP, PP and SP * Fixing mcore version * Fixing mcore version * Update Jenkinsfile Signed-off-by: Shanmugam Ramasamy <111910568+shanmugamr1992@users.noreply.github.com> * Update Jenkinsfile Signed-off-by: Shanmugam Ramasamy <111910568+shanmugamr1992@users.noreply.github.com> * Update Jenkinsfile Signed-off-by: Shanmugam Ramasamy <111910568+shanmugamr1992@users.noreply.github.com> --------- Signed-off-by: Shanmugam Ramasamy <111910568+shanmugamr1992@users.noreply.github.com> Co-authored-by: Shanmugam Ramasamy <shanmugamr@shanmugamr-mlt.client.nvidia.com> Co-authored-by: Eric Harper <complex451@gmail.com> * Add settings to suppress bf16 compile errors in CI on V100 (NVIDIA#8481) * Add settings to suppress bf16 compile errors in CI on V100 Signed-off-by: Abhishree <abhishreetm@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Abhishree <abhishreetm@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * MoE parameter passing (NVIDIA#8255) * MoE parameter passing Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * Pass EP/MoE params in consumer scripts. Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * PR fixes Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * Use latest commit of mcore-0.5 Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * CI fix Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> Co-authored-by: Alexandros Koumparoulis <akoumparouli@dgx1v-loki-21.nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Update k2 version (NVIDIA#8478) (NVIDIA#8492) Signed-off-by: Vladimir Bataev <vbataev@nvidia.com> * Add fp8 support for SD/Update notebook paths (NVIDIA#8489) * Add fp8 support for SD/Update notebook paths Signed-off-by: Mingyuan Ma <mingyuanm@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Mingyuan Ma <mingyuanm@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com> * pin to 0.5.0 (NVIDIA#8465) Signed-off-by: eharper <eharper@nvidia.com> * Update NeMo Multimodal Requirements (NVIDIA#8515) * Update requirements_multimodal.txt Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * update github raw content link (NVIDIA#8517) Signed-off-by: Chen Cui <chcui@nvidia.com> * Add dep notice for notebooks (NVIDIA#8522) * add dep notice Signed-off-by: eharper <eharper@nvidia.com> * revert Signed-off-by: eharper <eharper@nvidia.com> --------- Signed-off-by: eharper <eharper@nvidia.com> * Revert FP8 integration (NVIDIA#8520) * Revert FP8 integration Signed-off-by: Mingyuan Ma <mingyuanm@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Mingyuan Ma <mingyuanm@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Update data prep notebook (NVIDIA#8532) Signed-off-by: Mingyuan Ma <mingyuanm@nvidia.com> * before update branch with latest r1.23.0 * update to run with MLM ae2817b3dde4efb1515061a5311d01d8f85bd99c (runnable training and saving checkpoint) * remove compile_helpers * reverse changes from main branch to r1.23.0 * adding *_legacy files * update MLM commit in Jenkinsfile to latest * debugging Jenkinstest: test different mcore import in retro_dataset * update Jenkinsfile edit megatron_retro_mutransfer_pretrain_legacy.py * removing all mcore RETRO to pass the Jenkinstest * fixing import legacy problem for tests/collections/nlp/test_indexed_retrieval_dataset.py * update Jenkinsfile file to use TE v0.7 * update NeMo to work with latest mcore RETRO (solving TE problems) * update TE commit Jenkinsfile to be the same with r1.23.0's Jenkinsfile * update commit for MLM * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * jenkinstest debugging * temporary fix RETRO's __init__ for jenkinstest * edit splits_string in jenkinsfile to correct format; put RETRO test in front to test faster * edit splits_string in jenkinsfile to correct format; put RETRO test in front to test faster * edit splits_string in jenkinsfile to correct format; put RETRO test in front to test faster * edit splits_string in jenkinsfile to correct format; put RETRO test in front to test faster * add model.data.dataloader_type=cyclic to jenkinsfile * runnable for inference * update code to work with latest megatron-lm main 81dab6067 * update M-LM commit in Jenkinsfile to latest main M-LM 81dab6067 * cleaning inference code * fix to by pass CI test bf16 problem (following this PR https://github.com/NVIDIA/NeMo/pull/8481/files) * isort and black * adjusting model.micro_batch_size to 1 * fix BRANCH = 'r1.23.0' * replace tutorials dir from main branch to huvu/mcore_retro * fix minor merges conflict * update Jenkinsfile * runnable with a temporary fix from Jacek (unfound -unfinished problem) * runnable with a temporary fix from Jacek (unfound -unfinished problem) * modified nlp_overrides.py back to original * fix checkpoint from Jacek Bieniusiewicz * config Jenkinsfile test * set RETRO Jenkins MBS to 1 * black fix * isort fix * update TE commit * update to latest Jenkinsfile with latest container and commits * remove new RETRO jenkinstest * merge latest main * put RETRO Jenkinstest to the right place * update code for megatron_retro_pretraining_legacy.py * update Jenkins and _legacy.py * update new RETRO jenkinstest to run faster * fixing errors from GitHub Advanced Security / CodeQL * fixing errors from GitHub Advanced Security / CodeQL * update manually branch to huvu/mcore_retro * remove DEBUGGING markers * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * copy paste scripts/tts_dataset_files/ipa_cmudict-0.7b_nv23.01.txt * update codes to fix Github warnings; adding cicd-main.yml action tests * cleaning code, addressing Shanmugam's comments * saving before pulling from main * cleaning code * adding deprecations note * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: eharper <eharper@nvidia.com> Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> Signed-off-by: dimapihtar <dpihtar@gmail.com> Signed-off-by: Piotr Żelasko <petezor@gmail.com> Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com> Signed-off-by: Nithin Rao Koluguri <nithinraok> Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com> Signed-off-by: Chen Cui <chcui@nvidia.com> Signed-off-by: Mariana Graterol Fuenmayor <marianag@nvidia.com> Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> Signed-off-by: Sangkug Lym <slym@nvidia.com> Signed-off-by: smajumdar <titu1994@gmail.com> Signed-off-by: Krishna Puvvada <kpuvvada@nvidia.com> Signed-off-by: Somshubra Majumdar <titu1994@gmail.com> Signed-off-by: Aishwarya Bhandare <abhandare@nvidia.com> Signed-off-by: Dmytro Pykhtar <dpykhtar@login-eos01.eos.clusters.nvidia.com> Signed-off-by: Dmytro Pykhtar <37850217+dimapihtar@users.noreply.github.com> Signed-off-by: Valerie Sarge <vsarge@nvidia.com> Signed-off-by: Huiying Li <huiyingl@nvidia.com> Signed-off-by: Huiying Li <willwin.lee@gmail.com> Signed-off-by: Travis Bartley <tbartley@nvidia.com> Signed-off-by: Taejin Park <tango4j@gmail.com> Signed-off-by: George <gzelenfroind@nvidia.com> Signed-off-by: Shanmugam Ramasamy <111910568+shanmugamr1992@users.noreply.github.com> Signed-off-by: Abhishree <abhishreetm@gmail.com> Signed-off-by: Vladimir Bataev <vbataev@nvidia.com> Signed-off-by: Mingyuan Ma <mingyuanm@nvidia.com> Co-authored-by: eharper <eharper@nvidia.com> Co-authored-by: mikolajblaz <mikolajblaz@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com> Co-authored-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> Co-authored-by: Dmytro Pykhtar <37850217+dimapihtar@users.noreply.github.com> Co-authored-by: dimapihtar <dpihtar@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Piotr Żelasko <petezor@gmail.com> Co-authored-by: Elena Rastorgueva <80532067+erastorgueva-nv@users.noreply.github.com> Co-authored-by: Nithin Rao <nithinrao.koluguri@gmail.com> Co-authored-by: Somshubra Majumdar <titu1994@gmail.com> Co-authored-by: JimmyZhang12 <67203904+JimmyZhang12@users.noreply.github.com> Co-authored-by: Jimmy Zhang <jiemingz@nvidia.com> Co-authored-by: Chen Cui <chcui@nvidia.com> Co-authored-by: Huy Vu2 <huvu@login-eos01.eos.clusters.nvidia.com> Co-authored-by: Mariana <47233618+mgrafu@users.noreply.github.com> Co-authored-by: yaoyu-33 <54727607+yaoyu-33@users.noreply.github.com> Co-authored-by: Bobby Chen <bobchen@nvidia.com> Co-authored-by: akoumpa <153118171+akoumpa@users.noreply.github.com> Co-authored-by: Sangkug Lym <slym@nvidia.com> Co-authored-by: Krishna Puvvada <93558329+krishnacpuvvada@users.noreply.github.com> Co-authored-by: Krishna Puvvada <kpuvvada@nvidia.com> Co-authored-by: ashbhandare <ash.bhandare@gmail.com> Co-authored-by: Aishwarya Bhandare <abhandare@nvidia.com> Co-authored-by: Dmytro Pykhtar <dpykhtar@login-eos01.eos.clusters.nvidia.com> Co-authored-by: Pablo Garay <palenq@gmail.com> Co-authored-by: Valerie Sarge <vsarge@nvidia.com> Co-authored-by: Huiying <willwin.lee@gmail.com> Co-authored-by: Huiying Li <huiyingl@nvidia.com> Co-authored-by: tbartley94 <90423858+tbartley94@users.noreply.github.com> Co-authored-by: Taejin Park <tango4j@gmail.com> Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: Shanmugam Ramasamy <111910568+shanmugamr1992@users.noreply.github.com> Co-authored-by: Shanmugam Ramasamy <shanmugamr@shanmugamr-mlt.client.nvidia.com> Co-authored-by: Abhishree Thittenamane <47577437+athitten@users.noreply.github.com> Co-authored-by: Alexandros Koumparoulis <akoumparouli@dgx1v-loki-21.nvidia.com> Co-authored-by: Vladimir Bataev <vbataev@nvidia.com> Co-authored-by: Ming <111467530+Victor49152@users.noreply.github.com> Co-authored-by: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com> Co-authored-by: root <root@eos0150.eos.clusters.nvidia.com>
alxzhang-amazon · Apr 26, 2024 · 1614576 · 1614576
1 parent ee2ff42
commit 1614576
Show file tree

Hide file tree

Showing 9 changed files with 856 additions and 145 deletions.
diff --git a/examples/nlp/language_modeling/conf/megatron_retro_inference.yaml b/examples/nlp/language_modeling/conf/megatron_retro_inference.yaml
@@ -3,42 +3,40 @@ inference:
   top_k: 0  # The number of highest probability vocabulary tokens to keep for top-k-filtering.
   top_p: 0.9 # If set to float < 1, only the most probable tokens with probabilities that add up to top_p or higher are kept for generation.
   temperature: 1.0 # sampling temperature
-  add_BOS: True # add the bos token at the begining of the prompt
+  add_BOS: False # add the bos token at the begining of the prompt
   tokens_to_generate: 30 # The minimum length of the sequence to be generated.
   all_probs: False  # whether return the log prob for all the tokens in vocab
   repetition_penalty: 1.2  # The parameter for repetition penalty. 1.0 means no penalty.
   min_tokens_to_generate: 0  # The minimum length of the sequence to be generated.
   compute_logprob: False  # a flag used to compute logprob of all the input text, a very special case of running inference, default False
-
+  end_strings: ["<|endoftext|>"]  # generation will stop when one of these tokens is generated
+  # RETRO-specific arguments
+  retro_inference:
+    retro_gpt_retrieved_length: 128
+    retro_num_neighbors: 2
+    ft_neighbours: 0
+    reuse_top: False
 
 trainer:
   devices: 1
   num_nodes: 1
   accelerator: gpu
   logger: False # logger provided by exp_manager
-  precision: 16 # 16, 32, or bf16
-
-inference_batch_size: 2
+  precision: 32 # 16, 32, or bf16
+  use_distributed_sampler: False
+  
 tensor_model_parallel_size: -1
 pipeline_model_parallel_size: -1
 pipeline_model_parallel_split_rank: -1 # used for encoder and decoder model (0 for others)
-retro_model_file: null  # RETRO nemo file path
+megatron_amp_O2: False  # Enable O2-level automatic mixed precision to save memory
 
-use_predict_method: False  # whether to use the predict method
+retro_model_file: null  # Retro nemo file path
+checkpoint_dir: null # checkpoint file dir. This is used to load the PTL checkpoint generated during the Retro training
+checkpoint_name: null # PTL checkpoint file name, only used for PTL checkpoint loading
+hparams_file: null # model configuration file, only used for PTL checkpoint loading
 
-prompts: # prompts for RETRO model inference
-  - "hello,"
-  - "good morning,"
-  - "good afternoon,"
-  - "good evening,"
-
-########### Faiss service parameters ########
-retrieval_service:
-  strategy: RetroModelTextGenerationStrategy  # choose customized inference strategy 
-  neighbors: 4
-  frequent_query: False  # for the current token generation, frequently update the retrieval context. If false, update it every 64 tokens 
-  pad_tokens: True # pad the tokens at the beginning to make it minimum of 64 tokens for retrieving at least once
-  store_retrieved: False # whether store the retrieved documents, so it can be checked
-  combo_service:
-    service_ip: '0.0.0.0'
-    service_port: 17181 
+# RETRO inference
+prompt: "sample prompt"
+neighbors:
+  - "neighbor text 1"
+  - "neighbor text 2"
diff --git a/examples/nlp/language_modeling/conf/megatron_retro_inference_legacy.yaml b/examples/nlp/language_modeling/conf/megatron_retro_inference_legacy.yaml
@@ -0,0 +1,46 @@
+# (This inferencing script for native NeMo RETRO will be soon deprecated. For new inferencing script for mcore RETRO, see ./megatron_retro_inference.yaml)
+
+inference:
+  greedy: False # Whether or not to use sampling ; use greedy decoding otherwise
+  top_k: 0  # The number of highest probability vocabulary tokens to keep for top-k-filtering.
+  top_p: 0.9 # If set to float < 1, only the most probable tokens with probabilities that add up to top_p or higher are kept for generation.
+  temperature: 1.0 # sampling temperature
+  add_BOS: True # add the bos token at the begining of the prompt
+  tokens_to_generate: 30 # The minimum length of the sequence to be generated.
+  all_probs: False  # whether return the log prob for all the tokens in vocab
+  repetition_penalty: 1.2  # The parameter for repetition penalty. 1.0 means no penalty.
+  min_tokens_to_generate: 0  # The minimum length of the sequence to be generated.
+  compute_logprob: False  # a flag used to compute logprob of all the input text, a very special case of running inference, default False
+
+
+trainer:
+  devices: 1
+  num_nodes: 1
+  accelerator: gpu
+  logger: False # logger provided by exp_manager
+  precision: 16 # 16, 32, or bf16
+
+inference_batch_size: 2
+tensor_model_parallel_size: -1
+pipeline_model_parallel_size: -1
+pipeline_model_parallel_split_rank: -1 # used for encoder and decoder model (0 for others)
+retro_model_file: null  # RETRO nemo file path
+
+use_predict_method: False  # whether to use the predict method
+
+prompts: # prompts for RETRO model inference
+  - "hello,"
+  - "good morning,"
+  - "good afternoon,"
+  - "good evening,"
+
+########### Faiss service parameters ########
+retrieval_service:
+  strategy: RetroModelTextGenerationStrategy  # choose customized inference strategy 
+  neighbors: 4
+  frequent_query: False  # for the current token generation, frequently update the retrieval context. If false, update it every 64 tokens 
+  pad_tokens: True # pad the tokens at the beginning to make it minimum of 64 tokens for retrieving at least once
+  store_retrieved: False # whether store the retrieved documents, so it can be checked
+  combo_service:
+    service_ip: '0.0.0.0'
+    service_port: 17181 
diff --git a/examples/nlp/language_modeling/conf/megatron_retro_qatask.yaml b/examples/nlp/language_modeling/conf/megatron_retro_qatask.yaml
@@ -0,0 +1,40 @@
+inference:
+  greedy: False # Whether or not to use sampling ; use greedy decoding otherwise
+  top_k: 0  # The number of highest probability vocabulary tokens to keep for top-k-filtering.
+  top_p: 0.9 # If set to float < 1, only the most probable tokens with probabilities that add up to top_p or higher are kept for generation.
+  temperature: 1.0 # sampling temperature
+  add_BOS: False # add the bos token at the begining of the prompt
+  tokens_to_generate: 30 # The minimum length of the sequence to be generated.
+  all_probs: False  # whether return the log prob for all the tokens in vocab
+  repetition_penalty: 1.2  # The parameter for repetition penalty. 1.0 means no penalty.
+  min_tokens_to_generate: 0  # The minimum length of the sequence to be generated.
+  compute_logprob: False  # a flag used to compute logprob of all the input text, a very special case of running inference, default False
+  end_strings: ["<|endoftext|>"]  # generation will stop when one of these tokens is generated
+  # RETRO-specific arguments
+  retro_inference:
+    retro_gpt_retrieved_length: 128
+    retro_num_neighbors: 2
+    ft_neighbours: 0
+    reuse_top: False
+
+trainer:
+  devices: 1
+  num_nodes: 1
+  accelerator: gpu
+  logger: False # logger provided by exp_manager
+  precision: 32 # 16, 32, or bf16
+  use_distributed_sampler: False
+
+tensor_model_parallel_size: -1
+pipeline_model_parallel_size: -1
+pipeline_model_parallel_split_rank: -1 # used for encoder and decoder model (0 for others)
+megatron_amp_O2: False  # Enable O2-level automatic mixed precision to save memory
+
+retro_model_file: null  # Retro nemo file path
+checkpoint_dir: null # checkpoint file dir. This is used to load the PTL checkpoint generated during the Retro training
+checkpoint_name: null # PTL checkpoint file name, only used for PTL checkpoint loading
+hparams_file: null # model configuration file, only used for PTL checkpoint loading
+
+# qa tasks
+qa_file_path: null
+pred_file_path: null
diff --git a/examples/nlp/language_modeling/megatron_retro_eval.py b/examples/nlp/language_modeling/megatron_retro_eval.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2022, NVIDIA CORPORATION.  All rights reserved.
+# Copyright (c) 2021, NVIDIA CORPORATION.  All rights reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -12,128 +12,119 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
+import datetime
 import os
 
-from examples.nlp.language_modeling.megatron_gpt_eval import RequestDataSet
-from omegaconf.omegaconf import OmegaConf, open_dict
-from pytorch_lightning import Trainer
-from torch.utils.data import DataLoader
+import torch
+from omegaconf import OmegaConf
+from pytorch_lightning.trainer.trainer import Trainer
+from torch.utils.data import DataLoader, Dataset
 
-from nemo.collections.nlp.models.language_modeling.megatron_retrieval_model import MegatronRetrievalModel
-from nemo.collections.nlp.modules.common.transformer.text_generation import LengthParam, SamplingParam
-from nemo.collections.nlp.parts.nlp_overrides import NLPDDPStrategy, NLPSaveRestoreConnector
+from nemo.collections.nlp.models.language_modeling.megatron_retro_model import MegatronRetroModel
+from nemo.collections.nlp.modules.common.megatron.megatron_init import fake_initialize_model_parallel
+from nemo.collections.nlp.parts.nlp_overrides import CustomProgressBar, NLPDDPStrategy
 from nemo.core.config import hydra_runner
-
-try:
-    from megatron.core import parallel_state
-
-    HAVE_MEGATRON_CORE = True
-
-except (ImportError, ModuleNotFoundError):
-
-    HAVE_MEGATRON_CORE = False
+from nemo.utils.app_state import AppState
+from nemo.utils.model_utils import inject_model_parallel_rank
 
 """
-This is the script to run RETRO Model text generation.
+This is the script to run Retro text generation.
 
 Usage:
-    Assume the model has TP=1, PP=1
-    run greedy inference from a nemo file:
+    Currently, Mcore-based RETRO only support batch-size of 1.
+    Example running greedy inference from a distributed checkpoint dir:
         python megatron_retro_eval.py \
+            checkpoint_dir=PATH_TO_CHECKPOINT \
+            checkpoint_name=CHECKPOINT_NAME \
+            inference.greedy=True \
+            inference.add_BOS=False \
             trainer.devices=1 \
             trainer.num_nodes=1 \
-            trainer.accelerator=gpu \
-            trainer.precision=16 \
-            inference.tokens_to_generate=128 \
-            inference.greedy=True \
-            retro_model_file=path_to_retro_nemo_file \
             tensor_model_parallel_size=-1 \
             pipeline_model_parallel_size=-1 \
-            retrieval_service.faiss_devices='0' \
-            retrieval_service.faiss_index=path_to_faiss_index \
-            retrieval_service.retrieval_index=path_to_retrieval_dataset \
-            retrieval_service.neighbors=20
-"""
+            prompt="sample prompt" \
+            inference.retro_inference.retro_num_neighbors=2 \
+            neighbors=["neighbor text 1", "neighbor text 2"]
 
 
-@hydra_runner(config_path="conf", config_name="megatron_retro_inference")
-def main(cfg) -> None:
-    trainer = Trainer(strategy=NLPDDPStrategy(), **cfg.trainer)
+        ```
+"""
 
-    model_path = cfg.retro_model_file
+if not torch.cuda.is_available():
+    raise EnvironmentError("GPU is needed for the inference")
 
-    save_restore_connector = NLPSaveRestoreConnector()
 
-    if os.path.isdir(model_path):
-        save_restore_connector.model_extracted_dir = model_path
+class RequestDataSet(Dataset):
+    def __init__(self, sentences, neighbors):
+        super().__init__()
+        self.sentences = sentences
+        self.neighbors = neighbors
 
-    model_cfg = MegatronRetrievalModel.restore_from(
-        model_path, trainer=trainer, return_config=True, save_restore_connector=save_restore_connector,
-    )
+    def __len__(self,):
+        return len(self.sentences)
 
-    with open_dict(model_cfg):
-        model_cfg.precision = trainer.precision
-        model_cfg.sequence_parallel = False
-        model_cfg.activations_checkpoint_granularity = None
-        model_cfg.activations_checkpoint_method = None
-
-    if (
-        cfg.tensor_model_parallel_size < 0
-        or cfg.pipeline_model_parallel_size < 0
-        or cfg.get('pipeline_model_parallel_split_rank', -1) < 0
-    ):
-        with open_dict(cfg):
-            cfg.tensor_model_parallel_size = model_cfg.get('tensor_model_parallel_size', 1)
-            cfg.pipeline_model_parallel_size = model_cfg.get('pipeline_model_parallel_size', 1)
-            cfg.pipeline_model_parallel_split_rank = model_cfg.get('pipeline_model_parallel_split_rank', 0)
-
-    model = MegatronRetrievalModel.restore_from(
-        model_path, trainer=trainer, save_restore_connector=save_restore_connector, override_config_path=model_cfg,
-    )
+    def __getitem__(self, idx):
+        return {'prompts': self.sentences[idx], 'neighbors': self.neighbors[idx]}
 
-    length_params: LengthParam = {
-        "max_length": cfg.inference.tokens_to_generate,
-        "min_length": cfg.inference.min_tokens_to_generate,
-    }
 
-    sampling_params: SamplingParam = {
-        "use_greedy": cfg.inference.greedy,
-        "temperature": cfg.inference.temperature,
-        "top_k": cfg.inference.top_k,
-        "top_p": cfg.inference.top_p,
-        "repetition_penalty": cfg.inference.repetition_penalty,
-        "add_BOS": cfg.inference.add_BOS,
-        "all_probs": cfg.inference.all_probs,
-        "compute_logprob": cfg.inference.compute_logprob,
-    }
+@hydra_runner(config_path="conf", config_name="megatron_retro_inference")
+def main(cfg) -> None:
+
+    # trainer required for restoring model parallel models
+    trainer = Trainer(
+        strategy=NLPDDPStrategy(timeout=datetime.timedelta(seconds=18000)),
+        **cfg.trainer,
+        callbacks=[CustomProgressBar()],
+    )
 
-    # check whether the DDP is initialized
-    if not parallel_state.is_initialized():
+    if cfg.checkpoint_dir:
+        app_state = AppState()
+        if cfg.tensor_model_parallel_size > 1 or cfg.pipeline_model_parallel_size > 1:
+            app_state.model_parallel_size = cfg.tensor_model_parallel_size * cfg.pipeline_model_parallel_size
+            app_state.tensor_model_parallel_size = cfg.tensor_model_parallel_size
+            app_state.pipeline_model_parallel_size = cfg.pipeline_model_parallel_size
+            (
+                app_state.tensor_model_parallel_rank,
+                app_state.pipeline_model_parallel_rank,
+                app_state.model_parallel_size,
+                app_state.data_parallel_size,
+                app_state.pipeline_model_parallel_split_rank,
+                app_state.virtual_pipeline_model_parallel_rank,
+            ) = fake_initialize_model_parallel(
+                world_size=app_state.model_parallel_size,
+                rank=trainer.global_rank,
+                tensor_model_parallel_size_=cfg.tensor_model_parallel_size,
+                pipeline_model_parallel_size_=cfg.pipeline_model_parallel_size,
+                pipeline_model_parallel_split_rank_=cfg.pipeline_model_parallel_split_rank,
+            )
+        checkpoint_path = os.path.join(cfg.checkpoint_dir, cfg.checkpoint_name)
+        # checkpoint_path is a dir in case of distributed checkpointing
+        if not os.path.isdir(checkpoint_path):
+            # legacy checkpoint needs model parallel rank injection
+            checkpoint_path = inject_model_parallel_rank(os.path.join(cfg.checkpoint_dir, cfg.checkpoint_name))
+        model = MegatronRetroModel.load_from_checkpoint(
+            checkpoint_path, hparams_file=cfg.hparams_file, trainer=trainer
+        )
+    else:
+        raise ValueError("Requiring distributed checkpoint dir for loading Mcore RETRO.")
 
-        def dummy():
-            return
+    model.freeze()
 
-        if model.trainer.strategy.launcher is not None:
-            model.trainer.strategy.launcher.launch(dummy, trainer=model.trainer)
-        model.trainer.strategy.setup_environment()
+    # Have to turn off activations_checkpoint_method for inference
+    try:
+        model.model.language_model.encoder.activations_checkpoint_method = None
+    except AttributeError:
+        pass
 
+    prompt = [cfg.prompt]
+    neighbors = [cfg.neighbors]
+    ds = RequestDataSet(prompt, neighbors)
+    bs = 1
+    request_dl = DataLoader(dataset=ds, batch_size=bs)
     config = OmegaConf.to_container(cfg.inference)
-    retrieval_service = OmegaConf.to_container(cfg.retrieval_service)
-    model.set_inference_config(config, retrieval_service)
-
-    if not cfg.use_predict_method:
-        # First method of running text generation, call model.generate method
-        response = model.generate(
-            inputs=OmegaConf.to_container(cfg.prompts),
-            length_params=length_params,
-            sampling_params=sampling_params,
-            strategy=model.inference_strategy,
-        )
-    else:
-        # Second method of running text generation, call trainer.predict
-        ds = RequestDataSet(OmegaConf.to_container(cfg.prompts))
-        request_dl = DataLoader(dataset=ds, batch_size=cfg.inference_batch_size)
-        response = trainer.predict(model, request_dl)
+    model.set_inference_config(config)
+
+    response = trainer.predict(model, request_dl)
 
     print("***************************")
     print(response)