# Using TensorRT-LLM to Run Phi-3-mini Models with LoRA Fine-tuned Models

In this notebook you'll learn how to use NVIDIA's [TensorRT-LLM](https://developer.nvidia.com/tensorrt#section-inference-for-llms) to run a Phi-3-mini base model along with a task-specific Phi-3 model fine-tuned using Low-rank Adaptation (LoRA) technique. LoRA is a parameter efficient fine-tuning (PEFT) technique that introduces low-rank matrices into each layer of the LLM architecture, and only trains these matrices while keeping the original LLM weights frozen. It is one of the several LLM customization methods supported in NVIDIA [NeMo](https://www.nvidia.com/en-us/ai-data-science/generative-ai/nemo-framework/) described in this [blog](https://developer.nvidia.com/blog/selecting-large-language-model-customization-techniques).

TensorRT-LLM is an open-source library that accelerates LLM inference performance on NVIDIA GPUs. NeMo is an end-to-end framework for building, customizing, and deploying generative AI applications. You can immediately try Phi-3 models through a browser user interface. Or, through API endpoints running on a fully accelerated NVIDIA stack from the [NVIDIA API catalog](http://build.nvidia.com/), where each model in Phi-3 is packaged as an [NVIDIA NIM](https://developer.nvidia.com/blog/nvidia-nim-offers-optimized-inference-microservices-for-deploying-ai-models-at-scale/) with a standard API that can be deployed anywhere.  

To accelerate inferencing and offer state-of-the-art performance on NVIDIA GPUs, TensorRT-LLM compiles the models into TensorRT engines, from model layers into optimized CUDA kernels using [pattern matching and fusion](https://nvidia.github.io/TensorRT-LLM/architecture/core-concepts.html#pattern-matching-and-fusion). Those engines are executed by the TensorRT-LLM runtime which includes several advanced optimizations such as [in-flight batching](https://nvidia.github.io/TensorRT-LLM/advanced/gpt-attention.html#in-flight-batching), [KV caching](https://nvidia.github.io/TensorRT-LLM/performance/perf-best-practices.html#paged-kv-cache) and [quantization](https://nvidia.github.io/TensorRT-LLM/reference/precision.html) to support lower precision workloads. 

## Initial Setup

In [None]:
!sudo apt-get update && apt-get -y install openmpi-bin libopenmpi-dev git git-lfs

## Install TensorRT-LLM

In [1]:
!pip install -q ipywidgets
!pip install tensorrt_llm -U -q --extra-index-url https://pypi.nvidia.com

[0m

In [None]:
!wget https://raw.githubusercontent.com/NVIDIA/TensorRT-LLM/main/examples/run.py -P .
!wget https://raw.githubusercontent.com/NVIDIA/TensorRT-LLM/main/examples/utils.py -P .
!wget https://raw.githubusercontent.com/NVIDIA/TensorRT-LLM/main/examples/quantization/quantize.py -P .

TensorRT-LLM supports running Phi-3-mini/small models with FP16/BF16/FP32 LoRA. In this notebook, we'll use Phi-3-mini as an example to show how to run an FP8 base model with FP16 LoRA module.

- download the base model and lora model from Hugging Face

In [None]:
!git lfs install
!git-lfs clone https://huggingface.co/microsoft/Phi-3-mini-4k-instruct
!git-lfs clone https://huggingface.co/sikoraaxd/Phi-3-mini-4k-instruct-ru-lora

## Quantize Model

Now let's quantize the Phi-3-mini base model from Hugging Face to FP8 creating smaller model with lower memory footprint without sacrificing accuracy. 

In [None]:
# Quantize the base model

#BASE_PHI_3_MINI_MODEL=./Phi-3-mini-4k-instruct
!python3 quantize.py --model_dir ./Phi-3-mini-4k-instruct \
                                   --dtype float16 \
                                   --qformat fp8 \
                                   --kv_cache_dtype fp8 \
                                   --output_dir phi3_mini_4k_instruct/trt_ckpt/fp8/1-gpu \
                                   --calib_size 512

## Build Optimized TensorRT Engine

In [None]:
Next we build the TensorRT engine for the base model specifying the lora model as a config parameter

In [8]:
# Build TensorRT engine

!trtllm-build --checkpoint_dir phi3_mini_4k_instruct/trt_ckpt/fp8/1-gpu \
             --output_dir phi3_mini_4k_instruct/trt_engines/fp8_lora/1-gpu \
             --gemm_plugin auto \
             --max_batch_size 8 \
             --max_input_len 1024 \
             --max_seq_len 2048 \
             --lora_plugin auto \
             --lora_dir ./Phi-3-mini-4k-instruct-ru-lora

[TensorRT-LLM] TensorRT-LLM version: 0.11.0
[07/21/2024-06:50:42] [TRT-LLM] [I] Set bert_attention_plugin to auto.
[07/21/2024-06:50:42] [TRT-LLM] [I] Set gpt_attention_plugin to auto.
[07/21/2024-06:50:42] [TRT-LLM] [I] Set gemm_plugin to auto.
[07/21/2024-06:50:42] [TRT-LLM] [I] Set gemm_swiglu_plugin to None.
[07/21/2024-06:50:42] [TRT-LLM] [I] Set nccl_plugin to auto.
[07/21/2024-06:50:42] [TRT-LLM] [I] Set lookup_plugin to None.
[07/21/2024-06:50:42] [TRT-LLM] [I] Set lora_plugin to auto.
[07/21/2024-06:50:42] [TRT-LLM] [I] Set moe_plugin to auto.
[07/21/2024-06:50:42] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto.
[07/21/2024-06:50:42] [TRT-LLM] [I] Set context_fmha to True.
[07/21/2024-06:50:42] [TRT-LLM] [I] Set context_fmha_fp32_acc to False.
[07/21/2024-06:50:42] [TRT-LLM] [I] Set paged_kv_cache to True.
[07/21/2024-06:50:42] [TRT-LLM] [I] Set remove_input_padding to True.
[07/21/2024-06:50:42] [TRT-LLM] [I] Set use_custom_all_reduce to True.
[07/21/2024-06:50:42] [TRT-LLM] [

## Run Inference for Q&A Task

In [9]:
# Run inference

!python3 run.py --engine_dir phi3_mini_4k_instruct/trt_engines/fp8_lora/1-gpu \
                 --max_output_len 500 \
                 --tokenizer_dir ./Phi-3-mini-4k-instruct-ru-lora \
                 --input_text "<|user|>\nCan you provide ways to eat combinations of bananas and dragonfruits?<|end|>\n<|assistant|>" \
                 --lora_task_uids 0 \
                 --use_py_session

[TensorRT-LLM] TensorRT-LLM version: 0.11.0
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[07/21/2024-06:51:28] [TRT-LLM] [W] Implicitly setting PretrainedConfig.producer = {'name': 'modelopt', 'version': '0.13.1'}
[07/21/2024-06:51:28] [TRT-LLM] [W] Implicitly setting PretrainedConfig.residual_mlp = False
[07/21/2024-06:51:28] [TRT-LLM] [W] Implicitly setting PretrainedConfig.bias = False
[07/21/2024-06:51:28] [TRT-LLM] [W] Implicitly setting PretrainedConfig.rotary_pct = 1.0
[07/21/2024-06:51:28] [TRT-LLM] [W] Implicitly setting PretrainedConfig.rank = 0
[07/21/2024-06:51:28] [TRT-LLM] [W] Implicitly setting PretrainedConfig.decoder = phi3
[07/21/2024-06:51:28] [TRT-LLM] [W] Implicitly setting PretrainedConfig.rmsnorm = True
[07/21/2024-06:51:28] [TRT-LLM] [W] Implicitly setting PretrainedConfig.lm_head_bias = False
[07/21/2024-06:51:28] [TRT-LLM] [W] Implicitly setting PretrainedConfig.rotary_base = 10000.0
[07/