# Simple HuggingFace inference with Huggingface Adapted FMS LLaMA

*Note: This notebook is using Torch 2.1.0 and Transformers 4.35.0.dev0*

If you would like to run a similar pipeline using a script, please view the following file: `scripts/hf_compile_example.py`

In [1]:
import transformers
import torch
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
from fms.models.hf.llama import get_model

## load Huggingface Adapted FMS LLaMA model

Simply get the Huggingface Llama model and convert it to an HF adapted Llama model

In [2]:
model_path = "/path/to/hf_llama_model"

In [3]:
model = get_model(model_path).to("cuda")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

## Simple inference with Huggingface pipelines

In [4]:
tokenizer = AutoTokenizer.from_pretrained(model_path)

In [5]:
%%timeit -r 1 -n 1
pipe = pipeline(task="text-generation", model=model, max_new_tokens=25, tokenizer=tokenizer, device="cuda")
prompt = """I believe the meaning of life is"""
result = pipe(prompt)
print(result)

[{'generated_text': 'I believe the meaning of life is to find happiness and fulfillment. Here are some of the things that bring me joy and fulfillment:\n\n'}]
1.36 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


## Compilation

all fms models support torch compile for faster inference, therefore Huggingface Adapted FMS models also support this feature. 

*Note: `generate` calls the underlying decoder and not the model itself, which requires compiling the underlying decoder.*

In [6]:
model.decoder = torch.compile(model.decoder)

Because compile is lazy, we first just do a single generation pipeline to compile the graph

In [None]:
pipe = pipeline(task="text-generation", model=model, max_new_tokens=25, tokenizer=tokenizer, device="cuda")
prompt = """I believe the meaning of life is"""
result = pipe(prompt)

At this point, the graph should be compiled and we can get proper performance numbers

In [8]:
%%timeit -r 1 -n 1
pipe = pipeline(task="text-generation", model=model, max_new_tokens=25, tokenizer=tokenizer, device="cuda")
prompt = """I believe the meaning of life is"""
result = pipe(prompt)
print(result)

[{'generated_text': 'I believe the meaning of life is to find happiness and fulfillment. Here are some of the things that bring me joy and fulfillment:\n\n'}]
587 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
