# Simple HuggingFace inference with Huggingface Adapted FMS LLaMA

In [1]:
import transformers
from fms.models import llama
import torch
from fms.models.hf.llama.modeling_llama_hf import LLaMAHFForCausalLM
from transformers import LlamaForCausalLM, pipeline, AutoTokenizer
from fms.models.hf.utils import register_fms_models

## load Huggingface LLaMA model

Simply load the Huggingface LLaMA model from a path containing the Huggingface LLaMA checkpoint

In [2]:
model_path = "/path/to/hf_model"
hf_model = LlamaForCausalLM.from_pretrained(model_path)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

## convert Huggingface LLaMA model to FMS LLaMA model

fms provides a simple function which will convert a pre-trained Huggingface LLaMA model to a pre-trained FMS LLaMA model

In [3]:
model = llama.convert_hf_llama(hf_model)

## Convert FMS LLaMA model to its Huggingface adapted FMS LLaMA model

all fms models can be wrapped to act like a Huggingface model with a single line, but with performance benefits of the underlying fms model

In [4]:
# register LLaMAHFForCausalLM with AutoModel
register_fms_models()
# convert FMS LLaMA to HF-Adapted FMS LLaMA
model = LLaMAHFForCausalLM.from_fms_model(model)

## Simple inference with Huggingface pipelines

In [5]:
tokenizer = AutoTokenizer.from_pretrained(model_path)

In [6]:
%%timeit -r 1 -n 1
pipe = pipeline(task="text-generation", model=model, max_new_tokens=25, tokenizer=tokenizer)
prompt = """I believe the meaning of life is"""
result = pipe(prompt)
print(result)

[{'generated_text': 'I believe the meaning of life is to find happiness and fulfillment. Here are some of the things that bring me joy and fulfillment:\n\n'}]
36.5 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


## Compilation

all fms models support torch compile for faster inference, therefore Huggingface Adapted FMS models also support this feature. 

*Note: `generate` calls the underlying decoder and not the model itself, which requires compiling the underlying decoder.*

In [7]:
model.decoder = torch.compile(model.decoder)

Because compile is lazy, we first just do a single generation pipeline to compile the graph

In [None]:
# pipe = pipeline(task="text-generation", model=model, max_new_tokens=25, tokenizer=tokenizer)
# prompt = """I believe the meaning of life is"""
# result = pipe(prompt)

In [9]:
%%timeit -r 1 -n 1
pipe = pipeline(task="text-generation", model=model, max_new_tokens=25, tokenizer=tokenizer)
prompt = """I believe the meaning of life is"""
result = pipe(prompt)
print(result)

[{'generated_text': 'I believe the meaning of life is to find happiness and fulfillment. Here are some of the things that bring me joy and fulfillment:\n\n'}]
26.3 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
