# Simple HuggingFace inference with Huggingface Adapted FMS models

*Note: This notebook is using Torch 2.1.0 and Transformers 4.35.0.dev0*

If you would like to run a similar pipeline using a script, please view the following file: `scripts/hf_compile_example.py`

In [None]:
import transformers
import torch
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
from fms.models import get_model
from fms.models.hf import to_hf_api

## load Huggingface Adapted FMS model

Simply get the Huggingface model and convert it to an equivalent HF adapted FMS model

In [2]:
architecture = "llama"
variant = "13b"
model_path = "/path/to/hf_model"

If you intend to use half tensors, you must set the default device to cuda and default dtype to half tensors prior to loading the model to save space in memory

In [3]:
torch.set_default_device("cuda")
torch.set_default_dtype(torch.half)

get the model and wrap in huggingface adapter api

In [11]:
model = get_model(architecture, variant, model_path=model_path, source="hf", device_type="cuda", norm_eps=1e-6)
model = to_hf_api(model)

## Simple inference with Huggingface pipelines

In [12]:
tokenizer = AutoTokenizer.from_pretrained(model_path)

In [13]:
%%timeit -r 1 -n 1
pipe = pipeline(task="text-generation", model=model, max_new_tokens=25, tokenizer=tokenizer, device="cuda")
prompt = """I believe the meaning of life is"""
result = pipe(prompt)
print(result)

[{'generated_text': 'I believe the meaning of life is to find your purpose and to fulfill it.\n\nI believe that everyone has a unique purpose in life, and that'}]
1.14 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


## Compilation

All fms models support torch compile for faster inference, therefore Huggingface Adapted FMS models also support this feature. 

*Note: `generate` calls the underlying decoder and not the model itself, which requires compiling the underlying decoder.*

In [14]:
model.decoder = torch.compile(model.decoder)

Because compile is lazy, we first just do a single generation pipeline to compile the graph

In [15]:
pipe = pipeline(task="text-generation", model=model, max_new_tokens=25, tokenizer=tokenizer, device="cuda")
prompt = """I believe the meaning of life is"""
result = pipe(prompt)

At this point, the graph should be compiled and we can get proper performance numbers

In [16]:
%%timeit -r 1 -n 1
pipe = pipeline(task="text-generation", model=model, max_new_tokens=25, tokenizer=tokenizer, device="cuda")
prompt = """I believe the meaning of life is"""
result = pipe(prompt)
print(result)

[{'generated_text': 'I believe the meaning of life is to find your purpose and to fulfill it.\n\nI believe that everyone has a unique purpose in life, and that'}]
648 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
