# Simple HuggingFace inference with Huggingface Adapted FMS models

*Note: This notebook is using Torch 2.1.0 and Transformers 4.35.0.dev0*

If you would like to run a similar pipeline using a script, please view the following file: `scripts/hf_compile_example.py`

In [1]:
import transformers
import torch
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
from fms.models import get_model
from fms.models.hf import to_hf_api

  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(


ModuleNotFoundError: No module named 'triton'

## load Huggingface Adapted FMS model

Simply get the Huggingface model and convert it to an equivalent HF adapted FMS model

In [None]:
architecture = "llama"
variant = "llama2_1.4b"
model_path = "/Users/dwertheimer/Downloads/oc_work/downloads/llama2-base.pth"

If you intend to use half tensors, you must set the default device to cuda and default dtype to half tensors prior to loading the model to save space in memory

In [None]:
# torch.set_default_device("cuda")
torch.set_default_dtype(torch.half)

get the model and wrap in huggingface adapter api

In [None]:
from torch.distributed._shard.checkpoint import (
    FileSystemReader,
    FileSystemWriter,
    load_state_dict,
    save_state_dict,
)
from torch.distributed.checkpoint.default_planner import (
    DefaultLoadPlanner,
    DefaultSavePlanner,
)
from torch.distributed.fsdp import FullStateDictConfig
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from torch.distributed.fsdp import StateDictType

from fms.models.llama import LLaMA, LLaMAConfig

c = LLaMAConfig()
c.kvheads=8
c.nlayers=16
c.hidden_grow_factor=3
c.emb_dim=2048
c.nheads=16

model = LLaMA(c)

print("Model built!")

# model = get_model(architecture, variant, model_path=model_path, source="fms", device_type="cpu", norm_eps=1e-6)
# model = to_hf_api(model)

In [None]:
model

In [None]:
d = torch.load(model_path)['model_state']
d = {k[10:]:v for k,v in d.items()}
model.load_state_dict(d)
print("Model loaded!")

In [5]:
from torch.distributed import init_process_group
import os

os.environ['MASTER_ADDR'] = '127.0.0.1'
os.environ['MASTER_PORT'] = '29500'
init_process_group("gloo", rank=0, world_size=1)

In [7]:
# with FSDP.state_dict_type(model, StateDictType.SHARDED_STATE_DICT):
state_dict = model.state_dict()
model_ckp = {"model_state": state_dict}
load_state_dict(
    state_dict=model_ckp,
    storage_reader=FileSystemReader(model_path),
    planner=DefaultLoadPlanner(),
)


CheckpointException: CheckpointException ranks:dict_keys([0])
Traceback (most recent call last): (RANK 0)
  File "/Users/dwertheimer/miniconda3/envs/python3/lib/python3.8/site-packages/torch/distributed/checkpoint/utils.py", line 173, in reduce_scatter
    local_data = map_fun()
  File "/Users/dwertheimer/miniconda3/envs/python3/lib/python3.8/site-packages/torch/distributed/checkpoint/state_dict_loader.py", line 150, in local_step
    metadata = storage_reader.read_metadata()
  File "/Users/dwertheimer/miniconda3/envs/python3/lib/python3.8/site-packages/torch/distributed/checkpoint/filesystem.py", line 498, in read_metadata
    return pickle.load(metadata_file)
AttributeError: Can't get attribute '_MEM_FORMAT_ENCODING' on <module 'torch.distributed.checkpoint.metadata' from '/Users/dwertheimer/miniconda3/envs/python3/lib/python3.8/site-packages/torch/distributed/checkpoint/metadata.py'>


## Simple inference with Huggingface pipelines

In [12]:
tokenizer = AutoTokenizer.from_pretrained(model_path)

In [13]:
%%timeit -r 1 -n 1
pipe = pipeline(task="text-generation", model=model, max_new_tokens=25, tokenizer=tokenizer, device="cuda")
prompt = """I believe the meaning of life is"""
result = pipe(prompt)
print(result)

[{'generated_text': 'I believe the meaning of life is to find your purpose and to fulfill it.\n\nI believe that everyone has a unique purpose in life, and that'}]
1.14 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


## Compilation

All fms models support torch compile for faster inference, therefore Huggingface Adapted FMS models also support this feature. 

*Note: `generate` calls the underlying decoder and not the model itself, which requires compiling the underlying decoder.*

In [14]:
model.decoder = torch.compile(model.decoder)

Because compile is lazy, we first just do a single generation pipeline to compile the graph

In [15]:
pipe = pipeline(task="text-generation", model=model, max_new_tokens=25, tokenizer=tokenizer, device="cuda")
prompt = """I believe the meaning of life is"""
result = pipe(prompt)

At this point, the graph should be compiled and we can get proper performance numbers

In [16]:
%%timeit -r 1 -n 1
pipe = pipeline(task="text-generation", model=model, max_new_tokens=25, tokenizer=tokenizer, device="cuda")
prompt = """I believe the meaning of life is"""
result = pipe(prompt)
print(result)

[{'generated_text': 'I believe the meaning of life is to find your purpose and to fulfill it.\n\nI believe that everyone has a unique purpose in life, and that'}]
648 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
