# Local deployment of LLama v2 from HF

In this section, we’ll go through different approaches to running inference of the Llama2 models. Before using these models, make sure you have requested access to one of the models in the official Meta Llama 2 repositories.

Note: Make sure to also fill the official Meta form. Users are provided access to the repository once both forms are filled after few hours.

Using transformers
With transformers release 4.31, one can already use Llama 2 and leverage all the tools within the HF ecosystem, such as:

training and inference scripts and examples
safe file format (safetensors)
integrations with tools such as bitsandbytes (4-bit quantization) and PEFT (parameter efficient fine-tuning)
utilities and helpers to run generation with the model
mechanisms to export the models to deploy
Make sure to be using the latest transformers release and be logged into your Hugging Face account.

In [1]:
#!pip install transformers --upgrade
#!pip install tokenizers>=0.13.3  --upgrade
#!pip install ipywidgets
#!pip install xformers --upgrade

In [1]:
#Login to HuggingFace
import os
hf_api_key =  os.environ.get('hf_api_token')
!huggingface-cli login --token {hf_api_key}

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /home/alfred/.cache/huggingface/token
Login successful


In [5]:
from transformers import AutoTokenizer
import transformers
import torch

#model = "meta-llama/Llama-2-13b-chat-hf"
#model = 'TheBloke/Llama-2-13B-chat-GGML'
model = "/model/hf/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model)

pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    torch_dtype=torch.float16,
    device_map="auto",
)

The model weights are not tied. Please use the `tie_weights` method before using the `infer_auto_device` function.
Loading checkpoint shards: 100%|██████████| 2/2 [01:41<00:00, 50.95s/it]


In [None]:
sequences = pipeline(
    'I liked "Top Gun Maerick" and "Tomorrow Never Dies". Do you have any recommendations of other movies I might like?\n',
    do_sample=True,
    top_k=20,
    top_p=0.65,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    max_length=1024,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

