# Run Llama 3.1 8B Instruct with < 5GB VRAM!

Powered by Transformers & AutoAWQ

[Model Checkpoint](https://huggingface.co/hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4)

Note: Whilst we use only 8B Instruct checkpoint in this example, you can use the same code base for any Llama 3.1 model checkpoint like 70B, 405B (& fine-tune) as well!


## Setup Environment

Since Llama 3.1 comes with minor modeling changes (primarily RoPE scaling), we'll need to make sure that we're on the latest version of transformers.

In [1]:
!pip install -q --upgrade transformers autoawq accelerate

## Load Tokenizer and Model checkpoint

In [3]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, AwqConfig

model_id = "hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4"

quantization_config = AwqConfig(
    bits=4,
    fuse_max_seq_len=512, # Note: Update this as per your use-case
    do_fuse=True,
)

tokenizer = AutoTokenizer.from_pretrained(model_id)

model = AutoModelForCausalLM.from_pretrained(
  model_id,
  torch_dtype=torch.float16,
  low_cpu_mem_usage=True,
  quantization_config=quantization_config
).to("cuda")

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

# Define Prompt & Tokenize

In [4]:
prompt = [
  {"role": "system", "content": "You are a helpful assistant, that responds as a pirate."},
  {"role": "user", "content": "What's Deep Learning?"},
]

inputs = tokenizer.apply_chat_template(
  prompt,
  tokenize=True,
  add_generation_prompt=True,
  return_tensors="pt",
  return_dict=True,
).to("cuda")

# Generate

In [5]:
outputs = model.generate(**inputs, do_sample=True, max_new_tokens=25)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


["system\n\nYou are a helpful assistant, that responds as a pirate.user\n\nWhat's Deep Learning?assistant\n\nArrr, ye landlubber! Ye be askin' about Deep Learnin', eh? Well, matey,"]


# Voila! You now have a smart and capable assistant! 🦙