# Running Llama2 on your Mac with Hugging Face

Trying to stick as closely as possible to the original [hackers guide by Jeremy Howard](https://github.com/fastai/lm-hackers/blob/main/lm-hackers.ipynb), I tried to run some LLMs locally on my machine. I am working on a MacBook Pro with M2 Max, therefore, I do not have any Nvidia support and I needed to adapt the code to make it compatible with [Apple's Metal Framework](https://developer.apple.com/documentation/metalperformanceshaders).

As expected there were some hurdles along the way and performance was not exactly great. Therefore, this is more an experiment, and there are better ways to run LLMs on a Mac which I will explore in another notebook.

## Downloading LLama 2

Before, you can access the Llama2 model, you need to agree to the terms and conditions of Meta for Llama2.

As per the time of writing this, the process was as follows:

- Visit [the model's home page at Hugging Face](https://huggingface.co/meta-llama/Llama-2-7b-hf)
- Go to Meta's website (https://ai.meta.com/resources/models-and-libraries/llama-downloads/), and complete the registration form
- Confirm the terms and conditions on the Hugging Face Website (see [screenshot](access-llama2-on-hf.png))

The approval only took a couple of minutes.

When you want to access the model from a notebook, you need to authenticate with Hugging Face so that they can check that you agreed with the terms and conditions.

You can easily create an access token on the [Hugging Face website](https://huggingface.co/settings/tokens) as explained [here](https://huggingface.co/docs/hub/security-tokens).

I recommend to a store this token in an environment variable so that you don't have to included into your Jupyter notebook: The environment variable is `HF_TOKEN`, so just add it to your `.env`-file.

> 💡 Note: For more details on how to manage your access token, please check out my [notebook on accessing the OpenAI API](https://chrwittm.github.io/posts/2024-01-27-how-to-call-openai-api/).

In [1]:
from dotenv import load_dotenv
import os

load_dotenv(".env")

False

In [2]:
#!pip install accelerate
#!pip install bitsandbytes

Once the authorization is solved, you can load the model from Huggingface. The first time you run the code, it will take some time because it needs to download 15 GB of model data.

As you can see in the codes, I need to adjust the parameters for compatibility with the Apple Metal Framework:

- `device_map="auto"`: The value `0` specifies that a CUDA-compatible device (from Nvidia) should be used. Using the value `auto`, the memory usage [is optimized](https://huggingface.co/docs/accelerate/v0.26.1/en/package_reference/big_modeling#accelerate.load_checkpoint_and_dispatch.device_map) to [first fill all the space in your GPU(s)](https://huggingface.co/docs/accelerate/usage_guides/big_modeling), then into to the CPU, and finally, if there is not enough RAM, it will be loaded to the disk (the absolute slowest option). This also works on Apple Silicon.

- `load_in_8bit=False`: Apple silicon currently does not support the same 8-bit quantization optimizations that Nvidia's GPUs do. Therefore, you need to load the model without quantizing it to 8-bit format.

> 💡 Note: "Quantization" refers to the process of reducing the precision of the numbers used to represent model weights and activations. This is typically done to reduce memory usage and improve the performance and efficiency of a model during inference.

In [3]:
from transformers import AutoModelForCausalLM,AutoTokenizer
import torch

In [4]:
model_name = "meta-llama/Llama-2-7b-hf"
token = os.getenv('HF_TOKEN')
if token is None:
    raise ValueError("Hugging Face token not found. Please check your .env file.")
#model = AutoModelForCausalLM.from_pretrained(mn, use_auth_token=token, device_map=0, load_in_8bit=True)
model = AutoModelForCausalLM.from_pretrained(model_name, token=token , device_map="auto", load_in_8bit=False)


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



There are a few thing to note regarding the download process:

- The download of the model is done once, and it is buffered on your disk: The cache directory is: `/Users/<YourUser>/.cache/huggingface/hub`
- When loading the model, you can observe that `shards` are loaded. A shard is, in simple terms, a chunk of the model. It is split up into several chunks for efficiency.

## Tokenization

The next step is to get the **tokenizer**. A tokenizer is a class, specific to the model which converts words into numbers and vice versa. A token is a numerical representation of a word or a segment of a word.

When tokenizing a **prompt**, we request a PyTorch tensor ("`pt`"):

In [5]:
tokr = AutoTokenizer.from_pretrained(model_name, token=token)
prompt = "Jeremy Howard is a "
toks = tokr(prompt, return_tensors="pt")

In [6]:
toks

{'input_ids': tensor([[    1,  5677,  6764, 17430,   338,   263, 29871]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]])}

Converting the numbers back to words returns the same result, preceeded by a special token `<s>` the start token. 

In [7]:
tokr.batch_decode(toks['input_ids'])

['<s> Jeremy Howard is a ']

## GPU vs. CPU

Before we use the model, let's quickly explore the methods `.to("mps")` and `.to("cpu")`:

In [8]:
print(toks.to("cpu"))
print(toks.to("mps"))

{'input_ids': tensor([[    1,  5677,  6764, 17430,   338,   263, 29871]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]])}
{'input_ids': tensor([[    1,  5677,  6764, 17430,   338,   263, 29871]], device='mps:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]], device='mps:0')}


You can spot the difference that the `.to("mps")`-tensor has an additional `device='mps:0'`-element. This means that this tensor is processed on the GPU, unlike the `.to("cpu")`-tensor. In the context of CUDA-acceleration, the analogous statements `.to("cuda")` and `.to("cpu")` would not only switch between GPU and CPU, but it would also copy the tensor from RAM to VRAM (VRAM being the dedicated GPU-Video RAM). The Apple Silicon world operates on Unified Memory (shared memory), meaning that the full memory of a machine can potentially be used by the GPU cores. While this saves time in the Apple Silicon world because there is no copying, the memory of Nvidia cards tends to be faster then the RAM in Apple machines.

When we pass the tokens to the LLM, we want the prompt tokens to be on the GPU, because even at inference, the LLM is very compute intensive, and it uses operations like matrix multiplication which can very well be optimized and parallelized on a GPU.

We want the resulting tensor to be on the CPU because the subsequent operations (like converting the generated tokens to text) expect a tensor to be on the CPU. In the Apple Silicon world we do not need to free valuable GPU memory, but moving the tensor to the CPU is best for compatibility with further processing.

## Running Llama2

So let's go ahead and generate the response of the model by passing the prompt tokens and requesting new tokens.

In [9]:
%%time
res = model.generate(**toks.to("mps"), max_new_tokens=15).to('cpu')
res

CPU times: user 3.63 s, sys: 33.6 s, total: 37.3 s
Wall time: 3min 28s


tensor([[    1,  5677,  6764, 17430,   338,   263, 29871, 29896, 29929, 29899,
          6360, 29899,  1025,   515,   951,  5779, 29889,   940,   756,  1063,
          9701,   297]])

In [10]:
tokr.batch_decode(res)

['<s> Jeremy Howard is a 19-year-old from Leeds. He has been involved in']

So far so unperformant:

- CPU times: user 3.63 s, sys: 33.6 s, total: 37.3 s
- Wall time: 3min 28s

The result is pretty poor on my M2 Max with 32GB RAM, especially considering that Jeremy's machine did the same thing in less than 2 seconds. There are probably a couple of reasons which produce this dramatic difference in performance:

- Nvidia memory throughput is a lot better then Apple's unified RAM
- The model I used was originally optimized and quantized for Nvidia GPUs. To run this model on my MacBook, I had to disable the 8-bit quantization (`load_in_8bit=False`) among other changes. While this adaptation was necessary for compatibility with Apple Silicon, it discarded all the optimizations.
- PyTorch's optimization for CUDA is probably still way better than its MPS optimization.

Whatever the true reason is, my intuition is that it is not mainly the "hardware's fault", but rather the way I ran the model. After all, I had to removed the quantization to get LLama2 to run on my Mac. But there are other options out there, and I will explore them in other notebooks.

Spoiler alert: It is possible to run LLama2 on a Mac at a good speed.