# An Introduction to Quantization with MLX
In this notebook, we will learn the basics of **quantization** and why it can be important in the context of large language mdoels (LLMs). We'll start more generically by demonstrating how to perform quantization in general, and then we'll move into quantization using Apple's new Python framework, MLX.

## Why Quantization?
When an LLM is trained, the weights of the model are generally stored in a float datatype, particularly `float32`. Float values are very precise, but they can be computationally inefficient to calculate. This is because `float32` can span a very wide range of values. This precision can be important to get the most accuracy out of an LLM, but in addition to computational inefficiency, this also means that storing the weights of a model can make the file sizes very large.

Let's now compare this with a smaller datatype, like `int8`. `int8` has a much smaller span of values, ranging from -127 to 127. What people have discovered is that we can rather effectively map something like the wide range of `float32` values to this much narrower set of `int8` values without degrading the model performance too much. Additionally, it shrinks the storage size of the model considerably to something that can run on smaller hardware.

For example, let's say we had the following four `float32` values.

- 0.1357
- -0.9875
- 0.5432
- -0.3214

If we were to map these values to the `int8` datatype, they would transform the following way:

- 13
- -98
- 54
- -32

Again, we are indeed effectively losing information here, but as you'll discover, this trade off may be acceptable for our purposes. Namely, running an LLM on a cloud GPU can be very costly, so using your own personal hardware (aka a MacBook) to run a quantized LLM can save you quite the cost!

Throughout this notebook, we will be making use of the open weight model [**Mistral 7B**](https://mistral.ai/news/announcing-mistral-7b/). We will specifically be making use of it from the Hugging Face Hub, downloading the model in its "raw" state and then later quantizing that for usage on a MacBook for inference purposes.



## YMMV (Your Mileage May Vary)
In case you are curious, I personally am working on a MacBook Pro with an M1 Pro chip with 16gb of RAM (memory). This is the "baseline" 16 inch MacBook Pro model that was for sale a few years ago, and Apple has since developed newer models with even more powerful M2 and M3 chips. (Note: MLX is specifically designed for these newer Apple Silicon Macs and will NOT work with older Intel-based Macs.)

The reason I share this is because MLX will work on any Apple Silicon Mac, but the hardware restrictions limit what you can effectively do with MLX. For example, I attempted to download Mistral's larger [Mixtral](https://mistral.ai/news/mixtral-of-experts/) model. Mixtral is significantly larger than Mistral. Whereas Mistral 7B is roughly a 28GB download, Mixtral is a whopping 100GB. When I attempted to quantize Mixtral on my own hardware, my hardware kept crapping out. 😅 This is why we'll be working with Mistral 7B, since it is smaller and can run more effectively on hardware such as mine.

## Notebook Setup
Let's begin by talking about the Python libraries that we'll be installing / using throughout this notebook.

- `transformers`: This is the primary Python library supported by Hugging Face. While `transformers` is great for many purposes, the sole reason we'll be using it in this notebook is to simply download the tokenizer and model for Mistral 7B.
- `mlx`: This is the base ML framework created by Apple to run advanced predictive models and more directly on Apple hardware. While we will be using a MacBook in our case, MLX also has a flavoring in Swift, meaning that you can effectively use MLX to run AI models on iPhones and iPads! While I have not created an app like this myself, there are actually apps that you can purchase on the App Store right now that will run a quantized version of Mistral 7B. These apps seem more sandboxy in nature, so I won't give any particular recommendations. But it's nice to know this is possible!
- `mlx-lm`: Knowing that people are very interested in running LLMs on Mac hardware, Apple created a subsidiary library to `mlx` called `mlx-lm` for this very purpose. Technically speaking, if all you're interested in is this model inferenece, all you need is `mlx-lm` with a few short lines of code. We will be going a little bit deeper in this notebook for educational purposes.

While these are the primary libraries that we'll be using, we also need to install a few more for dependency purposes. They will not necessarily be used directly.

- `bitsandbytes`: `bitsandbytes` is another library also maintained by Hugging Face, and it is the standard for quantizing LLMs at a general level. We will quickly demonstrate how to make use of it on Mistral 7B, but given the focus of this notebook is more on MLX, we will not actually be making use of this Bits and Bytes quantized model. Additionally, we don't actually make direct use of this library if we use the `transformer` library. Instead, we simply make use of the `BitsAndBytesConfig`, which we will demonstrate in a later section.
- `torch`: This is the Python client representing the popular ML framework PyTorch. While PyTorch remains a very popular library in the AI/ML community (including amongst Hugging Face users), we will not be using it in this notebook. The only reason we need it is as a dependency for the Bits & Bytes quantization.
- `accelerate`: This Hugging Face library is designed to accelerate ML workflows, and our specific usage of it is as a dependency for Bits & Bytes to perform its quantization. (Note: We will actually not be able to effectively demonstrate this in this notebook, because as of this notebook's creation, `accelerate` does not support Apple Silicon hardware. I can still assure you the commented out code works fine in other environments!)


You can install all these libraries by running the following command:

`pip install transformers torch accelerate bitsandbytes mlx mlx-lm`

In [1]:
# Importing the necessary Python libraries
import os
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from mlx_lm import load, generate, convert

  from .autonotebook import tqdm as notebook_tqdm


## Downloading Mistral 7B from the Hugging Face Hub
The first thing we will need to do is to load the [Mistral 7B model from the Hugging Face Hub](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2). If you are not familiar with how downloading LLMs from Hugging Face works, it's pretty easy to get up and going! Generally speaking, most LLMs can be loaded using `AutoModelForCausalLM` and `AutoTokenizer`. If the `Auto` prefix is throwing you off, it simply means that when you pass in a string representing the model name to one of these functions, it will automatically figure out what model architecture it needs to load on the backend to make your model work. So effectively, the `AutoModelForCausalLM` will actually turn into an architecture that specifically works for Mistral 7B.

To save ourselves the headache of downloading the model artifacts every time, we will be saving them to a local directory. (Note: My `.gitignore` file is intentionally not pushing these files into GitHub.) This is because the model artifacts for Mistral 7B are roughly 28GB in size, so downloading them can take a little while depending on your internet speed. (I personally have gigabit wifi, and it takes about 10-15 minutes for me to download the artifacts.)

In [2]:
# Setting constant values to represent model name and directory
MODEL_NAME = 'mistralai/Mistral-7B-Instruct-v0.2'
BASE_DIRECTORY = '../models'
BNB_DIRECTORY = f'{BASE_DIRECTORY}/bnb'
MLX_DIRECTORY = f'{BASE_DIRECTORY}/mlx'

# Setting the full model directory path for each of our three model types (base, quantized with bits & bytes, quantized with MLX)
model_directory = f'{BASE_DIRECTORY}/{MODEL_NAME}'
bnb_model_directory = f'{BNB_DIRECTORY}/{MODEL_NAME}'
mlx_model_directory = f'{MLX_DIRECTORY}/{MODEL_NAME}'

In [3]:
# Checking to see if the directory has already been created
if os.path.exists(model_directory):

    # Loading the tokenizer and model from local file
    print('Loading the model artifacts from disk.')
    tokenizer = AutoTokenizer.from_pretrained(model_directory)
    model = AutoModelForCausalLM.from_pretrained(model_directory)

else:

    # Creating the new model directory
    os.makedirs(model_directory)

    # Downloading the tokenizer and model from Hugging Face
    print('No local model found. Downloading artifacts from the Hugging Face Hub.')
    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
    model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)

    # Saving the tokenizer and model to model directory
    print('Saving the model artifacts and tokenizer to disk.')
    tokenizer.save_pretrained(save_directory = model_directory)
    model.save_pretrained(save_directory = model_directory)

Loading the model artifacts from disk.


Loading checkpoint shards: 100%|██████████| 6/6 [00:47<00:00,  7.84s/it]


In [4]:
# Deleting the model and tokenizer variables to save on memory (Comment out this cell if you don't want to remove these from RAM.)
del tokenizer
del model

## Quantization with Bits & Bytes
Now that we have downloaded our model artifacts, we can demonstrate how to generally quantize them using the Hugging Face standard Bits & Bytes library. You'll see just how easy this is to do! All we need to do is to set a configuration we would like to specify for how for Bits & Bytes to perform the quantization.

Unfortunately, this code actually is not compatible with Apple Silicon, so I will leave it commented out. But I can confirm this will work in environments like an AWS SageMaker notebook!

In [6]:
# # Loading the Bits & Bytes configuration for 4-bit quantization
# bnb_config = BitsAndBytesConfig(
#     load_in_4bit = True,
#     bnb_4bit_quant_type = "nf4",  # Use NF4 data type for weights from normal distribution
#     bnb_4bit_use_double_quant = True,  # Use nested quantization
#     bnb_4bit_compute_dtype = torch.bfloat16  # Use bfloat16 for faster computation
# )

In [7]:
# # Checking to see if the directory has already been created
# if os.path.exists(bnb_model_directory):

#     # Loading the tokenizer and model from local file
#     print('Loading the BNB-quantized model artifacts from disk.')
#     tokenizer = AutoTokenizer.from_pretrained(bnb_model_directory)
#     model = AutoModelForCausalLM.from_pretrained(bnb_model_directory)

# else:

#     # Creating the new model directory
#     os.makedirs(bnb_model_directory)

#     # Downloading the tokenizer and model from Hugging Face
#     print('No local BNB model found. Quantizing artifacts using Bits & Bytes.')
#     tokenizer = AutoTokenizer.from_pretrained(model_directory)
#     model = AutoModelForCausalLM.from_pretrained(model_directory, quantization_config = bnb_config)

#     # Saving the tokenizer and model to model directory
#     print('Saving the BNB-quantized model artifacts and tokenizer to disk.')
#     tokenizer.save_pretrained(save_directory = bnb_model_directory)
#     model.save_pretrained(save_directory = bnb_model_directory)

## Quantization with MLX
Now that we have demonstrated how to perform quantization with Bits & Bytes, let's move into quantization with MLX. While it is certainly possible to do this with the base `mlx` library, Apple has made it very easy on us by enabling this functionality within the `mlx-lm` library. Please be aware this quantization is not supported for all models on the Hugging Face Hub. For a list of models that you can effectively quantize with MLX, [please visit this page](https://github.com/ml-explore/mlx-examples/tree/main/llms#supported-models).

In [8]:
# Checking to see if the directory has already been created
if os.path.exists(mlx_model_directory):

    # Loading the tokenizer and model from local file
    print('Loading the MLX-quantized model artifacts from disk.')
    model, tokenizer = load(mlx_model_directory)

else:

    # Creating the new model directory
    os.makedirs(mlx_model_directory)

    # Loading the tokenizer from disk
    tokenizer = AutoTokenizer.from_pretrained(model_directory)

    # Saving the tokenizer and model to model directory
    print('Saving the MLX-quantized model artifacts and tokenizer to disk.')
    tokenizer.save_pretrained(save_directory = mlx_model_directory)
    convert(hf_path = model_directory,
            mlx_path = mlx_model_directory,
            quantize = True)

Loading the MLX-quantized model artifacts from disk.


## Making Use of the Quantized MLX Model
Now that we have successfully quantized Mistral 7B, let's demonstrate now how we can effectively make use of it!

In [9]:
# Setting the prompt we would like to ask of the quantized model
prompt = 'Give me a recipe for delicious chocolate chip cookies. Please give the response in the tone of Jar Jar Binks.'

# Producing the response (completion) with the quantized model
response = generate(
    model = model,
    tokenizer = tokenizer,
    prompt = prompt,
    max_tokens = 1000
)

# Printing the response
print(response)



Mesa have dewicious chwocow chip cowookies fow you! Mesa give you step by step instructions, but mesa warn you, dis will be a wowising journey to de wowest cookie in the galaxy!

Ingredients you need:
- 2 1/4 cups all-purpose flour
- 1/2 cup unsalted butter, softened
- 1/2 cup shortening
- 1 cup granulated sugar
- 1 cup packed brown sugar
- 2 eggs
- 2 teaspoons vanilla extract
- 3 cups quick-cooking oats
- 1 teaspoon baking soda
- 1/2 teaspoon baking powder
- 1/2 teaspoon salt
- 2 cups semisweet chocolate chips
- 1 cup chopped walnuts (optional)

Mesa begin! Preheat oven to 350 degrees F (175 degrees C). Mesa cream butter, shortening, granulated sugar, and brown sugar until light and fluffy. Mesa add eggs one at a time, beating well after each addition. Mesa stir in vanilla extract.

Mesa combine flour, baking soda, baking powder, and salt; gradually add to butter mixture and mix well. Mesa add oats, chocolate chips, and nuts (if desired).

Mesa drop by rounded tablespoonfuls onto un