# MISTRAL-7B

List of main relevant papers:
* [Jiang et al. (2023)](https://arxiv.org/abs/2310.06825). Mistral 7B

List of main relevant blogs:
* [Mistral (2023)](https://www.pinecone.io/learn/image-search/). Mistral 7B. The best 7B model to date, Apache 2.0

List of main relevant Youtube videos:
* [PromptEngineering (2023)](https://www.youtube.com/watch?v=z4wPiallZcI). Mistral 7B -The Most Powerful 7B Model Yet

Model: https://huggingface.co/mistralai/Mistral-7B-v0.1

# 1 - Introduction

In the rapidly evolving domain of Natural Language Processing (NLP), the race towards higher model performance often necessitates an escalation in model size. However, this scaling tends to increase computational costs and inference latency, thereby raising barriers to deployment in practical, real-world scenarios. In this context, the search for balanced models delivering both high-level performance and efficiency becomes critically essential.

Mistral 7B, demonstrates that a carefully designed language model can deliver high performance while maintaining an efficient inference. Mistral 7B outperforms the previous best 13B model (Llama 2, [Touvron et al., 2023](https://arxiv.org/pdf/2307.09288.pdf)) across all tested benchmarks, and surpasses the best 34B model (LLaMa 34B, [Touvron et al., 2023](https://arxiv.org/pdf/2302.13971.pdf)) in mathematics and code generation. Furthermore, Mistral 7B approaches the coding performance of Code-Llama 7B ([Rozière et al., 2023](https://arxiv.org/abs/2308.12950)), without sacrificing performance on non-code related benchmarks.


Mistral 7B leverages grouped-query attention (GQA, [Ainslie et al., 2023](https://arxiv.org/abs/2305.13245)), and sliding window attention (SWA, [Beltagy et al., 2020](https://arxiv.org/abs/2004.05150)). GQA significantly accelerates the inference speed, and also reduces the memory requirement during decoding, allowing for higher batch sizes hence higher throughput, a crucial factor for real-time applications. In addition, SWA is designed to handle longer sequences more effectively at a reduced computational cost, thereby alleviating a common limitation in LLMs.

----

**Note:** These attention mechanisms are aimed towards improving the efficiency of inference and the but should not improve the general model performance just by themselves. The intuition behind this performance improve is the propietary data that was used to train the model, probably much cleaner than most open-source data sources


----

<table>
    <tr>
        <td><img src="./images_1/results_bar.png" width="700"/></td>
    </tr>
</table>

The benchmarks are categorized by their themes:

* **Commonsense Reasoning:** 0-shot average of Hellaswag, Winogrande, PIQA, SIQA, OpenbookQA, ARC-Easy, ARC-Challenge, and CommonsenseQA.
* **World Knowledge:** 5-shot average of NaturalQuestions and TriviaQA.
* **Reading Comprehension:** 0-shot average of BoolQ and QuAC.
* **Math:** Average of 8-shot GSM8K with maj@8 and 4-shot MATH with maj@4
* **Code:** Average of 0-shot Humaneval and 3-shot MBPP
* **Popular aggregated results:** 5-shot MMLU, 3-shot BBH, and 3-5-shot AGI Eval (English multiple-choice questions only)

<table>
    <tr>
        <td><img src="./images_1/results_table.png" width="900"/></td>
    </tr>
</table>

An interesting metric to compare how models fare in the cost/performance plane is to compute “equivalent model sizes”. On reasoning, comprehension and STEM reasoning (MMLU), M**istral 7B performs equivalently to a Llama 2 that would be more than 3x its size**. This is as much saved in memory and gained in throughput.

<table>
    <tr>
        <td><img src="./images_1/results_effective_sizes.png" width="700"/></td>
    </tr>
</table>

Mistral 7B largely outperforms Llama 2 13B on all evaluations, except on knowledge benchmarks, where it is on par (this is likely due to its limited parameter count, which restricts the amount of knowledge it can compress).

# 2 - Architecture

Mistral 7B is based on an autoregressive Transformer architecture. The main paramters of the architecture are summarized on the following table:

| Parameter     | Value    |
|-------------- | -------- |
| dim           | 4096     |
| n_layers      | 32       |
| head_dim      | 128      |
| hidden_dim    | 14336    |
| n_heads       | 32       |
| n_kv_heads    | 8        |
| window_size   | 4096     |
| context_len   | 8192     |
| vocab_size    | 32000    |

Compared to LLaMA, it introduces a few changes:

* **Sliding Window Attention (SWA)**.
* **Rolling Buffer Cache**.
* **Pre-fill and Chunking**.

In addition, it introduces GQA (which was not used for small models in LLaMA 2).

## 2.1 - Sliding Window Attention

Transformer-based models are unable to process long sequences due to their self-attention operation, which scales quadratically with the sequence length. To adress this limitation, [Beltagy et al. (2020)](https://arxiv.org/abs/2004.05150) introduces a sliding window. By limiting the attention computation to a fixed-size window, Sliding Window Attention (SWA) reduces the time complexity from quadratic to linear or sublinear. This makes it more efficient for long sequences, as it avoids the need to attend to all tokens simultaneously.

With SWA, each token can attend to at most $W$ tokens from the previous layer. However, tokens outside the sliding window still influence next word prediction. At each attention layer, information can move forward by $W$ tokens. Hence, after k attention layers, information can move forward by up to $k * W$ tokens.

<table>
    <tr>
        <td><img src="./images_1/sliding_window.png" width="600"/></td>
    </tr>
</table>

SWA exploits stacked layers of transformer to attend information beyond the window size $W$. The hidden state in position $i$ of the layer $k$, $h_{i}$, attends to all hidden states from the previous layer with positions between $i - W$ and $i$. Recursively, $h_{i}$ can access tokens from the input layer at a distance up to $k * W$ tokens. At the last layer, using a window size of 4096, we have a theoretical attention span of approximately 131K tokens. In practice, for a squence length of 16K and $W = 4096$ changes made to [FlashAttention](https://github.com/Dao-AILab/flash-attention) and [xFormers](https://github.com/facebookresearch/xformers) yield a 2x speed improvement over a vanilla attention baseline.



## 2.2 - Rolling Buffer (KV) Cache

### 2.2.1 - Reminder of KV cache

Since the decoder is causal (i.e., the attention of a token only depends on its preceding tokens), at each generation step we are recalculating the same previous token attention, **when we actually just want to calculate the attention for the new token**.

This is where KV comes into play. By caching the previous Keys and Values, we can focus on only calculating the attention for the new token.

<table>
    <tr>
        <td><img src="./images_1/kv_caching_2.gif" width="600"/></td>
    </tr>
</table>

### 2.2.1 - Rolling KV cache

A fixed attention span means that we can limit our cache size using a rolling buffer cache. The cache has a fixed size of $W$, and the keys and values for the timestep $i$ are stored in position $i$ % $W$ of the cache. As a result, when the position $i$ is larger than $W$, past value in the cache are overwritten, and the size of the cache stops increasing.

To illustrate this, let's say you have a rolling buffer cache with a fixed size of $W = 3$. Now, consider a sequence of timesteps, i = 1, 2, 3, 4, and so on:

* For timestep $i$ = 1, the position will be 1 % 3, which is 1. So, the keys and values for timestep 1 will be stored in position 1 of the cache.
* For timestep $i$ = 2, the position will be 2 % 3, which is 2. So, the keys and values for timestep 2 will be stored in position 2 of the cache.
* For timestep $i$ = 3, the position will be 3 % 3, which is 0. So, the keys and values for timestep 3 will be stored in position 0 of the cache.
* For timestep $i$ = 4, the position will be 4 % 3, which is 1. So, the keys and values for timestep 4 will overwrite what's in position 1 of the cache because it's older, and the cache doesn't grow beyond size W.

<table>
    <tr>
        <td><img src="./images_1/rolling_kv_cache.png" width="800"/></td>
    </tr>
</table>

## 2.3 - Pre-fill and Chunking

When generating a sequence, we need to predict tokens one-by-one, as each token is conditioned on the previous ones. However, the prompt is known in advance, and we can pre-fill the KV cache with the prompt. If the prompt is very large we can chunk it into smaller pieces, and pre-fill the cache with each chunk. For this purpose, we can select the window size as our chunk size. For each chunk, we thus need to compute the attention over the cache and over the chunk

The following Figure shows how the attention mask works over both the cache and the chunk.

<table>
    <tr>
        <td><img src="./images_1/prefill_chunking.png" width="800"/></td>
    </tr>
</table>

# 3 - Testing Mistral 7b




The majority of models on Hugging Face are stored and run in 32-bit floating-point (FP32) precision by default. This format provides a wide dynamic range, which helps in maintaining numerical stability during training.

Quantization is a technique used to reduce the model size by converting the model weights from a higher precision (like FP32) to lower precision (such as 16-bit floating point (FP16), 8-bit integer (INT8)) or 4-bit (INT4).

Since running a 7b parameter requires around 24GB of GPU memory, we are going to use a quantized version of the model. More specifically:
* 8-bit version (about 7GB of memory during inference)
* 4-bit version (about 3.5GB of memory during inference)

-----

> **Warning:** As of 11/10/2024, `bitsandbytes` is only supported on CUDA GPU hardware. Support for AMD GPUs and M1 chips (MacOS) is coming soon.

-----

#### When does quantization occur?

An interesting thing about quantization is that happens AFTER we download the model, so we can try different quantizations without needing to have duplicated model parameters in the disk.

* Quantization, such as 8-bit (using bitsandbytes or similar libraries), is applied locally after the model has been downloaded and loaded into memory.
* The full-precision model is initially loaded, and then the quantization library reduces the precision of the weights from 32-bit or 16-bit floating-point down to 8-bit or 4-bit integers, depending on your configuration.
* This process happens in-memory, so there is no need to re-download the model each time you use a different quantization scheme.

In [None]:
!pip install -U bitsandbytes

> Google Colab: Remember to restart the session after installing the package

In [None]:
import torch
from torch import device
from huggingface_hub import login
from transformers import BitsAndBytesConfig
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

In [None]:
login()

### 8-bit quantization

In [None]:
# Define the quantization configuration for 8-bit
quantization_config = BitsAndBytesConfig(
    load_in_8bit=True,  # Change to 8-bit quantization
)

# Load the model and tokenizer
model_id = "mistralai/Mistral-7B-Instruct-v0.3"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config)

In [None]:
# tokenizer.add_special_tokens({'pad_token': '<pad>'})
# model_4bit.resize_token_embeddings(len(tokenizer))

print(tokenizer.vocab_size)
print(model.get_input_embeddings())

print(tokenizer.special_tokens_map)

In [None]:
import torch

# Check if CUDA is available
if torch.cuda.is_available():
    # Get the current device (usually device 0)
    device = torch.device("cuda")

    # Memory allocated on the GPU (in bytes)
    allocated_memory = torch.cuda.memory_allocated(device)

    # Memory reserved by PyTorch's memory allocator (in bytes)
    reserved_memory = torch.cuda.memory_reserved(device)

    # Print the memory information (in MB for better readability)
    print(f"Allocated memory: {allocated_memory / (1024 ** 2)} MB")
    print(f"Reserved memory: {reserved_memory / (1024 ** 2)} MB")
else:
    print("CUDA is not available on this system.")


In [None]:
# Prepare a prompt
prompt = "<s>[INST] What is your favourite condiment? [/INST]"
# prompt = "What is your favourite condiment?"

# Tokenize the input prompt with padding
inputs = tokenizer(prompt, return_tensors="pt")

# Ensure the attention mask is set properly
attention_mask = inputs.attention_mask

# Generate response from the model
outputs = model.generate(
    inputs["input_ids"],
    attention_mask=attention_mask,
    max_length=50,
)

# Decode the generated tokens to text
response = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(response)

### 4-bit quantization

For more information about INT4 quantization: https://arxiv.org/pdf/2301.12017

> **Warning:** 4-bit quantization is much more UNSTABLE than 8-bit, I had issues trying to execute the model


### Breakdown of the quantization configuration

- **`load_in_4bit=True`**:
  This parameter indicates that the model should be loaded using 4-bit quantization. This significantly reduces the model size compared to the default 32-bit or even 16-bit floating-point formats, allowing for faster inference and lower memory usage.

- **`bnb_4bit_compute_dtype=torch.float16`**:
  This specifies the data type used for computations during inference. Setting this to `torch.float16` means that even though the model weights are quantized to 4 bits, the computations (such as activations and gradients, if applicable) will be performed in 16-bit floating-point (FP16). This can help maintain accuracy while leveraging the reduced precision of the weights.

- **`bnb_4bit_quant_type="nf4"`**:
  This specifies the quantization method used for 4-bit representation. The `"nf4"` quantization type refers to “normalized float 4,” which is a specific approach to quantizing weights. It retains more information about the distribution of the weights compared to simpler quantization methods. This can lead to better model performance after quantization.

- **`bnb_4bit_use_double_quant=True`**:
  When set to `True`, this parameter enables double quantization. Double quantization can help reduce the quantization error by quantizing the quantized values again, providing more stability and accuracy. It essentially applies quantization twice to achieve better performance and accuracy, particularly in complex models.

In [None]:
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=quantization_config,
)

`low_cpu_mem_usage` was None, now set to True since model is quantized.


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [None]:
# Prepare a prompt
prompt = "<s>[INST] What is your favourite condiment? [/INST]"

# Tokenize the input prompt with padding
inputs = tokenizer(prompt, return_tensors="pt")

# Ensure the attention mask is set properly
attention_mask = inputs.attention_mask

# Generate response from the model
outputs = model.generate(
    inputs["input_ids"],
    attention_mask=attention_mask,
    max_length=50,
)

# Decode the generated tokens to text
response = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(response)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


What is your favourite condiment? 


In [None]:
outputs

tensor([[    1, 23325,     2]])