# Mixtral in Colab

Welcome! In this notebook you can run [Mixtral8x7B-Instruct](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1) with decent generation speed **right in Google Colab or on a consumer-grade GPU**. This was made possible by quantizing the original model in mixed precision and implementing a MoE-specific offloading strategy.

To learn more, read our [tech report](https://arxiv.org/abs/2312.17238) or check out the [repo](https://github.com/dvmazur/mixtral-offloading) on GitHub.

One will need approximately 16 GB of VRAM and 11 GB of RAM to run this notebook and generate somewhat long texts.


<details>

<summary>How to balance between RAM and GPU VRAM usage</summary>

You can balance between RAM and GPU VRAM usage by changing <code>offload_per_layer</code> variable in the <a href="#scrollTo=_mIpePTMFyRY&line=10&uniqifier=1">Initialize model</a> section. Increasing <code>offload_per_layer</code> will decrease GPU VRAM usage, increase RAM usage and decrease generation speed. Decreasing <code>offload_per_layer</code> will have the opposite effect.

Note that this notebook should run normally in Google Colab with <code>offload_per_layer = 4</code>, but may crush with other values. However, if you run this somewhere else, you're free to play with this variable.
</details>

## Install and import libraries

In [1]:
# fix numpy in colab
import numpy
from IPython.display import clear_output

# fix triton in colab
!export LC_ALL="en_US.UTF-8"
!export LD_LIBRARY_PATH="/etc/alternatives/cuda/targets/x86_64-linux/include:/usr/include/python3.6m:$LD_LIBRARY_PATH"
!export LIBRARY_PATH="/etc/alternatives/cuda/lib64/stubs"
# !ldconfig /etc/alternatives/cuda/lib64/lib64-nvidia

# !git clone https://github.com/dvmazur/mixtral-offloading.git --quiet
# !cd mixtral-offloading && pip install -q -r requirements.txt
!huggingface-cli download lavawolfiee/Mixtral-8x7B-Instruct-v0.1-offloading-demo --quiet --local-dir Mixtral-8x7B-Instruct-v0.1-offloading-demo

clear_output()

In [2]:

import os, sys
script_dir = os.getcwd()
module_path = script_dir
for _ in range(1):
    module_path = os.path.abspath(os.path.join(module_path, '../'))
    if module_path not in sys.path:
        sys.path.insert(0,module_path)
        
sys.path.append("mixtral-offloading")
import torch
from torch.nn import functional as F
from hqq.core.quantize import BaseQuantizeConfig
from huggingface_hub import snapshot_download
from IPython.display import clear_output
from tqdm.auto import trange
from transformers import AutoConfig, AutoTokenizer
from transformers.utils import logging as hf_logging

from src.build_model import OffloadConfig, QuantConfig, build_model

[36mhqq_aten package not installed. HQQBackend.ATEN backend will not work unless you install the hqq_aten lib in hqq/kernels.[0m


  from .autonotebook import tqdm as notebook_tqdm
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(


In [3]:
# This will reload the imported modules (e.g. get_decode_model_characterstics) every time you execute the jupyter cells, so that you don't need to restart the notebook after updating the source codes.
%load_ext autoreload
%autoreload 2  

## Initialize model

In [4]:
model_name = "mistralai/Mixtral-8x7B-Instruct-v0.1"
quantized_model_name = "lavawolfiee/Mixtral-8x7B-Instruct-v0.1-offloading-demo"
state_path = "Mixtral-8x7B-Instruct-v0.1-offloading-demo"

config = AutoConfig.from_pretrained(quantized_model_name)

device = torch.device("cuda:0")

##### Change this to 5 if you have only 12 GB of GPU VRAM #####
offload_per_layer = 4
# offload_per_layer = 5
###############################################################

num_experts = config.num_local_experts

offload_config = OffloadConfig(
    main_size=config.num_hidden_layers * (num_experts - offload_per_layer),
    offload_size=config.num_hidden_layers * offload_per_layer,
    buffer_size=4,
    offload_per_layer=offload_per_layer,
)


attn_config = BaseQuantizeConfig(
    nbits=4,
    group_size=64,
    quant_zero=True,
    quant_scale=True,
)
attn_config["scale_quant_params"]["group_size"] = 256


ffn_config = BaseQuantizeConfig(
    nbits=2,
    group_size=16,
    quant_zero=True,
    quant_scale=True,
)
quant_config = QuantConfig(ffn_config=ffn_config, attn_config=attn_config)


model = build_model(
    device=device,
    quant_config=quant_config,
    offload_config=offload_config,
    state_path=state_path,
)

Loading experts: 100%|██████████| 32/32 [00:10<00:00,  3.03it/s]


In [5]:
model

MixtralForCausalLM(
  (model): MixtralModel(
    (embed_tokens): Embedding(32000, 4096)
    (layers): ModuleList(
      (0-31): 32 x MixtralDecoderLayer(
        (self_attn): MixtralAttention(
          (q_proj): HQQLinearTritonSavable()
          (k_proj): HQQLinearTritonSavable()
          (v_proj): HQQLinearTritonSavable()
          (o_proj): HQQLinearTritonSavable()
          (rotary_emb): MixtralRotaryEmbedding()
        )
        (block_sparse_moe): SparseMoeWrapper(
          (gate): Linear(in_features=4096, out_features=8, bias=False)
        )
        (input_layernorm): MixtralRMSNorm()
        (post_attention_layernorm): MixtralRMSNorm()
      )
    )
    (norm): MixtralRMSNorm()
  )
  (lm_head): Linear(in_features=4096, out_features=32000, bias=False)
)

## Run the model

In [19]:
from transformers import TextStreamer


tokenizer = AutoTokenizer.from_pretrained(model_name)
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
past_key_values = None
sequence = None

seq_len = 0
# while True:
print("User: ", end="")
user_input = "1+1 = "
print("\n")

user_entry = dict(role="user", content=user_input)
input_ids = tokenizer.apply_chat_template([user_entry], return_tensors="pt").to(device)

if past_key_values is None:
  attention_mask = torch.ones_like(input_ids)
else:
  seq_len = input_ids.size(1) + past_key_values[0][0][0].size(1)
  attention_mask = torch.ones([1, seq_len - 1], dtype=torch.int, device=device)

print("Mixtral: ", end="")
result = model.generate(
  input_ids=input_ids,
  attention_mask=attention_mask,
  past_key_values=past_key_values,
  streamer=streamer,
  do_sample=True,
  temperature=0.9,
  top_p=0.9,
  min_new_tokens=2,
  max_new_tokens=2,
  pad_token_id=tokenizer.eos_token_id,
  return_dict_in_generate=True,
  output_hidden_states=True,
)
print("\n")

# sequence = result["sequences"]
# past_key_values = result["past_key_values"]

User: 

Mixtral: tensor([0, 1, 1, 0, 0, 0, 1, 1])
tensor([0, 0, 1, 1, 1, 0, 0, 1])
tensor([1, 0, 1, 0, 0, 1, 1, 0])
tensor([1, 0, 0, 0, 0, 1, 1, 1])
tensor([1, 1, 1, 0, 1, 0, 0, 0])
tensor([1, 1, 1, 0, 1, 0, 0, 0])
tensor([1, 1, 0, 0, 0, 1, 1, 0])
tensor([0, 1, 1, 0, 0, 1, 0, 1])
tensor([1, 1, 1, 1, 0, 0, 0, 0])
tensor([1, 1, 1, 0, 0, 0, 0, 1])
tensor([1, 1, 0, 1, 0, 0, 0, 1])
tensor([0, 0, 1, 0, 1, 1, 0, 1])
tensor([1, 0, 0, 0, 1, 1, 0, 1])
tensor([0, 1, 1, 0, 0, 1, 0, 1])
tensor([1, 0, 1, 1, 1, 0, 0, 0])
tensor([0, 1, 1, 0, 1, 0, 0, 1])
tensor([1, 0, 0, 0, 1, 0, 1, 1])
tensor([0, 1, 0, 1, 0, 0, 1, 1])
tensor([0, 1, 0, 1, 0, 0, 1, 1])
tensor([1, 0, 1, 1, 0, 1, 0, 0])
tensor([1, 1, 1, 0, 0, 0, 1, 0])
tensor([0, 1, 1, 0, 1, 0, 1, 0])
tensor([0, 1, 0, 0, 0, 1, 1, 1])
tensor([1, 1, 0, 1, 0, 0, 0, 1])
tensor([0, 1, 0, 1, 1, 1, 0, 0])
tensor([1, 0, 1, 1, 0, 0, 0, 1])
tensor([1, 0, 1, 0, 1, 0, 0, 1])
tensor([1, 1, 0, 0, 1, 0, 0, 1])
tensor([1, 0, 0, 1, 1, 0, 0, 1])
tensor([1, 1, 1, 0, 1, 0, 

In [7]:
from torch.profiler import profile, schedule, tensorboard_trace_handler, ProfilerActivity
   
log_dir = "moe_profiling1"
warmup = 1
tracing_schedule = schedule(wait=0, warmup=warmup, active=1)
trace_handler = tensorboard_trace_handler(dir_name=log_dir)
print("starting inf")
with profile(
    activities = [ProfilerActivity.CPU, ProfilerActivity.CUDA],
    schedule = tracing_schedule,
    on_trace_ready = trace_handler,
    profile_memory = True,
    record_shapes = True,
    with_stack = True
) as prof:
    
    for i in range(2*warmup):
        # print(i )
        result = model.generate(
                                input_ids=input_ids,
                                attention_mask=attention_mask,
                                past_key_values=past_key_values,
                                streamer=streamer,
                                do_sample=True,
                                temperature=0.9,
                                top_p=0.9,
                                min_new_tokens=2,
                                max_new_tokens=3,
                                pad_token_id=tokenizer.eos_token_id,
                                return_dict_in_generate=True,
                                output_hidden_states=True,
                                )
        # outputs = model.generate(**input_ids)
        prof.step()
        if i == warmup:
            break
    
    # pipe(\"My tart needs some\", return_full_text=Fal

starting inf
defaultdict(<class 'src.expert_cache.EvictionGroupInfo'>, {0: EvictionGroupInfo(main_infos=OrderedDict([((0, 4), ExpertInfo(uid=(0, 4), eviction_group=0, offloaded=False, index=3)), ((0, 5), ExpertInfo(uid=(0, 5), eviction_group=0, offloaded=False, index=0)), ((0, 6), ExpertInfo(uid=(0, 6), eviction_group=0, offloaded=False, index=1)), ((0, 7), ExpertInfo(uid=(0, 7), eviction_group=0, offloaded=False, index=2))]), offloaded_infos=OrderedDict([((0, 0), ExpertInfo(uid=(0, 0), eviction_group=0, offloaded=True, index=0)), ((0, 1), ExpertInfo(uid=(0, 1), eviction_group=0, offloaded=True, index=2)), ((0, 2), ExpertInfo(uid=(0, 2), eviction_group=0, offloaded=True, index=3)), ((0, 3), ExpertInfo(uid=(0, 3), eviction_group=0, offloaded=True, index=1))]), hits=9, misses=7), 1: EvictionGroupInfo(main_infos=OrderedDict([((1, 6), ExpertInfo(uid=(1, 6), eviction_group=1, offloaded=False, index=7)), ((1, 4), ExpertInfo(uid=(1, 4), eviction_group=1, offloaded=False, index=4)), ((1, 5), E

STAGE:2024-04-10 00:41:17 74047:74047 ActivityProfilerController.cpp:314] Completed Stage: Warm Up


defaultdict(<class 'src.expert_cache.EvictionGroupInfo'>, {0: EvictionGroupInfo(main_infos=OrderedDict([((0, 7), ExpertInfo(uid=(0, 7), eviction_group=0, offloaded=False, index=2)), ((0, 1), ExpertInfo(uid=(0, 1), eviction_group=0, offloaded=False, index=1)), ((0, 5), ExpertInfo(uid=(0, 5), eviction_group=0, offloaded=False, index=0)), ((0, 6), ExpertInfo(uid=(0, 6), eviction_group=0, offloaded=False, index=3))]), offloaded_infos=OrderedDict([((0, 0), ExpertInfo(uid=(0, 0), eviction_group=0, offloaded=True, index=0)), ((0, 2), ExpertInfo(uid=(0, 2), eviction_group=0, offloaded=True, index=2)), ((0, 3), ExpertInfo(uid=(0, 3), eviction_group=0, offloaded=True, index=1)), ((0, 4), ExpertInfo(uid=(0, 4), eviction_group=0, offloaded=True, index=3))]), hits=15, misses=12), 1: EvictionGroupInfo(main_infos=OrderedDict([((1, 3), ExpertInfo(uid=(1, 3), eviction_group=1, offloaded=False, index=5)), ((1, 4), ExpertInfo(uid=(1, 4), eviction_group=1, offloaded=False, index=4)), ((1, 5), ExpertInfo(u

STAGE:2024-04-10 00:41:22 74047:74047 ActivityProfilerController.cpp:320] Completed Stage: Collection
STAGE:2024-04-10 00:41:22 74047:74047 ActivityProfilerController.cpp:324] Completed Stage: Post Processing
STAGE:2024-04-10 00:41:38 74047:74047 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
STAGE:2024-04-10 00:41:38 74047:74047 ActivityProfilerController.cpp:320] Completed Stage: Collection
STAGE:2024-04-10 00:41:38 74047:74047 ActivityProfilerController.cpp:324] Completed Stage: Post Processing


In [8]:
!nvidia-smi

Wed Apr 10 00:41:38 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA TITAN RTX               On  |   00000000:1A:00.0 Off |                  N/A |
| 40%   40C    P2             65W /  280W |   12357MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA TITAN RTX               On  |   00

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [9]:
%%capture
%load_ext tensorboard
%tensorboard --logdir moe_profiling1 --port 6006

ModuleNotFoundError: No module named 'tensorboard'