# Mixtral in Colab

Welcome! In this notebook you can run [Mixtral8x7B-Instruct](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1) with decent generation speed **right in Google Colab or on a consumer-grade GPU**. This was made possible by quantizing the original model in mixed precision and implementing a MoE-specific offloading strategy.

To learn more, read our [tech report](https://arxiv.org/abs/2312.17238) or check out the [repo](https://github.com/dvmazur/mixtral-offloading) on GitHub.

One will need approximately 16 GB of VRAM and 11 GB of RAM to run this notebook and generate somewhat long texts.


<details>

<summary>How to balance between RAM and GPU VRAM usage</summary>

You can balance between RAM and GPU VRAM usage by changing <code>offload_per_layer</code> variable in the <a href="#scrollTo=_mIpePTMFyRY&line=10&uniqifier=1">Initialize model</a> section. Increasing <code>offload_per_layer</code> will decrease GPU VRAM usage, increase RAM usage and decrease generation speed. Decreasing <code>offload_per_layer</code> will have the opposite effect.

Note that this notebook should run normally in Google Colab with <code>offload_per_layer = 4</code>, but may crush with other values. However, if you run this somewhere else, you're free to play with this variable.
</details>

## Install and import libraries

In [1]:
# fix numpy in colab
import numpy
from IPython.display import clear_output

# fix triton in colab
!export LC_ALL="en_US.UTF-8"
!export LD_LIBRARY_PATH="/usr/lib64-nvidia"
!export LIBRARY_PATH="/usr/local/cuda/lib64/stubs"
!ldconfig /usr/lib64-nvidia

!git clone https://github.com/dvmazur/mixtral-offloading.git --quiet
!cd mixtral-offloading && pip install -q -r requirements.txt
!huggingface-cli download lavawolfiee/Mixtral-8x7B-Instruct-v0.1-offloading-demo --quiet --local-dir Mixtral-8x7B-Instruct-v0.1-offloading-demo

clear_output()

In [2]:
import sys

sys.path.append("mixtral-offloading")
import torch
from torch.nn import functional as F
from hqq.core.quantize import BaseQuantizeConfig
from huggingface_hub import snapshot_download
from IPython.display import clear_output
from tqdm.auto import trange
from transformers import AutoConfig, AutoTokenizer
from transformers.utils import logging as hf_logging

from src.build_model import OffloadConfig, QuantConfig, build_model

[36mhqq_aten package not installed. HQQBackend.ATEN backend will not work unless you install the hqq_aten lib in hqq/kernels.[0m


The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

Process ForkProcess-2:
Process ForkProcess-3:
Process ForkProcess-1:
Process ForkProcess-4:
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.10/concurrent/futures/process.py", line 240, in _process_worker
    call_item = call_queue.get(block=True)
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target

## Initialize model

In [3]:
MODEL_NAME = "mistralai/Mixtral-8x7B-Instruct-v0.1"
quantized_model_name = "lavawolfiee/Mixtral-8x7B-Instruct-v0.1-offloading-demo"
state_path = "Mixtral-8x7B-Instruct-v0.1-offloading-demo"

In [5]:
config = AutoConfig.from_pretrained(quantized_model_name)
device = torch.device("cuda:0")
offload_per_layer=5
num_experts = config.num_local_experts

In [6]:
offload_config = OffloadConfig(
    main_size=config.num_hidden_layers * (num_experts - offload_per_layer),
    offload_size=config.num_hidden_layers * offload_per_layer,
    buffer_size=4,
    offload_per_layer=offload_per_layer
)

In [7]:
attn_config = BaseQuantizeConfig(
    nbits=4,
    group_size=64,
    quant_zero=True,
    quant_scale=True
)
attn_config["scale_quant_params"]["group_size"] = 256

In [9]:
ffn_config = BaseQuantizeConfig(
    nbits=2,
    group_size=16,
    quant_zero=True,
    quant_scale=True
)
quant_config = QuantConfig(ffn_config=ffn_config, attn_config=attn_config)

In [11]:
model = build_model(
    device=device,
    quant_config=quant_config,
    offload_config = offload_config,
    state_path=state_path
)

config.json:   0%|          | 0.00/720 [00:00<?, ?B/s]



Loading experts:   0%|          | 0/32 [00:00<?, ?it/s]

## Run the model

In [20]:
from transformers import TextStreamer

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
past_key_values = None
sequence = None

seq_len = 0
while True:
    print("User: ", end="")
    user_input = input()
    if user_input == "end":
        break
    print("\n")
    
    user_entry = dict(role="user", content=user_input)
    input_ids = tokenizer.apply_chat_template([user_entry], return_tensors="pt").to(device)
    
    if past_key_values is None:
        attention_mask = torch.ones_like(input_ids)
    else:
        seq_len = input_ids.size(1) + past_key_values[0][0][0].size(1)
        attention_mask = torch.ones([1, seq_len-1], dtype=torch.int, device=device)
    
    print("Mixtral: ", end="")
    result = model.generate(
        input_ids = input_ids,
        attention_mask = attention_mask,
        past_key_values=past_key_values,
        streamer=streamer,
        do_sample=True,
        temperature=0.8,
        top_p=0.9,
        max_new_tokens=512,
        pad_token_id=tokenizer.eos_token_id,
        return_dict_in_generate=True,
        output_hidden_states=True,
    )
    print("\n")

    sequence = result["sequences"]
    past_key_values = result["past_key_values"]

User: 

 show me the most effective way to manage time for better academic success




Mixtral: Time management is a crucial skill for academic success. Here are some strategies that can help you make the most of your time and improve your academic performance:

1. Create a schedule: Plan out your days, weeks, and even months in advance. Make a timetable that includes classes, study time, breaks, and extracurricular activities. Make sure to allocate time for each task and prioritize your most important tasks.
2. Set clear goals: Set specific, measurable, achievable, relevant, and time-bound (SMART) goals for yourself. This will help you stay focused and motivated, and will enable you to track your progress.
3. Use a planner: Use a planner to keep track of your assignments, deadlines, and appointments. This will help you stay organized and reduce the chances of forgetting important tasks.
4. Prioritize your tasks: Prioritize your tasks based on their importance and urgency. Focus on completing your most important tasks first, and then move on to less important tasks.
5.

 how to personal branding




Mixtral: Personal branding is the process of creating a distinct and recognizable image and reputation for yourself, highlighting your strengths, values, and personality. Here are some steps to help you develop your personal brand:

1. Define your brand: Start by defining who you are, what you stand for, and what makes you unique. Identify your strengths, skills, values, and passions. Consider what makes you different from others in your field and what you want to be known for.
2. Know your audience: Understand who your target audience is and what they are looking for. Consider their needs, interests, and values. Tailor your personal brand to resonate with them and stand out from the competition.
3. Establish a consistent image: Create a consistent visual and messaging style across all your online and offline platforms. Use the same color scheme, font, and language to create a cohesive and recognizable brand.
4. Leverage social media: Use social media platforms to showcase your exper

 how to become a content creator




Mixtral: Content creation is a popular and in-demand career path, with many opportunities available for those who are passionate, skilled, and persistent. Here are some steps to help you become a content creator:

1. Identify your niche: Choose a topic or theme that you are passionate about and knowledgeable in. Consider your audience and what they are looking for. Narrow down your focus to a specific niche, where you can establish yourself as an expert and attract a loyal following.
2. Develop your voice and style: Establish a unique and recognizable voice and style that reflects your personality, values, and expertise. Develop your own style, tone, and approach to content creation.
3. Build your skills: Learn the basics of writing, editing, and visual storytelling. Familiarize yourself with the tools, platforms, and techniques used in content creation. Practice and refine your skills through regular writing, photography, or videography.
4. Create a portfolio: Build a portfolio of y

 end


In [None]:
  user_entry = dict(role="user", content=user_input)
  input_ids = tokenizer.apply_chat_template([user_entry], return_tensors="pt").to(device)

  if past_key_values is None:
    attention_mask = torch.ones_like(input_ids)
  else:
    seq_len = input_ids.size(1) + past_key_values[0][0][0].size(1)
    attention_mask = torch.ones([1, seq_len - 1], dtype=torch.int, device=device)

  print("Mixtral: ", end="")
  result = model.generate(
    input_ids=input_ids,
    attention_mask=attention_mask,
    past_key_values=past_key_values,
    streamer=streamer,
    do_sample=True,
    temperature=0.9,
    top_p=0.9,
    max_new_tokens=512,
    pad_token_id=tokenizer.eos_token_id,
    return_dict_in_generate=True,
    output_hidden_states=True,
  )
  print("\n")

  sequence = result["sequences"]
  past_key_values = result["past_key_values"]

User: Write a funny poem about Python, please





Mixtral: There once was a language so bright,

named Python, a helpful little wight.

It slithered through code with ease,

No bug stood a chance, to say the least.



Its syntax so clean, and easy on the eyes,

Programmers from far and wide would shout their surprise.

But unlike its namesake, it's not sneaky or sly,

It's open source, and free to the sky!



With libraries so vast, it's a data geek's dream,

From AI to web scraping, it's the ultimate scheme.

It's the Swiss Army knife of coding, or perhaps a black belt,

In the world of programming, it's the ultimate feel.



So, whether you're a beginner or a seasoned coder,

Python's the language that will make you feel higher.

With a community so welcoming and a syntax so neat,

It's no wonder that Python is simply hard to beat!



So here's to the snake, in programming so great,

May it continue to dominate, until a much later date.

In the world of technology, it's here to stay