<a href="https://colab.research.google.com/github/darthchudi/mistral-inference-google-colab/blob/main/mistral_inference.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook runs Inference on [Mistral's 7B model](https://mistral.ai/) on Google Colab.

This is based on [Josh Bickett's notebook](https://colab.research.google.com/github/joshbickett/run-mistral-7b/blob/main/inference.ipynb#scrollTo=5r6GrJzvNWFQ
) which runs inference on the [sharded Mistral 7B](https://huggingface.co/filipealmeida/Mistral-7B-Instruct-v0.1-sharded) instruct model.

Previously, I tried running inference directly on the Mistral 7B model on Google Colab but ran out of memory. Josh's notebook uses a sharded version of the Mistral model + uses 4-bit quantization so the Neural Network's parameters/weights take up less memory.



In [None]:
!pip install git+https://github.com/huggingface/transformers
!pip install -q peft accelerate bitsandbytes safetensors
!pip install sentencepiece


Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-ul9bo_0r
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers /tmp/pip-req-build-ul9bo_0r
  Resolved https://github.com/huggingface/transformers to commit 9ed538f2e67ee10323d96c97284cf83d44f0c507
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import transformers
import time

model_name = "filipealmeida/Mistral-7B-Instruct-v0.1-sharded"

bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    load_in_4bit=True,
    torch_dtype=torch.bfloat16,
    quantization_config=bnb_config
)

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.bos_token_id = 1

Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
device = "cuda"

start = time.time()

text = "[INST] ~Tell me a short story about Nigeria's history ~ [/INST]"
model_input = tokenizer(text, return_tensors="pt", add_special_tokens=False).to(device)

generated_ids = model.generate(**model_input, max_new_tokens=1500, do_sample=True)
decoded = tokenizer.batch_decode(generated_ids)

end = time.time() - start
print(f"Ran inference in {end} seconds")
print(decoded[0])

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Ran inference in 23.62244415283203 seconds
[INST] ~Tell me a short story about Nigeria's history ~ [/INST] Nigeria has a rich and

Once upon a time, there was a young king in the kingdom of Nigeria, a land prosperous with rich natural resources. He used his power to bring peace and prosperity to his people and to establish a reputation for fairness and righteousness. Despite his young age, the king was respected and loved by all who knew him.

But one day, a neighboring kingdom launched a surprise attack on Nigeria, invading its lands and plundering its treasures. The king worked tirelessly to defend his people, rallying them to face the enemy in battle.

In the end, the king's bravery and leadership led his people to victory, driving the invaders back beyond their borders. His kingdom was left stronger than ever before, with a newfound sense of pride and unity among its people. And his legacy lived on, inspiring generations of sons and daughters to come.</s>
