<a href="https://www.kaggle.com/code/ashishkumarak/mixtral-moe-8x7b-instruct-inference-t4-2-gpu?scriptVersionId=174489589" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

## About the Model : Mistral Mixture of Experts (MoE)

<p><img src="https://www.unite.ai/wp-content/uploads/2023/12/DALL%C2%B7E-2023-12-11-19.51.47-Design-a-banner-with-a-white-background-intended-to-showcase-pixelated-letters-_M-8x7_-in-a-bold-font-style.-The-letters-should-feature-a-gradient-ef.png" height="700" width="700" style="object-fit: cover;"></p>
<p style="text-align:justify;">
</p>  

<div class="anchor" id="top" style="
    margin-right: auto; 
    margin-left: auto;
    padding: 10px;
   font-size : 120%;
    background-color: #FEF2EF;
    border-radius: 3px;
    font-color :  #581845  ;        
    border: 1px solid #FF5733 ;">  
    
Mixtral 8x7B, often referred to as a "miniature GPT-4," leverages a Mixture of Experts (MoE) architecture with eight individual experts, is a 45B parameter model. This architectural decision is noteworthy because it allows only two experts to participate in the inference process for each token, signifying a move towards more streamlined and targeted AI processing.

A standout feature of Mixtral is its capacity to handle an extensive context of 32,000 tokens, offering a wide-ranging scope for tackling intricate tasks. Moreover, the model's multilingual support extends to English, French, Italian, German, and Spanish, making it a versatile tool for a diverse global developer community.
    
Read more about it here : https://www.unite.ai/mistral-ais-latest-mixture-of-experts-moe-8x7b-model/

<div class="anchor" id="top" style="
    margin-right: auto; 
    margin-left: auto;
    padding: 10px;
   font-size : 120%;
    background-color: #FFFFFF;
    border-radius: 20px;
    font-color :  #581845  ;        
    border: 4px solid #000000 ;"> 
    
Don't forget to upvote 👆 if you find it useful :) 

<div class="anchor" id="top" style="
    margin-right: auto; 
    margin-left: auto;
    padding: 10px;
   font-size : 120%;
    background-color: #FEF2EF;
    border-radius: 2px;
    font-color :  #581845  ;        
    border: 1px solid #FF5733 ;"> 
    
- Installing the relevant libraries from the github repository itself : [POV : There are methods like e.g. 4-bit peft adapter merge_and_unload() which are still in the beta version, else every functionality which is in the released version behaves the same way]

In [1]:
%%capture
!pip install git+https://github.com/huggingface/transformers.git  -U 
!pip install git+https://github.com/huggingface/accelerate.git  -U 
!pip install bitsandbytes 
!pip install git+https://github.com/huggingface/peft.git  -U 

## Quantization

<div class="anchor" id="top" style="
    margin-right: auto; 
    margin-left: auto;
    padding: 10px;
   font-size : 120%;
    background-color: #FEF2EF;
    border-radius: 2px;
    font-color :  #581845  ;        
    border: 2px solid #FF5733 ;"> 
    
- If you're new to quantization,start by reading [this](https://en.wikibooks.org/wiki/A-level_Computing/AQA/Paper_2/Fundamentals_of_data_representation/Floating_point_numbers#:~:text=In%20decimal%2C%20very%20large%20numbers,be%20used%20for%20binary%20numbers.) page.
    
- The quantization method is based on the paper QLoRA paper whose abstract follows as :
    >*We present QLoRA, an efficient finetuning approach that reduces memory usage enough to finetune a 65B parameter model on a single 48GB GPU while preserving full 16-bit finetuning task performance. QLoRA backpropagates gradients through a frozen, 4-bit quantized pretrained language model into Low Rank Adapters~(LoRA). Our best model family, which we name Guanaco, outperforms all previous openly released models on the Vicuna benchmark, reaching 99.3% of the performance level of ChatGPT while only requiring 24 hours of finetuning on a single GPU. QLoRA introduces a number of innovations to save memory without sacrificing performance: (a) 4-bit NormalFloat (NF4), a new data type that is information theoretically optimal for normally distributed weights (b) double quantization to reduce the average memory footprint by quantizing the quantization constants, and (c) paged optimizers to manage memory spikes. We use QLoRA to finetune more than 1,000 models, providing a detailed analysis of instruction following and chatbot performance across 8 instruction datasets, multiple model types (LLaMA, T5), and model scales that would be infeasible to run with regular finetuning (e.g. 33B and 65B parameter models). Our results show that QLoRA finetuning on a small high-quality dataset leads to state-of-the-art results, even when using smaller models than the previous SoTA. We provide a detailed analysis of chatbot performance based on both human and GPT-4 evaluations showing that GPT-4 evaluations are a cheap and reasonable alternative to human evaluation. Furthermore, we find that current chatbot benchmarks are not trustworthy to accurately evaluate the performance levels of chatbots. A lemon-picked analysis demonstrates where Guanaco fails compared to ChatGPT. We release all of our models and code, including CUDA kernels for 4-bit training*
    
- Bitsandbytes makes it easy to quantize and finetune the model with lesser memory requirements

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
from peft import PeftModel, PeftConfig


bnb_config = BitsAndBytesConfig(  
    load_in_4bit= True,
    bnb_4bit_quant_type= "nf4",
    bnb_4bit_compute_dtype= torch.float16,
    bnb_4bit_use_double_quant= True,
llm_int8_enable_fp32_cpu_offload= True)

<div class="anchor" id="top" style="
    margin-right: auto; 
    margin-left: auto;
    padding: 10px;
   font-size : 120%;
    background-color: #FEF2EF;
    border-radius: 2px;
    font-color :  #581845  ;        
    border: 2px solid #FF5733 ;"> 
    
- `load_in_4bit` : bitsandbytes stores weights in 4-bits
- `bnb_4bit_quant_type= "nf4"` : normalized float 4 (as per the QLoRA paper,using NF4 quantization is recommended for better performance
- `bnb_4bit_compute_dtype= torch.bfloat16` : 
<p><img src="https://storage.googleapis.com/gweb-cloudblog-publish/images/Three_floating-point_formats.max-700x700.png" height="600" width="600" style="object-fit: cover;"></p>
<p style="text-align:justify;">
</p>  
The dynamic range of bfloat16 and float32 are equivalent. However, bfloat16 takes up half the memory space
- `bnb_4bit_use_double_quant= True` :  Uses a second quantization after the first one to save an additional 0.4 bits per parameter
- `llm_int8_enable_fp32_cpu_offload = True` : If you want to split your model in different parts and run some parts in int8 on GPU and some parts in fp32 on CPU, you can use this flag. This is useful for offloading large models such as google/flan-t5-xxl. Note that the int8 operations will not be run on CPU.


In [3]:
torch.cuda.empty_cache()

<div class="anchor" id="top" style="
    margin-right: auto; 
    margin-left: auto;
    padding: 10px;
   font-size : 120%;
    background-color: #FEF2EF;
    border-radius: 2px;
    font-color :  #581845  ;        
    border: 2px solid #FF5733 ;"> 
    
`torch.cuda.empty_cache()` : releases all unoccupied cached memory currently held by the caching allocator so that those can be used in other GPU application and visible in nvidia-smi.

In [4]:
import gc
gc.collect()

30

## Get the Model

In [5]:
model = AutoModelForCausalLM.from_pretrained(
        '/kaggle/input/mixtral/pytorch/8x7b-instruct-v0.1-hf/1',
        quantization_config=bnb_config,
        device_map="auto",
        trust_remote_code=True,)

Loading checkpoint shards:   0%|          | 0/19 [00:00<?, ?it/s]

<div class="anchor" id="top" style="
    margin-right: auto; 
    margin-left: auto;
    padding: 10px;
   font-size : 120%;
    background-color: #FEF2EF;
    border-radius: 2px;
    font-color :  #581845  ;        
    border: 2px solid #FF5733 ;"> 
    
- `device_map="auto"` : pass "auto" to get a device map that will be automatically inferred).Manually setting a device once the model has been loaded with device_map is not recommended when using accelerate. So any device assignment call to the model, or to any model’s submodules should be avoided after that line - unless you know what you are doing
- Set `trust_remote_code=True` to use a model with custom code 

## Get the tokenizer

In [6]:
from transformers import pipeline, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('/kaggle/input/mixtral/pytorch/8x7b-instruct-v0.1-hf/1', trust_remote_code=True)

2024-04-28 16:07:06.694379: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-28 16:07:06.694478: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-28 16:07:06.847619: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


In [7]:
import gc
gc.collect()

66

<div class="anchor" id="top" style="
    margin-right: auto; 
    margin-left: auto;
    padding: 10px;
   font-size : 120%;
    background-color: #FEF2EF;
    border-radius: 2px;
    font-color :  #581845  ;        
    border: 2px solid #FF5733 ;"> 

- Python’s memory allocation and deallocation method is automatic. The user does not have to preallocate or deallocate memory.
- Invoking the garbage collector (using `gc.collect`) manually during the execution of a program can be a good idea for how to handle memory being consumed by reference cycles.

## Model Inference

<div class="anchor" id="top" style="
    margin-right: auto; 
    margin-left: auto;
    padding: 10px;
   font-size : 120%;
    background-color: #FEF2EF;
    border-radius: 2px;
    font-color :  #581845  ;        
    border: 2px solid #FF5733 ;"> 

#### Creating a prompt template alongwith the input for passing it to the model
- Here I'm passing an abstract to the model, and asking it to create a catchy title that maximises the clickbait. The prompt is in the format as desired by the mistral models (mentioned on [this](https://www.kaggle.com/models/mistral-ai/mixtral/frameworks/PyTorch/variations/8x7b-instruct-v0.1-hf/versions/1) page as well)

In [8]:
abstract = 'Diabetes is now one of the major public health challenges, globally. Prolonged diabetes leads to various diabetic microvascular complications (DMCs) like retinopathy, nephropathy, and neuropathy. Multiple factors are likely to be involved in predisposing diabetic individuals to complications. Early detection or diagnosis is essential in developing strategies to reduce the risk factors and management costs of these diabetic complications. In this study, we employed Raman Spectroscopy (RS) to analyse the plasma samples of diabetes patients without and with DMCs along with the plasma samples of healthy subjects. Spectral comparisons revealed decrease in protein content in Diabetes group and further subsequent decrease in proteins in DMC groups when compared with control group, which corroborates with the fact that there exists increased secretion of proteins in urine and corresponding decreased protein content in their blood in case of diabetic individuals. Among all study groups, it was noted that 75% of control spectra show correct classification, while spectral misclassification is high amongst the subjects with Diabetes and DMCs. Interestingly, very few Diabetes and DMC plasma spectra are misclassified as control spectra. Findings demonstrate that 70% of the Diabetes subjects without complications can be correctly identified from diabetes with complications. Further, investigations could also attempt to explore the use of serum instead of plasma to reduce the spectral misclassifications as one of the abundant constituents namely clotting factors could be avoided. The outcome of RS study may be imminent for the early detection or diagnosis of DMCs.'

example = {'instruction' : 'Write 5 catchy title each not more than 15 words for the following paper abstract. The title should be a single sentence that accurately captures what you have done and sounds interesting to the people who work on the same or a similar topic. The title should contain the important title keywords that other researchers use when looking for literature in databases. The title should also use synonyms, broader terms, or abstractive keywords to make it more appealing and informative. Do not use words that are not related to the paper extract or the topic',
    'input' : abstract }
def formatting_func(example):
    text = f"<s>[INST]Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.{example['instruction']}\n\n### Input:{example['input']}[/INST]"
    return text
eval_prompt = formatting_func(example)
print(eval_prompt)

<s>[INST]Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.Write 5 catchy title each not more than 15 words for the following paper abstract. The title should be a single sentence that accurately captures what you have done and sounds interesting to the people who work on the same or a similar topic. The title should contain the important title keywords that other researchers use when looking for literature in databases. The title should also use synonyms, broader terms, or abstractive keywords to make it more appealing and informative. Do not use words that are not related to the paper extract or the topic

### Input:Diabetes is now one of the major public health challenges, globally. Prolonged diabetes leads to various diabetic microvascular complications (DMCs) like retinopathy, nephropathy, and neuropathy. Multiple factors are likely to be involved in predisposing diabetic ind

<div class="anchor" id="top" style="
    margin-right: auto; 
    margin-left: auto;
    padding: 10px;
   font-size : 120%;
    background-color: #FEF2EF;
    border-radius: 2px;
    font-color :  #581845  ;        
    border: 2px solid #FF5733 ;"> 

#### To filter out any unnecessary warnings

In [9]:
import warnings
warnings.filterwarnings("ignore")
warnings.filterwarnings("ignore", category=UserWarning, module="transformers")

In [10]:
from datetime import datetime
import re

def model_seq_gen(model) : 
        pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer)
        start = datetime.now()
        sequences = pipe(
            f'{eval_prompt}' ,
            do_sample=True,
            max_new_tokens=200, 
            temperature=0.7, 
            top_p=0.95
        )
        extracted_title = re.sub(r'[\'"]', '', sequences[0]['generated_text'].split("[/INST]")[1])
        stop = datetime.now()
        time_taken = stop-start
        print(f"Execution Time : {time_taken}")
        return extracted_title

extracted_title = model_seq_gen(model)
print(extracted_title)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Execution Time : 0:00:48.228708
 1. Raman Spectroscopy for Early Detection of Diabetic Microvascular Complications: A Comparative Study
2. Decreased Protein Content in Diabetes and Diabetic Microvascular Complications: A Raman Spectroscopy Analysis
3. Raman Spectroscopy as a Tool for Predicting Diabetic Microvascular Complications: An Exploration
4. Spectral Comparisons of Plasma Samples Reveal Differences in Diabetes and Diabetic Microvascular Complications
5. The Role of Raman Spectroscopy in Identifying Diabetes and Diabetic Microvascular Complications: A Study of Plasma Samples


<div class="anchor" id="top" style="
    margin-right: auto; 
    margin-left: auto;
    padding: 10px;
   font-size : 120%;
    background-color: #FEF2EF;
    border-radius: 2px;
    font-color :  #581845  ;        
    border: 2px solid #FF5733 ;"> 

- `do_sample=True` :  model.generate() method will use Sample Decoding
- `max_new_tokens=400` : Number of new tokens you want to generate
- `temperature=0.5` :  To increase the probability of probable tokens while reducing the one that is not : <p><img src="https://miro.medium.com/v2/resize:fit:1400/format:webp/1*41TqaBXrhIGU2V1JCEzU5Q.png" height="600" width="600" style="object-fit: cover;"></p>
<p style="text-align:justify;">
</p>   t is the temperature value.At temp=0.5, the most probable words like i, yeah, me, have more chance of being generated. At the same time, this also lowers the probability of the less probable ones, although this does not stop them from occurring.
- `top_p=0.95` :  Instead of considering all possible next words, top-p sampling only considers the smallest set of top words whose cumulative probability exceeds a certain threshold, p. A higher value of p means more words are considered, leading to more randomness in the generated text.
 <p><img src="https://api.wandb.ai/files/darek/images/projects/37727390/20e4f024.png" height="700" width="700" style="object-fit: cover;"></p>
<p style="text-align:justify;">

Learn more about it [here](https://towardsdatascience.com/decoding-strategies-that-you-need-to-know-for-response-generation-ba95ee0faadc)

In [11]:
!nvidia-smi

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Sun Apr 28 16:08:09 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   70C    P0              32W /  70W |  10929MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla T4                       Off | 00000000:00:0

<div class="anchor" id="top" style="
    margin-right: auto; 
    margin-left: auto;
    padding: 10px;
   font-size : 120%;
    background-color: #FEF2EF;
    border-radius: 2px;
    font-color :  #581845  ;        
    border: 2px solid #FF5733 ;"> 

- On 2*T4 GPU vram occupied is almost 25 gb/ 30 gb
- Inference can be done easily.

<div class="anchor" id="top" style="
    margin-right: auto; 
    margin-left: auto;
    padding: 10px;
   font-size : 120%;
    background-color: #FEF2EF;
    border-radius: 2px;
    font-color :  #581845  ;        
    border: 2px solid #FF5733 ;"> 
    
    
### References : 

1. https://huggingface.co/docs/transformers/en/main_classes/quantization
2. https://huggingface.co/blog/4bit-transformers-bitsandbytes
3. https://huggingface.co/docs/transformers/en/custom_models
4. https://www.geeksforgeeks.org/garbage-collection-python/
5. https://www.kaggle.com/code/sangeek/pynvml-module-to-identify-and-monitor-gpu-usage
6. https://towardsdatascience.com/decoding-strategies-that-you-need-to-know-for-response-generation-ba95ee0faadc
7. https://wandb.ai/darek/llmapps/reports/A-Gentle-Introduction-to-LLM-APIs--Vmlldzo0NjM0MTMz
