# Fine-Tuning Open-Source LLM using QLoRA with MLflow and PEFT

## Overview

Many powerful open-source LLMs have emerged and are easily accessible. However, they are not designed to be deployed to your production environment out-of-the-box; instead, you have to **fine-tune** them for your specific tasks, such as a chatbot, content generation, etc. One challenge, though, is that training LLMs is usually very expensive. Even if your dataset for fine-tuning is small, the backpropagation step needs to compute gradients for billions of parameters. For example, fully fine-tuning the Llama7B model requires 112GB of VRAM, i.e. at least two 80GB A100 GPUs. Fortunately, there are many research efforts on how to reduce the cost of LLM fine-tuning.

In this tutorial, we will demonstrate how to build a powerful **text-to-pythonL** generator by fine-tuning the Mistral 7B model with **a single 24GB VRAM GPU**.

### What You Will Learn
1. Hands-on learning of the typical LLM fine-tuning process.
2. Understand how to use **QLoRA** and **PEFT** to overcome the GPU memory limitation for fine-tuning.
3. Manage the model training cycle using **MLflow** to log the model artifacts, hyperparameters, metrics, and prompts.
4. How to save prompt template and inference parameters (e.g. max_token_length) in MLflow to simplify prediction interface.

### Key Actors
In this tutorial, you will learn about the techniques and methods behind efficient LLM fine-tuning by actually running the code. There are more detailed explanations for each cell below, but let's start with a brief preview of a few main important libraries/methods used in this tutorial.

* [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) model is a pretrained text-generation model with 7 billion parameters, developed by [mistral.ai](https://mistral.ai/). The model employs various optimization techniques such as Group-Query Attention, Sliding-Window Attention, Byte-fallback BPE tokenizer, and outperforms the Llama 2 13B on benchmarks with fewer parameters.
* [QLoRA](https://github.com/artidoro/qlora) is a novel method that allows us to fine-tune large foundational models with limited GPU resources. It reduces the number of trainable parameters by learning pairs of rank-decomposition matrices and also applies 4-bit quantization to the frozen pretrained model to further reduce the memory footprint.
* [PEFT](https://huggingface.co/docs/peft/en/index) is a library developed by HuggingFace🤗, that enables developers to easily integrate various optimization methods with pretrained models available on the HuggingFace Hub. With PEFT, you can apply QLoRA to the pretrained model with a few lines of configurations and run fine-tuning just like the normal Transformers model training.
* [MLflow](https://mlflow.org/) manages an exploding number of configurations, assets, and metrics during the LLM training on your behalf. MLflow is natively integrated with Transformers and PEFT, and plays a crucial role in organizing the fine-tuning cycle.

## 1. Environment Set up

### Hardware Requirement
Please ensure your GPU has at least 20GB of VRAM available. This notebook has been tested on a single NVIDIA A10G GPU with 24GB of VRAM.

In [None]:
! nvidia-smi

Tue Aug 20 12:34:46 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off | 00000000:00:04.0 Off |                    0 |
| N/A   33C    P0              50W / 400W |  26267MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                    

### Install Python Libraries

This tutorial utilizes the following Python libraries:

* [mlflow](https://pypi.org/project/mlflow/) - for tracking parameters, metrics, and saving trained models. Version **2.11.0 or later** is required to log PEFT models with MLflow.
* [transformers](https://pypi.org/project/transformers/) - for defining the model, tokenizer, and trainer.
* [peft](https://pypi.org/project/peft/) - for creating a LoRA adapter on top of the Transformer model.
* [bitsandbytes](https://pypi.org/project/bitsandbytes/) - for loading the base model with 4-bit quantization for QLoRA.
* [accelerate](https://pypi.org/project/accelerate/) - a dependency required by bitsandbytes.
* [datasets](https://pypi.org/project/datasets/) - for loading the training dataset from the HuggingFace hub.

**Note**: Restarting the Python kernel may be necessary after installing these dependencies.

The notebook has been tested with `mlflow==2.11.0`, `transformers==4.35.2`, `peft==0.8.2`, `bitsandbytes==0.42.0`, `accelerate==0.27.2`, and `datasets==2.17.1`.

In [None]:
%pip install transformers peft accelerate bitsandbytes datasets -q -U
!pip install --upgrade comet-ml>=3.43.2

## 2. Dataset Preparation

### Load Dataset from HuggingFace Hub

We will use the `flytech/python-codes-25k` dataset from the [Hugging Face Hub](https://huggingface.co/datasets/flytech/python-codes-25k) for this tutorial. This dataset comprises 78.6k pairs of natural language queries and their corresponding SQL statements, making it ideal for training a text-to-SQL model. The dataset includes three columns:

* `question`: A natural language question posed regarding the data.
* `context`: Additional information about the data, such as the schema for the table being queried.
* `answer`: The python code that represents the expected output.

In [None]:
import pandas as pd
from datasets import load_dataset
from IPython.display import HTML, display

dataset_name = "flytech/python-codes-25k"
dataset = load_dataset(dataset_name, split="train")
dataset = dataset.select(range(3000))

def display_table(dataset_or_sample):
    # A helper fuction to display a Transformer dataset or single sample contains multi-line string nicely
    pd.set_option("display.max_colwidth", None)
    pd.set_option("display.width", None)
    pd.set_option("display.max_rows", None)

    if isinstance(dataset_or_sample, dict):
        df = pd.DataFrame(dataset_or_sample, index=[0])
    else:
        df = pd.DataFrame(dataset_or_sample)

    html = df.to_html().replace("\\n", "<br>")
    styled_html = f"""<style> .dataframe th, .dataframe tbody td {{ text-align: left; padding-right: 30px; }} </style> {html}"""
    display(HTML(styled_html))


#display_table(dataset.select(range(3)))

### Split Train and Test Dataset
The `flytech/python-codes-25k` dataset consists of a single split, "train". We will separate 20% of this as test samples.

In [None]:
split_dataset = dataset.train_test_split(test_size=0.2, seed=42)
train_dataset = split_dataset["train"]
test_dataset = split_dataset["test"]

print(f"Training dataset contains {len(train_dataset)} text-to-python pairs")
print(f"Test dataset contains {len(test_dataset)} text-to-python pairs")

Training dataset contains 2400 text-to-python pairs
Test dataset contains 600 text-to-python pairs


### Define Prompt Template

The Mistral 7B model is a text comprehension model, so we have to construct a text prompt that incorporates the user's question, context, and our system instructions. The new `prompt` column in the dataset will contain the text prompt to be fed into the model during training. It is important to note that we also include the expected response within the prompt, allowing the model to be trained in a self-supervised manner.

In [None]:
PROMPT_TEMPLATE = """You are a powerful text-to-python model. Given the data shape and natural language question, your job is to write python code that answers the question.

### data:
{context}

### Question:
{question}

### Response:
{output}"""


def apply_prompt_template(row):
    prompt = PROMPT_TEMPLATE.format(
        question=row["instruction"],
        context=row["input"],
        output=row["output"],
    )
    return {"prompt": prompt}


train_dataset = train_dataset.map(apply_prompt_template)
test_dataset = test_dataset.map(apply_prompt_template)
#display_table(train_dataset.select(range(2)))

Map:   0%|          | 0/2400 [00:00<?, ? examples/s]

Map:   0%|          | 0/600 [00:00<?, ? examples/s]

### Padding the Training Dataset

As a final step of dataset preparation, we need to apply **padding** to the training dataset. Padding ensures that all input sequences in a batch are of the same length.

A crucial point to note is the need to *add padding to the left*. This approach is adopted because the model generates tokens autoregressively, meaning it continues from the last token. Adding padding to the right would cause the model to generate new tokens from these padding tokens, resulting in the output sequence including padding tokens in the middle.


* Padding to right

```
Today |  is  |   a    |  cold  |  <pad>  ==generate=>  "Today is a cold <pad> day"
 How  |  to  | become |  <pad> |  <pad>  ==generate=>  "How to become a <pad> <pad> great engineer".
```

* Padding to left:

```
<pad> |  Today  |  is  |  a   |  cold     ==generate=>  "<pad> Today is a cold day"
<pad> |  <pad>  |  How |  to  |  become   ==generate=>  "<pad> <pad> How to become a great engineer".
```

In [None]:
from transformers import AutoTokenizer

base_model_id = "mistralai/Mistral-7B-v0.1"

# You can use a different max length if your custom dataset has shorter/longer input sequences.
MAX_LENGTH = 256

tokenizer = AutoTokenizer.from_pretrained(
    base_model_id,
    model_max_length=MAX_LENGTH,
    padding_side="left",
    add_eos_token=True,
)
tokenizer.pad_token = tokenizer.eos_token


def tokenize_and_pad_to_fixed_length(sample):
    result = tokenizer(
        sample["prompt"],
        truncation=True,
        max_length=MAX_LENGTH,
        padding="max_length",
    )
    result["labels"] = result["input_ids"].copy()
    return result


tokenized_train_dataset = train_dataset.map(tokenize_and_pad_to_fixed_length)
tokenized_test_dataset = test_dataset.map(tokenize_and_pad_to_fixed_length)
assert all(len(x["input_ids"]) == MAX_LENGTH for x in tokenized_train_dataset)

#display_table(tokenized_train_dataset.select(range(1)))

Map:   0%|          | 0/2400 [00:00<?, ? examples/s]

Map:   0%|          | 0/600 [00:00<?, ? examples/s]

## 3. Load the Base Model (with 4-bit quantization)

Next, we'll load the Mistral 7B model, which will serve as our base model for fine-tuning. This model can be loaded from the HuggingFace Hub repository [mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) using the Transformers' `from_pretrained()` API. However, here we are also providing a `quantization_config` parameter.

This parameter embodies the key technique of [QLoRA](https://github.com/artidoro/qlora) that significantly reduces memory usage during fine-tuning. The following paragraph details the method and the implications of this configuration. However, feel free to skip if it appears complex. After all, we rarely need to modify the `quantization_config` values ourselves :)

**How It Works**

In short, QLoRA is a combination of **Q**uantization and **LoRA**. To grasp its functionality, it's simpler to begin with LoRA. [LoRA (Low Rank Adaptation)](https://github.com/microsoft/LoRA) is a preceding method for resource-efficient fine-tuning, by reducing the number of trainable parameters through matrix decomposition. Let `W'` represent the final weight matrix from fine-tuning. In LoRA, `W'` is approximated by the sum of the original weight and its update, i.e., `W + ΔW`, then decomposing the delta part into two low-dimensional matrices, i.e., `ΔW ≈ AB`. Suppose `W` is `m`x`m`, and we select a smaller `r` for the rank of `A` and `B`, where `A` is `m`x`r` and `B` is `r`x`m`. Now, the original trainable parameters, which are quadratic in size of `W` (i.e., `m^2`), after decomposition, become `2mr`. Empirically, we can choose a much smaller number for `r`, e.g., 32, 64, compared to the full weight matrix size, therefore this significantly reduces the number of parameters to train.

[QLoRA](https://github.com/artidoro/qlora) extends LoRA, employing the same strategy for matrix decomposition. However, it further reduces memory usage by applying 4-bit quantization to the frozen pretrained model `W`. According to their research, the largest memory usage during LoRA fine-tuning is the backpropagation through the frozen parameters `W` to compute gradients for the adaptors `A` and `B`. Thus, quantizing `W` to 4-bit significantly reduces the overall memory consumption. This is achieved with the `load_in_4bit=True` setting shown below.

Moreover, QLoRA introduces additional techniques to optimize resource usage without significantly impacting model performance. For more technical details, please refer to [the paper](https://arxiv.org/pdf/2305.14314.pdf), but we implement them by setting the following quantization configurations in bitsandbytes:
* The 4-bit NormalFloat type is specified by `bnb_4bit_quant_type="nf4"`.
* Double quantization is activated by `bnb_4bit_use_double_quant=True`.
* QLoRA re-quantizes the 4-bit weights back to a higher precision when computing the gradients for `A` and `B`, to prevent performance degradation. This datatype is specified by `bnb_4bit_compute_dtype=torch.bfloat16`.


In [None]:
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    # Load the model with 4-bit quantization
    load_in_4bit=True,
    # Use double quantization
    bnb_4bit_use_double_quant=True,
    # Use 4-bit Normal Float for storing the base model weights in GPU memory
    bnb_4bit_quant_type="nf4",
    # De-quantize the weights to 16-bit (Brain) float before the forward/backward pass
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained(base_model_id, quantization_config=quantization_config)

`low_cpu_mem_usage` was None, now set to True since model is quantized.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

### How Does the Base Model Perform?
First, let's assess the performance of the vanilla Mistral model on the SQL generation task before any fine-tuning. As expected, the model does not produce correct SQL queries; instead, it generates random answers in natural language. This outcome indicates the necessity of fine-tuning the model for our specific task.


In [None]:
import transformers

tokenizer = AutoTokenizer.from_pretrained(base_model_id)
pipeline = transformers.pipeline(model=model, tokenizer=tokenizer, task="text-generation")

sample = test_dataset[100]
prompt = PROMPT_TEMPLATE.format(
    context=sample["input"], question=sample["instruction"], output=""
)  # Leave the answer part blank
import time
start_time = time.time()
with torch.no_grad():
    response = pipeline(prompt, max_new_tokens=256, repetition_penalty=1.15, return_full_text=False)
duration = time.time() - start_time
display_table({"prompt": prompt, "generated_query": response[0]["generated_text"]})

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Unnamed: 0,prompt,generated_query
0,"You are a powerful text-to-python model. Given the data shape and natural language question, your job is to write python code that answers the question. ### data: Finding the best time to post... ### Question: Help me find the best time to post on social media based on my followers! ### Response:","```python import pandas as pd from datetime import date # Load the data into a DataFrame df = pd.read_csv('data/followers.csv') # Filter the data by country filtered_df = df[df['country'] == 'US'] # Calculate the number of followers for each hour of the day hourly_counts = filtered_df.groupby(pd.Grouper(key='time', freq='H'))['follower_count'].sum() # Plot the hourly counts hourly_counts.plot(kind='bar') plt.title('Number of Followers per Hour') plt.xlabel('Time (hours)') plt.ylabel('Followers') plt.show() # Find the peak hours peak_hours = hourly_counts.idxmax() print(""The peak hours are:"", peak_hours) # Convert the peak hours to dates peak_dates = [date(2019, 5, int(hour)) for hour in peak_hours]"


In [None]:
PROMPT_TEMPLATE

'You are a powerful text-to-python model. Given the data shape and natural language question, your job is to write python code that answers the question.\n\n### data:\n{context}\n\n### Question:\n{question}\n\n### Response:\n{output}'

## 4. Define a PEFT Model

As discussed earlier, QLoRA stands for **Quantization** + **LoRA**. Having applied the quantization part, we now proceed with the LoRA aspect. Although the mathematics behind LoRA is intricate, [PEFT](https://huggingface.co/docs/peft/en/index) helps us by simplifying the process of adapting LoRA to the pretrained Transformer model.

In the next cell, we create a [LoraConfig](https://github.com/huggingface/peft/blob/main/src/peft/tuners/lora/config.py) with various settings for LoRA. Contrary to the earlier `quantization_config`, these hyperparameters might need optimization to achieve the best model performance for your specific task. **MLflow** facilitates this process by tracking these hyperparameters, the associated model, and its outcomes.

At the end of the cell, we display the number of trainable parameters during fine-tuning, and their percentage relative to the total model parameters. Here, we are training only 1.16% of the total 7 billion parameters.

In [None]:
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

# Enabling gradient checkpointing, to make the training further efficient
model.gradient_checkpointing_enable()
# Set up the model for quantization-aware training e.g. casting layers, parameter freezing, etc.
model = prepare_model_for_kbit_training(model)

peft_config = LoraConfig(
    task_type="CAUSAL_LM",
    # This is the rank of the decomposed matrices A and B to be learned during fine-tuning. A smaller number will save more GPU memory but might result in worse performance.
    r=32,
    # This is the coefficient for the learned ΔW factor, so the larger number will typically result in a larger behavior change after fine-tuning.
    lora_alpha=64,
    # Drop out ratio for the layers in LoRA adaptors A and B.
    lora_dropout=0.1,
    # We fine-tune all linear layers in the model. It might sound a bit large, but the trainable adapter size is still only **1.16%** of the whole model.
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
        "lm_head",
    ],
    # Bias parameters to train. 'none' is recommended to keep the original model performing equally when turning off the adapter.
    bias="none",
)

peft_model = get_peft_model(model, peft_config)
peft_model.print_trainable_parameters()

trainable params: 85,041,152 || all params: 7,326,773,248 || trainable%: 1.1607


**That's it!!!** PEFT has made the LoRA setup super easy.

An additional bonus is that the PEFT model exposes the same interfaces as a Transformers model. This means that everything from here on is quite similar to the standard model training process using Transformers.

## 5. Kick-off a Training Job

Similar to conventional Transformers training, we'll first set up a Trainer object to organize the training iterations. There are numerous hyperparameters to configure, but MLflow will manage them on your behalf.

To enable MLflow logging, you can specify `report_to="mlflow"` and name your training trial with the `run_name` parameter. This action initiates an [MLflow run](https://mlflow.org/docs/latest/tracking.html#runs) that automatically logs training metrics, hyperparameters, configurations, and the trained model.

In [None]:
!pip install codebleu



In [None]:
import torch
from transformers import Trainer, TrainingArguments, DataCollatorForLanguageModeling
from codebleu import calc_codebleu
import numpy as np

# Define custom compute_metrics function
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    max_vocab_size = tokenizer.vocab_size

    # Ensure logits are tensors before using torch.argmax
    if not isinstance(logits, torch.Tensor):
        logits = torch.tensor(logits, dtype=torch.float32)  # Adjust dtype as needed

    # Convert logits to token IDs using argmax
    predictions = torch.argmax(logits, dim=-1)

    # Convert tensors to lists of integers for both predictions and labels
    predictions = predictions.tolist()
    labels = labels.tolist() if isinstance(labels, torch.Tensor) else labels

    # Max vocab size and EOS token from tokenizer
    eos_token_id = tokenizer.eos_token_id  # End of sentence token

    # Handle tokens exceeding max_vocab_size or less than 1
    valid_tokens = set(range(max_vocab_size))

    # Replace tokens not in the tokenizer's vocab with 1
    for i, seq in enumerate(predictions):
        valid_seq = []
        for j, token in enumerate(seq):
            token = int(token)  # Ensure token is cast to int
            if token not in valid_tokens:
                token = 1  # Replace with a compatible token, like 1
            valid_seq.append(token)
        predictions[i] = valid_seq

    # Filter out -100 values from labels and predictions
    filtered_predictions = [[token for token in pred if token != -100] for pred in predictions]
    filtered_labels = [[token for token in label if token != -100] for label in labels]

    # Decode the predictions and labels
    decoded_preds = [tokenizer.decode(pred, skip_special_tokens=True) for pred in filtered_predictions]
    decoded_labels = [tokenizer.decode(label, skip_special_tokens=True) for label in filtered_labels]

    # Compute CodeBLEU score
    codebleu_score = calc_codebleu(decoded_labels, decoded_preds, lang="python", weights=(0.25, 0.25, 0.25, 0.25), tokenizer=tokenizer)

    # Calculate accuracy
    accuracy = np.mean([pred == label for pred, label in zip(predictions, labels)])

    # Calculate evaluation loss
    loss_fct = torch.nn.CrossEntropyLoss()
    labels_tensor = torch.tensor(labels)
    eval_loss = loss_fct(logits.view(-1, logits.size(-1)), labels_tensor.view(-1))

    return {
        "eval_codebleu": codebleu_score["codebleu"],
        "eval_accuracy": accuracy,
        "eval_loss": eval_loss.item()
    }


In [None]:
!pip install pyngrok --quiet
!pip install tree-sitter
!pip install -q -U trl

In [None]:
pip install comet-llm datasets --quiet

In [None]:
import comet_llm, os
comet_api = "HvDBna6FPF1cTRHAxDaqXvLce"
os.environ["COMET_API_KEY"] = comet_api

In [None]:
os.environ["COMET_LOG_ASSETS"] = "true"

In [None]:
tags = ['prompt with tempalte','summarization']
metadata = {
    "model": "Mistrail-ai-7b",
    "max_tokens": 256
}

comet_llm.log_prompt(
    prompt = prompt,
    prompt_template=PROMPT_TEMPLATE,
    prompt_template_variables=[sample["input"], sample["instruction"]],
    output= response[0]["generated_text"],
    tags=tags,
    duration=duration*1000,
    metadata=metadata
)

In [None]:
from transformers.integrations import CometCallback
from datasets import load_dataset, load_metric

In [None]:
!pip install --upgrade comet-ml>=3.43.2

In [None]:
from datetime import datetime
import transformers
from transformers import TrainingArguments, Trainer
import os
import comet_ml
from trl import SFTTrainer
from torch.optim import AdamW
from torch.optim.lr_scheduler import LinearLR
from transformers import EarlyStoppingCallback
from transformers import DataCollatorForLanguageModeling

# Initialize the tokenizer
tokenizer.pad_token = tokenizer.eos_token

# Set up the trainer
trainer = SFTTrainer(
    model=peft_model,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_test_dataset,
    dataset_text_field="prompt",
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        num_train_epochs=1,
        max_steps=1000,
        save_steps=100,  # Save model every 100 steps
        logging_steps=50,  # Log every 50 steps
        output_dir="output_dir",
        gradient_checkpointing=True,
        eval_strategy="epoch",  # Evaluate at the end of every epoch
        save_strategy="epoch",  # Save model at the end of every epoch
        #load_best_model_at_end=True,
        metric_for_best_model="eval_codebleu",  # Metric for selecting the best model
        report_to=["comet_ml"],  # Report to Comet.ml for logging
        run_name=f"Mistral-7B-SQL-QLoRA-{datetime.now().strftime('%Y-%m-%d-%H-%M-%s')}",
        warmup_steps=5,
        ddp_find_unused_parameters=False,
        evaluation_strategy="steps",  # Evaluation can also be done every few steps if needed
        eval_steps=100,  # Optionally evaluate every 100 steps as well
        save_total_limit=2,  # Limit the number of saved checkpoints
        fp16=True,  # Enable mixed precision training for efficiency (if supported by hardware)
    ),
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
    compute_metrics=compute_metrics,
    callbacks=[CometCallback()],
)

peft_model.config.use_cache = False



Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.
You are adding a <class 'transformers.integrations.integration_utils.CometCallback'> to the callbacks of this Trainer, but there is already one. The currentlist of callbacks is
:DefaultFlowCallback
CometCallback
max_steps is given, it will override any value given in num_train_epochs


The training duration may span several hours, contingent upon your hardware specifications. Nonetheless, the primary objective of this tutorial is to acquaint you with the process of fine-tuning using PEFT and MLflow, rather than to cultivate a highly performant SQL generator. If you don't care much about the model performance, you may specify a smaller number of steps or interrupt the following cell to proceed with the rest of the notebook.

In [None]:
experiment.end()
experiment = comet_ml.Experiment(
    project_name="Mistral-finetune"
)
trainer.train()

In [None]:
!pip install tree-sitter
!pip install tree-sitter-python

In [None]:
from comet_ml import Experiment

comet_ml.get_global_experiment().log_model("Mistral", "./output_dir/checkpoint-500")

In [None]:
from comet_ml import Experiment
exp = Experiment()
exp.log_model("Mistral", "./output_dir/checkpoint-500")

In [None]:
from comet_ml import API
api=API()
experiment=api.get("eliasammari/mistral-finetune/striped_dragon_2485")
experiment.register_model("mistral")

In [None]:
model = api.get_model("eliasammari", "mistral")
md = model.download("1.0.0")

In [None]:
# Define the name of the model you will be pushing to Hugging Face Model Hub
new_model = "mistral-CodePython_V2"  # Replace with your desired model name on the Hub

# Save the fine-tuned model using the trainer
trainer.model.save_pretrained(new_model)

# Load the base model and apply the PEFT model to it
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_id,  # Replace with the ID of the base model used for fine-tuning
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map={"": 0},
)

# Merge the PEFT model into the base model
merged_model = PeftModel.from_pretrained(base_model, new_model)
merged_model = merged_model.merge_and_unload()

# Save the merged model and tokenizer
merged_model.save_pretrained("merged_model", safe_serialization=True)
tokenizer.save_pretrained("merged_model")
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

# Push the model and tokenizer to the Hugging Face Model Hub
merged_model.push_to_hub(new_model, use_temp_dir=False)
tokenizer.push_to_hub(new_model, use_temp_dir=False)

print(f"Model pushed to Hugging Face Hub: https://huggingface.co/{new_model}")


In [None]:
import transformers

model = ""
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
pipeline = transformers.pipeline(model=model, tokenizer=tokenizer, task="text-generation")

sample = test_dataset[100]
prompt = PROMPT_TEMPLATE.format(
    context=sample["input"], question=sample["instruction"], output=""
)  # Leave the answer part blank
import time
start_time = time.time()
with torch.no_grad():
    response = pipeline(prompt, max_new_tokens=256, repetition_penalty=1.15, return_full_text=False)
duration = time.time() - start_time
display_table({"prompt": prompt, "generated_query": response[0]["generated_text"]})