Select the desired [quantized Code Llama](https://huggingface.co/models?search=TheBloke/CodeLlama) model from HuggingFace.

In [None]:
model_id = "TheBloke/CodeLlama-7B-Instruct-GPTQ"

Configure GPU flag to indicate tensor device allocation.

In [None]:
runtimeFlag = "cuda:0" #Run on GPU (you can't run [GPTQ](https://arxiv.org/pdf/2210.17323.pdf) on CPU)
cache_dir = None
scaling_factor = 1.0 # allows for a max sequence length of 16384*6 = 98304, but it requires Colab Pro and a V100 or A100 GPUs to have sufficient RAM.

Set up the standard system prompt and configure instruction and system prompt tokens to control generation.

In [None]:
DEFAULT_SYSTEM_PROMPT = """You are a powerful model specialized in refactoring java code.

You must output a refactored version of the code."""

SYSTEM_PROMPT = DEFAULT_SYSTEM_PROMPT

B_INST, E_INST = "[INST]", "[/INST]"  # for instruction models
B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"

print(SYSTEM_PROMPT)

Install required dependencies

In [None]:
!pip install -q -U transformers peft accelerate optimum bitsandbytes

!pip install datasets==2.10.1

!pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/

Load global dependencies.

In [None]:
import torch
import json
import os

Download the quantized pre-trained model from HuggingFace and load it in memory.

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    # rope_scaling = {"type": "dynamic", "factor": scaling_factor}
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

Check model configurations

In [None]:
print(model.config)

# Prepare dataset for fine-tuning

Load the dataset.

In [None]:
from datasets import load_dataset

eval_dataset = load_dataset('json', data_files='PATH', split='train') # `data_files` argument should be the path to validation split files
train_dataset = load_dataset('json', data_files='PATH', split='train') # `data_files` argument should be the path to training split files

If an error arises while reading the input JSON files, replace all single quotes (') with escaped double quotes (\") in the training and validation files.

Check the data format. Fields descriptions are [here](https://github.com/madaan/pie-perf#dataset). The contents of the 'input' and 'target' fields are the slow and optimized program pairs.

In [None]:
train_dataset[0]

Setup some tokenization settings like left padding so [training uses less memory](https://ai.stackexchange.com/questions/41485/while-fine-tuning-a-decoder-only-llm-like-llama-on-chat-dataset-what-kind-of-pa).

In [None]:
tokenizer.add_eos_token = True
tokenizer.pad_token_id = 0
tokenizer.padding_side = "left"

Redefine the tokenize function to make labels and input_ids the same (self-supervised learning).

In [None]:
def tokenize(prompt):
    result = tokenizer(
        prompt,
        truncation=True,
        max_length=512,
        padding=False,
        return_tensors=None,
        add_special_tokens=True
    )

    # "self-supervised learning" means the labels are also the inputs:
    result["labels"] = result["input_ids"].copy()

    return result

Convert each `data_point` of the dataset into a prompt, so the model is fine-tuned using such prompts. The same method is used to generate few-shot examples from the dataset for prompt-tuning.

In [None]:
def generate_and_tokenize_prompt(data_point):
    full_prompt =f"""# java code before refactoring:
{data_point["before"]}

# refactored version of the same java code:
{data_point["after"]}
"""
    return tokenize(f"{B_INST} {B_SYS}{SYSTEM_PROMPT}{E_SYS}{full_prompt} {E_INST}")

Apply the prompt generation function to each data point.

In [None]:
tokenized_train_dataset = train_dataset.map(generate_and_tokenize_prompt)
tokenized_val_dataset = eval_dataset.map(generate_and_tokenize_prompt)

# Setup LoRA

Here, we perform the necessary configuration and train a LoRA adapter. We first define the location where the LoRA adapter will be saved, along with checkpoints.

In [None]:
output_dir = "OUTPUT_DIRECTORY" # replace with ouput directory of LoRA adapter

We then configure the LoRA procedure and prepare the loaded model for fine-tuning training.

In [None]:
from peft import (
    LoraConfig,
    get_peft_model,
    prepare_model_for_int8_training,
)

model.train() # put model back into training mode
model = prepare_model_for_int8_training(model)

config = LoraConfig(
    r=16,
    lora_alpha=16,
    target_modules=[
    "q_proj",
    "k_proj",
    "v_proj",
    "o_proj",
],
    lora_dropout=0.01,
    bias="none",
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, config)

To resume from a checkpoint, set `resume_from_checkpoint` to the path of the `adapter_model.safetensors` you want to resume from. This code will replace the LoRA adapter attached to the model.

In [None]:
import os
from peft import set_peft_model_state_dict

resume_from_checkpoint = "CHECKPOINT_PATH"  # set this to adapter path

if resume_from_checkpoint:
    if os.path.exists(resume_from_checkpoint):
        print(f"Restarting from {resume_from_checkpoint}")
        adapters_weights = torch.load(resume_from_checkpoint)
        set_peft_model_state_dict(model, adapters_weights)
    else:
        print(f"Checkpoint {resume_from_checkpoint} not found")

Configure parallelization strategy used by `torch`.

In [None]:
if torch.cuda.device_count() > 1:
    model.is_parallelizable = True
    model.model_parallel = True

Configure the training parameters. If you run out of GPU memory, change `per_device_train_batch_size`. The `gradient_accumulation_steps` variable should ensure this doesn't affect batch dynamics during the training run. All the other variables are standard.

In [None]:
from datetime import datetime

from transformers import (
    TrainingArguments,
    Trainer,
    DataCollatorForSeq2Seq
)

batch_size = 8
per_device_train_batch_size = 1
gradient_accumulation_steps = batch_size // per_device_train_batch_size

training_args = TrainingArguments(
        per_device_train_batch_size=per_device_train_batch_size,
        gradient_accumulation_steps=gradient_accumulation_steps,
        warmup_steps=100,
        max_steps=400,
        learning_rate=3e-4,
        fp16=True,
        logging_steps=10,
        optim="adamw_torch",
        evaluation_strategy="steps", # if val_set_size > 0 else "no",
        save_strategy="steps",
        eval_steps=20,
        save_steps=20,
        output_dir=output_dir,
        # save_total_limit=3,
        load_best_model_at_end=False,
        # ddp_find_unused_parameters=False if ddp else None,
        group_by_length=True, # group sequences of roughly the same length together to speed up training
        report_to="none",
        run_name=None
    )

trainer = Trainer(
    model=model,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_val_dataset,
    args=training_args,
    data_collator=DataCollatorForSeq2Seq(
        tokenizer, pad_to_multiple_of=8, return_tensors="pt", padding=True
    )
)

Pytorch-related optimization (makes training faster but doesn't affect accuracy).

In [None]:
import sys
from peft import get_peft_model_state_dict

model.config.use_cache = False

if torch.__version__ >= "2" and sys.platform != "win32":
    print("compiling the model")
    model = torch.compile(model)

In [None]:
trainer.train()