<a href="https://colab.research.google.com/github/abhilesh11111/LLM_fine_tuning/blob/main/fine_tune_code_llama.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine tune your own private Copilot

The integration between GitHub and Colab has been annoyingly difficult. While it's possible to open a notebook from a GitHub link in Colab, unfortunately, none of the rest of the repository content is brought into the Colab runtime. This makes it cumbersome to make use of other materials saved in your repo, that includes your dataset preprocessing scripts, structured training code, and maybe even the dataset itself. People have compromised and resorted to alternative solutions to complete a fine tuning lifecycle:

1. First create some dataset and put it in GDrive or a Hugging Face dataset repo.
2. Put up some code in notebook and run it in Colab, loading models from a Hugging Face model repo.
3. Save the fine tuned model back into a Hugging Face model repo.
4. Evaluate the fine tuned model. And if it's not ideal, go back to step 1.

This breaks one project into three pieces stored in different places: a dataset repo, a source code (notebook) repo, and a model repo, and there's no good way to cross reference between their individual versions. For example, if one fine tuning lifecycle deteriorates, one has to manually search back into three parallel history, letting alone the difficulty to revert to a good base.

In this guide we demonstrate that one can
1. Version **all** three pieces together in one GitHub repo managed by [XetData](https://github.com/apps/XetData) GitHub app.
2. Clones **only** what you need in the training to Colab runtime using [Lazy clone](https://xethub.com/assets/docs/large-repos/lazy-clone) feature.


This fine tuning example uses a Lora approach on top of [Code Llama](https://ai.meta.com/blog/code-llama-large-language-model-coding/), quantizing the base model to int 8, freezing its weights and only training an adapter. Please accept their License at https://ai.meta.com/resources/models-and-libraries/llama-downloads/. Much of the code is refactored from [[1]](https://github.com/tloen/alpaca-lora), [[2]](https://github.com/samlhuillier/code-llama-fine-tune-notebook/tree/main), [[3]](https://github.com/pacman100/DHS-LLM-Workshop/tree/main/personal_copilot).


*Avoid running this on V100 GPUs as [BF16 is not supported on V100](https://github.com/facebookresearch/llama-recipes/issues/284) and will otherwise throw out errors.

### Set up environment
This installs necessary training libraries and [git-xet](https://xethub.com/assets/docs/getting-started/install) that adds natural support for managing large files to Git, also sets up authorization to access your github repo. To get started, first create a GitHub personal access token as mentioned [here](https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens).

In [None]:
# Install python dependencies
!pip install tqdm nbformat
!pip install git+https://github.com/huggingface/transformers.git@main bitsandbytes
!pip install git+https://github.com/huggingface/peft.git@main
!pip install datasets
import locale # colab workaround
locale.getpreferredencoding = lambda x=False:"UTF-8" # colab workaround

Collecting git+https://github.com/huggingface/transformers.git@main
  Cloning https://github.com/huggingface/transformers.git (to revision main) to /tmp/pip-req-build-cpahwlbd
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers.git /tmp/pip-req-build-cpahwlbd
  Resolved https://github.com/huggingface/transformers.git to commit b0f0c61899019d316db17a493023828aa44db06d
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting bitsandbytes
  Downloading bitsandbytes-0.44.1-py3-none-manylinux_2_24_x86_64.whl.metadata (3.5 kB)
Collecting tokenizers<0.21,>=0.20 (from transformers==4.46.0.dev0)
  Downloading tokenizers-0.20.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Downloading bitsandbytes-0.44.1-py3-none-manylinux_2_24_x86_64.whl (122.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
# Download and install git-xet
!curl -fsSLO https://github.com/xetdata/xet-tools/releases/latest/download/xet-linux-x86_64.tar.gz
!tar -xvf xet-linux-x86_64.tar.gz && rm xet-linux-x86_64.tar.gz
!mv git-xet /usr/local/bin
!git xet install

git-xet
xet


In [None]:
# Set up authorization to access your repo where models, source code, etc. are versioned.
from IPython.display import clear_output
user = input("GitHub user name?")
%env GH_USER=$user
email = input("GitHub user email?")
%env GH_USER_EMAIL=$email
token = input("GitHub token?")
%env GH_TOKEN=$token
%env XET_LOG_PATH=log.txt
clear_output()

In [None]:
# The repo that contains the model and fine tuning code.
model_repo = "LTTS" # change to your own model repo
model_repo_url = f"https://{user}:{token}@github.com/{user}/{model_repo}.git"

In [None]:
# Configure git for later commit author info
!git config --global user.name $GH_USER
!git config --global user.email $GH_USER_EMAIL

# Clone in lazy mode so as to materialize files on need basis
!git xet clone --lazy {model_repo_url} {model_repo}

Preparing to clone Xet repository.
Cloning into 'LTTS'...
remote: Enumerating objects: 3, done.[K
remote: Counting objects: 100% (3/3), done.[K
remote: Total 3 (delta 0), reused 0 (delta 0), pack-reused 0 (from 0)[K
Receiving objects: 100% (3/3), done.


In [None]:
!pip install git+https://github.com/huggingface/peft.git@main


Collecting git+https://github.com/huggingface/peft.git@main
  Cloning https://github.com/huggingface/peft.git (to revision main) to /tmp/pip-req-build-sko5h0vy
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/peft.git /tmp/pip-req-build-sko5h0vy
  Resolved https://github.com/huggingface/peft.git to commit cff2a454ad0254ecdfcb9dfa3fac4abf2b4b9f09
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


In [None]:
from datetime import datetime
import os
import sys

import torch
from peft import (
    LoraConfig,
    get_peft_model,
    get_peft_model_state_dict,
    prepare_model_for_kbit_training,  # Updated function
    set_peft_model_state_dict,
)

from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer, DataCollatorForSeq2Seq


### Load model
Load the model from the cloned model repo. This is the base Code Llama model or your fine tuned model saved from previous runs. You can drop other LLMs into this repo, resting assured that XetData [supports per-repository limit of over 100TB and no per file or number of file limits](https://xethub.com/assets/docs/).

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "meta-llama/CodeLlama-7b-hf"  # Example model name on Hugging Face
model = AutoModelForCausalLM.from_pretrained(model_name, load_in_8bit=True, torch_dtype=torch.float16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)

print("Model and tokenizer loaded successfully.")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


OSError: You are trying to access a gated repo.
Make sure to have access to it at https://huggingface.co/meta-llama/CodeLlama-7b-hf.
401 Client Error. (Request ID: Root=1-6719f088-7346d71446c8a4b2443bb53c;00467305-9392-405a-be2c-3bbb7684f731)

Cannot access gated repo for url https://huggingface.co/meta-llama/CodeLlama-7b-hf/resolve/main/config.json.
Access to model meta-llama/CodeLlama-7b-hf is restricted. You must have access to it and be authenticated to access it. Please log in.

In [None]:


# Set the model name and repository
model_name = "CodeLlama-7b-hf"  # the model that you want to fine-tune on
model_repo = "LTTS"  # your model repository

# Materialize the model files to local
!cd {model_repo} && git xet materialize {model_name}

# Define the base model path
base_model = f"./{model_repo}/{model_name}"

# Load the model
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    load_in_8bit=True,
    torch_dtype=torch.float16,
    device_map="auto",
)

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model)

# Verify the model and tokenizer are loaded correctly
print("Model and tokenizer loaded successfully.")


Didn't find any checked in files under ["CodeLlama-7b-hf"], skip materializing.


OSError: Incorrect path_or_model_id: './LTTS/CodeLlama-7b-hf'. Please provide either the path to a local folder or the repo_id of a model on the Hub.

In [None]:
# Ensure git-xet is installed and configured
!pip install git+https://github.com/xetdata/xet-tools.git

# Set the model name and repository
model_name = "CodeLlama-7b-hf"  # the model that you want to fine-tune on
model_repo = "LTTS"  # your model repository

# Materialize the model files to local
!cd {model_repo} && git xet materialize {model_name}

# Define the base model path
base_model = f"./{model_repo}/{model_name}"

# Load the model
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    load_in_8bit=True,
    torch_dtype=torch.float16,
    device_map="auto",
)

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model)

# Verify the model and tokenizer are loaded correctly
print("Model and tokenizer loaded successfully.")


Collecting git+https://github.com/xetdata/xet-tools.git
  Cloning https://github.com/xetdata/xet-tools.git to /tmp/pip-req-build-ie0y5va2
  Running command git clone --filter=blob:none --quiet https://github.com/xetdata/xet-tools.git /tmp/pip-req-build-ie0y5va2
  Resolved https://github.com/xetdata/xet-tools.git to commit 9e40423f5c901d0b90aef51698bc6278c5d6d7d8
[31mERROR: git+https://github.com/xetdata/xet-tools.git does not appear to be a Python project: neither 'setup.py' nor 'pyproject.toml' found.[0m[31m
[0mDidn't find any checked in files under ["CodeLlama-7b-hf"], skip materializing.


OSError: Incorrect path_or_model_id: './LTTS/CodeLlama-7b-hf'. Please provide either the path to a local folder or the repo_id of a model on the Hub.

In [None]:
model_name = "CodeLlama-7b-hf" # the model that you want to fine tune on
!cd {model_repo} && git xet materialize {model_name} # brings the model files to local

base_model = f"./{model_repo}/{model_name}"
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    load_in_8bit=True,
    torch_dtype=torch.float16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(base_model)

Didn't find any checked in files under ["CodeLlama-7b-hf"], skip materializing.


OSError: Incorrect path_or_model_id: './LTTS/CodeLlama-7b-hf'. Please provide either the path to a local folder or the repo_id of a model on the Hub.

### Check base model
As a baseline, let's first check how does the existing model behave.


In [None]:
eval_prompt = """
def parse_url(url, force_domain='xethub.com', partial_remote=False):
    '''
    Parses a Xet URL of the form
     - xet://user/repo/branch/[path]
     - /user/repo/branch/[path]

    Into a XetPathInfo which forms it as remote=https://[domain]/user/repo
    branch=[branch] and path=[path].

    branches with '/' are not supported.

    If partial_remote==True, allows [repo] to be optional. i.e. it will
    parse /user or xet://user
    '''

    <FILL_ME>
"""

model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")

model.eval()
with torch.no_grad():
    print(tokenizer.decode(model.generate(**model_input, max_new_tokens=300)[0], skip_special_tokens=True))

I get the below output, which is erroneous and contains repetitive code.
```
if url.startswith('xet://'):
        url = url[len('xet://'):]

    if url.startswith('/'):
        url = url[1:]

    if '/' in url:
        repo, branch, path = url.split('/', 2)
    else:
        repo, branch = url.split('/', 1)
        path = ''

    if not repo:
        raise ValueError('No repo specified')

    if not branch:
        raise ValueError('No branch specified')

    if not path:
        path = ''

    if not partial_remote:
        if not repo.startswith('xet-'):
            raise ValueError('Invalid repo name')

        if not repo.endswith('.git'):
            raise ValueError('Invalid repo name')

        repo = repo[4:-4]

    if not repo:
        raise ValueError('No repo specified')

    if not branch:
        raise ValueError('No branch specified')

    if not path:
        path = ''

    if not path.startswith('/'):
        path = '/' + path

    if not path.endswith('/'):
        path = path + '/'

    return XetPathInfo(
        remote='https://'
```

### Load dataset
This example fine tunes using the [pyxet](https://github.com/xetdata/pyxet) project source code as the dataset. We use the scripts stored together in the model repo to clone pyxet and extract source code into a pandas DataFrame of format `['repo_id', 'file_path', 'content']`.

In [None]:
import importlib
myscripts=importlib.import_module(f"{model_repo}.scripts")
import pandas as pd
from datasets import Dataset

# Clones a source code repository as fine tuning data
username='xetdata'
repository='pyxet'
parquet_file = myscripts.create_dataset_from_git_repo(username,repository)
# Optionally you can save this dataset back to the model repo
df = pd.read_parquet(parquet_file)
dataset = Dataset.from_pandas(df, split="train")
train_dataset = dataset.train_test_split(test_size=0.1)["train"]
eval_dataset = dataset.train_test_split(test_size=0.1)["test"]

### Tokenization
Source code files come with drastically different length, feeding them directly into tensors requires padding and/or truncation which either amplifies memory usage or discards information. We split file content into constant length (512 tokens) chunks and tokenize each chunk.

In [None]:
tokenized_train_dataset, tokenized_val_dataset = myscripts.constant_length_token_seq_from(tokenizer, train_dataset, eval_dataset, seq_length=512)
tokenized_train_dataset = [s for s in tokenized_train_dataset]
tokenized_val_dataset = [s for s in tokenized_val_dataset]

### 5. Setup Lora

In [None]:
model.train() # put model back into training mode
model = prepare_model_for_int8_training(model)

config = LoraConfig(
    r=16,
    lora_alpha=16,
    target_modules=[
    "q_proj",
    "k_proj",
    "v_proj",
    "o_proj",
],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, config)

### 6. Training arguments
If you run out of GPU memory, change per_device_train_batch_size. The gradient_accumulation_steps variable should ensure this doesn't affect batch dynamics during the training run. All the other variables are standard stuff that I wouldn't recommend messing with:

In [None]:
if torch.cuda.device_count() > 1:
    # keeps Trainer from trying its own DataParallelism when more than 1 gpu is available
    model.is_parallelizable = True
    model.model_parallel = True

In [None]:
batch_size = 128
per_device_train_batch_size = 4
gradient_accumulation_steps = batch_size // per_device_train_batch_size
output_dir = "code-llama" # to write checkpoints

training_args = TrainingArguments(
        per_device_train_batch_size=per_device_train_batch_size,
        gradient_accumulation_steps=gradient_accumulation_steps,
        warmup_steps=100,
        max_steps=400,
        #num_train_epochs=3,
        learning_rate=3e-4,
        fp16=True,
        logging_steps=10,
        optim="adamw_torch",
        evaluation_strategy="steps", # if val_set_size > 0 else "no",
        save_strategy="steps",
        eval_steps=20,
        save_steps=20,
        output_dir=output_dir,
        # save_total_limit=3,
        load_best_model_at_end=False,
        # ddp_find_unused_parameters=False if ddp else None,
        report_to="none", # if use_wandb else "none",
        run_name=f"codellama-{datetime.now().strftime('%Y-%m-%d-%H-%M')}", # if use_wandb else None,
    )

trainer = Trainer(
    model=model,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_val_dataset,
    args=training_args,
)

Then we do some pytorch-related optimizations (which just make training faster but don't affect accuracy):

In [None]:
model.config.use_cache = False

if torch.__version__ >= "2" and sys.platform != "win32":
    print("compiling the model")
    model = torch.compile(model)

In [None]:
trainer.train()

On a T4 GPU, I'm getting the below training speed:
 - 21 steps or ~3.70 epochs for 2:04:20
 - 109 steps or ~20.56 epochs for 9:29:53

In [None]:
model.save_pretrained(output_dir)

### Load the final checkpoint

If loading from a certain checkpoint, load the base model and the adapters separately, the checkpoint directory should contain an adapter_config.json and an adapter_model.safetensors:
```
from peft import PeftModel
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    load_in_8bit=True,
    torch_dtype=torch.float16,
    device_map="auto",
)
model = PeftModel.from_pretrained(model, [checkpoint directory])
```

Otherwise, directly try the same prompt as before:

In [None]:
eval_prompt = """
def parse_url(url, force_domain='xethub.com', partial_remote=False):
    '''
    Parses a Xet URL of the form
     - xet://user/repo/branch/[path]
     - /user/repo/branch/[path]

    Into a XetPathInfo which forms it as remote=https://[domain]/user/repo
    branch=[branch] and path=[path].

    branches with '/' are not supported.

    If partial_remote==True, allows [repo] to be optional. i.e. it will
    parse /user or xet://user
    '''

    <FILL_ME>
"""

model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")

model.eval()
with torch.no_grad():
    print(tokenizer.decode(model.generate(**model_input, max_new_tokens=512)[0], skip_special_tokens=True))


And the model outputs the below code after training for 100 steps. This is much better than before the fine tuning.
```
import re
    from .xet_path import XetPathInfo

    if url.startswith('xet://'):
        url = url[len('xet://'):]

    if url.startswith('/'):
        url = url[1:]

    if len(url) == 0:
        raise ValueError('Invalid Xet URL')

    parts = url.split('/')
    if len(parts) < 3:
        raise ValueError('Invalid Xet URL')

    if len(parts) == 3:
        branch = parts[2]
        path = ''
    else:
        branch = parts[2]
        path = '/'.join(parts[3:])

    if len(branch) == 0:
        raise ValueError('Invalid Xet URL')

    if partial_remote:
        if len(parts) == 2:
            return XetPathInfo(f'https://{force_domain}/{parts[0]}', branch, path)
        else:
            return XetPathInfo(f'https://{force_domain}/{parts[0]}/{parts[1]}', branch, path)
    else:
        return XetPathInfo(f'https://{force_domain}/{parts[0]}/{parts[1]}', branch, path)
```



In [None]:
# Finally merge the adapter into the model and save the model back to the repo.
model = model.merge_and_unload()
model.save_pretrained(base_model)
commit_id=!cd {repository} && git rev-parse --short HEAD
commit_id=commit_id[0]
!cd {model_repo} && git add {model_name} && git commit -m "Fine tuned model trained on {username}/{repository}@{commit_id}" && git push

###At last, save this notebook back to your GitHub repository if there are changes.