# Fine-tuning

Let's start by importing packages!

In [1]:
!module load CUDA
!module load cuDNN/8.9.2.26-CUDA-12.1.1

In [2]:
%pip uninstall -y torch

Found existing installation: torch 2.4.0+cu121
Uninstalling torch-2.4.0+cu121:
  Successfully uninstalled torch-2.4.0+cu121
Note: you may need to restart the kernel to use updated packages.


In [3]:
%pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu121

Defaulting to user installation because normal site-packages is not writeable
Looking in indexes: https://download.pytorch.org/whl/cu121
Collecting torch==2.4.0
  Using cached https://download.pytorch.org/whl/cu121/torch-2.4.0%2Bcu121-cp311-cp311-linux_x86_64.whl (799.1 MB)
Collecting typing-extensions>=4.8.0 (from torch==2.4.0)
  Using cached https://download.pytorch.org/whl/typing_extensions-4.9.0-py3-none-any.whl (32 kB)
Installing collected packages: typing-extensions, torch
[0m[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
datasets 3.0.2 requires requests>=2.32.2, but you have requests 2.31.0 which is incompatible.
mlflow 2.17.0 requires pyarrow<18,>=4.0.0, but you have pyarrow 18.0.0 which is incompatible.[0m[31m
[0mSuccessfully installed torch-2.4.0+cu121 typing-extensions-4.9.0
Note: you may need to restart the kernel to use updated packages.


In [4]:

# Install necessary libraries
%pip install transformers==4.45.0  peft accelerate


Defaulting to user installation because normal site-packages is not writeable
Collecting transformers==4.45.0
  Using cached transformers-4.45.0-py3-none-any.whl.metadata (44 kB)
Collecting typing-extensions>=3.7.4.3 (from huggingface-hub<1.0,>=0.23.2->transformers==4.45.0)
  Using cached typing_extensions-4.12.2-py3-none-any.whl.metadata (3.0 kB)
Using cached transformers-4.45.0-py3-none-any.whl (9.9 MB)
Using cached typing_extensions-4.12.2-py3-none-any.whl (37 kB)
Installing collected packages: typing-extensions, transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.46.1
    Uninstalling transformers-4.46.1:
      Successfully uninstalled transformers-4.46.1
[0m[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
datasets 3.0.2 requires requests>=2.32.2, but you have requests 2.31.0 which is incompatible.
mlflow 2

In [5]:

# Import necessary libraries for LoRA fine-tuning
from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments
from peft import LoraConfig, get_peft_model
from datasets import load_dataset
import torch


In [6]:
print("torch version:", torch.__version__)
print("CUDA Version:", torch.version.cuda)
print("CUDA Available:", torch.cuda.is_available())
print("Number of GPUs:", torch.cuda.device_count())
print("Current CUDA Device:", torch.cuda.current_device())
print("Device Name:", torch.cuda.get_device_name(torch.cuda.current_device()))

torch version: 2.4.0+cu121
CUDA Version: 12.1
CUDA Available: True
Number of GPUs: 1
Current CUDA Device: 0
Device Name: NVIDIA A100-PCIE-40GB


In [7]:
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling,
)

In [8]:
%pip install --upgrade  pip
%pip install -U  transformers accelerate datasets deepspeed
%pip install torch --index-url https://download.pytorch.org/whl/cu121

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.
Defaulting to user installation because normal site-packages is not writeable
Collecting transformers
  Using cached transformers-4.46.1-py3-none-any.whl.metadata (44 kB)
Collecting requests (from transformers)
  Using cached requests-2.32.3-py3-none-any.whl.metadata (4.6 kB)
Collecting typing-extensions>=3.7.4.3 (from huggingface-hub<1.0,>=0.23.2->transformers)
  Using cached typing_extensions-4.12.2-py3-none-any.whl.metadata (3.0 kB)
Using cached transformers-4.46.1-py3-none-any.whl (10.0 MB)
Using cached requests-2.32.3-py3-none-any.whl (64 kB)
Using cached typing_extensions-4.12.2-py3-none-any.whl (37 kB)
Installing collected packages: typing-extensions, requests, transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.45.0
    Uninstalling transformers-4.45.0:
      Successfully uninstalled transfo

In [9]:
%pip install flash-attn

Defaulting to user installation because normal site-packages is not writeable
Collecting typing-extensions>=4.8.0 (from torch->flash-attn)
  Using cached typing_extensions-4.12.2-py3-none-any.whl.metadata (3.0 kB)
Using cached typing_extensions-4.12.2-py3-none-any.whl (37 kB)
Installing collected packages: typing-extensions
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
datasets 3.0.2 requires requests>=2.32.2, but you have requests 2.31.0 which is incompatible.
mlflow 2.17.0 requires pyarrow<18,>=4.0.0, but you have pyarrow 18.0.0 which is incompatible.[0m[31m
[0mSuccessfully installed typing-extensions-4.9.0
Note: you may need to restart the kernel to use updated packages.


In [10]:
import os
os.environ['CUDA_HOME'] = '/cvmfs/hpc.rug.nl/versions/2023.01/rocky8/x86_64/amd/zen3/software/CUDA/12.1.1'
os.environ['PATH'] = f"{os.environ['CUDA_HOME']}/bin:{os.environ['PATH']}"
os.environ['LD_LIBRARY_PATH'] = f"{os.environ['CUDA_HOME']}/lib64:{os.environ.get('LD_LIBRARY_PATH', '')}"

In [11]:

# Load the base model and tokenizer
model_name = "stabilityai/stable-code-3b"  # Replace with your desired model
import os
os.environ["HF_TOKEN"] = "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"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map='auto',
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)
model.config.use_cache = False
model.gradient_checkpointing_enable()
# Set up the LoRA configuration
lora_config = LoraConfig(
    r=32,                # LoRA rank
    lora_alpha=64,            # Scaling factor for LoRA
    lora_dropout=0.05,        # Dropout for LoRA layers
    target_modules=["q_proj", "v_proj"]  # Set LoRA on attention layers (adjust based on architecture)
)

# Wrap the model with LoRA
model = get_peft_model(model, lora_config)


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [12]:

# Load dataset for training
dataset = load_dataset("json", data_files="../habrok/dataset.json")
split_dataset = dataset["train"].train_test_split(test_size=0.2)
train_eval_dataset = split_dataset["train"]
test_dataset = split_dataset["test"]

train_eval_split = train_eval_dataset.train_test_split(test_size=0.25)
train_dataset = train_eval_split["train"]
eval_dataset = train_eval_split["test"]


print(f"Train dataset size: {len(train_dataset)}")
print(f"Eval dataset size: {len(eval_dataset)}")
print(f"Test dataset size: {len(test_dataset)}")
test_dataset.save_to_disk("test_dataset")


Train dataset size: 920
Eval dataset size: 307
Test dataset size: 307


Saving the dataset (0/1 shards):   0%|          | 0/307 [00:00<?, ? examples/s]

In [13]:
def tokenize(prompt):
    result = tokenizer(
        prompt,
        truncation=True,

        padding="max_length",
        max_length=1024
    )

    # "self-supervised learning" means the labels are also the inputs:
    result["labels"] = result["input_ids"].copy()
    return result


def formatting_prompts_func(datapoint):
    question = datapoint["question"]
    query = datapoint["SQL"]
    database_schema = datapoint["database_schema"]
    prompt = f"""Given the following SQL tables, your job is to generate the Sqlite SQL query given the user's question.
Put your answer inside the ⁠```sql and ```⁠ tags.
{database_schema}
###
Question: {question}

⁠```sql
{query} ;
```
<|EOT|>
"""

    return tokenize(prompt)


train_dataset = train_dataset.map(formatting_prompts_func, batched=False)
eval_dataset = eval_dataset.map(formatting_prompts_func, batched=False)

Map:   0%|          | 0/920 [00:00<?, ? examples/s]

Map:   0%|          | 0/307 [00:00<?, ? examples/s]

In [14]:
train_dataset

Dataset({
    features: ['question_id', 'db_id', 'question', 'evidence', 'SQL', 'difficulty', 'database_schema', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 920
})

In [15]:
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False,
    return_tensors="pt",
    pad_to_multiple_of=8,  # Efficient padding for GPU
)

In [16]:
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=2,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=32, # effective batch size
    learning_rate=5e-5,
    bf16=True,
    logging_steps=10,
    save_steps=500,
    save_total_limit=2,
    eval_strategy="steps",
    eval_steps=10,  # Evaluate every 100 steps
    save_strategy="steps",
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    greater_is_better=False,
    group_by_length=True,
)

In [17]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


In [18]:
%pip install numpy

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [19]:
%pip install --upgrade pyarrow datasets numpy

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Defaulting to user installation because normal site-packages is not writeable
Collecting requests>=2.32.2 (from datasets)
  Using cached requests-2.32.3-py3-none-any.whl.metadata (4.6 kB)
Using cached requests-2.32.3-py3-none-any.whl (64 kB)
Installing collected packages: requests
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
mlflow 2.17.0 requires pyarrow<18,>=4.0.0, but you have pyarrow 18.0.0 which is incompatible.[0m[31m
[0mSuccessfully installed requests-2.32.3
Note: you may need to restart the kernel to use updated packages.


In [20]:

# Start training using LoRA fine-tuning
trainer.train()


[2024-10-29 19:16:36,511] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)


/cvmfs/hpc.rug.nl/versions/2023.01/rocky8/x86_64/intel/icelake/software/binutils/2.40-GCCcore-12.3.0/bin/ld: cannot find -laio: No such file or directory
collect2: error: ld returned 1 exit status
/cvmfs/hpc.rug.nl/versions/2023.01/rocky8/x86_64/intel/icelake/software/binutils/2.40-GCCcore-12.3.0/bin/ld: /cvmfs/hpc.rug.nl/versions/2023.01/rocky8/x86_64/amd/zen3/software/CUDA/12.1.1/lib64/libcufile.so: undefined reference to `dlopen'
/cvmfs/hpc.rug.nl/versions/2023.01/rocky8/x86_64/intel/icelake/software/binutils/2.40-GCCcore-12.3.0/bin/ld: /cvmfs/hpc.rug.nl/versions/2023.01/rocky8/x86_64/amd/zen3/software/CUDA/12.1.1/lib64/libcufile.so: undefined reference to `dlclose'
/cvmfs/hpc.rug.nl/versions/2023.01/rocky8/x86_64/intel/icelake/software/binutils/2.40-GCCcore-12.3.0/bin/ld: /cvmfs/hpc.rug.nl/versions/2023.01/rocky8/x86_64/amd/zen3/software/CUDA/12.1.1/lib64/libcufile.so: undefined reference to `dlerror'
/cvmfs/hpc.rug.nl/versions/2023.01/rocky8/x86_64/intel/icelake/software/binutils/

Step,Training Loss,Validation Loss
10,0.8501,No log
20,0.6924,No log
30,0.559,No log
40,0.4561,No log
50,0.3904,No log


TrainOutput(global_step=56, training_loss=0.5676604764802116, metrics={'train_runtime': 400.3622, 'train_samples_per_second': 4.596, 'train_steps_per_second': 0.14, 'total_flos': 2.947555793043456e+16, 'train_loss': 0.5676604764802116, 'epoch': 1.9478260869565216})

In [21]:

# Save the LoRA fine-tuned model and tokenizer
model.save_pretrained("./lora_finetuned_model")
tokenizer.save_pretrained("./lora_finetuned_model")


('./lora_finetuned_model/tokenizer_config.json',
 './lora_finetuned_model/special_tokens_map.json',
 './lora_finetuned_model/tokenizer.json')

In [22]:
trainer.evaluate(eval_dataset)

{'eval_runtime': 13.9833,
 'eval_samples_per_second': 21.955,
 'eval_steps_per_second': 2.789,
 'epoch': 1.9478260869565216}