# Phi2 SFT training baseline

## data

use all public data, but I dropped the dupilcate rewrite prompts.

## hyperparamters

epoch: 5

batch size: 2

gradient_accumulation_steps: 8

max_seq_length: 1024

learing rate: 1e-4

# 1. 加载 (1.35GB)
model = load_quantized_model()

# 2. 注入LoRA (增加10M参数)
model = inject_lora(model)

# 3. 准备数据 (7000条)
data = format_as_instructions(raw_data)

# 4. 训练 (4375步 × 5 epochs)
for epoch in range(5):
    for batch in data:
        loss = model(batch)
        loss.backward()
        optimizer.step()

# 5. 保存 (33MB)
save_lora_adapter(model)

# 资源消耗
时间: ~10小时 (T4 GPU)
内存: ~8GB
成本: ~$5
效果: 接近完整微调的95%+


inference notebook click [here](https://www.kaggle.com/code/mozhiwenmzw/0-61-llmpr-phi2-sft-model-generate-infer?scriptVersionId=169380324)

In [1]:
!pip install -Uq /kaggle/input/llm-whls/bitsandbytes-0.41.1-py3-none-any.whl
!pip install -Uq /kaggle/input/llm-whls/peft-0.4.0-py3-none-any.whl
!pip install -Uq /kaggle/input/library-off-for-llm/transformers-4.38.2-py3-none-any.whl

In [2]:
!pip install -Uq /kaggle/input/llm-whls/trl-0.5.0-py3-none-any.whl

In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split

from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from transformers import TrainingArguments

from trl import SFTTrainer, DataCollatorForCompletionOnlyLM
from peft import LoraConfig

2024-03-30 07:11:25.612795: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-30 07:11:25.612916: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-30 07:11:25.751220: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


In [4]:
exp_name = 'phi2_public_data_sft'
data_path = '/kaggle/input/llmpr-public-10k-unique/public_10k_unique_rewrite_prompt.csv'
model_path = '/kaggle/input/phi/transformers/2/1'
output_path = f'outputs'
model_save_path =  f'{exp_name}_adapter'

In [5]:
epochs=5
batch_size=1 # 2 
max_seq_length=512 # 1024 
lr = 1e-4

In [6]:
df = pd.read_csv(data_path)
train_df, val_df = train_test_split(df, test_size=0.3, random_state=42)
train_df = train_df.reset_index(drop=True)
val_df = val_df.reset_index(drop=True)

In [7]:
train_ds = Dataset.from_pandas(train_df)
val_ds = Dataset.from_pandas(val_df)

In [8]:
tokenizer = AutoTokenizer.from_pretrained(
    model_path,
    )
tokenizer.pad_token = tokenizer.eos_token

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [9]:
bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type='nf4',
        bnb_4bit_compute_dtype='float16',
        bnb_4bit_use_double_quant=False,
    )

In [10]:
model = AutoModelForCausalLM.from_pretrained(model_path,
                                             quantization_config=bnb_config,
                                             trust_remote_code=True,
                                             use_auth_token=True)

`low_cpu_mem_usage` was None, now set to True since model is quantized.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

You are calling `save_pretrained` to a 4-bit converted model, but your `bitsandbytes` version doesn't support it. If you want to save 4-bit models, make sure to have `bitsandbytes>=0.41.3` installed.


In [11]:
model.config.gradient_checkpointing = False

In [12]:
def token_len(text):
    tokenized = tokenizer(text, return_length=True)
    length = tokenized['length'][0]
    return length

In [13]:
def formatting_prompts_func(example):
    output_texts = []
    for i in range(len(example['rewritten_text'])):
        ori_text = example['original_text'][i]
        rew_text = example['rewritten_text'][i]
        rew_prompt = example['rewrite_prompt'][i]
        text = f"Instruct: Original Text:{ori_text}\nRewritten Text:{rew_text}\nWrite a prompt that was likely given to the LLM to rewrite original text into rewritten text.Output: {rew_prompt}"
        if token_len(text) > max_seq_length:
            continue
        output_texts.append(text)
    return output_texts

In [None]:
response_template = "Output:"
collator = DataCollatorForCompletionOnlyLM(response_template=response_template, 
                                           tokenizer=tokenizer)

In [15]:
peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules= ["q_proj", "k_proj", "v_proj", "dense"],
)

In [16]:
args = TrainingArguments(
    output_dir = output_path,
    fp16=True,
    learning_rate=lr,
    optim="adafactor",
    num_train_epochs=epochs,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size*2,
    gradient_accumulation_steps=8,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=1,
    logging_steps=50,
    lr_scheduler_type="cosine",
    warmup_ratio=0.1,
    weight_decay=0.01,
    report_to='none',
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    )

In [17]:
trainer = SFTTrainer(
    model=model,
    args = args,
    max_seq_length=max_seq_length,
    train_dataset=train_ds,
    eval_dataset=val_ds,
    formatting_func=formatting_prompts_func,
    data_collator=collator,
    peft_config=peft_config,
)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


  0%|          | 0/8 [00:00<?, ?ba/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (2159 > 2048). Running this sequence through the model will result in indexing errors


  0%|          | 0/4 [00:00<?, ?ba/s]

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss


In [None]:
trainer.save_model(model_save_path)
tokenizer.save_pretrained(model_save_path)