## Fine-tune a pretrained Meta-Llama-3.1-8B model

#### There are significant benefits to using a pretrained model. It reduces computation costs, your carbon footprint, and allows you to use state-of-the-art models without having to train one from scratch using 🤗 Transformers Trainer.

In [22]:
# Ignore insignificant warnings (ex: deprecation warnings)
import warnings
warnings.filterwarnings('ignore')

In [75]:
from datasets import load_dataset

ds = load_dataset("mi-rei/ClinicalTrial-gov-LLaMA", split="train")
ds = ds.select(range(2000))

In [76]:
ds = ds.train_test_split(test_size=0.2, shuffle=True)

In [103]:
ds['train']['text'][100]

'<s>[INST]<<SYS>>You are given the full description of a clinical trial. Please summarize it.<</SYS>> - At the first visit the following information will be collected about the participant: original diagnosis, the date and type of transplant, transplant conditioning regimen, cGVHD prophylaxis regimen, the time when oral cGVHD was first noticed, specific treatments for oral cGVHD, and any current medications. - At each visit, and before the participant begins phototherapy treatment, they will answer a series of questions asking about how their mouth feels and what they are able to eat. A clinical examination of the mouth will be performed and recorded and photographs will be taken of the inside of the mouth. - Participants will then receive phototherapy treatment. This will take approximately three minutes and will involve opening the mouth and closing the eyes. Following phototherapy, the participant will be asked several questions on how they tolerated the treatment. Phototherapy trea

### Load Tokenizer

In [81]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B")

def tokenize_function(examples):
    return tokenizer(examples["text"], padding=True, truncation=True)

In [82]:
tokenizer.add_special_tokens({'pad_token': '</s>'})
tokenized_datasets = ds.map(tokenize_function, batched=True)
#print("Writing datasets to disk...")
#tokenized_datasets.save_to_disk('tk_datasets/llama3_1Finetune_clinical_trial_datasets')

Map:   0%|          | 0/1600 [00:00<?, ? examples/s]

Map:   0%|          | 0/400 [00:00<?, ? examples/s]

### create a smaller subset of the full dataset to fine-tune on to reduce the time it takes

In [83]:
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1600))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(400))

### Evaluate

In [84]:
import numpy as np
import evaluate

metric = evaluate.load("accuracy")

In [85]:
from transformers import EvalPrediction
def compute_metrics(eval_pred: EvalPrediction):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

### Monitor Evaluation

In [86]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(output_dir="test_trainer", 
                                  eval_strategy="epoch",
                                  overwrite_output_dir=True,
                                  save_strategy = "epoch",
                                 )

### Load Model

In [87]:
# Load model directly
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("mi-rei/Cthalpaca-llama2-7b")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

### Train

In [104]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)

In [None]:
trainer.train()