# How to finetune HuggingFace models on text data of any size and format with custom splitting (not random)

A way to handle text data of any size and format with custom split because random splitting is not recommended for protein sequences.

In [1]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch
import numpy as np
from Bio import SeqIO
from datasets import Dataset, DatasetDict
from BioML.utilities import split_methods

## https://github.com/huggingface/notebooks/blob/main/examples/text_classification.ipynb
## https://github.com/huggingface/notebooks/blob/main/examples/protein_language_modeling-tf.ipynb

## Load data

You need to label the target values as labels so Trainer can recognize it.
Dataset can actually be used for any usecases with large   files it doesn't depend on transformers  
Although you would need to use PyTorch Dataloader to transform it into batches (but it only returns inputs ids and attention masks will it also return labels?)

In [3]:
def fasta_generator(fasta_file: str="../data/whole_sequence.fasta"):
    with open(fasta_file, 'r') as f:
        seqs = SeqIO.parse(f, 'fasta')
        for seq in seqs:
            yield {"id":seq.id, "seq":str(seq.seq)}

b = Dataset.from_generator(fasta_generator, gen_kwargs={"fasta_file":"../data/whole_sequence.fasta"})
y = np.random.randint(0, 2, size=len(b))
dataset = b.add_column("labels", y)

## Custom spliting with indices

In [4]:
cluster = split_methods.ClusterSpliter("../data/resultsDB_clu.tsv")
train, test = cluster.train_test_split(range(len(dataset)), index=dataset["id"])

In [5]:
new = DatasetDict({"train":dataset.select(train), "test":dataset.select(test)})

## Load model

In [6]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
device="cpu"

In [7]:
model = AutoModelForSequenceClassification.from_pretrained("facebook/esm2_t6_8M_UR50D", num_labels=2)
tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t6_8M_UR50D")

Some weights of EsmForSequenceClassification were not initialized from the model checkpoint at facebook/esm2_t6_8M_UR50D and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.weight', 'classifier.out_proj.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [8]:
new["train"] = new["train"].map(lambda examples: tokenizer(examples["seq"], return_tensors="np",padding=True, truncation=True), batched=True)
new["test"] = new["test"].map(lambda examples: tokenizer(examples["seq"], return_tensors="np",padding=True, truncation=True), batched=True)

Map:   0%|          | 0/117 [00:00<?, ? examples/s]

Map:   0%|          | 0/30 [00:00<?, ? examples/s]

Dataset({
    features: ['id', 'seq', 'labels', 'input_ids', 'attention_mask'],
    num_rows: 117
})

### I create the training arguments

In [8]:
from transformers import TrainingArguments, Trainer

In [9]:
lr = 8e-5
bs = 1
epochs = 4

Se use cpu to False whe you wan to use GPUs (it will automatically use GPUs), when f16 is True it will only use GPUs.

In [10]:
args = TrainingArguments('outputs', learning_rate=lr, warmup_ratio=0.2, lr_scheduler_type='cosine', fp16=False,
    evaluation_strategy="epoch", per_device_train_batch_size=bs, per_device_eval_batch_size=bs*2,
    num_train_epochs=epochs, weight_decay=0.01, report_to=['mlflow'],
    load_best_model_at_end=True, metric_for_best_model="matthews_correlation", 
    save_total_limit=3, save_strategy="epoch", seed=3242342, gradient_accumulation_steps=4, use_cpu=True) 

## cosine will set it to cosine and then we have a learning rate
## weight decay for the Adam -> this is fast.Ai does
## fp16 is half precision -> mixed training (using fp32 and fp16)
## save_total_limit to 5 -> so only 5 models will be saved
## each 500 steps will be saved a model
## Save the report to mlflow
# How to evaluate mlflow?
# LR finder does not give reliable results for Transformers models

## I train the model

In [11]:
import evaluate

You can use your own function as an evaluation metric -> then you have to retun as an dict  
Or you can use the evaluate library from hugging face to load different functions: [evaluate](https://huggingface.co/docs/evaluate/a_quick_tour)


In [12]:
def compute_metrics(eval_pred):
    metrics = ["accuracy", "f1", "matthews_correlation", "precision", "recall"]
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)

    loaded = {metric:evaluate.load(metric) for metric in metrics}
    results = {metric: loaded[metric].compute(predictions=predictions, references=labels)[metric] 
               for metric in metrics}

    # the predictions from the models are logits (it also returns the labels, 
    # it also returns loss, attentions and hidden state but that is the classification model, for evalaution Trainer will only 
    # return logits and labels)
    return results

In [13]:
class CustomTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        print(inputs)
        labels = inputs.pop("labels")
        outputs = model(**inputs)
        logits = outputs.logits
        loss = self.compute_loss(logits, labels)
        return (loss, outputs) if return_outputs else loss

In [14]:
trainer = Trainer(model, args, train_dataset=new['train'], eval_dataset=new['test'], # we need to pass tokenized datasets
                  tokenizer=tokenizer, compute_metrics=compute_metrics)

In [15]:
trainer.train()

  0%|          | 0/116 [00:00<?, ?it/s]

  0%|          | 0/15 [00:00<?, ?it/s]

{'eval_loss': 0.7221236824989319, 'eval_accuracy': 0.5333333333333333, 'eval_f1': 0.6956521739130436, 'eval_matthews_correlation': 0.0, 'eval_precision': 0.5333333333333333, 'eval_recall': 1.0, 'eval_runtime': 31.508, 'eval_samples_per_second': 0.952, 'eval_steps_per_second': 0.476, 'epoch': 0.99}


  0%|          | 0/15 [00:00<?, ?it/s]

{'eval_loss': 0.7303746938705444, 'eval_accuracy': 0.5333333333333333, 'eval_f1': 0.6956521739130436, 'eval_matthews_correlation': 0.0, 'eval_precision': 0.5333333333333333, 'eval_recall': 1.0, 'eval_runtime': 31.2142, 'eval_samples_per_second': 0.961, 'eval_steps_per_second': 0.481, 'epoch': 1.98}


  0%|          | 0/15 [00:00<?, ?it/s]

{'eval_loss': 0.7445932030677795, 'eval_accuracy': 0.5333333333333333, 'eval_f1': 0.6956521739130436, 'eval_matthews_correlation': 0.0, 'eval_precision': 0.5333333333333333, 'eval_recall': 1.0, 'eval_runtime': 31.2265, 'eval_samples_per_second': 0.961, 'eval_steps_per_second': 0.48, 'epoch': 2.97}


  0%|          | 0/15 [00:00<?, ?it/s]

{'eval_loss': 0.753971517086029, 'eval_accuracy': 0.6, 'eval_f1': 0.7272727272727273, 'eval_matthews_correlation': 0.2857142857142857, 'eval_precision': 0.5714285714285714, 'eval_recall': 1.0, 'eval_runtime': 31.5365, 'eval_samples_per_second': 0.951, 'eval_steps_per_second': 0.476, 'epoch': 3.97}
{'train_runtime': 618.7594, 'train_samples_per_second': 0.756, 'train_steps_per_second': 0.187, 'train_loss': 0.6006371070598734, 'epoch': 3.97}


TrainOutput(global_step=116, training_loss=0.6006371070598734, metrics={'train_runtime': 618.7594, 'train_samples_per_second': 0.756, 'train_steps_per_second': 0.187, 'train_loss': 0.6006371070598734, 'epoch': 3.97})

## Search for hyperparameters like the learning rate which is the most important

Well it is actually batch size and learning rate -> smaller batch sizes tend to work better than large batch sizes -> but learning rate is affected by batch as well -> higher abtch need higher learning rate.

Fix everything else and tune the learning rate -> learning rate finder doesn'0t seem to work very well for transformers?  
But teh idea of learning rate finder is just test different learning rates -> so I cannot test them?

Ktrains: A wrapper to do many tasks and has a learning rate finder: [ktrains](https://github.com/amaiya/ktrain)

Use pytorch lightning perhaps: [pytorch_lighningt_huggingface](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/VisionTransformer/Fine_tuning_the_Vision_Transformer_on_CIFAR_10_with_PyTorch_Lightning.ipynb)