In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive') 

In [None]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

In [None]:
from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

if ram_gb < 20:
  print('Not using a high-RAM runtime')
else:
  print('You are using a high-RAM runtime!')

If you're opening this Notebook on colab, you will probably need to install 🤗 Transformers and 🤗 Datasets. Uncomment the following cell and run it.

In [None]:
! pip install datasets transformers

If you're opening this notebook locally, make sure your environment has an install from the last version of those libraries.

To be able to share your model with the community and generate results like the one shown in the picture below via the inference API, there are a few more steps to follow.

First you have to store your authentication token from the Hugging Face website (sign up [here](https://huggingface.co/join) if you haven't already!) then uncomment the following cell and input your username and password (this only works on Colab, in a regular notebook, you need to do this in a terminal):

In [None]:
# !huggingface-cli login

Then you need to install Git-LFS and setup Git if you haven't already. Uncomment the following instructions and adapt with your name and email:

In [None]:
!apt install git-lfs
!git config --global user.email "jirarotej@gmail.com"
!git config --global user.name "Jirarote Jirasirikul"

Make sure your version of Transformers is at least 4.8.1 since the functionality was introduced in that version:

In [None]:
import transformers

print(transformers.__version__)

In [None]:
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

You can find a script version of this notebook to fine-tune your model in a distributed fashion using multiple GPUs or TPUs [here](https://github.com/huggingface/transformers/tree/master/examples/text-classification).

# Fine-tuning a model on a text classification task

In this notebook, we will see how to fine-tune one of the [🤗 Transformers](https://github.com/huggingface/transformers) model to a text classification task of the [GLUE Benchmark](https://gluebenchmark.com/).

![Widget inference on a text classification task](https://github.com/huggingface/notebooks/blob/master/examples/images/text_classification.png?raw=1)

The GLUE Benchmark is a group of nine classification tasks on sentences or pairs of sentences which are:

- [CoLA](https://nyu-mll.github.io/CoLA/) (Corpus of Linguistic Acceptability) Determine if a sentence is grammatically correct or not.is a  dataset containing sentences labeled grammatically correct or not.
- [MNLI](https://arxiv.org/abs/1704.05426) (Multi-Genre Natural Language Inference) Determine if a sentence entails, contradicts or is unrelated to a given hypothesis. (This dataset has two versions, one with the validation and test set coming from the same distribution, another called mismatched where the validation and test use out-of-domain data.)
- [MRPC](https://www.microsoft.com/en-us/download/details.aspx?id=52398) (Microsoft Research Paraphrase Corpus) Determine if two sentences are paraphrases from one another or not.
- [QNLI](https://rajpurkar.github.io/SQuAD-explorer/) (Question-answering Natural Language Inference) Determine if the answer to a question is in the second sentence or not. (This dataset is built from the SQuAD dataset.)
- [QQP](https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs) (Quora Question Pairs2) Determine if two questions are semantically equivalent or not.
- [RTE](https://aclweb.org/aclwiki/Recognizing_Textual_Entailment) (Recognizing Textual Entailment) Determine if a sentence entails a given hypothesis or not.
- [SST-2](https://nlp.stanford.edu/sentiment/index.html) (Stanford Sentiment Treebank) Determine if the sentence has a positive or negative sentiment.
- [STS-B](http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark) (Semantic Textual Similarity Benchmark) Determine the similarity of two sentences with a score from 1 to 5.
- [WNLI](https://cs.nyu.edu/faculty/davise/papers/WinogradSchemas/WS.html) (Winograd Natural Language Inference) Determine if a sentence with an anonymous pronoun and a sentence with this pronoun replaced are entailed or not. (This dataset is built from the Winograd Schema Challenge dataset.)

We will see how to easily load the dataset for each one of those tasks and use the `Trainer` API to fine-tune a model on it. Each task is named by its acronym, with `mnli-mm` standing for the mismatched version of MNLI (so same training set as `mnli` but different validation and test sets):

In [None]:
# GLUE_TASKS = ["cola", "mnli", "mnli-mm", "mrpc", "qnli", "qqp", "rte", "sst2", "stsb", "wnli"]

This notebook is built to run on any of the tasks in the list above, with any model checkpoint from the [Model Hub](https://huggingface.co/models) as long as that model has a version with a classification head. Depending on you model and the GPU you are using, you might need to adjust the batch size to avoid out-of-memory errors. Set those three parameters, then the rest of the notebook should run smoothly:

## Loading the dataset

We will use the [🤗 Datasets](https://github.com/huggingface/datasets) library to download the data and get the metric we need to use for evaluation (to compare our model to the benchmark). This can be easily done with the functions `load_dataset` and `load_metric`.  

In [None]:
import json
import os
import pandas as pd
from datasets import load_dataset, load_metric, Dataset, DatasetDict, ClassLabel

Apart from `mnli-mm` being a special code, we can directly pass our task name to those functions. `load_dataset` will cache the dataset to avoid downloading it again the next time you run this cell.

In [None]:
# actual_task = "mnli" if task == "mnli-mm" else task
# dataset = load_dataset("glue", actual_task)
# metric = load_metric('glue', actual_task)

### HoC

In [None]:
HOC_LABEL_LIST = ['label_IM', 'label_ID', 'label_CE', 'label_RI', 'label_GS', 'label_GI', 'label_A', 'label_CD', 'label_PS', 'label_TPI']
        
def extract_hoc_label(df_input):
    try:
        temp = df_input['labels'].str.split(',').apply(lambda x: [int(i.split('_')[1]) for i in x])
        temp_df = pd.DataFrame(temp.tolist())
        temp_df.columns = HOC_LABEL_LIST
        df_output = pd.concat([df_input,temp_df], axis=1)
        print("Extract HoC label")
        return df_output
    except:
        print("Something went wrong")

In [None]:
DATAPATH = "/content/drive/MyDrive/MinorThesis/"
DATASET = "HoC"

temppath_train = os.path.join(DATAPATH,"datasets","raw",DATASET,"train.tsv")
temppath_valid = os.path.join(DATAPATH,"datasets","raw",DATASET,"dev.tsv")
temppath_test = os.path.join(DATAPATH,"datasets","raw",DATASET,"test.tsv")

df_train = pd.read_csv(temppath_train, sep='\t')
df_test = pd.read_csv(temppath_test, sep='\t')
df_valid = pd.read_csv(temppath_valid, sep='\t')

df_train = extract_hoc_label(df_train)
df_test = extract_hoc_label(df_test)
df_valid = extract_hoc_label(df_valid)

dataset = DatasetDict({
    'train':Dataset.from_pandas(df_train),
    'validation':Dataset.from_pandas(df_valid),
    'test':Dataset.from_pandas(df_test),
})

dataset = dataset.remove_columns(['labels'])
# for col in HOC_LABEL_LIST:
#     dataset = dataset.class_encode_column(col)
# dataset.rename_column_("final_decision", "labels")
dataset.num_rows

In [None]:
# dataset = get_pubmedqa_fold(list_data_fold,df_test,0)
# dataset.num_rows

The `dataset` object itself is [`DatasetDict`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict), which contains one key for the training, validation and test set (with more keys for the mismatched validation and test set in the special case of `mnli`).

To access an actual element, you need to select a split first, then give an index:

In [None]:
dataset["train"].features

In [None]:
dataset["train"][0]

To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset.

In [None]:
import datasets
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [None]:
show_random_elements(dataset["train"])

The metric is an instance of [`datasets.Metric`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Metric):

In [None]:
# https://github.com/huggingface/datasets/tree/master/metrics
from datasets import load_metric
metric = load_metric("accuracy")

# metric = load_metric("https://github.com/huggingface/datasets/blob/master/metrics/accuracy/accuracy.py")
# Example of typical usage
# for batch in dataset:
#     inputs, references = batch
#     predictions = model(inputs)
#     metric.add_batch(predictions=predictions, references=references)
# score = metric.compute()

In [None]:
metric

You can call its `compute` method with your predictions and labels directly and it will return a dictionary with the metric(s) value:

In [None]:
import numpy as np

fake_preds = np.random.randint(0, 2, size=(64,))
fake_labels = np.random.randint(0, 2, size=(64,))
metric.compute(predictions=fake_preds, references=fake_labels)

## Fine-tuning the model

In [None]:
# # MY GLOBAL FUNCTION - 

ENABLE_LOGS = 1
def print_log(*arg, log_type="Info"):
    global ENABLE_LOGS
    if(ENABLE_LOGS==1 or log_type!="Info"): 
        print("["+log_type+"]"," ".join(str(x) for x in arg))

In [None]:
from sklearn.metrics import confusion_matrix, classification_report
class my_evaluator:
    LABELS_HOC_FULL = ['activating invasion and metastasis', 'avoiding immune destruction',
                  'cellular energetics', 'enabling replicative immortality', 'evading growth suppressors',
                  'genomic instability and mutation', 'inducing angiogenesis', 'resisting cell death',
                  'sustaining proliferative signaling', 'tumor promoting inflammation']
    LABELS_HOC_SHORT = ['label_IM', 'label_ID', 
                       'label_CE', 'label_RI', 'label_GS', 
                       'label_GI', 'label_A', 'label_CD', 
                       'label_PS', 'label_TPI']
    @classmethod
    def divide(self, x, y):
        return np.true_divide(x, y, out=np.zeros_like(x, dtype=np.float), where=y != 0)

    @classmethod
    def get_p_r_f_arrary(self, test_predict_label, test_true_label):
        num, cat = test_predict_label.shape
        # print(num,cat)
        acc_list = []
        prc_list = []
        rec_list = []
        f_score_list = []
        for i in range(num):
            # print(test_predict_label[i])
            # print(test_true_label[i])

            acc = accuracy_score(test_true_label[i], test_predict_label[i])
            prc,rec,f_score,_ = precision_recall_fscore_support(test_true_label[i], test_predict_label[i], average='macro')

            if prc == 0 and rec == 0:
                f_score = 0
            else:
                f_score = 2 * prc * rec / (prc + rec)

            acc_list.append(acc)
            prc_list.append(prc)
            rec_list.append(rec)
            f_score_list.append(f_score)

        # print(prc_list)
        # print(rec_list)

        mean_prc = np.mean(prc_list)
        mean_rec = np.mean(rec_list)
        f_score = self.divide(2 * mean_prc * mean_rec, (mean_prc + mean_rec))
        return mean_prc, mean_rec, f_score

    @classmethod
    def hoc_sentence2doc(self, input_df):
        # Output variables
        data = {}
        input_labels_count = dict(zip(self.LABELS_HOC_FULL, [0]*len(self.LABELS_HOC_FULL))) # sentence

        # Group sentence back into documents      
        for i in range(len(input_df)):
            input_row = input_df.iloc[i]
            
            key = input_row['index'][:input_row['index'].find('_')]

            if key not in data:
                data[key] = set()

            if not pd.isna(input_row['labels']):
                for l in input_row['labels'].split(','):
                    ind,val = l.split('_')
                    if(val != '0'):
                        data[key].add(self.LABELS_HOC_FULL[int(ind)])
                        input_labels_count[self.LABELS_HOC_FULL[int(ind)]] += 1

        return data, input_labels_count

    @classmethod
    def hoc_labels2np(self, data):
        labels_list = dict(zip(self.LABELS_HOC_FULL, [[],[],[],[],[],[],[],[],[],[]]))

        y_np = []

        for k, v in data.items():
            # print(k)
            # print(true,pred)
            t = [0] * len(self.LABELS_HOC_FULL)
            for i in v:
                t[self.LABELS_HOC_FULL.index(i)] = 1

            y_np.append(t)

            for lab in self.LABELS_HOC_FULL:
                if(lab in v):
                    labels_list[lab].append(1)
                else:
                    labels_list[lab].append(0)

        return np.array(y_np),labels_list

    @classmethod
    def analysis_hoc(self, input_df):
        # Reformat to Paper evaluator format
        input_df = input_df[['filename_line','label']].copy()
        input_df.columns = ['index','labels']

        data, labels_counts_sen = self.hoc_sentence2doc(input_df)

        print_log('HoC Dataset Details')
        print_log('No. of Documents:',len(data))
        print_log('No. of Sentences:',len(input_df))
        # print(labels_counts_sen)

        y_np, labels_counts_doc = self.hoc_labels2np(data)
        # print(y_np)
        # print(labels_counts_doc)
        res = pd.DataFrame()
        for lab in self.LABELS_HOC_FULL:
            print_log(lab,sum(labels_counts_doc[lab]),len(labels_counts_doc[lab]))
            temp_df = pd.DataFrame([[sum(labels_counts_doc[lab]),len(labels_counts_doc[lab])]], columns=['sum','len'], index=[lab])
            res = res.append(temp_df)

        return res

    @classmethod
    def eval_hoc(self, input_df):   
        print_log("eval hoc",log_type="Function")
    
        # Reformat to Paper evaluator format
        ## Label need to be in a format of list of 10 cancers in fixed order
        true_df = input_df[['filename_line','label']].copy()
        pred_df = input_df[['filename_line','prediction']].copy()
        true_df.columns = ['index','labels']
        pred_df.columns = ['index','labels']

        # Group sentence back into documents 
        data_true, true_labels_count_sen = self.hoc_sentence2doc(true_df)
        data_pred, pred_labels_count_sen = self.hoc_sentence2doc(pred_df)

        # merge data_true/pred into format of {'key':(set(true),set(pred))}
        assert data_true.keys() == data_pred.keys(), 'Key mismatch'
        all_keys = set(data_true.keys()).union(data_pred.keys()) 

        data = {}
        for k in all_keys:
            data[k] = (data_true[k],data_pred[k]) 
        # print(data)
        assert len(data) == 371, 'There are 371 documents in the test set: %d' % len(data)
        
        print_log('HoC Dataset Details')
        print_log('No. of Documents:',len(data))
        print_log('No. of Sentences:',len(true_df),'/',len(pred_df))
        print(true_labels_count_sen,pred_labels_count_sen)

        # Write into dataframe
        res_count_sen = pd.DataFrame()
        for lab in self.LABELS_HOC_FULL:
            temp_df = pd.DataFrame([[true_labels_count_sen[lab],pred_labels_count_sen[lab],len(true_df)]], columns=['sentence_count_label','sentence_count_pred','sentence_count_total'], index=[lab])
            res_count_sen = res_count_sen.append(temp_df)
        # print(res_count)

        y_test, true_labels_count_doc = self.hoc_labels2np(data_true)
        y_pred, pred_labels_count_doc = self.hoc_labels2np(data_pred)
        
        res_count_doc = pd.DataFrame()
        for lab in self.LABELS_HOC_FULL:
            temp_df = pd.DataFrame([[sum(true_labels_count_doc[lab]),sum(pred_labels_count_doc[lab]),len(true_labels_count_doc[lab])]], columns=['doc_count_label','doc_count_pred','doc_count_total'], index=[lab])
            res_count_doc = res_count_doc.append(temp_df)
        res_confmat = pd.DataFrame(columns=['tn','fp','fn','tp'])

        # print(true_labels_list,pred_labels_list)
        for lab in self.LABELS_HOC_FULL:
            # print(lab)
            df2 = pd.DataFrame([list(confusion_matrix(true_labels_count_doc[lab],pred_labels_count_doc[lab]).ravel())], columns=['tn','fp','fn','tp'], index=[lab])
            res_confmat = res_confmat.append(df2)
        # print(res_confmat)
        df_res = pd.concat([res_confmat,res_count_sen,res_count_doc], axis=1)
        print(df_res)

        r, p, f1 = self.get_p_r_f_arrary(y_pred, y_test)
        print('Precision: {:.6f}'.format(p))
        print('Recall   : {:.6f}'.format(r))
        print('F1       : {:.6f}'.format(f1))
        return float(r), float(p), float(f1) , df_res

    @classmethod
    def multi_acc(self, y_pred, y_test):
        y_pred_softmax = torch.log_softmax(y_pred, dim = 1)
        _, y_pred_tags = torch.max(y_pred_softmax, dim = 1)    
        
        correct_pred = (y_pred_tags == y_test).float()
        acc = correct_pred.sum() / len(correct_pred)
        
        acc = torch.round(acc * 100)
        
        return acc

    @classmethod
    def eval_pubmedqa(self, input_df):
        class2idx = {
                'no':0,
                'maybe':1,
                'yes':2
        }
        idx2class = {v: k for k, v in class2idx.items()}
        input_df['label'].replace(idx2class, inplace=True)
        input_df['prediction'].replace(idx2class, inplace=True)

        confusion_matrix_df = pd.DataFrame(confusion_matrix(input_df.label.values, input_df.prediction.values))
        class_report = classification_report(input_df.label.values, input_df.prediction.values, digits=3,output_dict = True)
        print(class_report)
        sns.heatmap(confusion_matrix_df, annot=True)
        return class_report , confusion_matrix_df

    @classmethod
    def eval_bioasq(self, input_df):
        class2idx = {
                'no':0,
                'yes':1
        }
        idx2class = {v: k for k, v in class2idx.items()}
        input_df['label'].replace(idx2class, inplace=True)
        input_df['prediction'].replace(idx2class, inplace=True)

        confusion_matrix_df = pd.DataFrame(confusion_matrix(input_df.label.values, input_df.prediction.values))
        class_report = classification_report(input_df.label.values, input_df.prediction.values, digits = 3,output_dict = True)
        print(class_report)
        sns.heatmap(confusion_matrix_df, annot=True)


        acc = accuracy_score(input_df.label.values, input_df.prediction.values)
        prc,rec,f_score,_ = precision_recall_fscore_support(input_df.label.values, input_df.prediction.values, average='macro')

        print("acc",acc)
        print("prc",prc)
        print("rec",rec)
        print("f_score",f_score)

        return class_report , confusion_matrix_df

# eval_hoc(temp_df)

#### For all folds

In [None]:
import torch
torch.cuda.empty_cache()

In [None]:
task = "HoC"
model_checkpoint = "microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext"
# model_checkpoint = "dmis-lab/biobert-base-cased-v1.1"
# model_checkpoint = "bert-base-uncased"

LEARNING_RATE = 1e-5
BATCH_SIZE = 8
NUM_LABELS = 2
MAX_LENGTH = 512

In [None]:
metric_name = "accuracy"
metric = load_metric("accuracy")

model_name = model_checkpoint.split("/")[-1]

In [None]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    # predictions = predictions[:, 0]
    predictions = np.argmax(predictions, axis=1)
    return metric.compute(predictions=predictions, references=labels)

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

def preprocess_function(examples):
    return tokenizer(examples["sentence"], truncation=True,padding=True,max_length=MAX_LENGTH)

In [None]:
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score, confusion_matrix, matthews_corrcoef
from sklearn.metrics import precision_recall_fscore_support,accuracy_score

# r, p, f1, _ = my_evaluator.eval_hoc(output_df)

In [None]:
def format_output(list_output,filename="result_adapter_hoc.csv"):
    list_df_labels = []
    list_df_preds = []
    for k,v in list_output.items():
        print(k,v)
        y_pred = np.argmax(v.predictions, axis=1)
        temp_df = pd.DataFrame(data=y_pred, columns=[k])
        list_df_preds.append(temp_df)

        y_true = v.label_ids
        temp_df = pd.DataFrame(data=y_true, columns=[k])
        list_df_labels.append(temp_df)

    temp_df_lab = pd.DataFrame(data=dataset["test"]['index'], columns=["filename_line"])
    temp_df_lab


    list_cancer = ['label_IM', 'label_ID', 'label_CE', 'label_RI', 'label_GS', 'label_GI', 'label_A', 'label_CD', 'label_PS', 'label_TPI']
        
    temp_df = pd.concat(list_df_labels,axis=1)
    for col in temp_df.columns:
        print(col)
        temp_df[col] = str(list_cancer.index(col))+'_'+temp_df[col].astype(str)
    temp_df['label'] = temp_df.apply(lambda x: ','.join(x.dropna().values.tolist()), axis=1)
    temp_df

    temp_df2 = pd.concat(list_df_preds,axis=1)
    for col in temp_df2.columns:
        print(col)
        temp_df2[col] = str(list_cancer.index(col))+'_'+temp_df2[col].astype(str)
    temp_df2['prediction'] = temp_df2.apply(lambda x: ','.join(x.dropna().values.tolist()), axis=1)
    temp_df2

    temp_df3 = pd.concat([temp_df_lab,temp_df.label,temp_df2.prediction], axis=1)
    temp_df3

    temp_df3.to_csv(DATAPATH+filename)

    return temp_df3

# output_df = format_output(list_output,"result_adapter_hoc_"+model_name+"_"+LEARNING_RATE+".csv")

In [None]:
def train_adapter_hoc(LANG_MODEL,LEARNING_RATE):
    model_name = LANG_MODEL.split("/")[-1]
    print("--------",model_name,LEARNING_RATE,"--------")
    training_args = TrainingArguments(
        learning_rate=LEARNING_RATE,
        num_train_epochs=6,
        per_device_train_batch_size=BATCH_SIZE,
        per_device_eval_batch_size=BATCH_SIZE,
        eval_accumulation_steps=1,
        logging_steps=200,
        output_dir="./training_output",
        overwrite_output_dir=True,
        # The next line is important to ensure the dataset labels are properly passed to the model
        remove_unused_columns=False,
    )
    list_f1 = []
    list_output = {}
    for lab in HOC_LABEL_LIST:
        torch.cuda.empty_cache()

        print("###### LAB",lab,"######")

        HOC_LABEL_LIST_temp = HOC_LABEL_LIST.copy()
        HOC_LABEL_LIST_temp.remove(lab)

        # dataset_lab = dataset.remove_columns(HOC_LABEL_LIST[1:])

        dataset_lab = dataset.remove_columns(HOC_LABEL_LIST_temp)
        print("num_rows",dataset_lab.num_rows)
        # print(dataset_lab)

        encoded_dataset = dataset_lab.map(preprocess_function, batched=True)
        encoded_dataset.rename_column_(lab, "labels")
        encoded_dataset.set_format(type="torch", columns=["input_ids", "attention_mask","labels"])
        print(encoded_dataset)

        # dataset_lab.rename_column_(lab, "labels")
        # dataset_lab.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])
        # dataset_lab = dataset_lab.class_encode_column("labels")
        # encoded_dataset = dataset_lab.map(preprocess_function, batched=True)
        # encoded_dataset.set_format(type="torch", columns=["input_ids", "attention_mask","labels"])

        print("Setup model",model_checkpoint)
        config = AutoConfig.from_pretrained(
            model_checkpoint,
            num_labels=2,
        )
        model = AutoModelWithHeads.from_pretrained(
            model_checkpoint,
            config=config,
        )

        # Add a new adapter
        model.add_adapter("hoc")
        # Add a matching classification head
        model.add_classification_head(
            "hoc",
            num_labels=2,
            # id2label={ 0: "👎", 1: "👍"}
          )
        # Activate the adapter
        model.train_adapter("hoc")

        trainer = Trainer(
            model=model,
            args=training_args,
            train_dataset=encoded_dataset["train"],
            eval_dataset=encoded_dataset["validation"],
            compute_metrics=compute_accuracy,
        )

        trainer.train()

        eval_pred = trainer.predict(encoded_dataset['test'])
        list_output[lab] = eval_pred

    output_df = format_output(list_output,"result_adapter_hoc_"+model_name+"_"+str(LEARNING_RATE)+".csv")
    r, p, f1, _ = my_evaluator.eval_hoc(output_df)
    print("")
    print("+++++","+++++")
    print("+++++","r",r,"+++++")
    print("+++++","p",p,"+++++")
    print("+++++","F1",f1,"+++++")
    print("+++++","+++++")
    print("")


for model in ["bert-base-uncased"]:
    for lr in [5e-5]:
        train_adapter_hoc(model,lr)

In [None]:
list_f1 = []
list_output = {}
for lab in HOC_LABEL_LIST:
    torch.cuda.empty_cache()
    print("###### LAB",lab,"######")

    HOC_LABEL_LIST_temp = HOC_LABEL_LIST.copy()
    HOC_LABEL_LIST_temp.remove(lab)

    dataset_lab = dataset.remove_columns(HOC_LABEL_LIST_temp)
    dataset_lab.rename_column_(lab, "labels")
    dataset_lab = dataset_lab.class_encode_column("labels")

    print("num_rows",dataset_lab.num_rows)
    print(dataset_lab['train'].features)

    encoded_dataset = dataset_lab.map(preprocess_function, batched=True)
    encoded_dataset.set_format(type="torch", columns=["input_ids", "attention_mask","labels"])

    model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=NUM_LABELS)

    args = TrainingArguments(
            f"{model_name}-finetuned-{task}",
            evaluation_strategy = "epoch",
            save_strategy = "epoch",
            learning_rate=LEARNING_RATE,
            per_device_train_batch_size=BATCH_SIZE,
            per_device_eval_batch_size=BATCH_SIZE,
            eval_accumulation_steps=1,
            num_train_epochs=6,
            weight_decay=0.01,
            load_best_model_at_end=True,
            metric_for_best_model=metric_name,
            # The next line is important to ensure the dataset labels are properly passed to the model
            remove_unused_columns=True,
            # push_to_hub=True,
            # push_to_hub_model_id=f"{model_name}-finetuned-{task}",
        )

    trainer = Trainer(
            model,
            args,
            train_dataset=encoded_dataset["train"],
            eval_dataset=encoded_dataset["validation"],
            tokenizer=tokenizer,
            compute_metrics=compute_metrics,
        )

    trainer.train()

    commit_msg = ""
    commit_msg += "BATCH_SIZE="+str(BATCH_SIZE)+"\n"
    commit_msg += "LEARNING_RATE="+str(LEARNING_RATE)+"\n"
    commit_msg += "MAX_LENGTH="+str(MAX_LENGTH)+"\n"
    commit_msg += "LABEL="+lab+"\n"

    # trainer.push_to_hub(commit_msg)
    # output = trainer.evaluate(encoded_dataset['test'])
    # list_f1.append(output)

    eval_pred = trainer.predict(encoded_dataset['test'])
    list_output[lab] = eval_pred
    # temp = {}
    # temp['predictions'] = eval_pred['predictions']
    # temp['label_ids'] = eval_pred['label_ids']
    # list_output[lab] = temp


In [None]:
list_df_labels = []
list_df_preds = []
for k,v in list_output.items():
    print(k,v)
    y_pred = np.argmax(v.predictions, axis=1)
    temp_df = pd.DataFrame(data=y_pred, columns=[k])
    list_df_preds.append(temp_df)

    y_true = v.label_ids
    temp_df = pd.DataFrame(data=y_true, columns=[k])
    list_df_labels.append(temp_df)

In [None]:
temp_df_lab = pd.DataFrame(data=dataset["test"]['index'], columns=["filename_line"])
temp_df_lab

In [None]:
list_cancer = ['label_IM', 'label_ID', 'label_CE', 'label_RI', 'label_GS', 'label_GI', 'label_A', 'label_CD', 'label_PS', 'label_TPI']
    
temp_df = pd.concat(list_df_labels,axis=1)
for col in temp_df.columns:
    print(col)
    temp_df[col] = str(list_cancer.index(col))+'_'+temp_df[col].astype(str)
temp_df['label'] = temp_df.apply(lambda x: ','.join(x.dropna().values.tolist()), axis=1)
temp_df

In [None]:
temp_df2 = pd.concat(list_df_preds,axis=1)
for col in temp_df2.columns:
    print(col)
    temp_df2[col] = str(list_cancer.index(col))+'_'+temp_df2[col].astype(str)
temp_df2['prediction'] = temp_df2.apply(lambda x: ','.join(x.dropna().values.tolist()), axis=1)
temp_df2

In [None]:
temp_df3 = pd.concat([temp_df_lab,temp_df.label,temp_df2.prediction], axis=1)
temp_df3

In [None]:
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score, confusion_matrix, matthews_corrcoef
from sklearn.metrics import precision_recall_fscore_support,accuracy_score

r, p, f1, _ = my_evaluator.eval_hoc(temp_df3)