<a href="https://colab.research.google.com/github/crux82/BISS-2024/blob/main/BISS-2024_LAB-2.4_ExtremITA_data_decoder.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### BISS-2024 Tutorial

## LAB 2.4: Large Language Models and How to Instruction Tune Them (in a Sustainable Way)

**Authors**: C.D. Hromei & D. Croce

This is an implementation for training and using a Large Language Model (based on [LLaMA](https://ai.meta.com/blog/large-language-model-llama-meta-ai/)) with instructions in order to solve the linguistic tasks of [EVALITA](https://www.evalita.it/campaigns/evalita-2023/). In this lab, we will see how to encode datasets from any format to a sequence to sequence format, train the model using [Q-LoRA](https://arxiv.org/abs/2305.14314), perform the inference using the previous trained model for generating answers to instructions, and finally, how to encode back the data to the original format.. all of it using the only available *T4 GPU with 15GB from Google Colab*.

The tutorial is split into 4 steps, reflecting the aforementioned process:
- Step 1 - Encoding the data
- Step 2 - Training the LLaMA model
- Step 3 - Inference: generating answers
- **Step 4 - Deconding the data**

# Index:
1. Introduction, Workflow and Objectives
2. Preliminary steps
3. Loading the data from previous step
4. Decoding: generating the PubTator format
5. Bonus: ExtremITA demo

## Step 4 - Decoding the data

In this Notebook we will see the decoding part of the data, once the model generated its answer, in order to transform it back into the task specific format. Here we will focus on the [CLinkaRT](https://e3c.fbk.eu/clinkart) task: we will take the sequence of events (the tests) and results, we will look for the indices in the original text and we will produce a PubTator format-like file.

## Input
The "input" of the Notebook is the previously generated 4-column file with the predictions for the CLinkaRT task.

# Let's decode

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
import random
from os.path import isdir
from os import mkdir
import spacy
import re
import pandas as pd

### Handle external data:
Download one example for the Clinkart dataset as labeled from the LLM.

In [None]:
! wget -nc https://raw.githubusercontent.com/crux82/CLiC-it_2023_tutorial/main/data/clinkart_predictions.tsv

input_file_path = "clinkart_predictions.tsv"

--2024-03-06 16:24:14--  https://raw.githubusercontent.com/crux82/CLiC-it_2023_tutorial/main/data/clinkart_predictions.tsv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 153 [text/plain]
Saving to: ‘clinkart_predictions.tsv’


2024-03-06 16:24:14 (13.1 MB/s) - ‘clinkart_predictions.tsv’ saved [153/153]



In [None]:
! cat clinkart_predictions.tsv

11_0_66	clinkart	Veniva documentato, inoltre, il rialzo della troponina TnT-hs (289; 	[BREL] 289 [SEP] troponina [EREL] [BREL] 289 [SEP] troponina [EREL]

-------------------------
### Decode
Now we will load the created file with the predictions for the CLinkaRT task and we will convert them back in the PubTator format. We will use the specific decoder from the list of [decoders](https://github.com/crux82/ExtremITA/tree/main/tasks).

**Note**: this is the most complicated decoder among the EVALITA 2023 tasks. We invite you to take a look at the others.

In [None]:
def decode(preds):
  out = dict()
  texts = dict()

  for example_pred in preds:
    relations = []

    id = example_pred[0]
    text = example_pred[2]
    prediction = example_pred[3]

    doc_id, char_from, char_to = id.split("_")
    if not doc_id in out:
        out[doc_id] = []
    if not doc_id in texts:
        texts[doc_id] = ""
    texts[doc_id] += text

    char_from = eval(char_from)
    char_to = eval(char_to)
    regex = re.compile(r"\[BREL\].*?\[SEP\].*?\[EREL\]")

    matched_list = re.findall(regex, prediction)
    for matched in matched_list:
        valid = True
        try:
            tmp = re.sub("^\[BREL\] ", "", matched)
            tmp = re.sub(" \[EREL\]$", "", tmp)
            brel, erel = tmp.split(" [SEP] ")
        except:
            valid = False

        if "[BREL]" in brel or "[SEP]" in brel or "[EREL]" in brel or "[BREL]" in erel or "[SEP]" in erel or "[EREL]" in erel:
            valid = False
        if valid:
            relations.append((brel, erel))

    for brel, erel in relations:
        try:
            m_from_brel, m_to_brel = re.search(r"\b{}\b".format(brel), text).span()

            # find the EREL closest to the BREL
            min_dist = 10000000
            for m in re.finditer(r"\b{}\b".format(erel), text):
                f, _ = m.span()
                if abs(f-m_from_brel)<min_dist:
                    min_dist = abs(f-m_from_brel)
                    m_from_erel, m_to_erel = m.span()
        except:
            continue # it can happen when the model hallucinates

        # correct the offset wrt the sentence
        m_from_brel += char_from
        m_from_erel += char_from
        m_to_brel += char_from
        m_to_erel += char_from

        obj = (brel, erel, m_from_brel, m_to_brel, m_from_erel, m_to_erel)
        if obj not in out[doc_id]:
            out[doc_id].append(obj)

  with open(f"clinkart.txt", "w") as fo:
      for doc_id, sentences in out.items():
          text = texts[doc_id]
          fo.write(f"{doc_id}|t|{text}\n")
          for brel, erel, m_from_brel, m_to_brel, m_from_erel, m_to_erel in sentences:
              fo.write(f"{doc_id}\tREL\t{m_from_brel}-{m_to_brel}\t{m_from_erel}-{m_to_erel}\t{brel}\t{erel}\n")
          fo.write(f"\n")

Here we load the previous generated file, we transform it into a list of lists of 4 elements and then produce the requested format.

In [None]:
import csv

predictions = []

with open(input_file_path, 'r') as file:
    reader = csv.reader(file, delimiter='\t')

    for row in reader:
      predictions.append(row)

decode(predictions)

If we load the saved file, we can see what's inside.

In [None]:
with open(f"clinkart.txt", "r") as f:
  lines = f.readlines()
  for line in lines:
    print(line)

11|t|Veniva documentato, inoltre, il rialzo della troponina TnT-hs (289; 

11	REL	63-66	45-54	289	troponina





## Decoding all the datasets
We saw how to decode one dataset (CLinkaRT), but if you want to decode all the others [here](https://github.com/crux82/ExtremITA/tree/main/tasks) you can find the list with the links to the code. Each specific decoder will produce the task specific format file, ready to be submitted to the competition.

If you clone the Github Repository, you can exploit the `decode` method in the root to automatically decode each dataset.

- [acti](https://github.com/crux82/ExtremITA/blob/main/tasks/acti.py)
- [clinkart](https://github.com/crux82/ExtremITA/blob/main/tasks/clinkart.py)
- [discotex](https://github.com/crux82/ExtremITA/blob/main/tasks/discotex.py)
- [emit](https://github.com/crux82/ExtremITA/blob/main/tasks/emit.py)
- [emotivita](https://github.com/crux82/ExtremITA/blob/main/tasks/emotivita.py)
- [geolingit](https://github.com/crux82/ExtremITA/blob/main/tasks/geolingit.py)
- [haspeede](https://github.com/crux82/ExtremITA/blob/main/tasks/haspeede.py)
- [hodi](https://github.com/crux82/ExtremITA/blob/main/tasks/hodi.py)
- [langlearn](https://github.com/crux82/ExtremITA/blob/main/tasks/langlearn.py)
- [multifakedetective](https://github.com/crux82/ExtremITA/blob/main/tasks/multifakedetective.py)
- [nermud](https://github.com/crux82/ExtremITA/blob/main/tasks/nermud.py)
- [politicit](https://github.com/crux82/ExtremITA/blob/main/tasks/politicit.py)
- [wicita](https://github.com/crux82/ExtremITA/blob/main/tasks/wicita.py)