<a href="https://colab.research.google.com/github/crux82/BISS-2024/blob/main/BISS-2024_LAB-2.3_ExtremITA_inference.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Camoscio VS Minerva LLMs comparison when finetuned on the GeoLingIt dataset

The tutorial is split into 4 steps, reflecting the aforementioned process:
- Step 1 - Encoding the data
- Step 2 - Training the LLaMA model
- **Step 3 - Inference: generating answers**
- Step 4 - Deconding the data

# Index:
1. Introduction, Workflow and Objectives
2. Preliminary steps
3. Loading the model
4. Generating answers
5. Saving the data in the 4-column format

In [1]:
# install eventually required packages

! pip3 install peft
! pip3 install sentencepiece
! pip3 install accelerate
! pip3 install bitsandbytes
! pip3 install geopy
! pip3 install scikit-learn



In [2]:
import warnings
warnings.filterwarnings('ignore')

In [52]:
import torch
import pandas as pd

from peft import PeftModel
from transformers import LlamaTokenizer, LlamaForCausalLM, GenerationConfig
import re
from os import makedirs
from os.path import isdir
import csv
import math
import pprint
from tqdm import tqdm
from geopy.distance import geodesic
from sklearn.metrics import f1_score
import numpy as np

In [4]:
relPath = '.'
TASK = 'geolingit'

## Encode the test set

In [5]:
def clean_input_text(text):
    text = re.sub(r'\t+', ' ', re.sub(r'\n+', ' ', re.sub(r'\s+', " ", text)))
    text = text.rstrip()
    return text

def encode():
    if not isdir(f"out/{TASK}"):
        makedirs(f"out/{TASK}")

    data = dict()

    with open(f"{relPath}/GeoLingIt/test_a_GOLD.tsv", encoding="utf-8") as f:
        reader = csv.DictReader(f, delimiter="\t", quoting=csv.QUOTE_NONE)
        for row in reader:
            text = clean_input_text(row['text'])
            label = row['region']
            data[row['id']] = {
                'text': text,
                'label': label,
            }

    with open(f"{relPath}/GeoLingIt/test_b_GOLD.tsv", encoding="utf-8") as f:
        reader = csv.DictReader(f, delimiter="\t", quoting=csv.QUOTE_NONE)
        for row in reader:
            latitude = math.floor(eval(row['latitude'])*100)/100.
            longitude = math.floor(eval(row['longitude'])*100)/100.
            data[row['id']]['latitude'] = latitude
            data[row['id']]['longitude'] = longitude

    with open(f"out/{TASK}/test.txt", "w", encoding="utf-8") as f_o:
        for id, features in data.items():
            f_o.write(f"{id}\t{TASK}\t{features['text']}\t[regione] {features['label']} [geo] {features['latitude']} {features['longitude']}\n")
            
encode()

### Utils code for generating text in the ad hoc form for each task

In [6]:
def task_to_prompt(task: str):
    if task == "geolingit":
        return "Scrivi la regione di appartenenza di chi ha scritto questo testo, seguito dalla latitudine, seguita dalla longitudine."
    else:
        return "task sconosciuto"


 ################ GENERATE METHODS ################
def generate_prompt_pred(instruction, input_):
    return f"""Di seguito è riportata un'istruzione che descrive un task, insieme ad un input che fornisce un contesto più ampio. Scrivete una risposta che completi adeguatamente la richiesta.
### Istruzione:
{instruction}
### Input:
{input_}
### Risposta:"""

# Download the model

This section provides detailed instructions on how to download and load the model. We specifically focus on the "Extremita" model, which has been trained and evaluated as part of a major competition. For more information on the competition and the model's performance, refer to the following link: https://ceur-ws.org/Vol-3473/paper13.pdf.

The model is readily available on Hugging Face and consists of two main components: the original language model (Camoscio: ``sag-uniroma2/extremITA-Camoscio-7b``) and the adapters obtained through fine-tuning with all the data from Evalita 2023 in terms of adapters (``sag-uniroma2/extremITA-Camoscio-7b-adapters``).

It is important to note that the model can be loaded using different precision levels, including 4-bit, 8-bit, and 16-bit approximations, as outlined below.

In [7]:
bits = "4" #@param [4, 8, "full"]

In [8]:
tokenizer = LlamaTokenizer.from_pretrained("yahma/llama-7b-hf")
tokenizer.padding_side = "left"
tokenizer.pad_token_id = (0)

# base model here, choose between 4, 8 bits or full precision
if bits == "8":
  model = LlamaForCausalLM.from_pretrained(
    "./LLaMinerva/checkpoint-426",
    load_in_8bit=True,
    torch_dtype=torch.float16,
    device_map="auto",
  )
elif bits == "4":
  model = LlamaForCausalLM.from_pretrained(
    "./LLaMinerva/checkpoint-426",
    load_in_4bit=True,
    torch_dtype=torch.float16,
    device_map="auto",
  )
else:
  model = LlamaForCausalLM.from_pretrained(
    "./LLaMinerva/checkpoint-426",
    torch_dtype=torch.float16,
    device_map="auto",
  )

You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


Loading checkpoint shards:   0%|          | 0/39 [00:00<?, ?it/s]

In [9]:
model.config.pad_token_id = tokenizer.pad_token_id = 0
model.config.bos_token_id = tokenizer.bos_token_id = 1
model.config.eos_token_id = tokenizer.eos_token_id = 2

model.eval()
if torch.__version__ >= "2":
    model = torch.compile(model)

In [10]:
inputs = []
with open(f"{relPath}/out/{TASK}/test.txt", "r", encoding="utf-8") as f:
    for line in f.readlines():
        lc = re.split(r"\t|\[regione\]|\[geo\]", line)
        inputs.append(["1", TASK, lc[2], "[regione]" + lc[4] + " [geo]" + lc[5]])
    pprint.pp(inputs)

[['1',
  'geolingit',
  '#NapoliRijeka i cambi hanno fatto la differenza...anche se il gol di [USER] '
  'è stato di cazzimma! Quello di [USER] con doppia cazzimma! Peccato per le '
  "occasioni di #Petagna e #Insigne si poteva finire con un'imbarcata! "
  '#ForzaNapoliSempre',
  '[regione] Campania  [geo] 40.85 14.24\n'],
 ['1',
  'geolingit',
  '[USER] [USER] E ti chiama Gugliermo Passeri di bar Armida e ti porta un’ape '
  'du’ cami e tre purmi pieni di nuovi giocatori raccattati ni sudicio',
  '[regione] Toscana  [geo] 43.78 11.24\n'],
 ['1',
  'geolingit',
  '[USER] [USER] In dialetto ti ghe rason...😉😁',
  '[regione] Piemonte  [geo] 45.44 8.61\n'],
 ['1',
  'geolingit',
  '[USER] Je c’è vo a cariola pe portalo a spasso',
  '[regione] Lazio  [geo] 42.44 12.06\n'],
 ['1',
  'geolingit',
  '[USER] [USER] Frastimo, ma no isco frastimare ca Deus non m’hat dadu su '
  'destinu. Males cantas renas b’hat in mare e unzas cantu pesat su terrinu. '
  'Su cannau ti tostet su Buzinu manzanu a 

In [20]:
def elaborate_generated_output(text):
    region, coordinates = [e.strip() for e in text.removeprefix("[regione]").strip().split('[geo]')]
    latitude, longitude = [float(e) for e in coordinates.split()]
    return region, (latitude, longitude)

In [22]:
# generate prompts based on task and text
pred_text = []
true_text = []

for input in tqdm(inputs):
    id = input[0]
    task = input[1]
    text = input[2]
    expected_output = input[3]

    instruction = task_to_prompt(task)
    prompt = generate_prompt_pred(instruction, text) #pay attention that the input is not too long (over the max length of your model)

    # tokenization
    tokenized_inputs = tokenizer(prompt, return_tensors="pt", padding=True, truncation=True).to(model.device)

    # inference
    model.eval()
    with torch.no_grad():
        gen_outputs = model.generate(
            **tokenized_inputs,
            return_dict_in_generate=True,
            output_scores=True,
            max_new_tokens=256, # how many token (wordpieces) to add to the input prompt. In ExtremITA answers are short
            do_sample=False # we do not need any sampling or beam seach. We just need the "best" solution, so the greedy search is fine: https://huggingface.co/blog/how-to-generate
        )

        # decoding and printing
        for i in range(len(gen_outputs[0])):
            output = tokenizer.decode(gen_outputs[0][i], skip_special_tokens=True)
            if "### Risposta:" in output:
                response = output.split("### Risposta:")[1].rstrip().lstrip()
            else:
                response = "UNK"

            # print(text)
            # print(f"\t {expected_output} \t {response}")
            # print(50*"*")
            
            pred_text.append(elaborate_generated_output(response))
            true_text.append(elaborate_generated_output(expected_output))

100%|██████████| 818/818 [22:05<00:00,  1.62s/it]


## Compute the metrics: F1 score for classification and avg km between coordinates

In [53]:
pred_region, pred_coord = tuple(zip(*pred_text))
true_region, true_coord = tuple(zip(*true_text))

avg_km = np.array([geodesic(p, t).km for p, t in zip(pred_coord, true_coord)]).mean()
score = f1_score(true_region, pred_region, average='macro')

print(f"Average distance in km: {avg_km}")
print(f"F1 score: {score}")

Average distance in km: 167.96833368812435
F1 score: 0.3365633846445403


![image.png](assets/extremita.PNG)