<a href="https://colab.research.google.com/github/crux82/BISS-2024/blob/main/BISS-2024_LAB-2.3_ExtremITA_inference.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Camoscio VS Minerva LLMs comparison when finetuned on the GeoLingIt dataset

The tutorial is split into 4 steps, reflecting the aforementioned process:
- Step 1 - Encoding the data
- Step 2 - Training the LLaMA model
- **Step 3 - Inference: generating answers**
- Step 4 - Deconding the data

# Index:
1. Introduction, Workflow and Objectives
2. Preliminary steps
3. Loading the model
4. Generating answers
5. Saving the data in the 4-column format

In [2]:
# install eventually required packages

! pip3 install peft
! pip3 install sentencepiece
! pip3 install accelerate
! pip3 install bitsandbytes

Defaulting to user installation because normal site-packages is not writeable
distutils: /home/dosclic/.local/lib/python3.9/site-packages
sysconfig: /home/dosclic/.local/lib64/python3.9/site-packages[0m
user = True
home = None
root = None
prefix = None[0m
You should consider upgrading via the '/usr/bin/python -m pip install --upgrade pip' command.[0m
Defaulting to user installation because normal site-packages is not writeable
distutils: /home/dosclic/.local/lib/python3.9/site-packages
sysconfig: /home/dosclic/.local/lib64/python3.9/site-packages[0m
user = True
home = None
root = None
prefix = None[0m
You should consider upgrading via the '/usr/bin/python -m pip install --upgrade pip' command.[0m
Defaulting to user installation because normal site-packages is not writeable
distutils: /home/dosclic/.local/lib/python3.9/site-packages
sysconfig: /home/dosclic/.local/lib64/python3.9/site-packages[0m
user = True
home = None
root = None
prefix = None[0m
You should consider upgrading 

In [3]:
import warnings
warnings.filterwarnings('ignore')

In [4]:
import torch
import pandas as pd

from peft import PeftModel
from transformers import LlamaTokenizer, LlamaForCausalLM, GenerationConfig

### Utils code for generating text in the ad hoc form for each task

In [5]:
def task_to_prompt(task: str):
    if task == "geolingit":
        return "Scrivi la regione di appartenenza di chi ha scritto questo testo, seguito dalla latitudine, seguita dalla longitudine."
    else:
        return "task sconosciuto"


 ################ GENERATE METHODS ################
def generate_prompt_pred(instruction, input_):
    return f"""Di seguito è riportata un'istruzione che descrive un task, insieme ad un input che fornisce un contesto più ampio. Scrivete una risposta che completi adeguatamente la richiesta.
### Istruzione:
{instruction}
### Input:
{input_}
### Risposta:"""

# Download the model

This section provides detailed instructions on how to download and load the model. We specifically focus on the "Extremita" model, which has been trained and evaluated as part of a major competition. For more information on the competition and the model's performance, refer to the following link: https://ceur-ws.org/Vol-3473/paper13.pdf.

The model is readily available on Hugging Face and consists of two main components: the original language model (Camoscio: ``sag-uniroma2/extremITA-Camoscio-7b``) and the adapters obtained through fine-tuning with all the data from Evalita 2023 in terms of adapters (``sag-uniroma2/extremITA-Camoscio-7b-adapters``).

It is important to note that the model can be loaded using different precision levels, including 4-bit, 8-bit, and 16-bit approximations, as outlined below.

In [6]:
bits = "4" #@param [4, 8, "full"]

In [7]:
tokenizer = LlamaTokenizer.from_pretrained("yahma/llama-7b-hf")
tokenizer.padding_side = "left"
tokenizer.pad_token_id = (0)

# base model here, choose between 4, 8 bits or full precision
if bits == "8":
  model = LlamaForCausalLM.from_pretrained(
    "./LLaMinerva/checkpoint-426",
    load_in_8bit=True,
    torch_dtype=torch.float16,
    device_map="auto",
  )
elif bits == "4":
  model = LlamaForCausalLM.from_pretrained(
    "./LLaMinerva/checkpoint-426",
    load_in_4bit=True,
    torch_dtype=torch.float16,
    device_map="auto",
  )
else:
  model = LlamaForCausalLM.from_pretrained(
    "./LLaMinerva/checkpoint-426",
    torch_dtype=torch.float16,
    device_map="auto",
  )

You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


Loading checkpoint shards:   0%|          | 0/39 [00:00<?, ?it/s]

In [8]:
model.config.pad_token_id = tokenizer.pad_token_id = 0
model.config.bos_token_id = tokenizer.bos_token_id = 1
model.config.eos_token_id = tokenizer.eos_token_id = 2

model.eval()
if torch.__version__ >= "2":
    model = torch.compile(model)

In [75]:
import pprint
import re

relPath = '.'
TASK = "geolingit"

inputs = []
with open(f"{relPath}/out/{TASK}/dev.txt", "r", encoding="utf-8") as f:
    for line in f.readlines():
        lc = re.split(r"\t|\[regione\]|\[geo\]", line)
        inputs.append(["1", TASK, lc[2], "[regione]" + lc[4] + " [geo]" + lc[5]])
    pprint.pp(inputs)

[['1',
  'geolingit',
  "[USER] Mortacci, na roba che nse po' vede, Por, na cacata rara.",
  '[regione] Lazio  [geo] 41.89 12.54\n'],
 ['1',
  'geolingit',
  '[USER] In Liguria diciamo: CIU INLA’ U GHE CIOSU..!😂😂😂🤡🤡🤡',
  '[regione] Liguria  [geo] 43.9 8.0\n'],
 ['1',
  'geolingit',
  '[USER] Uuuuuuaaaaa.... [USER] si nu bell cafè cu nu bicchier d’ acqua '
  'minerale ‘ngopp a na nave e crocier 🤣🤣🤣🤣🌹🌹🌹🌹',
  '[regione] Campania  [geo] 40.85 14.24\n'],
 ['1',
  'geolingit',
  '[USER] [USER] Boffe a dui a dui finu caddivientanu rispari Pure così',
  '[regione] Sicilia  [geo] 38.13 13.34\n'],
 ['1',
  'geolingit',
  '[USER] Anvedi andò stai. La prossima volta dimmelo pe tempo che te faccio '
  'un saluto.',
  '[regione] Lazio  [geo] 41.89 12.54\n'],
 ['1',
  'geolingit',
  'Sta Mejo lei che [USER] #nzonzi ma de che stamo a parlà [URL]',
  '[regione] Lazio  [geo] 41.89 12.54\n'],
 ['1',
  'geolingit',
  '[USER] Certo che con questo approccio così "incisivo" cascheranno tutte ai '
  "suoi pie

In [76]:
# generate prompts based on task and text
preds = []
for input in inputs:
  id = input[0]
  task = input[1]
  text = input[2]
  expected_output = input[3]

  instruction = task_to_prompt(task)
  prompt = generate_prompt_pred(instruction, text) #pay attention that the input is not too long (over the max length of your model)

  # tokenization
  tokenized_inputs = tokenizer(prompt, return_tensors="pt", padding=True, truncation=True).to(model.device)

  # inference
  with torch.no_grad():
    gen_outputs = model.generate(
      **tokenized_inputs,
      return_dict_in_generate=True,
      output_scores=True,
      max_new_tokens=256, # how many token (wordpieces) to add to the input prompt. In ExtremITA answers are short
      do_sample=False # we do not need any sampling or beam seach. We just need the "best" solution, so the greedy search is fine: https://huggingface.co/blog/how-to-generate
    )

    # decoding and printing
    for i in range(len(gen_outputs[0])):
      output = tokenizer.decode(gen_outputs[0][i], skip_special_tokens=True)
      if "### Risposta:" in output:
        response = output.split("### Risposta:")[1].rstrip().lstrip()
      else:
        response = "UNK"

      print(text)
      print(f"\t {expected_output} \t {response}")
      print(50*"*")

[USER] Mortacci, na roba che nse po' vede, Por, na cacata rara.
	 [regione] Lazio  [geo] 41.89 12.54
 	 [regione] Lazio [geo] 41.89 12.54
**************************************************
[USER] In Liguria diciamo: CIU INLA’ U GHE CIOSU..!😂😂😂🤡🤡🤡
	 [regione] Liguria  [geo] 43.9 8.0
 	 [regione] Liguria [geo] 44.44 8.88
**************************************************
[USER] Uuuuuuaaaaa.... [USER] si nu bell cafè cu nu bicchier d’ acqua minerale ‘ngopp a na nave e crocier 🤣🤣🤣🤣🌹🌹🌹🌹
	 [regione] Campania  [geo] 40.85 14.24
 	 [regione] Campania [geo] 40.85 14.24
**************************************************
[USER] [USER] Boffe a dui a dui finu caddivientanu rispari Pure così
	 [regione] Sicilia  [geo] 38.13 13.34
 	 [regione] Sicilia [geo] 37.46 15.03
**************************************************
[USER] Anvedi andò stai. La prossima volta dimmelo pe tempo che te faccio un saluto.
	 [regione] Lazio  [geo] 41.89 12.54
 	 [regione] Lazio [geo] 41.89 12.54
*************************

KeyboardInterrupt: 