#Homework 1A - Task 2

In this homework we aim to create a JSON file, starting from existing datas in order to be used from Generative Language Models.

So strating from existing datasets we will convert them into a format useful for the evaluation of LLM.

NERMuD is a task presented at EVALITA 2023 consisting in the extraction and classification of named-entities in a document, such as persons, organizations, and locations.

The following is an example of the input datas.

    L'            O
    astronauta	O
    Umberto	   B-PER
    Guidoni	   I-PER
    ,	         O
    dell'	     O
    Agenzia	   B-ORG
    Spaziale	  I-ORG
    Europea	   I-ORG
    ,	         O
    svela	     O
    ai	        O
    bambini	   O
    i	         O
    segreti	   O
    della	     O
    Stazione	  B-LOC
    Spaziale	  I-LOC
    Internazionale	I-LOC
    .	         O

#Imports and constants

In [1]:
!git clone https://github.com/dhfbk/KIND.git

Cloning into 'KIND'...
remote: Enumerating objects: 128, done.[K
remote: Counting objects: 100% (128/128), done.[K
remote: Compressing objects: 100% (101/101), done.[K
remote: Total 128 (delta 58), reused 68 (delta 22), pack-reused 0[K
Receiving objects: 100% (128/128), 9.34 MiB | 10.12 MiB/s, done.
Resolving deltas: 100% (58/58), done.


I create the dictionary object in order to map the label from the original dataset to a real word comprehensible from the LLM

In [2]:
path = "KIND/evalita-2023/"

categories = {
    "PER": "persona",
    "LOC": "luogo",
    "ORG": "organizzazione",
}

In [3]:
from __future__ import unicode_literals, print_function
import os
import json
import csv
import torch
import random

files = []
for file in os.listdir(path):
  if file.endswith(".tsv") and not file.endswith("_nolabel.tsv"):
    files.append(file)

files.sort()
print(files)

['ADG_dev.tsv', 'ADG_test.tsv', 'ADG_train.tsv', 'FIC_dev.tsv', 'FIC_test.tsv', 'FIC_train.tsv', 'WN_dev.tsv', 'WN_test.tsv', 'WN_train.tsv']


# Define the extraction function

Looking at the structure of the dataset I have noticed that all the sentences are splitted using an empty array so first of all I splitted all the sentences and inserted them in an array

In [4]:
def extract_sentences(path, file):
  with open(path + file, 'r') as f:
        reader = csv.reader(f, delimiter='\t')
        sentences = []
        text = ""
        for line in reader:
          if line != []:
            text = text + line[0] + " "
          else:
            sentences.append(text)
            text = ""
  return sentences

sentences = extract_sentences(path, files[0])
print(sentences)

['Il Paese e i contadini ', 'Nel nostro Trentino attraversiamo un momento storico importante . ', 'Lo abbiamo detto cento e cento volte ; ', 'fino ad ora chi da noi faceva nuvolo e sereno erano quattro dottori , quattro avvocati , che quando non avevano da presentarsi al publico della città a farvi qualche bel gesto , a parlarvi del Paese nostro , accoppavano il tempo a giocare a tresette e tarocco nei caffè , criticando negli intermezzi i passanti e le ... passanti . ', 'Lo abbiamo detto ancora : ', 'i nostri politicanti da caffè non furono capaci che di chiacchiere , di promesse , di grandi progetti ; ', 'in una cosa si distinsero , nel criticare cioè e bollare come traditori della patria quei disgraziati che avessero avuto civile ardire , oltre che di far progetti , di metterli anche in esecuzione . ', 'Con la politica del « tutto o nulla » bisogna farla finita ; ', 'bisognava capire una buona volta che con un Governo come il nostro , con partiti nemici come abbiamo noi , a dir « tu

In this function i create an array containing all the entities with a label different from 'O'. Each entity is associated with the relative label that could be B-PER, B-LOC, B-ORG, I-PER, I-LOC or I-ORG.

In [5]:
def extract_entities(path, file):
  with open(path + file, 'r') as f:
        reader = csv.reader(f, delimiter='\t')
        entities = []
        for line in reader:
          if len(line)>1 and line[1] != 'O' and not( line in entities):
            entities.append(line)
  return entities

entities = extract_entities(path, files[3])
print(entities)

[['Garibaldi', 'B-PER'], ['Italia', 'B-LOC'], ['Settentrionale', 'I-LOC'], ['Sicilia', 'B-LOC'], ['Roma', 'B-LOC'], ['governo', 'B-ORG'], ['italiano', 'I-ORG'], ['Francia', 'B-ORG'], ['Napoleone', 'B-PER'], ['III', 'I-PER'], ['Vittorio', 'B-PER'], ['Emanuele', 'I-PER'], ['Europa', 'B-LOC'], ['Italia', 'B-ORG'], ['Catania', 'B-LOC'], ['Campidoglio', 'B-LOC'], ['Calabria', 'B-LOC'], ['Governo', 'B-ORG'], ['Aspromonte', 'B-LOC'], ['Varignano', 'B-LOC'], ['Angelo', 'B-LOC'], ['Genova', 'B-LOC'], ['Canelli', 'B-LOC'], ['Gaminella', 'B-LOC'], ['Valino', 'B-PER'], ['Nuto', 'B-PER'], ['Belbo', 'B-LOC'], ['Virgilia', 'B-PER'], ['Angiolina', 'B-PER'], ['Giulia', 'B-PER'], ['Stato', 'B-ORG'], ['Po', 'B-LOC'], ['Piacenza', 'B-LOC'], ['Mezzanacorti', 'B-LOC'], ['Annibale', 'B-PER'], ['Volturno', 'B-LOC'], ['Pontelagoscuro', 'B-LOC'], ['Borgoforte', 'B-LOC'], ['Sestocalende', 'B-LOC'], ['Taranto', 'B-LOC'], ['Tevere', 'B-LOC'], ['Umberto', 'B-PER'], ['Margherita', 'B-PER'], ['Spezia', 'B-LOC'], ['Ca

#Generate JSON

In this section I create the JSON files using the information that I extracted before. Each record of the JSON file have the following shape:


    {
        "sentence_id": int, # an incremental integer (starting from zero)

        "text": str, # the input sentence,
        
        "target_entity": str, # can be a multi-word

        "choices": List[str],
        
        "label": int, # the correct answer
    }

In [6]:
def generate_records(entities, sentences):
  id = 0
  all_records = []
  for entitie in entities:
    possible_sentence = []
    for sentence in sentences:
      reference = entitie[0] + " "
      if reference in sentence:
        possible_sentence.append(sentence)

    if len(possible_sentence)>0:

      sentence = possible_sentence[random.randint(0, len(possible_sentence)-1)]
      choices=[categories['PER'], categories['LOC'],categories['ORG']]
      record = {
          "sentence_id": id, # an incremental integer (starting from zero)
          "text": sentence, # the input sentence,
          "target_entity": entitie[0], # can be a multi-word
          "choices": choices,
          "label": choices.index(categories[entitie[1][-3:]]), # the correct answer
      }
      id +=1
      all_records.append(record)

  return all_records

all_records = generate_records(entities, sentences)
print(all_records)

[{'sentence_id': 0, 'text': 'E che non ci sia franchezza voi lo constatate anche oggi perché vedete che si chiama in ballo Garibaldi e ci si nasconde dietro il cosiddetto Fronte popolare . ', 'target_entity': 'Garibaldi', 'choices': ['persona', 'luogo', 'organizzazione'], 'label': 0}, {'sentence_id': 1, 'text': "L' Italia conferma ancora una volta il suo desiderio di accordarsi col popolo jugoslavo sulla base di un' equa considerazione dei diritti e degli interessi di entrambe le parti . ", 'target_entity': 'Italia', 'choices': ['persona', 'luogo', 'organizzazione'], 'label': 1}, {'sentence_id': 2, 'text': "l' autonomia , infatti , porterà a contatto le forze direttive con quelle del lavoro , il primo contributo in questo senso lo darà proprio la Sicilia e questa sarà la prima fase ; ", 'target_entity': 'Sicilia', 'choices': ['persona', 'luogo', 'organizzazione'], 'label': 1}, {'sentence_id': 3, 'text': 'Quanto al Msi , ricorda la denuncia del questore di Roma : ', 'target_entity': 'Ro

In [7]:
json_files = []
for file in files:

  sentences = extract_sentences(path, file)
  entities = extract_entities(path, file)
  records = generate_records(entities, sentences)
  filename = "NERMuD_" + file[:-4] + ".jsonl"

  with open(filename, 'w', encoding='utf-8') as json_file:
      json.dump(records, json_file, ensure_ascii=False, indent=4)

  json_files.append(filename)
print(json_files)

['NERMuD_ADG_dev.jsonl', 'NERMuD_ADG_test.jsonl', 'NERMuD_ADG_train.jsonl', 'NERMuD_FIC_dev.jsonl', 'NERMuD_FIC_test.jsonl', 'NERMuD_FIC_train.jsonl', 'NERMuD_WN_dev.jsonl', 'NERMuD_WN_test.jsonl', 'NERMuD_WN_train.jsonl']


#Generate prompts

In [8]:
prompts = {
    "prompt_1":"Data la seguente frase '{text}' in questo contesto la parola '{target_entity}' si riferisce ad una (persona), ad una (luogo) o ad un' (organizzazione) ?",
    "prompt_2":"Analizzando la frase '{text}' come classificheresti la parola '{target_entity}' date le seguenti possibili classi (persona), (luogo), o (organizzazione) ?",
    "prompt_3":"Una parola come '{target_entity}' può essere interpretata in modi differenti, nel testo '{text}', è usata riferendosi ad una (persona), ad una (luogo) o ad un' (organizzazione) ? ",
}

output_file = "/content/NERMuD_prompt.jsonl"

with open(output_file, 'w') as json_file:
    json.dump(prompts, json_file, indent=2)


#Test the prompt

In [9]:
if torch.cuda.is_available():
    device = torch.device("cuda:0")
    torch.cuda.set_device(device)
else:
    device = torch.device("cpu")

In [10]:
runtimeFlag = device
cache_dir = None
scaling_factor = 1.0

In [11]:
!pip install -q -U transformers peft accelerate optimum

!pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.8/8.8 MB[0m [31m18.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.1/199.1 kB[0m [31m22.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m297.3/297.3 kB[0m [31m24.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m409.9/409.9 kB[0m [31m39.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.0/46.0 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m30.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.7/23.7 MB[0m [31m51.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m823.6/823.6 kB[0m [31m59.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━

In [12]:
!pip install -q -U pdfminer.six # could maybe add pre-built wheels to speed this up.

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.6/5.6 MB[0m [31m16.2 MB/s[0m eta [36m0:00:00[0m
[?25h

In [13]:
import transformers
import torch
import json
import os
import numpy as np
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import pipeline

device = 0 if torch.cuda.is_available() else -1

model_id = "MoritzLaurer/mDeBERTa-v3-base-xnli-multilingual-nli-2mil7"
tokenizer_llama = AutoTokenizer.from_pretrained(model_id)
model_llama = AutoModelForSequenceClassification.from_pretrained(model_id).to(device)

translator = pipeline("translation", model="Helsinki-NLP/opus-mt-it-en", device=device)

classifier = pipeline("zero-shot-classification", model="xlm-roberta-large", device=device)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/467 [00:00<?, ?B/s]

spm.model:   0%|          | 0.00/4.31M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/16.3M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/23.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/173 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.09k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/558M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/344M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/814k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/790k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.37M [00:00<?, ?B/s]



config.json:   0%|          | 0.00/616 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at xlm-roberta-large and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

Failed to determine 'entailment' label id from the label2id mapping in the model config. Setting to -1. Define a descriptive label2id mapping in the model config to ensure correct outputs.


#Translation and Classification

In [15]:
count_array_per_file = []
correct_array_per_file=[]

with open("/content/NERMuD_prompt.jsonl", 'r') as file:

    prompts = json.loads(file.read())
    print(prompts)

for json_file in json_files:
    count_array = []
    correct_array = []
    for i in range(1, len(prompts)+1):
        with open("/content/" + json_file, 'r') as file:
            inputs = json.loads(file.read())

        count_input = 0
        correct = 0

        for input in inputs:
            template = prompts['prompt_' + str(i)]
            text_to_insert = input['text']
            entity_to_insert = input['target_entity']

            compiled_string = template.format(text=text_to_insert, target_entity=entity_to_insert)
            compiled_string = translator(compiled_string, max_length=1024)[0]['translation_text']

            token = tokenizer_llama(compiled_string, return_tensors="pt").to(device)

            output = model_llama(**token)

            logits = torch.softmax(output.logits, dim=-1)
            prediction = torch.argmax(logits, dim=-1).item()

            count_input += 1

            if prediction == input['label']:
                correct += 1

            if count_input == 50:
                count_array.append(count_input)
                correct_array.append(correct)
                break

    count_array_per_file.append(count_array)
    correct_array_per_file.append(correct_array)

for i in range(len(json_files)):
    print(f"File: {json_files[i]}")
    for j in range(len(count_array_per_file[i])):
        accuracy = correct_array_per_file[i][j] / count_array_per_file[i][j] * 100
        print(f"Accuracy with prompt {j+1}: {accuracy:.2f}%")


{'prompt_1': "Data la seguente frase '{text}' in questo contesto la parola '{target_entity}' si riferisce ad una (persona), ad una (luogo) o ad un' (organizzazione) ?", 'prompt_2': "Analizzando la frase '{text}' come classificheresti la parola '{target_entity}' date le seguenti possibili classi (persona), (luogo), o (organizzazione) ?", 'prompt_3': "Una parola come '{target_entity}' può essere interpretata in modi differenti, nel testo '{text}', è usata riferendosi ad una (persona), ad una (luogo) o ad un' (organizzazione) ? "}
File: NERMuD_ADG_dev.jsonl
Accuracy with prompt 1: 28.00%
Accuracy with prompt 2: 16.00%
Accuracy with prompt 3: 14.00%
File: NERMuD_ADG_test.jsonl
Accuracy with prompt 1: 32.00%
Accuracy with prompt 2: 32.00%
Accuracy with prompt 3: 42.00%
File: NERMuD_ADG_train.jsonl
Accuracy with prompt 1: 30.00%
Accuracy with prompt 2: 16.00%
Accuracy with prompt 3: 20.00%
File: NERMuD_FIC_dev.jsonl
Accuracy with prompt 1: 62.00%
Accuracy with prompt 2: 30.00%
Accuracy with 

#Zero shot classification

In [16]:
with open("/content/NERMuD_prompt.jsonl", 'r') as file:

    prompts = json.loads(file.read())
    print(prompts)

count_array_per_file = []
correct_array_per_file = []

for json_file in json_files:
  count_array = []
  correct_array = []
  for i in range(1, len(prompts)+1):

    with open("/content/"+ json_file, 'r') as file:

        inputs = json.loads(file.read())

        count_input = 0
        correct = 0

        for input in inputs:

          template = prompts['prompt_'+ str(i)]

          text_to_insert = input['text']

          entity_to_insert = input['target_entity']

          compiled_string = template.format(text=text_to_insert, target_entity=entity_to_insert)

          candidate_labels = ["persona",  "luogo", "organizzazione"]

          output = classifier(compiled_string, candidate_labels=candidate_labels)
          count_input += 1

          if output['labels'][0]== candidate_labels[input['label']]:
            correct += 1

          if count_input==50:

            count_array.append(count_input)
            correct_array.append(correct)

            break

  count_array_per_file.append(count_array)
  correct_array_per_file.append(correct_array)

for i in range(len(json_files)):
  print(f"File: {json_files[i]}")
  for j in range(len(count_array_per_file[i])):
      accuracy = correct_array_per_file[i][j] / count_array_per_file[i][j] * 100
      print(f"Accuracy with prompt {j+1}: {accuracy:.2f}%")

{'prompt_1': "Data la seguente frase '{text}' in questo contesto la parola '{target_entity}' si riferisce ad una (persona), ad una (luogo) o ad un' (organizzazione) ?", 'prompt_2': "Analizzando la frase '{text}' come classificheresti la parola '{target_entity}' date le seguenti possibili classi (persona), (luogo), o (organizzazione) ?", 'prompt_3': "Una parola come '{target_entity}' può essere interpretata in modi differenti, nel testo '{text}', è usata riferendosi ad una (persona), ad una (luogo) o ad un' (organizzazione) ? "}
File: NERMuD_ADG_dev.jsonl
Accuracy with prompt 1: 56.00%
Accuracy with prompt 2: 50.00%
Accuracy with prompt 3: 48.00%
File: NERMuD_ADG_test.jsonl
Accuracy with prompt 1: 38.00%
Accuracy with prompt 2: 38.00%
Accuracy with prompt 3: 44.00%
File: NERMuD_ADG_train.jsonl
Accuracy with prompt 1: 26.00%
Accuracy with prompt 2: 26.00%
Accuracy with prompt 3: 30.00%
File: NERMuD_FIC_dev.jsonl
Accuracy with prompt 1: 24.00%
Accuracy with prompt 2: 34.00%
Accuracy with 