# Fitness function

### Versione MAX
$$\mathbf{Fitness} = W_{acc} \cdot \text{Acc} + W_{struct} \cdot \text{ROUGE} - W_{tok} \cdot \text{Token} - W_{time} \cdot \text{Time} $$


La funzione di valutazione combina quattro componenti fondamentali:

- **Accuracy (Acc)**: Misura la correttezza del contenuto (ottenuta tramite Keyword Matching).

- **ROUGE-L (ROUGE)**: Misura la somiglianza strutturale con la risposta di riferimento.

- **Token Count (Token)**: Misura la lunghezza della risposta.

- **Latency (Time)**: Misura il tempo impiegato per generare la risposta.


## Versione MIN

$$\text{Loss} = W_{err} \cdot (1 - \text{Acc}) + W_{struct} \cdot (1 - \text{ROUGE}) + W_{tok} \cdot \text{TokenCount} + W_{time} \cdot \text{Latency}$$

# Versione "Problema di Massimizzazione"

In [1]:
# Installa la libreria 'dss' di Hugging Face
!pip install datasets=="2.19.1"

Collecting datasets==2.19.1
  Downloading datasets-2.19.1-py3-none-any.whl.metadata (19 kB)
Collecting pyarrow-hotfix (from datasets==2.19.1)
  Downloading pyarrow_hotfix-0.7-py3-none-any.whl.metadata (3.6 kB)
Collecting fsspec<=2024.3.1,>=2023.1.0 (from fsspec[http]<=2024.3.1,>=2023.1.0->datasets==2.19.1)
  Downloading fsspec-2024.3.1-py3-none-any.whl.metadata (6.8 kB)
Downloading datasets-2.19.1-py3-none-any.whl (542 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m11.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.3.1-py3-none-any.whl (171 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m172.0/172.0 kB[0m [31m10.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyarrow_hotfix-0.7-py3-none-any.whl (7.9 kB)
Installing collected packages: pyarrow-hotfix, fsspec, datasets
  Attempting uninstall: fsspec
    Found existing installation: fsspec 2025.3.0
    Uninstalling fsspec-2025.3.0:
      Successfully uninstalled

In [2]:
from datasets import load_ds

ds = load_ds("kaist-ai/CoT-Collection")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


Downloading builder script: 0.00B [00:00, ?B/s]

Downloading readme: 0.00B [00:00, ?B/s]

Downloading data:   0%|          | 0.00/2.36G [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

In [8]:
# 4. Visualizzazione della struttura del ds
print("\n--- Struttura del dataset ---")
print(ds)

# 5. Visualizzazione dei primi esempi del set di addestramento (train)
# Le colonne importanti sono 'source' (domanda) e 'target' (risposta finale)
# e 'rationale' (Chain-of-Thought, il ragionamento aperto)
print("\n--- Primi Esempi (Split 'train') ---")

# Modificato per prendere più esempi
examples = ds['train'].select(range(5))

for item in examples:
    print(f"Domanda: {item['source']}\nTarget: {item['target']}\nRationale: {item['rationale']}\n")
    print("="*60)


--- Struttura del dataset ---
DatasetDict({
    train: Dataset({
        features: ['source', 'target', 'rationale', 'task', 'type'],
        num_rows: 1837928
    })
})

--- Primi Esempi (Split 'train') ---
Domanda: Article: Phytochemistry is a branch of plant biochemistry primarily concerned with the chemical substances produced by plants during secondary metabolism. Some of these compounds are toxins such as the alkaloid coniine from hemlock. Others, such as the essential oils peppermint oil and lemon oil are useful for their aroma, as flavourings and spices (e.g., capsaicin), and in medicine as pharmaceuticals as in opium from opium poppies. Many medicinal and recreational drugs, such as tetrahydrocannabinol (active ingredient in cannabis), caffeine, morphine and nicotine come directly from plants. Others are simple derivatives of botanical natural products. For example, the pain killer aspirin is the acetyl ester of salicylic acid, originally isolated from the bark of willow tree

# Testing fitness function

In [9]:
!pip install evaluate rouge_score

Collecting evaluate
  Downloading evaluate-0.4.6-py3-none-any.whl.metadata (9.5 kB)
Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Downloading evaluate-0.4.6-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24934 sha256=48590f95d90dcc3cce48626baa2df9d78093c17ad0947d237f5fbd31aea02e08
  Stored in directory: /root/.cache/pip/wheels/85/9d/af/01feefbe7d55ef5468796f0c68225b6788e85d9d0a281e7a70
Successfully built rouge_score
Installing collected packages: rouge_score, evaluate
Successfully installed evaluate-0.4.6 rouge_score-0.1.2


### Fitness function class

In [14]:
import evaluate
import numpy as np

# Carichiamo la metrica ROUGE se non è già caricata
try:
    rouge_metric
except NameError:
    rouge_metric = evaluate.load("rouge")

class FitnessCalculator:
    def __init__(self,
                 w_accuracy=2.0,       # Ex Alpha: Quanto conta la risposta corretta
                 w_token_penalty=0.02, # Ex Beta: Costo per ogni parola usata
                 w_latency_penalty=0.1,# Ex Gamma: Costo per ogni secondo impiegato
                 w_rouge_reward=1.0):  # Ex Delta: Premio per la somiglianza strutturale

        self.w_acc = w_accuracy
        self.w_token = w_token_penalty
        self.w_time = w_latency_penalty
        self.w_rouge = w_rouge_reward

    def calculate_keyword_accuracy(self, prediction, reference):
        """
        Verifica se le parole chiave del target sono presenti nella predizione.
        Ritorna 1.0 se tutte le parole non-stopword sono presenti, altrimenti 0.0.
        """
        pred_words = set(prediction.strip().lower().split())
        ref_words = set(reference.strip().lower().split())

        # Stop words base per evitare falsi positivi grammaticali
        stop_words = {'the', 'a', 'an', 'in', 'on', 'at', 'to', 'from', 'of', 'and', 'is', 'are'}
        important_ref_words = ref_words - stop_words

        # Gestione casi limite (target vuoto o solo stopwords)
        if not important_ref_words:
            return 1.0 if reference.strip().lower() in prediction.strip().lower() else 0.0

        # Calcolo copertura
        matches = len(important_ref_words.intersection(pred_words))
        coverage = matches / len(important_ref_words)

        return 1.0 if coverage == 1.0 else 0.0

    def compute(self, generated_text, target_text, generation_time):
        """
        Calcola il punteggio di fitness totale.
        """
        # 1. Accuracy (Correttezza semantica di base)
        accuracy = self.calculate_keyword_accuracy(generated_text, target_text)

        # 2. ROUGE (Somiglianza strutturale)
        rouge_results = rouge_metric.compute(predictions=[generated_text], references=[target_text])
        rouge_l = rouge_results['rougeL']

        # 3. Conteggio Token (Lunghezza)
        token_count = len(generated_text.split())

        # --- LOGICA DI GATING ---
        # Se la risposta è sbagliata (accuracy 0), ignoriamo il premio ROUGE.
        # Non vogliamo premiare belle frasi che dicono cose false.
        effective_rouge = rouge_l if accuracy > 0 else 0.0

        # --- CALCOLO FITNESS ---
        # Fitness = (Premio Accuratezza) + (Premio Struttura) - (Costo Lunghezza) - (Costo Tempo)
        fitness = (self.w_acc * accuracy) + \
                  (self.w_rouge * effective_rouge) - \
                  (self.w_token * token_count) - \
                  (self.w_time * generation_time)

        return {
            "fitness": round(fitness, 4),
            "details": {
                "accuracy_score": accuracy,
                "rouge_score": round(effective_rouge, 4),
                "tokens_used": token_count,
                "time_taken": round(generation_time, 4)
            }
        }

# Istanziazione con i nomi chiari
fitness_engine = FitnessCalculator(
    w_accuracy=2.0,
    w_token_penalty=0.02,
    w_latency_penalty=0.1,
    w_rouge_reward=1.0
)

print("Fitness Engine v3.0 pronto con nomenclatura esplicita.")

Fitness Engine v3.0 pronto con nomenclatura esplicita.


### Testing function with some examples

In [15]:
# Dati presi dal tuo dataset
target_real = "from plants"

# Simuliamo 3 diversi scenari di generazione del modello

# SCENARIO A: Risposta perfetta e veloce
gen_a = "They come from plants"
time_a = 0.1 # secondi

# SCENARIO B: Risposta corretta ma troppo prolissa (Verbosa) e lenta
gen_b = "Based on the text provided, we can clearly see that many medicines and recreational drugs actually originate directly from various types of plants found in nature."
time_b = 0.8 # secondi

# SCENARIO C: Risposta sbagliata e breve
gen_c = "from chemicals"
time_c = 0.1 # secondi

# Calcoliamo le fitness
res_a = fitness_calc.compute(gen_a, target_real, time_a)
res_b = fitness_calc.compute(gen_b, target_real, time_b)
res_c = fitness_calc.compute(gen_c, target_real, time_c)

# Funzione per stampare bene i risultati
def print_report(name, text, res):
    print(f"--- {name} ---")
    print(f"Testo: '{text}'")
    print(f"FITNESS TOTALE: {res['total_fitness']}")
    print(f"Dettagli: Accuracy: {res['components']['accuracy']} | "
          f"Rouge Reward: +{res['components']['rouge_reward']} | "
          f"Token Penalty: -{res['components']['token_penalty']} | "
          f"Latency Penalty: -{res['components']['latency_penalty']}")
    print("\n")

print_report("Scenario A (Conciso e Corretto)", gen_a, res_a)
print_report("Scenario B (Prolisso)", gen_b, res_b)
print_report("Scenario C (Sbagliato)", gen_c, res_c)

--- Scenario A (Conciso e Corretto) ---
Testo: 'They come from plants'
FITNESS TOTALE: 2.5767
Dettagli: Accuracy: 1.0 | Rouge Reward: +0.6667 | Token Penalty: -0.08 | Latency Penalty: -0.01


--- Scenario B (Prolisso) ---
Testo: 'Based on the text provided, we can clearly see that many medicines and recreational drugs actually originate directly from various types of plants found in nature.'
FITNESS TOTALE: 1.5429
Dettagli: Accuracy: 1.0 | Rouge Reward: +0.1429 | Token Penalty: -0.52 | Latency Penalty: -0.08


--- Scenario C (Sbagliato) ---
Testo: 'from chemicals'
FITNESS TOTALE: -0.05
Dettagli: Accuracy: 0.0 | Rouge Reward: +0.0 | Token Penalty: -0.04 | Latency Penalty: -0.01




ok test passato ✅

### TEST 2 Fine-grained test

In [16]:
import pandas as pd

# Definiamo il target per questo test
target_ref = "Battle of Appomattox Court House"

# Definiamo una lista di casi limite (Edge Cases)
test_cases = [
    # 1. PERFETTO
    {"desc": "Perfetta", "text": "Battle of Appomattox Court House", "time": 0.1},

    # 2. VARIAZIONI CORRETTE
    {"desc": "Corretta (Minuscolo)", "text": "battle of appomattox court house", "time": 0.1},
    {"desc": "Corretta + Intro (Verbose)", "text": "The correct answer is the Battle of Appomattox Court House", "time": 0.5},
    {"desc": "Corretta + Intro Lunghissima", "text": "I have analyzed the history books and I can confidently say that the event in question is definitely the Battle of Appomattox Court House which ended the war.", "time": 1.2},

    # 3. CASI "QUASI" CORRETTI (Testiamo la severità della keyword accuracy)
    {"desc": "Manca 1 parola (Battle)", "text": "Appomattox Court House", "time": 0.1},
    {"desc": "Manca 1 parola (House)", "text": "Battle of Appomattox Court", "time": 0.1},

    # 4. HALLUCINAZIONI (Parole simili ma fatto diverso)
    {"desc": "Hallucination (Gettysburg)", "text": "Battle of Gettysburg Court House", "time": 0.2},

    # 5. CASI SBAGLIATI
    {"desc": "Sbagliata corta", "text": "Civil War", "time": 0.1},
    {"desc": "Sbagliata lunga", "text": "General Lee surrendered at a different location entirely.", "time": 0.5},

    # 6. RUMORE
    {"desc": "Vuota/Simboli", "text": "...", "time": 0.01},
]

results = []

# Eseguiamo il test
print(f"TARGET: '{target_ref}'\n")

for case in test_cases:
    # Calcoliamo la fitness
    score = fitness_engine.compute(case["text"], target_ref, case["time"])

    # Salviamo i dati per la tabella
    results.append({
        "Descrizione": case["desc"],
        "Risposta Generata": case["text"],
        "Fitness": score["fitness"],
        "Acc (Kwd)": score["details"]["accuracy_score"],
        "Rouge": score["details"]["rouge_score"],
        "Tokens": score["details"]["tokens_used"],
        "Time": score["details"]["time_taken"]
    })

# Creiamo un DataFrame per visualizzare bene
df_results = pd.DataFrame(results)

# Ordiniamo per Fitness decrescente (i migliori in alto)
df_results = df_results.sort_values(by="Fitness", ascending=False)

# Visualizzazione formattata
from IPython.display import display
display(df_results)

TARGET: 'Battle of Appomattox Court House'



Unnamed: 0,Descrizione,Risposta Generata,Fitness,Acc (Kwd),Rouge,Tokens,Time
0,Perfetta,Battle of Appomattox Court House,2.89,1.0,1.0,5,0.1
1,Corretta (Minuscolo),battle of appomattox court house,2.89,1.0,1.0,5,0.1
2,Corretta + Intro (Verbose),The correct answer is the Battle of Appomattox...,2.4167,1.0,0.6667,10,0.5
3,Corretta + Intro Lunghissima,I have analyzed the history books and I can co...,1.623,1.0,0.303,28,1.2
9,Vuota/Simboli,...,-0.021,0.0,0.0,1,0.01
7,Sbagliata corta,Civil War,-0.05,0.0,0.0,2,0.1
4,Manca 1 parola (Battle),Appomattox Court House,-0.07,0.0,0.0,3,0.1
5,Manca 1 parola (House),Battle of Appomattox Court,-0.09,0.0,0.0,4,0.1
6,Hallucination (Gettysburg),Battle of Gettysburg Court House,-0.12,0.0,0.0,5,0.2
8,Sbagliata lunga,General Lee surrendered at a different locatio...,-0.21,0.0,0.0,8,0.5


# Versione 2: Fitness "Minimizzabile"


In [17]:
import evaluate
import numpy as np

# Carichiamo ROUGE se necessario
try:
    rouge_metric
except NameError:
    rouge_metric = evaluate.load("rouge")

class CostCalculator:
    def __init__(self,
                 w_error=10.0,        # Penalità ENORME per risposta sbagliata
                 w_token_cost=0.05,   # Costo per ogni parola usata
                 w_time_cost=0.5,     # Costo per ogni secondo impiegato
                 w_divergence=2.0):   # Costo per struttura diversa dal target

        self.w_err = w_error
        self.w_tok = w_token_cost
        self.w_time = w_time_cost
        self.w_div = w_divergence

    def calculate_keyword_accuracy(self, prediction, reference):
        """
        Ritorna 1.0 se corretto, 0.0 se sbagliato.
        """
        pred_words = set(prediction.strip().lower().split())
        ref_words = set(reference.strip().lower().split())
        stop_words = {'the', 'a', 'an', 'in', 'on', 'at', 'to', 'from', 'of', 'and', 'is', 'are'}
        important_ref_words = ref_words - stop_words

        if not important_ref_words:
            return 1.0 if reference.strip().lower() in prediction.strip().lower() else 0.0

        matches = len(important_ref_words.intersection(pred_words))
        coverage = matches / len(important_ref_words)
        return 1.0 if coverage == 1.0 else 0.0

    def compute(self, generated_text, target_text, generation_time):
        """
        Calcola la LOSS totale (Minimizzare).
        Obiettivo teorico: 0.0 (Impossibile, ma ci tendiamo).
        """
        # 1. Accuracy
        accuracy = self.calculate_keyword_accuracy(generated_text, target_text)
        # Trasformiamo in Errore (0 se corretto, 1 se sbagliato)
        error_rate = 1.0 - accuracy

        # 2. ROUGE (Struttura)
        rouge_results = rouge_metric.compute(predictions=[generated_text], references=[target_text])
        rouge_l = rouge_results['rougeL']

        # GATING: Se la risposta è sbagliata (Error=1), non ci interessa se la struttura è simile.
        # Consideriamo la divergenza massima (1.0) o calcoliamo la divergenza reale solo se corretto.
        # Qui: Calcoliamo la "Divergenza" (1 - ROUGE).
        # Se accuracy è 0, la divergenza non aiuta a recuperare punti, l'errore domina.
        structural_divergence = 1.0 - rouge_l

        # 3. Costi di Risorse
        token_count = len(generated_text.split())

        # --- CALCOLO LOSS ---

        # Costo Base = (Peso Errore * Errore) + (Peso Divergenza * Divergenza)
        # Se accuracy = 1 (corretto), il primo termine sparisce.
        base_cost = (self.w_err * error_rate) + (self.w_div * structural_divergence)

        # Costo Efficienza = Token + Tempo
        efficiency_cost = (self.w_tok * token_count) + (self.w_time * generation_time)

        total_loss = base_cost + efficiency_cost

        return {
            "loss": round(total_loss, 4),
            "details": {
                "is_correct": accuracy > 0,
                "error_cost": round(self.w_err * error_rate, 4),
                "divergence_cost": round(self.w_div * structural_divergence, 4),
                "token_cost": round(self.w_tok * token_count, 4),
                "time_cost": round(self.w_time * generation_time, 4)
            }
        }

# Inizializzazione
# Nota come w_error (10.0) è molto più alto di w_token_cost * lunghezza_media
loss_engine = CostCalculator(
    w_error=10.0,       # Priorità assoluta: NON SBAGLIARE
    w_token_cost=0.05,  # Preferisci risposte brevi
    w_time_cost=0.5,    # Preferisci risposte veloci
    w_divergence=1.0    # Preferisci struttura simile
)

print("Cost Function (Loss) pronta per la minimizzazione.")

Cost Function (Loss) pronta per la minimizzazione.


### Test

In [18]:
import pandas as pd

target_ref = "Battle of Appomattox Court House"

test_cases = [
    # IL VINCITORE ATTESO (Basso costo)
    {"desc": "Perfetta", "text": "Battle of Appomattox Court House", "time": 0.1},

    # COSTO MEDIO (Corretto ma costoso)
    {"desc": "Corretta ma Lunga", "text": "The correct answer is the Battle of Appomattox Court House", "time": 0.5},

    # COSTO ALTO (Sbagliato)
    {"desc": "Sbagliata corta", "text": "Civil War", "time": 0.1},

    # COSTO ALTISSIMO (Sbagliato + Lungo)
    {"desc": "Sbagliata lunga", "text": "General Lee surrendered at a different location entirely.", "time": 0.5},

    # TRAPPOLA (Vuota/Veloce)
    {"desc": "Vuota (Empty Trap)", "text": ".", "time": 0.01},
]

results = []

print(f"TARGET: '{target_ref}'\n")

for case in test_cases:
    res = loss_engine.compute(case["text"], target_ref, case["time"])

    results.append({
        "Descrizione": case["desc"],
        "LOSS (Minimizzare)": res["loss"], # Più basso = Meglio
        "Corretto?": res["details"]["is_correct"],
        "Err Cost": res["details"]["error_cost"],
        "Div Cost": res["details"]["divergence_cost"],
        "Tok Cost": res["details"]["token_cost"]
    })

df_loss = pd.DataFrame(results)
# Ordiniamo dal più basso (migliore) al più alto
df_loss = df_loss.sort_values(by="LOSS (Minimizzare)", ascending=True)

from IPython.display import display
display(df_loss)

TARGET: 'Battle of Appomattox Court House'



Unnamed: 0,Descrizione,LOSS (Minimizzare),Corretto?,Err Cost,Div Cost,Tok Cost
0,Perfetta,0.3,True,0.0,0.0,0.25
1,Corretta ma Lunga,1.0833,True,0.0,0.3333,0.5
4,Vuota (Empty Trap),11.055,False,10.0,1.0,0.05
2,Sbagliata corta,11.15,False,10.0,1.0,0.1
3,Sbagliata lunga,11.65,False,10.0,1.0,0.4
