# HuLU eval of Koboldcpp hosted LLM
HuLU dataset is created with similar goals in mind as GLUE, but for Hungarian specific language evaluation in mind. This means, it is just a classification task list, which required to be evaluated on any kind of NLP classifier you can throw at it.

It's intended workflow is the following:
- One trains the classifier with training set from the whole dataset
- Checks back with the evaluation dataset how it performs (it is also labeled)
- Then runs the whole Test set, and stores the results (it doesn't contain labels, it's private)
- Uploads the result to somewhere, which calculates the metrics, and possibly displays it's score on a leader board

## Why keep the test labels secret?
Simple enough: to prevent the LLMs to contaminate their knowledge with the correct results, thus preventing some LLM developers to artificially train their LLMs with the good results, thus advancing their creation on the leader board.

Though, so far I didn't came accross any leader boards, nor services which calculates the metrics for me.

## How to evaluate then?
The idea, is to use the evaluation part of the dataset, to measure. We skip the part, where we train the LLM for the classification part, and instead we use the context learning capability of the model: we give a short description of the work they need to perform, and give some examples (5 diverse examples to be concrate) from the training set. This gives us a few-shot-example-prompt, which according to my experience, most of the LLMs can easily comply. For this, to work, the most interresting candidates for evaluation, are the instruction finetuned versions of the models.

## Why use Kobobldcpp to host LLM?
Koboldcpp is easy to configure through command line, can use CPU, CUDA, OpenCL, etc, so it's easy to squeeze out most of what's available on the local, or remote machine, without touching this very notebook, and it's codes. Also, one can use any arbitrary quantized version of any openly available LLM, provided Koboldcpp is capable to run it. It it can run, a lot of different types of LLMs.

In [1]:
!pip install datasets evaluate requests scipy scikit-learn openai



Load required packages

In [2]:
import datasets
print(datasets.__version__)

3.0.1


In this notebook, we will see how to evaluate one of the [Transformers](https://github.com/huggingface/transformers) model on [HuLU Benchmark](https://hulu.nytud.hu/) dataset.

The HuLU Benchmark is a group of six classification tasks on sentences or pairs of sentences which are:

- [HuCoLA](https://github.com/nytud/HuCOLA) (Hungarian Corpus of Linguistic Acceptability) contains 9 076 Hungarian sentences labeled for their acceptability/grammaticality (0/1).
- [HuCoPA](https://github.com/nytud/HuCoPA) (Hungarian Choice of Plausible Alternatives Corpus) contains 1,000 instances. Each instance is composed of a premise and two alternatives. The task is to select the alternative that describes a situation standing in causal relation to the situation described by the premise.
- [HuCB](https://github.com/nytud/HuCommitmentBank) The HuCommitmentBank consists of short text fragments in which at least one sentence contains a subordinating clause, which is syntactically subordinated to a logical inference-cancelling operator.
- [HuRTE](https://github.com/nytud/HuRTE) (Hungarian Recognizing Textual Entailment) The dataset contains 4 504 instances. Each example contains a (sometimes multi-sentence) premise and a one-sentence hypothesis, and the task is to decide whether the former entails the latter or not.Determine if a sentence entails a given hypothesis or not.
- [HuSST](https://github.com/nytud/HuSST) (Hungarian version of the Stanford Sentiment Treebank) contains 11 683 sentences. Each sentence is annotated for its sentiment on a three-point scale.
- [HuWNLI](https://github.com/nytud/HuWNLI) (Winograd Natural Language Inference) Anaphora resolution datasets for Hungarian as an inference task; this is a Hungarian dataset of anaphora resolution, designed as a sentence pair classification task of natural language inference.


## Loading the dataset

We will use git clone to download data, and [Datasets](https://github.com/huggingface/datasets) library to load the data and [Evaluate](https://github.com/huggingface/evaluate) library to get the metric we need to use for evaluation (to compare our model to the benchmark).

In [4]:
!mkdir hulu
!git clone https://github.com/nytud/HuCOLA/ hulu/hucola
!git clone https://github.com/nytud/HuCoPA/ hulu/hucopa
!git clone https://github.com/nytud/HuCommitmentBank/ hulu/hucb
!git clone https://github.com/nytud/HuRTE/ hulu/hurte
!git clone https://github.com/nytud/HuSST/ hulu/husst
!git clone https://github.com/nytud/HuWNLI/ hulu/huwnli

Cloning into 'hulu/hucola'...
Cloning into 'hulu/hucopa'...
Cloning into 'hulu/hucb'...
Cloning into 'hulu/hurte'...
Cloning into 'hulu/husst'...
Cloning into 'hulu/huwnli'...


### Define tasks
What can be found where, and which metric belongs to which

In [1]:
HULU_TASKS = [
    ("hucola", "hulu/hucola/data/cola_", ["train", "dev", "test"], "cola"), 
    ("hucopa", "hulu/hucopa/data/", ["train", "val", "test"], "rte"), 
    ("hucb", "hulu/hucb/data/hucb_", ["train", "dev", "test"], "mnli"), 
    ("hurte", "hulu/hurte/data/rte_", ["train", "dev", "test"], "rte"), 
    ("husst", "hulu/husst/data/sst_", ["train", "dev", "test"], "sst2"), 
    ("huwnli", "hulu/huwnli/data/", ["train", "dev", "test"], "wnli")
]
task = "huwnli"


**Note**: I had a little problem reading in '''hucb''' json files (deserialization errors), and the solution was to open them in editor, and save back with UTF-8 marker chars in the beginning of the file.

In [2]:
from datasets import load_dataset
from evaluate import load as load_metric

for (actual_task, path, variants, glue_metric) in HULU_TASKS:
    if actual_task == task:
        dataset = load_dataset('json', data_files=f"{path}{variants[0]}.json")
        metric = load_metric('glue', glue_metric)
        break

references = [0, 1, 0, 1]
predictions = [0, 1, 1, 1]
results = metric.compute(predictions=predictions, references=references)
print(results)
dataset


{'accuracy': 0.75}


DatasetDict({
    train: Dataset({
        features: ['orig_id', 'id', 'sentence1', 'sentence2', 'label', 'Column6'],
        num_rows: 562
    })
})

## Output values

The output of the metric depends on the GLUE subset chosen, consisting of a dictionary that contains one or several of the following metrics:

`accuracy`: the proportion of correct predictions among the total number of cases processed, with a range between 0 and 1 (see [accuracy](https://huggingface.co/metrics/accuracy) for more information).

`f1`: the harmonic mean of the precision and recall (see [F1 score](https://huggingface.co/metrics/f1) for more information). Its range is 0-1 – its lowest possible value is 0, if either the precision or the recall is 0, and its highest possible value is 1.0, which means perfect precision and recall.

`pearson`: a measure of the linear relationship between two datasets (see [Pearson correlation](https://huggingface.co/metrics/pearsonr) for more information). Its range is between -1 and +1, with 0 implying no correlation, and -1/+1 implying an exact linear relationship. Positive correlations imply that as x increases, so does y, whereas negative correlations imply that as x increases, y decreases.

`spearmanr`: a nonparametric measure of the monotonicity of the relationship between two datasets(see [Spearman Correlation](https://huggingface.co/metrics/spearmanr) for more information). spearmanr has the same range as pearson.

`matthews_correlation`: a measure of the quality of binary and multiclass classifications (see [Matthews Correlation](https://huggingface.co/metrics/matthews_correlation) for more information). Its range of values is between -1 and +1, where a coefficient of +1 represents a perfect prediction, 0 an average random prediction and -1 an inverse prediction.

The cola subset returns matthews_correlation, the stsb subset returns pearson and spearmanr, the mrpc and qqp subsets return both accuracy and f1, and all other subsets of GLUE return only accuracy.

In [3]:
import datasets
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: f"{typ.names[i]} ({i})")
    display(HTML(df.to_html()))

We can visualize a random portion of the loaded dataset by calling the above defined method:

In [4]:
show_random_elements(dataset["train"])

Unnamed: 0,orig_id,id,sentence1,sentence2,label,Column6
0,411,360,"Kimentem egy kis ételért, inkább az idő elütése végett, mint azért, mert szükségem volt rá.","Kimentem egy kis ételért, inkább az idő elütése végett, mint azért, mert szükségem volt az időre.",0,
1,16,13,Betettem a tortát a hűtőbe. Sok vaj van benne.,A tortában sok vaj van.,1,
2,137,118,"Nem találtam kanalat, szóval egy tollal próbáltam megkeverni a kávémat. Ez rossz ötletnek bizonyult, ugyanis tele lett tintával.",A toll tele lett tintával.,0,
3,563,491,Az asztalra raktam a nehéz könyvet és összetört.,Az asztal összetört.,1,
4,14,11,"Laci eddig mindig segített apának a munkával. De most nem tudott segíteni neki, mert apa azt mondta, hogy a főnöke a vasúti társaságnál nem akarja, hogy rajta kívül bárki az irodájában dolgozzon.",Apa most nem tudott segíteni.,0,
5,482,421,"Alíz a nappalit porolta, és próbálta megtalálni a gombot, amit anya elrejtett. Ma nem volt ideje arra, hogy régi képeket nézegessen a kedvenc fotóalbumában. Ma egy gombot kellett keresnie, ezért az albumot egy székre tette anélkül, hogy kinyitotta volna .","Az albumot egy székre tette anélkül, hogy kinyitotta volna az albumot.",1,
6,545,477,"Sam Goodman monográfiája a spártai tábornok Xenophanészról jól tükrözi azokat a nehézségeket, amelyekkel gyerekkora során szembesült.",Xenophanész nehézségekkel szembesült.,1,
7,573,497,"Janka bekopogtatott Zsuzsához, de senki nem nyitott ajtót. Csalódott volt.",Janka csalódott volt.,1,
8,39,34,"Jakab megnyugtatta Kevint, mert zaklatott volt.",Kevin zaklatott volt.,1,
9,517,452,"Erzsi nem lett mérges Sacira, aki félbeszakította, mert megállt és bocsánatot kért.",Erzsi bocsánatot kért.,0,


## Task description
Here we define a task description for each of the tasks:
- Where the data can be found in the dataset
- What needs the LLM do with those
- Regexp for the result to parse for

Out of these a prompt will be constructed for each of the tasks, automatically.

In [5]:
task_to_keys = {
    "hucola": ("Sent", None, None, None, "Határozd meg, hogy a mondat nyelvtanilag helyes-e (1), vagy helytelen (0).", r'[01]'),
    "hucopa": ("premise", "choice1", "choice2", "question", "Válaszd ki, melyik mondat (1 vagy 2) a kérdésnek megfelelő válasz.", r'[12]'),
    "hucb": ("premise", "hypothesis", None, None, "Határozd meg, hogy a premisa és hipotézis milyen kapcsolatban állnak: ellentmondás (0), semleges (1), következmény (2)", r'[012]'),
    "hurte": ("premise", "hypothesis", None, None, "Határozd meg, hogy a premisából következik-e (1) a hipotézis, vagy sem (0).", r'[01]'),
    "husst": ("Sent", None, None, None, "Határozd meg, hogy a mondat pozitív (1) vagy negatív (0) hangulatú-e.", r'[01]'),
    "huwnli": ("sentence1", "sentence2", None, None, "Határozd meg, hogy az első mondatból következik-e (1) a második mondat, vagy sem (0).", r'[01]'),
}

## Calling the LLM web service
For this, we define a method. Only select/run the one you want to use.

### 1.) Koboldcpp web service call

In [6]:
import requests
def call_llm(prompt):
    api_url = "http://localhost:5001/api/v1"
    stop_words = ["###","**Observation**","</s>","<|"]
    headers = {
        "Content-Type": "application/json"
    }

    data = {
        "prompt": prompt,
        "max_tokens": 10,
        "temperature": 0,
        "top_p": 1.0,
        "n": 20,
        "stop": stop_words
    }
    
    response = requests.post(f"{api_url}/completion", headers=headers, json=data)
    result = response.json()["choices"][0]["text"]
    for sw in stop_words:
        result = result.replace(sw, "")
    return result.strip()
    

### 2.) AzureOpenAI web service call
For this to work, one need to define 3 environment variables:
- ```AZURE_OPENAI_API_KEY```
- ```AZURE_OPENAI_ENDPOINT```
- ```AZURE_OPENAI_DEPLOYMENT_NAME```

In [8]:
from openai import AzureOpenAI
import os

azureClient = AzureOpenAI(api_key = os.getenv("AZURE_OPENAI_API_KEY"), azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT"), api_version = "2023-05-15")
def call_llm(prompt):
    try:
        response = azureClient.completions.create(
            model=os.getenv("AZURE_OPENAI_DEPLOYMENT_NAME"),  # Specify the engine you want to use
            prompt=prompt,
            max_tokens=10,  # Adjust the number of tokens as needed
            temperature=0
        )
    
        return response.choices[0].text.strip()
    except Exception as e:
        return e


In [7]:
call_llm("Tell me, how much is 2x2?")

'2 times 2 is 4.'

## The prompt
To prepare the prompt, we provide a system description, comming from `task_to_keys`, and additional 5 examples from the training dataset, with scores included.

The actual test is comming from the "validation" dataset part, one by one, only providing the sentences to operate on, and simply prompting for a numerical answer by placing <|assistant|> at the line start.

In [8]:
import pandas as pd
import re

def find_first_number(input_str, matchstr):
    match = re.search(matchstr, input_str)
    return float(match.group()) if match else None

def get_label_value(input_str):
    num = find_first_number(input_str, r'\d+(\.\d+)?')
    if not num is None:
        return num
    if input_str == "contradiction" or input_str == "negative":
        return 0
    elif input_str == "neutral" or input_str == "positive":
        return 1
    return 2

def measure(actual_task, path, variants, glue_metric):
    (field1, field2, field3, field4, system_prompt, matchstr) = task_to_keys[actual_task]
    dataset = load_dataset('json', data_files=f"{path}{variants[0]}.json")
    metric = load_metric('glue', glue_metric)
    
    prompt = f"<|system|>{system_prompt}\n"
    training_data = dataset['train']
    
    # Here we collect a diverse labled training set of 5 to make a few shot examples
    last_label = 0
    num_examples = 0
    for idx, f1 in enumerate(training_data[field1]):
        if training_data['label'][idx] == last_label: continue
        prompt += f"<|{field1 if field1 != "Sent" else "sentence"}|>{f1}\n"
        if not field2 is None:
            prompt += f"<|{field2}|>{training_data[field2][idx]}\n"
        if not field3 is None:
            prompt += f"<|{field3}|>{training_data[field3][idx]}\n"
        if not field4 is None:
            if training_data[field4][idx] == "cause":
                prompt += f"<|{field4}|>indok\n"
            else:
                prompt += f"<|{field4}|>következmény\n"
        prompt += f"<|assistant|>{get_label_value(training_data['label'][idx])}\n"
        
        last_label = training_data['label'][idx]
        num_examples += 1
        if num_examples >= 5: break
    
    references = []
    predictions = []
    failed = []
    testset = load_dataset('json', data_files=f"{path}{variants[1]}.json")
    testset = testset['train']
    count = 0
    for idx, f1 in enumerate(testset[field1]):
        query = prompt
        query += f"<|{field1 if field1 != "Sent" else "sentence"}|>{f1}\n"
        if not field2 is None:
            query += f"<|{field2}|>{training_data[field2][idx]}\n"
        if not field3 is None:
            query += f"<|{field3}|>{training_data[field3][idx]}\n"
        if not field4 is None:
            if training_data[field4][idx] == "cause":
                query += f"<|{field4}|>indok\n"
            else:
                query += f"<|{field4}|>következmény\n"
        query += f"<|assistant|>"

        result = call_llm(query)
        
        num_result = find_first_number(result, matchstr)
        if num_result is None:
            failed.append(idx)
        else:
            references.append(get_label_value(testset['label'][idx]))
            predictions.append(num_result)
        count += 1
        if count % 10 == 0:
            print(f"Task: {actual_task}, {count/len(testset[field1])*100:5.1f}%", end="\r")
    
    results = metric.compute(predictions=predictions, references=references)
    return (len(testset[field1]), len(failed), results)

In [None]:
for (actual_task, path, variants, glue_metric) in HULU_TASKS:
    if actual_task == task:
        (number, failed, result) = measure(task, path, variants, glue_metric)
        break

print(f"Task: {task} {number}/{failed}               ")
print(result)

# Putting all together
- iterate through all tasks
- load related dataset & metric
- evaluate all verification elements


In [12]:
for (task, path, variants, glue_metric) in HULU_TASKS:
    (number, failed, result) = measure(task, path, variants, glue_metric)
    print(f"Task: {task} {number}/{failed}               ")
    print(result)

Task: hucola 910/0               
{'matthews_correlation': 0.2671840356811353}
Task: hucopa 100/74               
{'accuracy': 0.6538461538461539}
Task: hucb 103/1               
{'accuracy': 0.3431372549019608}
Task: hurte 243/0               
{'accuracy': 0.4444444444444444}
Task: husst 1165/0               
{'accuracy': 0.6283261802575107}
Task: huwnli 60/0               
{'accuracy': 0.5333333333333333}
