# GLUE eval of Koboldcpp hosted LLM
Install required python packages

In [1]:
#!pip install ipywidgets

In [2]:
#!pip install datasets evaluate requests scipy scikit-learn

Load required packages

In [2]:
import datasets

In this notebook, we will see how to fine-tune one of the [Transformers](https://github.com/huggingface/transformers) model to a text classification task of the [GLUE Benchmark](https://gluebenchmark.com/).

The GLUE Benchmark is a group of nine classification tasks on sentences or pairs of sentences which are:

- [CoLA](https://nyu-mll.github.io/CoLA/) (Corpus of Linguistic Acceptability) Determine if a sentence is grammatically correct or not.is a  dataset containing sentences labeled grammatically correct or not.
- [MNLI](https://arxiv.org/abs/1704.05426) (Multi-Genre Natural Language Inference) Determine if a sentence entails, contradicts or is unrelated to a given hypothesis. (This dataset has two versions, one with the validation and test set coming from the same distribution, another called mismatched where the validation and test use out-of-domain data.)
- [MRPC](https://www.microsoft.com/en-us/download/details.aspx?id=52398) (Microsoft Research Paraphrase Corpus) Determine if two sentences are paraphrases from one another or not.
- [QNLI](https://rajpurkar.github.io/SQuAD-explorer/) (Question-answering Natural Language Inference) Determine if the answer to a question is in the second sentence or not. (This dataset is built from the SQuAD dataset.)
- [QQP](https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs) (Quora Question Pairs2) Determine if two questions are semantically equivalent or not.
- [RTE](https://aclweb.org/aclwiki/Recognizing_Textual_Entailment) (Recognizing Textual Entailment) Determine if a sentence entails a given hypothesis or not.
- [SST-2](https://nlp.stanford.edu/sentiment/index.html) (Stanford Sentiment Treebank) Determine if the sentence has a positive or negative sentiment.
- [STS-B](http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark) (Semantic Textual Similarity Benchmark) Determine the similarity of two sentences with a score from 1 to 5.
- [WNLI](https://cs.nyu.edu/faculty/davise/papers/WinogradSchemas/WS.html) (Winograd Natural Language Inference) Determine if a sentence with an anonymous pronoun and a sentence with this pronoun replaced are entailed or not. (This dataset is built from the Winograd Schema Challenge dataset.)

We will see how to easily load the dataset for each one of those tasks and use the `Trainer` API to fine-tune a model on it. Each task is named by its acronym, with `mnli-mm` standing for the mismatched version of MNLI (so same training set as `mnli` but different validation and test sets):

In [3]:
GLUE_TASKS = ["cola", "mnli", "mrpc", "qnli", "qqp", "rte", "sst2", "stsb", "wnli"]
#GLUE_TASKS = ["qqp", "rte", "sst2", "stsb", "wnli"]
task = "qqp"

actual_task = "mnli" if task == "mnli-mm" else task


## Loading the dataset

We will use the [Datasets](https://github.com/huggingface/datasets) library to download the data and [Evaluate](https://github.com/huggingface/evaluate) library to get the metric we need to use for evaluation (to compare our model to the benchmark).

In [4]:
from datasets import load_dataset

dataset = load_dataset("glue", actual_task)

dataset


DatasetDict({
    train: Dataset({
        features: ['question1', 'question2', 'label', 'idx'],
        num_rows: 363846
    })
    validation: Dataset({
        features: ['question1', 'question2', 'label', 'idx'],
        num_rows: 40430
    })
    test: Dataset({
        features: ['question1', 'question2', 'label', 'idx'],
        num_rows: 390965
    })
})

In [5]:
from evaluate import load as load_metric
metric = load_metric('glue', actual_task)

print("example output")
references = [0, 1, 0, 1]
predictions = [0, 1, 1, 1]


results = metric.compute(predictions=predictions, references=references)
results

example output


{'accuracy': 0.75, 'f1': 0.8}

## Output values

The output of the metric depends on the GLUE subset chosen, consisting of a dictionary that contains one or several of the following metrics:

`accuracy`: the proportion of correct predictions among the total number of cases processed, with a range between 0 and 1 (see [accuracy](https://huggingface.co/metrics/accuracy) for more information).

`f1`: the harmonic mean of the precision and recall (see [F1 score](https://huggingface.co/metrics/f1) for more information). Its range is 0-1 – its lowest possible value is 0, if either the precision or the recall is 0, and its highest possible value is 1.0, which means perfect precision and recall.

`pearson`: a measure of the linear relationship between two datasets (see [Pearson correlation](https://huggingface.co/metrics/pearsonr) for more information). Its range is between -1 and +1, with 0 implying no correlation, and -1/+1 implying an exact linear relationship. Positive correlations imply that as x increases, so does y, whereas negative correlations imply that as x increases, y decreases.

`spearmanr`: a nonparametric measure of the monotonicity of the relationship between two datasets(see [Spearman Correlation](https://huggingface.co/metrics/spearmanr) for more information). spearmanr has the same range as pearson.

`matthews_correlation`: a measure of the quality of binary and multiclass classifications (see [Matthews Correlation](https://huggingface.co/metrics/matthews_correlation) for more information). Its range of values is between -1 and +1, where a coefficient of +1 represents a perfect prediction, 0 an average random prediction and -1 an inverse prediction.

The cola subset returns matthews_correlation, the stsb subset returns pearson and spearmanr, the mrpc and qqp subsets return both accuracy and f1, and all other subsets of GLUE return only accuracy.

In [6]:
import datasets
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: f"{typ.names[i]} ({i})")
    display(HTML(df.to_html()))

We can visualize a random portion of the loaded dataset by calling the above defined method:

In [7]:
show_random_elements(dataset["train"])

Unnamed: 0,question1,question2,label,idx
0,How can I improve my data analysis skills?,How do I improve my research analysis and data analysis skills?,duplicate (1),170446
1,What is SAP?,What is sap bi?,not_duplicate (0),140876
2,Will the electoral college change their vote if enough people protest against Donald Trump?,Do you think republican members of the electoral college will go against tradition and refuse to vote for Trump in December 2016?,duplicate (1),329463
3,What will be a better choice between XLRI and IIFT?,Which is better XLRI or IIFT?,duplicate (1),47384
4,How do cameras work? How have they improved since their creation?,How do cameras work?,duplicate (1),185125
5,What happens if your car runs out of engine coolant?,How do I tell if my car is leaking coolant?,not_duplicate (0),162766
6,Which topics I should cover in air wave propagation?,"Why does sound need a medium like air or water in order to travel, but radio waves do not?",not_duplicate (0),186856
7,What are some yoga poses to help me lose weight?,What are the best yoga poses for weight loss?,duplicate (1),120778
8,How do block pornsites and pornwords on Chrome is?,"If the United States goes to war with another country, what happens to its own citizens residing in that country or in another country nearby?",not_duplicate (0),281613
9,How can I earn money from Facebook page?,How do I earn money with my Facebook page?,duplicate (1),60637


## Task description
Here we define a task description for each of the tasks:
- Where the data can be found in the dataset
- What needs the LLM do with those

Out of these a prompt will be constructed for each of the tasks, automatically.

In [8]:
task_to_keys = {
    "cola": ("sentence", None, "Determine if a sentence is grammatically correct (1) or not (0).", "validation", r'[01]'),
    "mnli": ("premise", "hypothesis", "Determine if a sentence entails (0), contradicts (1) or is unrelated (2) to a given hypothesis.", "validation_matched", r'[012]'),
    "mrpc": ("sentence1", "sentence2", "Determine if two sentences are paraphrases from one another (1) or not (0).", "validation", r'[01]'),
    "qnli": ("question", "sentence", "Determine if the answer to a question is in the second sentence (0) or not (1).", "validation", r'[01]'),
    "qqp": ("question1", "question2", "Determine if two questions are semantically equivalent (1) or not (0).", "validation", r'[01]'),
    "rte": ("sentence1", "sentence2", "Determine if a sentence entails a given hypothesis (0) or not (1).", "validation", r'[01]'),
    "sst2": ("sentence", None, "Determine if the sentence has a positive (1) or negative (0) sentiment.", "validation", r'[01]'),
    "stsb": ("sentence1", "sentence2", "Rate the semantic similarity between two sentences with a numeric score between 0.0 to 5.0.", "validation", r'\d+(\.\d+)?'),
    "wnli": ("sentence1", "sentence2", "Determine if a sentence with an anonymous pronoun and a sentence with this pronoun replaced are entailed (1) or not (0).", "validation", r'[01]'),
}

## The prompt
To prepare the prompt, we provide a system description, comming from `task_to_keys`, and additional 5 examples from the training dataset, with scores included.

The actual test is comming from the "validation" dataset part, one by one, only providing the sentences to operate on, and simply prompting for a numerical answer by placing <|assistant|> at the line start.

In [9]:
import requests
import pandas as pd
import re

def find_first_number(input_str, matchstr):
    match = re.search(matchstr, input_str)
    return float(match.group()) if match else None

def measure(actual_task, dataset, metric):
    (field1, field2, system_prompt, dsetname, matchstr) = task_to_keys[actual_task]
    prompt = f"<|system|>{system_prompt}\n"
    training_data = dataset['train']
    
    # Here we collect a diverse labled training set of 5 to make a few shot examples
    last_label = 0
    num_examples = 0
    for idx, f1 in enumerate(training_data[field1]):
        if training_data['label'][idx] == last_label: continue
        prompt += f"<|{field1}|>{f1}\n"
        if not field2 is None:
            prompt += f"<|{field2}|>{training_data[field2][idx]}\n"
        prompt += f"<|assistant|>{training_data['label'][idx]}\n"
        
        last_label = training_data['label'][idx]
        num_examples += 1
        if num_examples >= 5: break
    
    api_url = "http://localhost:5001/api/v1"
    stop_words = ["###","**Observation**","</s>","<|"]
    headers = {
        "Content-Type": "application/json"
    }
    
    references = []
    predictions = []
    failed = []
    testset = dataset[dsetname]
    count = 0
    for idx, f1 in enumerate(testset[field1]):
        query = prompt
        query += f"<|{field1}|>{f1}\n"
        if not field2 is None:
            query += f"<|{field2}|>{testset[field2][idx]}\n"
        query += f"<|assistant|>"
    
        data = {
            "prompt": query,
            "max_tokens": 10,
            "temperature": 0,
            "top_p": 1.0,
            "n": 20,
            "stop": stop_words
        }
        
        response = requests.post(f"{api_url}/completion", headers=headers, json=data)
        result = response.json()["choices"][0]["text"]
        for sw in stop_words:
            result = result.replace(sw, "")
        result = result.strip()
        num_result = find_first_number(result, matchstr)
        if num_result is None:
            failed.append(idx)
        else:
            references.append(testset['label'][idx])
            predictions.append(num_result)
        count += 1
        if count % 10 == 0:
            print(f"Task: {actual_task}, {count/len(testset[field1])*100:5.1f}%", end="\r")
    
    results = metric.compute(predictions=predictions, references=references)
    return (len(testset[field1]), len(failed), results)

In [48]:
(number, failed, result) = measure(task, dataset, metric)
print(f"Task: {task} {number}/{failed}               ")
print(result)

Task: mrpc 100/0               
{'accuracy': 0.74, 'f1': 0.8333333333333334}


# Putting all together
- iterate through all tasks
- load related dataset & metric
- evaluate all verification elements


In [10]:
for task in GLUE_TASKS:
    dataset = load_dataset("glue", task)
    metric = load_metric('glue', task)
    (number, failed, result) = measure(task, dataset, metric)
    print(f"Task: {task} {number}/{failed}               ")
    print(result)

Task: cola 1043/1               
{'matthews_correlation': 0.45562723707910674}
Task: mnli 9815/0               
{'accuracy': 0.3869587366276108}
Task: mrpc 408/0               
{'accuracy': 0.7549019607843137, 'f1': 0.8355263157894737}
Task: qnli 5463/0               
{'accuracy': 0.13637195680029288}
Task: qqp 40430/0               
{'accuracy': 0.8235221370269602, 'f1': 0.7773304621914303}
Task: rte 277/0               
{'accuracy': 0.35379061371841153}
Task: sst2 872/0               
{'accuracy': 0.9357798165137615}
Task: stsb 1500/0               
{'pearson': 0.826388381403035, 'spearmanr': 0.8408428920714652}
Task: wnli 71/0               
{'accuracy': 0.7183098591549296}
