# LLM Evaluation Demo

In this notebook we show how to create custom evaluation tasks within the `lm-eval-harness` library, a standard tool as of mid-2023 for LLM quantitative evaluation.

### Table of Contents
1. Install libraries
2. Creating custom HuggingFace `Dataset` objects
3. Testing LM harness (custom task)
   - Define custom task (check JSON file too!)
   - Run evaluation on checkpoint (zero shot)
4. Testing a real Elsevier gold set - Embase's ICSR snapshot (zero shot)

## 1. Install libraries

To make this notebook work, you need to create a Python 3.9+ `conda` environment and a Jupyter kernel. You need to clone the `lm-eval-harness` GitHub repo. 
Here we demonstrate basic installation. For advanced installations (include multilingual support for MT evaluation, include support for 4-, 8- and 16-bit quantized models), please check the repo

   - ``https://github.com/EleutherAI/lm-evaluation-harness.git``

for instructions.

In [29]:
# !pip install pytest
# !pip install -e ../tools/lm-evaluation-harness
# !pip install --upgrade --force-reinstall pandas # uncomment if it doesn't install properly 

## 2. Creating custom HuggingFace `Dataset` objects<a class="anchor" id="s1"></a>

Here we demonstrate how to parse custom, JSON-formatted NLP gold standards, into a HuggingFace `Dataset` and `DatasetDict`, **locally**. HuggingFace expects by default corpora to be uploaded to their public hub (this is fine for open data, but not for proprietary data). `lm-eval-harness` operates with HuggingFace `Dataset` objects.

The example data is located within  the `relevancy` dir, and it is only a stub, demonstrating the format. Other examples are treated similarly.

In [58]:
from datasets import load_dataset
from datasets.builder import BuilderConfig

In [66]:
dataset = load_dataset("../datasets/harness/relevancy", # path to directory; the name of the directory will become the dataset name
                       data_files= {"train"     : "train.json", 
                                    "test"      : "test.json", 
                                    "validation": "validation.json"},
                       field="data"
                      )

In [67]:
dataset

DatasetDict({
    train: Dataset({
        features: ['Relevancy', 'EntityText', 'Context'],
        num_rows: 2
    })
    test: Dataset({
        features: ['Relevancy', 'EntityText', 'Context'],
        num_rows: 2
    })
    validation: Dataset({
        features: ['Relevancy', 'EntityText', 'Context'],
        num_rows: 2
    })
})

In [68]:
dataset['test'].features

{'Relevancy': Value(dtype='string', id=None),
 'EntityText': Value(dtype='string', id=None),
 'Context': Value(dtype='string', id=None)}

In [69]:
dataset_icsr = load_dataset("../datasets/harness/icsr", # path to directory; the name of the directory will become the dataset name
                       data_files= {"test"      : "test.json"},
                       field="data",
                      )

In [70]:
dataset_icsr

DatasetDict({
    test: Dataset({
        features: ['Abstract', 'Relevant'],
        num_rows: 2440
    })
})

In [71]:
dataset_icsr['test'].features

{'Abstract': Value(dtype='string', id=None),
 'Relevant': Value(dtype='string', id=None)}

## 3. Testing LM harness (custom task)

The trick here is to define a custom task class, that inherits from class `Task`. The purpose of this class is to parse a JSON gold set into a HuggingFace `Dataset`, and then to transform its test-set partititon into a list of LLM prompts.

As our relevancy examples there are totally arbitrary, the expected final accuracy is 0. We do want to prevent exceptions and parsing errors.

In [72]:
# Lead repo into to Python path if it didn't install properly
import sys
sys.path.insert(0,'../tools/lm-evaluation-harness')

In [73]:
from lm_eval import tasks, utils, evaluator

### Define custom task (check JSON file too!)

This class reads our custom benchmark into a `DatasetDict`, and then transforms every row (in the relevancy example, a dictionary `{'EntityText':'<v1>', 
'Context':<v2>, 'Relevancy':<v3>}` into a prompt via the public methods `doc_to_text()` (inputs) and `doc_to_target()` (ouputs) below.

In [76]:
import numpy as np
from lm_eval.base import rf, Task
from lm_eval.metrics import mean
import datasets

_CITATION = """
@article{DBLP:journals/biodb/AkhondiRSMTNISI19,
  author       = {Saber A. Akhondi and
                  Hinnerk Rey and
                  Markus Schw{\"{o}}rer and
                  Michael Maier and
                  John P. Toomey and
                  Heike Nau and
                  Gabriele Ilchmann and
                  Mark Sheehan and
                  Matthias Irmer and
                  Claudia Bobach and
                  Marius A. Doornenbal and
                  Michelle Gregory and
                  Jan A. Kors},
  title        = {Automatic identification of relevant chemical compounds from patents},
  journal      = {Database J. Biol. Databases Curation},
  volume       = {2019},
  pages        = {baz001},
  year         = {2019},
  url          = {https://doi.org/10.1093/database/baz001},
  doi          = {10.1093/database/baz001},
  timestamp    = {Thu, 13 Aug 2020 12:41:41 +0200},
  biburl       = {https://dblp.org/rec/journals/biodb/AkhondiRSMTNISI19.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}
"""

class Relevancy(Task):
    VERSION = "0.0.0"
    DATASET_PATH = "../datasets/harness/relevancy" # absolute or relative path to dataset
    DATASET_NAME = 'default'
    
    def __init__(self, data_dir=None, cache_dir=None, download_mode=None):
        """
        Override default constructor so that the dataset is correctly
        parsed by HuggingFace into a DatasetDict
        """
        self._training_docs = None
        self._fewshot_docs = None
        self.dataset = datasets.load_dataset(
            path=self.DATASET_PATH,
            name=self.DATASET_NAME,
            data_dir=data_dir,
            cache_dir=cache_dir,
            download_mode=download_mode,
            field="data",
        )

    def has_training_docs(self):
        return True

    def has_validation_docs(self):
        return True

    def has_test_docs(self):
        return True
        
    def training_docs(self):
        if self.has_training_docs():
            return self.dataset["train"]

    def validation_docs(self):
        if self.has_validation_docs():
            return self.dataset["validation"]

    def test_docs(self):
        if self.has_test_docs():
            return self.dataset["test"]

    def doc_to_text(self, doc):
        # Do we *really*
        # want to do it exactly as OA did?
        return (
            "Is entity" + doc["EntityText"]
            + " relevant in this text: "
            + doc["Context"] + "? "
            + "Return either True or False.\nAnswer:"
        )

    def should_decontaminate(self):
        return True

    def doc_to_decontamination_query(self, doc):
        return doc["Context"]

    def doc_to_target(self, doc):
        return " " + doc["Relevancy"]
        
    def construct_requests(self, doc, ctx):
        """Uses RequestFactory to construct Requests and returns an iterable of
        Requests which will be sent to the LM.

        :param doc:
            The document as returned from training_docs, validation_docs, or test_docs.
        :param ctx: str
            The context string, generated by fewshot_context. This includes the natural
            language description, as well as the few shot examples, and the question
            part of the document for `doc`.
        """
        ll_true, x = rf.loglikelihood(ctx, " True")
        ll_false, y = rf.loglikelihood(ctx, " False")
        #print(ll_true, ll_false)
        #print(x, y)
        return ll_true, ll_false

    def process_results(self, doc, results):
        """Take a single document and the LM results and evaluates, returning a
        dict where keys are the names of submetrics and values are the values of
        the metric for that one document

        :param doc:
            The document as returned from training_docs, validation_docs, or test_docs.
        :param results:
            The results of the requests created in construct_requests.
        """
        print(results)
        gold    = doc["Relevancy"]
        pred    = np.argmax(results) # Notice that this library relies on *single token* best-first decoding
        labels  = {0: "True", 1: "False"}
        pred    = "{}".format(labels[pred])
        print("gold:{} pred:{}".format(gold, pred))
        return {"acc": pred == gold}

    def aggregation(self):
        """
        :returns: {str: [float] -> float}
            A dictionary where keys are the names of submetrics and values are
            functions that aggregate a list of metrics
        """
        return {"acc": mean}

    def higher_is_better(self):
        """
        :returns: {str: bool}
            A dictionary where keys are the names of submetrics and values are
            whether a higher value of the submetric is better
        """
        return {"acc": True}

In [77]:
task = Relevancy() # instantiate the class to parse the sample gold set

In [78]:
task.dataset # should return a DatasetDict

DatasetDict({
    train: Dataset({
        features: ['Context', 'EntityText', 'Relevancy'],
        num_rows: 2
    })
    validation: Dataset({
        features: ['Context', 'EntityText', 'Relevancy'],
        num_rows: 2
    })
    test: Dataset({
        features: ['Context', 'EntityText', 'Relevancy'],
        num_rows: 2
    })
})

In [79]:
task.DATASET_NAME # should print 'relevancy'

'default'

We need to add our custom task to the `lm-eval-harness` registry of known NLP tasks:

In [110]:
tasks.TASK_REGISTRY["relevancy"] = Relevancy

For comparison, we also download HellaSwag data from HuggingFace:

In [112]:
hellaswag = tasks.hellaswag.HellaSwag()

In [113]:
hellaswag.dataset

DatasetDict({
    train: Dataset({
        features: ['ind', 'activity_label', 'ctx_a', 'ctx_b', 'ctx', 'endings', 'source_id', 'split', 'split_type', 'label'],
        num_rows: 39905
    })
    test: Dataset({
        features: ['ind', 'activity_label', 'ctx_a', 'ctx_b', 'ctx', 'endings', 'source_id', 'split', 'split_type', 'label'],
        num_rows: 10003
    })
    validation: Dataset({
        features: ['ind', 'activity_label', 'ctx_a', 'ctx_b', 'ctx', 'endings', 'source_id', 'split', 'split_type', 'label'],
        num_rows: 10042
    })
})

### 3.2 Run evaluation on checkpoint

Here, we evaluate on a previously downloaded / cached LLM (`Opt-350M` from Meta). If you want to try a different one, use a HuggingFace remote model path vs a local path.

**<u>Important Notice</u>:** Evaluation in this notebook works only on CUDA (due to some obscure bug in the evaluation code that I wan't able to fix), so don't run it on CPU-only hosts.

In [114]:
model_path = "facebook/galactica-125m"

We check the registry of available decoder-only transformer architectures

In [115]:
from lm_eval import models
models.MODEL_REGISTRY

{'hf': lm_eval.models.gpt2.HFLM,
 'hf-causal': lm_eval.models.gpt2.HFLM,
 'hf-causal-experimental': lm_eval.models.huggingface.AutoCausalLM,
 'hf-seq2seq': lm_eval.models.huggingface.AutoSeq2SeqLM,
 'gpt2': lm_eval.models.gpt2.HFLM,
 'gpt3': lm_eval.models.gpt3.GPT3LM,
 'anthropic': lm_eval.models.anthropic_llms.AnthropicLM,
 'textsynth': lm_eval.models.textsynth.TextSynthLM,
 'dummy': lm_eval.models.dummy.DummyLM}

We first demonstrate evaluation on the HellaSwag benchmark. Notice that we limit the test to 10 examples per batch. Evaluation over an entire corpus can take --even on GPU-- many minutes, unless you use a multi-GPU machine. Recall that run-time per example in transformers in quadratic in the size of input tokens (due to attention). Also, some benchmarks can be huge.

`lm-eval-harness` relies on PyTest to run evaluations.

In [116]:
results = evaluator.simple_evaluate(
        model="hf-causal",
        model_args=" pretrained=" + model_path,
        tasks=["hellaswag"],
        num_fewshot=0,
        device="cpu",
        limit=5, # We limit the number of queries
        check_integrity="store_true",
)

Using device 'cpu'
platform linux -- Python 3.10.14, pytest-8.2.0, pluggy-1.5.0
rootdir: /home/jovyan/LREC-2024-notebooks/tools/lm-evaluation-harness
plugins: anyio-4.3.0
collected 441 items / 440 deselected / 1 selected

../tools/lm-evaluation-harness/tests/test_version_stable.py [32m.[0m[32m            [100%][0m

tests/test_version_stable.py::test_versions_stable[hellaswag-HellaSwag]
  You can avoid this message in future by passing the argument `trust_remote_code=True`.
  Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.

Task: hellaswag; number of docs: 10042
Task: hellaswag; document 0; context prompt (starting on next line):
Personal Care and Style: How to increase breast size with a bra. Check your bra size. Wearing a bra that is too big will not make your breasts look larger. That is why it is important to wear the right size bra for you.
(end of prompt on previous line)
Requests: [Req_loglikelihood('Personal 

0it [00:00, ?it/s]


The result table should spit out a value of 0.3:

In [117]:
print(evaluator.make_table(results))

|  Task   |Version| Metric |Value|   |Stderr|
|---------|------:|--------|----:|---|-----:|
|hellaswag|      0|acc     |  0.2|±  |0.2000|
|         |       |acc_norm|  0.6|±  |0.2449|



We now evaluate over our custom benchmark:

In [118]:
results_r = evaluator.simple_evaluate(
        model="hf-causal",
        model_args=" pretrained=" + model_path,
        tasks=["relevancy"],
        device="cpu",
        num_fewshot=0,
        limit=10,
)

Using device 'cpu'
Task: relevancy; number of docs: 2
Task: relevancy; document 0; context prompt (starting on next line):
Is entity4 relevant in this text: -5.5? Return either True or False.
Answer:
(end of prompt on previous line)
Requests: (Req_loglikelihood('Is entity4 relevant in this text: -5.5? Return either True or False.\nAnswer:', ' True')[0]
, Req_loglikelihood('Is entity4 relevant in this text: -5.5? Return either True or False.\nAnswer:', ' False')[0]
)
Running loglikelihood requests


0it [00:00, ?it/s]

[-1.8439056873321533, -3.0800535678863525]
gold:True pred:True
[-2.043750047683716, -3.6137354373931885]
gold:False pred:True





In [119]:
print(evaluator.make_table(results_r))

|  Task   |Version|Metric|Value|   |Stderr|
|---------|-------|------|----:|---|-----:|
|relevancy|0.0.0  |acc   |  0.5|±  |   0.5|



Few shot evaluation:

In [121]:
results_few = evaluator.simple_evaluate(
        model="hf-causal",
        model_args=" pretrained=" + model_path,
        tasks=["relevancy"],
        num_fewshot=1,
        device="cpu",
        limit=10,
)

Using device 'cpu'
Task: relevancy; number of docs: 2
Task: relevancy; document 0; context prompt (starting on next line):
Is entity1 relevant in this text: 2.0? Return either True or False.
Answer: False

Is entity4 relevant in this text: -5.5? Return either True or False.
Answer:
(end of prompt on previous line)
Requests: (Req_loglikelihood('Is entity1 relevant in this text: 2.0? Return either True or False.\nAnswer: False\n\nIs entity4 relevant in this text: -5.5? Return either True or False.\nAnswer:', ' True')[0]
, Req_loglikelihood('Is entity1 relevant in this text: 2.0? Return either True or False.\nAnswer: False\n\nIs entity4 relevant in this text: -5.5? Return either True or False.\nAnswer:', ' False')[0]
)
Running loglikelihood requests


0it [00:00, ?it/s]

[-0.5465310215950012, -0.921859085559845]
gold:True pred:True
[-0.43615809082984924, -1.0989350080490112]
gold:False pred:True





In [122]:
print(evaluator.make_table(results_few))

|  Task   |Version|Metric|Value|   |Stderr|
|---------|-------|------|----:|---|-----:|
|relevancy|0.0.0  |acc   |  0.5|±  |   0.5|



## 4. Testing a real Elsevier gold set - Embase's ICSR snapshot

We start by measuring accuracy, where we apply the formula:
$$
Acc = \frac{TP}{TP + FP + FN + FP}
$$

In [123]:
import numpy as np
from lm_eval.base import rf, Task
from lm_eval.metrics import mean
import datasets

_CITATION = """
NaN
"""

class ICSR(Task):
    VERSION = "0.1.0"
    DATASET_PATH = "../datasets/harness/icsr" # absolute or relative path to dataset
    DATASET_NAME = "default"
    
    def __init__(self, data_dir=None, cache_dir=None, download_mode=None):
        """
        Override default constructor so that the dataset is correctly
        parsed by HuggingFace into a DatasetDict
        """
        self._training_docs = None
        self._fewshot_docs = None
        self.dataset = datasets.load_dataset(
            path=self.DATASET_PATH,
            name=self.DATASET_NAME,
            data_dir=data_dir,
            cache_dir=cache_dir,
            download_mode=download_mode,
            field="data",
        )

    def has_training_docs(self):
        return True

    def has_validation_docs(self):
        return True

    def has_test_docs(self):
        return True
        
    def training_docs(self):
        if self.has_training_docs():
            return self.dataset["train"]

    def validation_docs(self):
        if self.has_validation_docs():
            return self.dataset["validation"]

    def test_docs(self):
        if self.has_test_docs():
            return self.dataset["test"]

    def doc_to_text(self, doc):
        # Do we *really*
        # want to do it exactly as OA did?
        return (
            "Does this paper abstract: " + doc["Abstract"]
            + " mention adverse drug effects? "
            + "Please answer Yes or No.\nAnswer:"
        )

    def should_decontaminate(self):
        return True

    def doc_to_decontamination_query(self, doc):
        return doc["Abstract"]
        
    def doc_to_target(self, doc):
        return " " + doc["Abstract"]

    def construct_requests(self, doc, ctx):
        """Uses RequestFactory to construct Requests and returns an iterable of
        Requests which will be sent to the LM.

        :param doc:
            The document as returned from training_docs, validation_docs, or test_docs.
        :param ctx: str
            The context string, generated by fewshot_context. This includes the natural
            language description, as well as the few shot examples, and the question
            part of the document for `doc`.
        """
        ll_true, _ = rf.loglikelihood(ctx, " Yes")
        ll_false, _ = rf.loglikelihood(ctx, " No")
        return ll_true, ll_false

    def process_results(self, doc, results):
        """Take a single document and the LM results and evaluates, returning a
        dict where keys are the names of submetrics and values are the values of
        the metric for that one document

        :param doc:
            The document as returned from training_docs, validation_docs, or test_docs.
        :param results:
            The results of the requests created in construct_requests.
        """
        gold    = doc["Relevant"]
        pred    = np.argmax(results) # Notice that this library relies on *single token* best-first decoding
        labels  = {0: "Yes", 1: "No"}
        pred    = "{}".format(labels[pred])
        return {"acc": pred == gold}

    def aggregation(self):
        """
        :returns: {str: [float] -> float}
            A dictionary where keys are the names of submetrics and values are
            functions that aggregate a list of metrics
        """
        return {"acc": mean}

    def higher_is_better(self):
        """
        :returns: {str: bool}
            A dictionary where keys are the names of submetrics and values are
            whether a higher value of the submetric is better
        """
        return {"acc": True}

In [124]:
task_icsr = ICSR()

In [125]:
task_icsr.dataset

DatasetDict({
    test: Dataset({
        features: ['Relevant', 'Abstract'],
        num_rows: 2440
    })
})

In [126]:
task_icsr.dataset['test'][0]

{'Relevant': 'Yes',
 'Abstract': 'Aims/Introduction: The convergence of tuberculosis (TB) and diabetes mellitus (DM) is a new challenge in Asia as a result of the rising prevalence of diabetes mellitus with higher TB infection rates, and also because diabetes mellitus itself enhances TB disease activity and consequently the spread of TB. We aimed to address the risk presented by diabetes mellitus for TB infection. Materials and Methods: Patients with diabetes mellitus were retrospectively recruited. The baseline assessments included age, sex, body mass index, fasting blood glucose, glycated hemoglobin, urine albumin-to-creatinine ratio and estimated glomerular filtration rate. TB was determined by meeting the international classification of disease, for TB diagNosis and receiving anti-TB treatment for at least 2¬†months. Results: In total, 9,750 individuals with diabetes mellitus were recruited. The event rate of TB was 47 (0.48%). Younger age, lower proportion of men, higher fasting b

In [127]:
task_icsr.DATASET_NAME

'default'

In [128]:
tasks.TASK_REGISTRY["icsr"] = ICSR

Run zero-shot eval on ICSR (should return accuracy > 0):

In [129]:
results_icsr = evaluator.simple_evaluate(
        model="hf-causal",
        model_args=" pretrained=" + model_path,
        tasks=["icsr"],
        num_fewshot=0,
        device="cpu",
        limit=10,
)

Using device 'cpu'
Task: icsr; number of docs: 2440
Task: icsr; document 0; context prompt (starting on next line):
Does this paper abstract: Objectives The aims of this study were to describe the following: (1) the time to change of therapy in patients with type 2 diabetes who had initiated metformin moNotherapy as first-line treatment and (2) the sequence in which subsequent therapeutic regimens were introduced. Design Cohort study. Setting National study based on linked data from the New Zealand Ministry of Health's National Collections of health and pharmaceutical dispensing data. Participants People with type 2 diabetes mellitus who initiated metformin moNotherapy between 1 January 2006 and 30 September 2014 (n=93 874). Primary outcome measures Cumulative incidence curves were plotted to show the time taken to move from one regimen to aNother, while sunburst plots were used to illustrate the sequence in which regimens were introduced. Results About 10% and 35% of cohort members ha

0it [00:00, ?it/s]


In [130]:
print(evaluator.make_table(results_icsr)) # 22.3% of all labels are 'Yes'!

|Task|Version|Metric|Value|   |Stderr|
|----|-------|------|----:|---|-----:|
|icsr|0.1.0  |acc   |  0.4|±  |0.1633|



## 4. Measure (word) perplexity

See:
- https://towardsdatascience.com/perplexity-of-language-models-revisited-6b9b4cf46792
- https://en.wikipedia.org/wiki/Perplexity
- https://huggingface.co/docs/transformers/perplexity

#### Definition 1
Given a vocabulary $V$ of size $|V|$ and tokens $t \in V$. 
Given a corpus $D \subseteq V^*$ of size $|D|$ and sentences $s \in D$ (every sentence is a sequences over V), each of
probability $p(s)$.
The perplexity of probability distribution $p$ on $D$ is defined as:
$$
\begin{align}
PPL(p) & ~~=~~ (\prod_{i=1}^{|D|} p(s_i))^{-\frac{1}{|V|}}
\end{align}
$$

#### Definition 2

This definition defines perpexlity as a function of the cross entropy $CE(\cdot,\cdot)$ of the probability distribution $p$ that 
we assume generated corpus $D$ (viz. $D \sim p$),
and the probability distribution $q_{\theta}$ estimated by model $\theta$ from $D$:
$$
\begin{align}
PPL(p,q_{\theta}) & ~~=~~       2^{CE(p, q_{\theta})} \\
                  & ~~\approx~~ 2^{ - \frac{1}{t} \sum_{i=1}^{t} \log q(t_{i}|t_{1},\dots,t_{i-1}) }
\end{align}
$$

In [103]:
import datasets

from lm_eval.base import PerplexityTask
from lm_eval.utils import escaped_split

class ICSRPerplexity(PerplexityTask):
    '''
    We re-use here the ICSR data, but we throw away the labels.
    We ask the model to generate the abstracts, and measure LM perplexity.
    '''
    VERSION = "0.1.0"
    DATASET_PATH = "../datasets/harness/icsr"
    DATASET_NAME = "default"

    def __init__(self, data_dir=None, cache_dir=None, download_mode=None):
        self.data_dir = data_dir
        self.dataset = datasets.load_dataset(
            path=self.DATASET_PATH,
            name=self.DATASET_NAME,
            data_dir=data_dir,
            cache_dir=cache_dir,
            download_mode=download_mode,
            field="data",
        )
        self._training_docs = None
        self._fewshot_docs = None

    def download(self, data_dir=None, cache_dir=None, download_mode=None):
        raise TypeError("cannot download an arbitrary JSON dataset")

    def has_validation_docs(self):
        return False
    
    def has_training_docs(self):
        return False

    def has_test_docs(self):
        return True

    def test_docs(self):
        return map(self._process_doc, self.dataset["test"])

    def _process_doc(self, doc):
        return doc['Abstract']

In [104]:
perp_task = ICSRPerplexity()

In [108]:
tasks.TASK_REGISTRY["icsr_perplexity"] = ICSRPerplexity

In [132]:
results_p = evaluator.simple_evaluate(
        model="hf-causal",
        model_args=" pretrained=" + model_path,
        tasks=["icsr_perplexity"],
        num_fewshot=0,
        device='cpu',
        limit=10, # We limit the number of queries
        #check_integrity="store_true",
)

Using device 'cpu'
Task: icsr_perplexity; number of docs: 2440
Task: icsr_perplexity; document 0; context prompt (starting on next line):

(end of prompt on previous line)
Requests: Req_loglikelihood_rolling("Objectives The aims of this study were to describe the following: (1) the time to change of therapy in patients with type 2 diabetes who had initiated metformin moNotherapy as first-line treatment and (2) the sequence in which subsequent therapeutic regimens were introduced. Design Cohort study. Setting National study based on linked data from the New Zealand Ministry of Health's National Collections of health and pharmaceutical dispensing data. Participants People with type 2 diabetes mellitus who initiated metformin moNotherapy between 1 January 2006 and 30 September 2014 (n=93 874). Primary outcome measures Cumulative incidence curves were plotted to show the time taken to move from one regimen to aNother, while sunburst plots were used to illustrate the sequence in which regim

  0%|          | 0/1 [00:00<?, ?it/s]


TypeError: 'NoneType' object cannot be interpreted as an integer

In [None]:
print(evaluator.make_table(results_p))

## 5. Repeat with a different model

We repeat zero-shot and perplexity experiments.

In [16]:
model_path_new = ""

In [19]:
results_acc = evaluator.simple_evaluate(
        model="hf-causal",
        model_args=" pretrained=" + model_path_new,
        tasks=["default"],
        num_fewshot=0,
)

Using device 'cuda'


Found cached dataset json (/home/jovyan/.cache/huggingface/datasets/json/icsr-2e3e9176217a300b/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96)
100%|██████████| 1/1 [00:00<00:00, 257.41it/s]


Task: icsr; number of docs: 2440
Task: icsr; document 0; context prompt (starting on next line):
Does this paper abstract: Objectives The aims of this study were to describe the following: (1) the time to change of therapy in patients with type 2 diabetes who had initiated metformin moNotherapy as first-line treatment and (2) the sequence in which subsequent therapeutic regimens were introduced. Design Cohort study. Setting National study based on linked data from the New Zealand Ministry of Health's National Collections of health and pharmaceutical dispensing data. Participants People with type 2 diabetes mellitus who initiated metformin moNotherapy between 1 January 2006 and 30 September 2014 (n=93 874). Primary outcome measures Cumulative incidence curves were plotted to show the time taken to move from one regimen to aNother, while sunburst plots were used to illustrate the sequence in which regimens were introduced. Results About 10% and 35% of cohort members had moved to a second

Token indices sequence length is longer than the specified maximum sequence length for this model (2273 > 2048). Running this sequence through the model will result in indexing errors
100%|██████████| 4348/4348 [15:19<00:00,  4.73it/s]


In [17]:
results_per = evaluator.simple_evaluate(
        model="hf-causal",
        model_args=" pretrained=" + model_path_new,
        tasks=["icsr-perplexity"],
        num_fewshot=0,
)

Using device 'cuda'


Found cached dataset json (/home/jovyan/.cache/huggingface/datasets/json/icsr-2e3e9176217a300b/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96)
100%|██████████| 1/1 [00:00<00:00, 299.91it/s]


Task: icsr-perplexity; number of docs: 2440
Task: icsr-perplexity; document 0; context prompt (starting on next line):

(end of prompt on previous line)
Requests: Req_loglikelihood_rolling("Objectives The aims of this study were to describe the following: (1) the time to change of therapy in patients with type 2 diabetes who had initiated metformin moNotherapy as first-line treatment and (2) the sequence in which subsequent therapeutic regimens were introduced. Design Cohort study. Setting National study based on linked data from the New Zealand Ministry of Health's National Collections of health and pharmaceutical dispensing data. Participants People with type 2 diabetes mellitus who initiated metformin moNotherapy between 1 January 2006 and 30 September 2014 (n=93 874). Primary outcome measures Cumulative incidence curves were plotted to show the time taken to move from one regimen to aNother, while sunburst plots were used to illustrate the sequence in which regimens were introduced

 25%|██▍       | 609/2440 [01:44<04:14,  7.18it/s]Token indices sequence length is longer than the specified maximum sequence length for this model (2253 > 2048). Running this sequence through the model will result in indexing errors
100%|██████████| 2440/2440 [07:01<00:00,  5.79it/s]


In [20]:
print(evaluator.make_table(results_acc))

|Task|Version|Metric|Value |   |Stderr|
|----|-------|------|-----:|---|-----:|
|icsr|0.1.0  |acc   |0.2307|±  |0.0085|



In [18]:
print(evaluator.make_table(results_per))

|     Task      |Version|    Metric     | Value |   |Stderr|
|---------------|-------|---------------|------:|---|------|
|icsr-perplexity|0.1.0  |word_perplexity|33.7934|   |      |
|               |       |byte_perplexity| 1.6510|   |      |
|               |       |bits_per_byte  | 0.7233|   |      |

