# The No-Data Algorithm

This is an implementation of two toy examples (Experiment 1 in the paper) to illustrate the inner workings of the No-Data Algorithm. 

The idea is to generate a 'good' (in-phenomenon) dataset with a rubric, and a bad ('other'; out-of-phenomenon) dataset with a separate rubric. 
We do not call these in-distribution / out-of-distribution for learning-theoretical reasons. 

The evaluator is a decision tree trained with the 'good' data, for which we know the accuracy in 'good' (and that it cannot possibly know 'other'). The rationale is that the decision tree _learnt_ the rubric.

In [1]:
# Shared functions/imports that you'll need
import json
import random
from collections import Counter
from tqdm import tqdm
from sklearn.metrics import f1_score, accuracy_score
import math

import numpy as np

K = 12 # Bitstring length
rounds = 3 # EV Rounds

def print_metrics(labels, successes, test_set, suff, flips=None, return_instead=False):
    Y = [int(p[-1]) for p in test_set]
    _good_labels = [int(g) for g in labels]
    acc = round(accuracy_score(Y, _good_labels), 3)
    f1 = round(f1_score(Y, _good_labels)*100., 3)
    succ = round(sum(successes)*100/len(test_set), 3)
    if flips is not None:
        flips = [int(g[0]) for g in flips]
        flips = round(sum(flips)*100/len(test_set), 3)
        if not return_instead:
            print(f"{suff}-phenomenon test score: {acc} | Successes: {succ} | F1: {f1} | Flips: {flips}")
        else:
            return acc, f1, succ, flips
    else:
        if not return_instead:
            print(f"{suff}-phenomenon test score: {acc} | Successes: {succ} | F1: {f1}")
        else:
            return acc, f1, succ


## Data Generation, Rubrics, etc.

In [1]:
def dataset_generator(n, k, rubric, balance=False, is_prime=False):
    '''
    Generate a dataset of k-ary binary strings given a rubric. 
    '''
    dataset = []
    ones, zeros = 0, 0
    while len(dataset) < n:
        x = "".join([random.choice(["0", "1"]) for _ in range(k)])
        answer = apply_rubric_to(x, rubric)
        y = get_label_for(answer, is_prime=is_prime)
        if balance:
            if y == "1" and ones > n // 2:
                continue
            if y == "0" and zeros > n // 2:
                continue
        dataset.append((x, y))
        if y == "1":
            ones += 1
        else:
            zeros += 1

    random.shuffle(dataset)
    if balance:
        one_labels = [p for p in dataset if p[-1] == "1"]
        zero_labels = [p for p in dataset if p[-1] == "0"]
        dataset = []
        for a, b in zip(one_labels, zero_labels):
            dataset.append(a)
            dataset.append(b)        
    return dataset


def apply_rubric_to(x, rubric):
    '''
    Apply a given rubric (set of criteria) to the given input.
    '''
    criteria_output = []
    for criteria in rubric:
        label = criteria(x)
        criteria_output.append(str(label))
    return "".join(criteria_output)


def get_label_for(x, is_prime=False):
    '''
    Simple majority vote aggregator, with tiebreaking.
    When it is prime, it is a mixture of tiebreaking and some hardcoded
    rules.
    '''
    _z = Counter(x)
    z = _z.most_common()
    vote = z[0][0]
    if len(z) > 1:
        # This will not be called for a 3-criteria rubric
        # but will for the 6-criteria ones. We default to 1
        # in case of ties
        if z[0][-1] == z[1][-1]:
            vote = "1"

    return vote


def what_matched_what(x, criterion, position=None, is_full_string=False):
    ''' 
    Helper function to determine subsets that matched a given criterion.
    We only need ONE subset fulfiling the criterion. 
    The difficulty lies on distinguishing criteria that evaluate
    the whole string, positional values, versus any subset. 
    
    This was because I dug my own grave with this code, but it should not
    affect any results. 
    '''
    if is_full_string:
        return x if criterion(x) == 1 else None
    if position is not None:
        return x[position] if criterion(x) == 1 else None
    subsets = []
    for i in range(len(x)):
        for j in range(i + 1, len(x) + 1):
            if criterion(x[i:j]) == 1:
                subsets.append(x[i:j])
    if subsets == []:
        return None
    return subsets


# Rubric to generate data that is `good' (in-phenomenon)
def xor(a, b):
    return a^b

rubric_good = [
    lambda x: 1 if x.count("0") % 2 == 0 else 0, # Does it have an even number of zeros?
    lambda x: 1 if xor(x[0] == "0", "10101" in x) else 0, # Does it start with a zero OR contain 10101?
    lambda x: 1 if x.count("1") > 5 else 0, # Is the number of ones larger than 5? (k-dependent!)
]

with_xors = {1: [lambda x: x[0] == "0", lambda x: "10101" in x]}

# Rubric to generate data that is `bad' (out-of-phenomenon)
rubric_other = [
    lambda x: 1 if x.find("111") != -1 else 0, # Does it contain three ones next to one another?
    lambda x: 1 if x[-1] == "1" else 0, # Does it end with a 1?
    lambda x: 1 if x.find("110001") != -1 else 0 , # Does it contain either 110 and 001?
]

# Natural-language descriptions (which we will use in LLMs)
good_rubric_nl = "- If the string contains an even number of zeros, 1. Otherwise 0\n\
    - If the string starts with a zero OR contains 10101 (but not both), it is 1. Otherwise 0\n\
    - If the string has more than five ones, it is 1. Otherwise 0."

other_rubric_nl = "- If the string contains three consecutive ones, 1. Otherwise 0\n\
    - If the string ends with a one, it is 1. Otherwise 0\n\
    - If the string contains the substring 110001, it is 1. Otherwise 0."

# Natural-language description of the aggregator, for the LLM.
# While in theory it should NOT know this, it does seem like without 
# bprop it'll be nearly impossible to solve otherwise.
aggregator_nl = "The final label should be determined by majority vote."

## Generate data (or load it)

We generate the dataset for reproducibility purposes, and then load it (hence why it is all commented out)

In [31]:
#good_data = dataset_generator(n=2500, k=K, rubric=rubric_good, balance=True)
#other_data = dataset_generator(n=2500, k=K, rubric=rubric_other, balance=True)
#with open("good_data_train_with_xor.json", "w", encoding="utf-8") as f: json.dump(good_data[:2000], f)
#with open("good_data_test_with_xor.json", "w", encoding="utf-8") as f: json.dump(good_data[2000:], f)
#with open("other_data_train_with_xor.json", "w", encoding="utf-8") as f: json.dump(other_data[:2000], f)
#with open("other_data_test_with_xor.json", "w", encoding="utf-8") as f: json.dump(other_data[2000:], f)

good_data_train = json.load(open("good_data_train_with_xor.json", "r", encoding="utf-8"))
good_data_test = json.load(open("good_data_test_with_xor.json", "r", encoding="utf-8"))
other_data_train = json.load(open("other_data_train_with_xor.json", "r", encoding="utf-8"))
other_data_test = json.load(open("other_data_test_with_xor.json", "r", encoding="utf-8"))

## Algorithm and so on

In [32]:
def no_data_algorithm(X, evaluator, generator, rubric, max_rounds=3,
                      max_ev_rounds=1, noise=None):
    '''
    Baseline implementation of the no-data algorithm without flipping labels.
    The generator parameter is because we will poll the LLM with its understanding
    of the data (decision trees aren't really good at generating, so we'll use a random choice)
    '''
    labels = []
    successes = []
    for x in tqdm(X):
        success, label, _ = ev_protocol(x, evaluator, generator, rubric,
                                             max_rounds, noise)
        if success:
            labels.append(label)
            successes.append(1)
        else:
            labels.append(not label)
    return labels, successes


def no_data_algorithm_with_flips(X, evaluator, generator, rubric, max_rounds=3,
                                 phi=0.3, noise=None):
    '''
    The actual implementation of the No-Data algorithm, flipping labels.
    The generator parameter is because we will poll the LLM with its understanding
    of the data (decision trees aren't really good at generating, so we'll use a random choice)
    '''
    labels = []
    flips = []
    successes = []
    for x in tqdm(X):

        success, label, reason = ev_protocol(x, evaluator, generator, rubric, 
                                                max_rounds, noise)
        if success:
            flips.append([False, reason] if "Random" not in reason else [True, reason])
            labels.append(label)
            successes.append(1)
        else:
            if random.uniform(0, 1) < phi:
                labels.append(label)
                flips.append([False, "Random flip"])
            else:
                label = not label
                labels.append(label)
                flips.append([True, "Random flip"])

    return labels, successes, flips


def ev_protocol(x, evaluator, generator, rubric, max_rounds, noise):
    ''' 
    Quick and dirty implementation of the EV Protocol. In here, 
    we understand lying as generating a datapoint that is not within
    the correct phenomenon. 
    '''
    for round in range(max_rounds):
        # Step 1: generate x \cong x' based on the belief of y
        y_tilde = evaluator(x)
        # Always generate with the rubric you know
        x_tilde = generator(x, y_tilde, rubric_good)
        # Step 2: get the challenge
        check = random.choice([check1, check2])
        # Step 3: git checkd
        success, reason = check(x_tilde, x, rubric)
        # Step 3.5: Randomness, if enabled, for automated accept
        if noise is not None and success is False:
            if random.uniform(0, 1) < noise:
                return True, y_tilde, ("Random guess", x, x_tilde)
        # Last bugfix: if the label is zero, the checks should fail... right?
        if not success: break
    return success, y_tilde, (reason, x, x_tilde)


def check1(x_tilde, x, rubric):
    ''' 
    Assert that a subset matching the criteria exists in both
    x_tilde and x.
    I.e., verify the isomorphism
    '''
    for i, c in enumerate(rubric):
        if i in with_xors and int(rubric[0]("0000")) == 1:
            c1, c1tilde = with_xors[i][0](x), with_xors[i][0](x_tilde)
            c2, c2tilde = with_xors[i][1](x), with_xors[i][1](x_tilde)
            if c1 != c1tilde:
                return False, f"Ch1 Failed iso {i}"
            if c2 != c2tilde:
                return False, f"Ch1 Failed iso {i}"
            if c(x) != c(x_tilde):
                return False, f"Ch1 Failed iso {i}"
        else:
            if c(x) != c(x_tilde):
                return False, f"Ch1 Failed iso {i}"

    return True, ""


def check2(x_tilde, x, rubric):
    ''' 
    Assert that the output of the criteria matches in both cases
    '''
    x_tilde_encoding = "".join([str(c(x_tilde)) for c in rubric])
    x_encoding = "".join([str(c(x)) for c in rubric])
    if x_tilde_encoding == x_encoding:
        return True, ""
    return False, "Ch2 Encoding"


# Experiments - DT
A good-old fashioned decision tree. This one is trained with the 'good' dataset and thus we can make a strong argument that it has neither seen the 'other' dataset nor memorised it.

There'll be several sub-experiments to evaluate the generator (oracle, noisy, etc) as well as the evaluator (pretrained, lying, etc)

This uses the `dt_generator_base`

In [33]:
def dt_generator_base(x, y=None, rubric=None):
    ''' 
    Call our data generation function for a single datapoint, until
    we find one that satisfies the rubric (whatever that is).
    This is the base case (when it actually knows what it is talking about).
    This one also is used to construct the exemplars in the LLM call
    '''
    # Note that we could have done something more sophisticated, like
    # random indices, but the code would be more complicated (you will need)
    # to check for isomorphism AND permutation
    found = False
    while not found:
        x_tilde = "".join([random.choice(["0", "1"]) for _ in range(K)])
        found = True
        for i, c in enumerate(rubric):
            if i in with_xors and int(rubric[0]("0000")) == 1:
                c1, c1tilde = with_xors[i][0](x), with_xors[i][0](x_tilde)
                c2, c2tilde = with_xors[i][1](x), with_xors[i][1](x_tilde)
                if c1 != c1tilde:
                    found = False
                if c2 != c2tilde:
                    found = False
                if c(x) != c(x_tilde):
                    found = False
            else:
                if c(x_tilde) != c(x):
                    found = False
    return x_tilde


def dt_generator_probabilistic(x, y=None, rubric=None, p=0.25):
    ''' 
    Call our data generation function for a single datapoint, until
    we find one that satisfies the rubric (whatever that is).
    Then, with probability 1/4, lie.
    '''
    # Note that we could have done something more sophisticated, like
    # random indices, but the code would be more complicated (you will need)
    # to check for isomorphism AND permutation
    found = False
    while not found:
        x_tilde = "".join([random.choice(["0", "1"]) for _ in range(K)])
        found = True
        for i, c in enumerate(rubric):
            if i in with_xors and int(rubric[0]("0000")) == 1:
                c1, c1tilde = with_xors[i][0](x), with_xors[i][0](x_tilde)
                c2, c2tilde = with_xors[i][1](x), with_xors[i][1](x_tilde)
                if c1 != c1tilde:
                    found = False
                if c2 != c2tilde:
                    found = False
                if c(x) != c(x_tilde):
                    found = False
            else:
                if c(x_tilde) != c(x):
                    found = False
    if random.uniform(0,1) < p:
        return "".join([random.choice(["0", "1"]) for _ in range(K)])
    return x_tilde


def dt_generator_lying_no_f(x, y=None, rubric=None):
    ''' 
    A lying generator (no f known)
    This is equivalent to claiming (falsely) "I understand the task"
    '''
    found = False
    while not found:
        x_tilde = "".join([random.choice(["0", "1"]) for _ in range(K)])
        y_tilde = get_label_for(x_tilde)
        if int(y_tilde) == int(y): found = True
    return x_tilde


def dt_generator_lying_no_sigma(x, y=None, rubric=None):
    ''' 
    A lying generator (no sigma known). This one is more sophisticated,
    since it "understands" the rubric but it can't generate something that
    is isomorphic, just similar.
    It is equivalent to falsely claiming you know how to label the data 
    given the rubric. 
    '''
    expected = "".join([str(c(x)) for c in rubric])
    expected = ''.join(sorted(expected))
    while True:
        x_tilde = "".join([random.choice(["0", "1"]) for _ in range(K)])
        prototype = [str(c(x_tilde)) for c in rubric]
        prototype = ''.join(sorted(prototype))
        if expected == prototype:
            break
    return x_tilde


def dt_evaluator(x):
    ''' 
    Wrapper to maintain signatures 
    '''
    return int(clf.predict(np.array([x]).reshape(-1, 1))[0])


def dt_evaluator_lying(x, p=0.1):
    ''' 
    Wrapper to maintain signatures 
    '''
    label = int(clf.predict(np.array([x]).reshape(-1, 1))[0])
    if random.uniform(0, 1) < p:
        return int(not label)
    return label



## Baseline

In [34]:
# First we baseline
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import f1_score, accuracy_score

import numpy as np

clf = DecisionTreeClassifier(random_state=123)

X = np.array([p[0] for p in good_data_train])
clf.fit(X.reshape(-1, 1), [p[-1] for p in good_data_train])

X = np.array([p[0] for p in good_data_test])
Y = [int(p[-1]) for p in good_data_test]
preds = [int(k) for k in clf.predict(X.reshape(-1, 1))]
dt_in_phenomenon_baseline_accuracy = round(accuracy_score(Y, preds), 3)
dt_in_phenomenon_baseline_f1 = round(f1_score(Y, preds), 3)

X = np.array([p[0] for p in other_data_test])
Y = [int(p[-1]) for p in other_data_test]
preds = [int(k) for k in clf.predict(X.reshape(-1, 1))]
dt_out_phenomenon_baseline_accuracy = round(accuracy_score(Y, preds), 3)
dt_out_phenomenon_baseline_f1 = round(f1_score(Y, preds)*100, 3)

print(f"In-phenomenon baseline test score: {dt_in_phenomenon_baseline_accuracy} | {dt_in_phenomenon_baseline_f1}")
print(f"Out-of-phenomenon baseline test score: {dt_out_phenomenon_baseline_accuracy} | {dt_out_phenomenon_baseline_f1}")


In-phenomenon baseline test score: 0.622 | 0.598
Out-of-phenomenon baseline test score: 0.542 | 54.217


## The No-Data Algorithm

In [45]:
phi = 0.5 # Doesn't matter in ID since it is using the oracular generator
good_labels, good_successes, good_flips = no_data_algorithm_with_flips([p[0] for p in good_data_test], 
                                                evaluator=dt_evaluator,
                                                generator=dt_generator_base,
                                                rubric=rubric_good,
                                                max_rounds=rounds,
                                                phi=phi, noise=None)


other_labels, other_successes, other_flips = no_data_algorithm_with_flips([p[0] for p in other_data_test], 
                                                evaluator=dt_evaluator,
                                                generator=dt_generator_base,
                                                rubric=rubric_other,
                                                max_rounds=rounds, 
                                                phi=phi, noise=None) # Check with the correct datapoint

print_metrics(good_labels, good_successes, good_data_test, "In", good_flips)
print_metrics(other_labels, other_successes, other_data_test, "Out", other_flips)


100%|██████████| 498/498 [00:00<00:00, 1816.34it/s]
100%|██████████| 498/498 [00:00<00:00, 3309.05it/s]

In-phenomenon test score: 0.622 | Successes: 100.0 | F1: 59.829 | Flips: 0.0
Out-phenomenon test score: 0.528 | Successes: 4.819 | F1: 52.138 | Flips: 46.386





## Ablation Studies

#### No flips - Lying Generator

In [25]:
for generator_name, generator in [("no sigma", dt_generator_lying_no_sigma), 
                                ("no f", dt_generator_lying_no_f),
                                ("p = 0.1", dt_generator_probabilistic),
                                ("oracle", dt_generator_base)]:

    good_labels, good_successes = no_data_algorithm([p[0] for p in good_data_test], 
                                                    evaluator=dt_evaluator,
                                                    generator=generator,
                                                    rubric=rubric_good,
                                                    max_rounds=rounds,
                                                    noise=None)

    other_labels, other_successes = no_data_algorithm([p[0] for p in other_data_test], 
                                                    evaluator=dt_evaluator,
                                                    generator=generator,
                                                    rubric=rubric_other,
                                                    max_rounds=rounds,
                                                    noise=None) # Check with the correct datapoint

    print(f"Generator: {generator_name}")
    print_metrics(good_labels, good_successes, good_data_test, "In")
    print_metrics(other_labels, other_successes, other_data_test, "Out")

100%|██████████| 498/498 [00:00<00:00, 6569.51it/s]
100%|██████████| 498/498 [00:00<00:00, 9029.44it/s]


Generator: no sigma
In-phenomenon test score: 0.488 | Successes: 18.273 | F1: 51.243
Out-phenomenon test score: 0.46 | Successes: 1.807 | F1: 46.307


100%|██████████| 498/498 [00:00<00:00, 13510.06it/s]
100%|██████████| 498/498 [00:00<00:00, 8355.99it/s]


Generator: no f
In-phenomenon test score: 0.382 | Successes: 0.402 | F1: 41.667
Out-phenomenon test score: 0.476 | Successes: 2.61 | F1: 47.695


100%|██████████| 498/498 [00:00<00:00, 2335.46it/s]
100%|██████████| 498/498 [00:00<00:00, 3597.99it/s]


Generator: p = 0.1
In-phenomenon test score: 0.502 | Successes: 50.201 | F1: 50.4
Out-phenomenon test score: 0.464 | Successes: 4.217 | F1: 46.493


100%|██████████| 498/498 [00:00<00:00, 1899.78it/s]
100%|██████████| 498/498 [00:00<00:00, 4270.64it/s]

Generator: oracle
In-phenomenon test score: 0.622 | Successes: 100.0 | F1: 59.829
Out-phenomenon test score: 0.462 | Successes: 4.819 | F1: 45.082





#### Flips -- Lying Generator

In [40]:
for phi in [0.0, 0.3, 0.5, 0.6, 0.9, 1.0]:
    print(f"--------------- Phi: {phi} ---------------")
    for generator_name, generator in [("no sigma", dt_generator_lying_no_sigma), 
                                    ("no f", dt_generator_lying_no_f),
                                    ("p = 0.1", dt_generator_probabilistic),
                                    ("oracle", dt_generator_base)]:

        good_labels, good_successes, good_flips = no_data_algorithm_with_flips([p[0] for p in good_data_test], 
                                                        evaluator=dt_evaluator,
                                                        generator=generator,
                                                        rubric=rubric_good,
                                                        max_rounds=rounds,
                                                        phi=phi, noise=None)

        other_labels, other_successes, other_flips = no_data_algorithm_with_flips([p[0] for p in other_data_test], 
                                                        evaluator=dt_evaluator,
                                                        generator=generator,
                                                        rubric=rubric_other,
                                                        max_rounds=rounds,
                                                        phi = phi, noise=None) # Check with the correct datapoint

        print(f"Generator: {generator_name}")
        print_metrics(good_labels, good_successes, good_data_test, "In", good_flips)
        print_metrics(other_labels, other_successes, other_data_test, "Out", other_flips)

--------------- Phi: 0.0 ---------------


100%|██████████| 498/498 [00:00<00:00, 1473.43it/s]
100%|██████████| 498/498 [00:00<00:00, 1943.57it/s]


Generator: no sigma
In-phenomenon test score: 0.482 | Successes: 16.064 | F1: 50.763 | Flips: 83.936
Out-phenomenon test score: 0.468 | Successes: 2.209 | F1: 46.894 | Flips: 97.791


100%|██████████| 498/498 [00:00<00:00, 2624.65it/s]
100%|██████████| 498/498 [00:00<00:00, 2717.70it/s]


Generator: no f
In-phenomenon test score: 0.376 | Successes: 0.602 | F1: 40.987 | Flips: 99.398
Out-phenomenon test score: 0.484 | Successes: 4.217 | F1: 49.109 | Flips: 95.783


100%|██████████| 498/498 [00:00<00:00, 713.42it/s]
100%|██████████| 498/498 [00:00<00:00, 1198.30it/s]


Generator: p = 0.1
In-phenomenon test score: 0.484 | Successes: 47.59 | F1: 49.31 | Flips: 52.41
Out-phenomenon test score: 0.468 | Successes: 4.618 | F1: 46.247 | Flips: 95.382


100%|██████████| 498/498 [00:00<00:00, 592.39it/s]
100%|██████████| 498/498 [00:00<00:00, 1019.07it/s]


Generator: oracle
In-phenomenon test score: 0.622 | Successes: 100.0 | F1: 59.829 | Flips: 0.0
Out-phenomenon test score: 0.464 | Successes: 5.823 | F1: 45.842 | Flips: 94.177
--------------- Phi: 0.3 ---------------


100%|██████████| 498/498 [00:00<00:00, 1845.26it/s]
100%|██████████| 498/498 [00:00<00:00, 3339.41it/s]


Generator: no sigma
In-phenomenon test score: 0.514 | Successes: 19.277 | F1: 54.682 | Flips: 60.643
Out-phenomenon test score: 0.476 | Successes: 1.606 | F1: 48.521 | Flips: 69.679


100%|██████████| 498/498 [00:00<00:00, 3305.39it/s]
100%|██████████| 498/498 [00:00<00:00, 3166.29it/s]


Generator: no f
In-phenomenon test score: 0.472 | Successes: 0.602 | F1: 49.326 | Flips: 68.072
Out-phenomenon test score: 0.52 | Successes: 2.008 | F1: 51.125 | Flips: 68.072


100%|██████████| 498/498 [00:00<00:00, 813.98it/s]
100%|██████████| 498/498 [00:00<00:00, 1213.46it/s]


Generator: p = 0.1
In-phenomenon test score: 0.508 | Successes: 46.386 | F1: 50.505 | Flips: 39.558
Out-phenomenon test score: 0.478 | Successes: 2.61 | F1: 46.939 | Flips: 65.06


100%|██████████| 498/498 [00:01<00:00, 489.90it/s]
100%|██████████| 498/498 [00:00<00:00, 1684.11it/s]


Generator: oracle
In-phenomenon test score: 0.622 | Successes: 100.0 | F1: 59.829 | Flips: 0.0
Out-phenomenon test score: 0.496 | Successes: 4.618 | F1: 49.087 | Flips: 66.064
--------------- Phi: 0.5 ---------------


100%|██████████| 498/498 [00:00<00:00, 2790.31it/s]
100%|██████████| 498/498 [00:00<00:00, 3551.70it/s]


Generator: no sigma
In-phenomenon test score: 0.542 | Successes: 19.277 | F1: 54.032 | Flips: 40.161
Out-phenomenon test score: 0.484 | Successes: 1.004 | F1: 49.31 | Flips: 50.0


100%|██████████| 498/498 [00:00<00:00, 4254.65it/s]
100%|██████████| 498/498 [00:00<00:00, 4302.72it/s]


Generator: no f
In-phenomenon test score: 0.512 | Successes: 1.004 | F1: 50.909 | Flips: 47.992
Out-phenomenon test score: 0.514 | Successes: 2.811 | F1: 52.174 | Flips: 48.996


100%|██████████| 498/498 [00:00<00:00, 729.92it/s] 
100%|██████████| 498/498 [00:00<00:00, 824.02it/s]


Generator: p = 0.1
In-phenomenon test score: 0.544 | Successes: 47.791 | F1: 52.008 | Flips: 27.51
Out-phenomenon test score: 0.522 | Successes: 2.811 | F1: 50.826 | Flips: 47.791


100%|██████████| 498/498 [00:00<00:00, 689.23it/s]
100%|██████████| 498/498 [00:00<00:00, 1789.62it/s]


Generator: oracle
In-phenomenon test score: 0.622 | Successes: 100.0 | F1: 59.829 | Flips: 0.0
Out-phenomenon test score: 0.546 | Successes: 6.426 | F1: 55.159 | Flips: 47.791
--------------- Phi: 0.6 ---------------


100%|██████████| 498/498 [00:00<00:00, 3097.67it/s]
100%|██████████| 498/498 [00:00<00:00, 2826.87it/s]


Generator: no sigma
In-phenomenon test score: 0.586 | Successes: 16.867 | F1: 58.8 | Flips: 32.53
Out-phenomenon test score: 0.524 | Successes: 2.811 | F1: 53.801 | Flips: 35.141


100%|██████████| 498/498 [00:00<00:00, 3221.11it/s]
100%|██████████| 498/498 [00:00<00:00, 1284.84it/s]


Generator: no f
In-phenomenon test score: 0.506 | Successes: 0.602 | F1: 48.101 | Flips: 37.349
Out-phenomenon test score: 0.514 | Successes: 3.012 | F1: 52.174 | Flips: 35.743


100%|██████████| 498/498 [00:00<00:00, 780.08it/s]
100%|██████████| 498/498 [00:00<00:00, 1763.02it/s]


Generator: p = 0.1
In-phenomenon test score: 0.558 | Successes: 43.574 | F1: 54.167 | Flips: 24.9
Out-phenomenon test score: 0.476 | Successes: 4.418 | F1: 49.516 | Flips: 37.952


100%|██████████| 498/498 [00:00<00:00, 950.82it/s] 
100%|██████████| 498/498 [00:00<00:00, 1965.77it/s]


Generator: oracle
In-phenomenon test score: 0.622 | Successes: 100.0 | F1: 59.829 | Flips: 0.0
Out-phenomenon test score: 0.486 | Successes: 6.024 | F1: 47.325 | Flips: 35.341
--------------- Phi: 0.9 ---------------


100%|██████████| 498/498 [00:00<00:00, 3294.75it/s]
100%|██████████| 498/498 [00:00<00:00, 4277.31it/s]


Generator: no sigma
In-phenomenon test score: 0.592 | Successes: 16.667 | F1: 57.263 | Flips: 9.036
Out-phenomenon test score: 0.522 | Successes: 2.209 | F1: 52.778 | Flips: 10.843


100%|██████████| 498/498 [00:00<00:00, 4643.80it/s]
100%|██████████| 498/498 [00:00<00:00, 3659.47it/s]


Generator: no f
In-phenomenon test score: 0.588 | Successes: 1.205 | F1: 55.724 | Flips: 9.839
Out-phenomenon test score: 0.532 | Successes: 3.213 | F1: 51.357 | Flips: 9.839


100%|██████████| 498/498 [00:00<00:00, 948.40it/s] 
100%|██████████| 498/498 [00:00<00:00, 1672.24it/s]


Generator: p = 0.1
In-phenomenon test score: 0.612 | Successes: 47.189 | F1: 58.134 | Flips: 5.823
Out-phenomenon test score: 0.53 | Successes: 3.414 | F1: 53.012 | Flips: 10.843


100%|██████████| 498/498 [00:00<00:00, 764.02it/s]
100%|██████████| 498/498 [00:00<00:00, 1013.22it/s]


Generator: oracle
In-phenomenon test score: 0.622 | Successes: 100.0 | F1: 59.829 | Flips: 0.0
Out-phenomenon test score: 0.544 | Successes: 4.217 | F1: 54.691 | Flips: 11.446
--------------- Phi: 1.0 ---------------


100%|██████████| 498/498 [00:00<00:00, 1773.24it/s]
100%|██████████| 498/498 [00:00<00:00, 3170.64it/s]


Generator: no sigma
In-phenomenon test score: 0.622 | Successes: 17.47 | F1: 59.829 | Flips: 0.0
Out-phenomenon test score: 0.542 | Successes: 3.012 | F1: 54.217 | Flips: 0.0


100%|██████████| 498/498 [00:00<00:00, 4981.16it/s]
100%|██████████| 498/498 [00:00<00:00, 4978.47it/s]


Generator: no f
In-phenomenon test score: 0.622 | Successes: 1.205 | F1: 59.829 | Flips: 0.0
Out-phenomenon test score: 0.542 | Successes: 3.614 | F1: 54.217 | Flips: 0.0


100%|██████████| 498/498 [00:00<00:00, 1390.72it/s]
100%|██████████| 498/498 [00:00<00:00, 2542.65it/s]


Generator: p = 0.1
In-phenomenon test score: 0.622 | Successes: 49.598 | F1: 59.829 | Flips: 0.0
Out-phenomenon test score: 0.542 | Successes: 4.016 | F1: 54.217 | Flips: 0.0


100%|██████████| 498/498 [00:00<00:00, 1340.61it/s]
100%|██████████| 498/498 [00:00<00:00, 2343.44it/s]

Generator: oracle
In-phenomenon test score: 0.622 | Successes: 100.0 | F1: 59.829 | Flips: 0.0
Out-phenomenon test score: 0.542 | Successes: 5.422 | F1: 54.217 | Flips: 0.0





#### No Flips -- Lying Evaluator

In [47]:
phi=0.5
for generator_name, generator in [("no sigma", dt_generator_lying_no_sigma), 
                                ("no f", dt_generator_lying_no_f),
                                ("p = 0.1", dt_generator_probabilistic),
                                ("oracle", dt_generator_base)]:

    good_labels, good_successes = no_data_algorithm([p[0] for p in good_data_test], 
                                                    evaluator=dt_evaluator_lying,
                                                    generator=generator,
                                                    rubric=rubric_good,
                                                    max_rounds=rounds,
                                                     noise=None)

    other_labels, other_successes = no_data_algorithm([p[0] for p in other_data_test], 
                                                    evaluator=dt_evaluator_lying,
                                                    generator=generator,
                                                    rubric=rubric_other,
                                                    max_rounds=rounds,
                                                    noise=None) # Check with the correct datapoint

    print(f"Generator: {generator_name}")
    print_metrics(good_labels, good_successes, good_data_test, "In")
    print_metrics(other_labels, other_successes, other_data_test, "Out")

100%|██████████| 498/498 [00:00<00:00, 4380.58it/s]
100%|██████████| 498/498 [00:00<00:00, 4477.69it/s]


Generator: no sigma
In-phenomenon test score: 0.426 | Successes: 15.06 | F1: 45.0
Out-phenomenon test score: 0.49 | Successes: 2.209 | F1: 49.402


100%|██████████| 498/498 [00:00<00:00, 7767.40it/s]
100%|██████████| 498/498 [00:00<00:00, 8795.16it/s]


Generator: no f
In-phenomenon test score: 0.394 | Successes: 0.201 | F1: 42.366
Out-phenomenon test score: 0.512 | Successes: 2.811 | F1: 52.816


100%|██████████| 498/498 [00:00<00:00, 1985.39it/s]
100%|██████████| 498/498 [00:00<00:00, 3813.64it/s]


Generator: p = 0.1
In-phenomenon test score: 0.49 | Successes: 51.205 | F1: 49.402
Out-phenomenon test score: 0.466 | Successes: 4.618 | F1: 48.649


100%|██████████| 498/498 [00:00<00:00, 1582.13it/s]
100%|██████████| 498/498 [00:00<00:00, 2214.54it/s]

Generator: oracle
In-phenomenon test score: 0.59 | Successes: 100.0 | F1: 57.5
Out-phenomenon test score: 0.456 | Successes: 6.426 | F1: 44.807





#### Flips -- Lying Evaluator

In [1]:
for phi in [0.1, 0.3, 0.5, 0.6, 0.9, 1.0]:
    print(f"-------------------- Phi: {phi} -----------------")
    for generator_name, generator in [("no sigma", dt_generator_lying_no_sigma), 
                                    ("no f", dt_generator_lying_no_f),
                                    ("p = 0.1", dt_generator_probabilistic),
                                    ("oracle", dt_generator_base)]:

        good_labels, good_successes, good_flips = no_data_algorithm_with_flips([p[0] for p in good_data_test], 
                                                        evaluator=dt_evaluator_lying,
                                                        generator=generator,
                                                        rubric=rubric_good,
                                                        max_rounds=rounds,
                                                        phi=phi, noise=None)

        other_labels, other_successes, other_flips = no_data_algorithm_with_flips([p[0] for p in other_data_test], 
                                                        evaluator=dt_evaluator_lying,
                                                        generator=generator,
                                                        rubric=rubric_other,
                                                        max_rounds=rounds,
                                                        phi = phi, noise=None) # Check with the correct datapoint

        print(f"Generator: {generator_name}")
        print_metrics(good_labels, good_successes, good_data_test, "In", good_flips)
        print_metrics(other_labels, other_successes, other_data_test, "Out", other_flips)

-------------------- Phi: 0.1 -----------------


100%|██████████| 498/498 [00:00<00:00, 3590.98it/s]
100%|██████████| 498/498 [00:00<00:00, 4637.47it/s]


Generator: no sigma
In-phenomenon test score: 0.514 | Successes: 19.277 | F1: 52.918 | Flips: 71.888
Out-phenomenon test score: 0.474 | Successes: 2.61 | F1: 46.091 | Flips: 84.94


100%|██████████| 498/498 [00:00<00:00, 5676.03it/s]
100%|██████████| 498/498 [00:00<00:00, 5262.58it/s]


Generator: no f
In-phenomenon test score: 0.452 | Successes: 0.602 | F1: 48.393 | Flips: 88.554
Out-phenomenon test score: 0.462 | Successes: 2.008 | F1: 47.451 | Flips: 89.357


100%|██████████| 498/498 [00:00<00:00, 1445.48it/s]
100%|██████████| 498/498 [00:00<00:00, 2587.76it/s]


Generator: p = 0.1
In-phenomenon test score: 0.49 | Successes: 48.394 | F1: 48.163 | Flips: 46.787
Out-phenomenon test score: 0.48 | Successes: 3.614 | F1: 48.915 | Flips: 82.932


100%|██████████| 498/498 [00:00<00:00, 1246.78it/s]
100%|██████████| 498/498 [00:00<00:00, 2575.05it/s]


Generator: oracle
In-phenomenon test score: 0.596 | Successes: 100.0 | F1: 57.862 | Flips: 0.0
Out-phenomenon test score: 0.468 | Successes: 6.225 | F1: 47.106 | Flips: 84.337
-------------------- Phi: 0.3 -----------------


100%|██████████| 498/498 [00:00<00:00, 3973.53it/s]
100%|██████████| 498/498 [00:00<00:00, 4062.24it/s]


Generator: no sigma
In-phenomenon test score: 0.492 | Successes: 16.667 | F1: 50.489 | Flips: 60.442
Out-phenomenon test score: 0.532 | Successes: 2.811 | F1: 54.043 | Flips: 66.867


100%|██████████| 498/498 [00:00<00:00, 3618.51it/s]
100%|██████████| 498/498 [00:00<00:00, 3779.72it/s]


Generator: no f
In-phenomenon test score: 0.494 | Successes: 0.602 | F1: 50.588 | Flips: 68.072
Out-phenomenon test score: 0.476 | Successes: 3.012 | F1: 47.485 | Flips: 68.474


100%|██████████| 498/498 [00:00<00:00, 1479.23it/s]
100%|██████████| 498/498 [00:00<00:00, 2628.80it/s]


Generator: p = 0.1
In-phenomenon test score: 0.5 | Successes: 50.201 | F1: 48.871 | Flips: 35.141
Out-phenomenon test score: 0.474 | Successes: 2.41 | F1: 46.964 | Flips: 68.072


100%|██████████| 498/498 [00:00<00:00, 1364.00it/s]
100%|██████████| 498/498 [00:00<00:00, 2996.93it/s]


Generator: oracle
In-phenomenon test score: 0.59 | Successes: 100.0 | F1: 56.962 | Flips: 0.0
Out-phenomenon test score: 0.474 | Successes: 4.418 | F1: 47.177 | Flips: 65.261
-------------------- Phi: 0.5 -----------------


100%|██████████| 498/498 [00:00<00:00, 3320.92it/s]
100%|██████████| 498/498 [00:00<00:00, 4198.30it/s]


Generator: no sigma
In-phenomenon test score: 0.528 | Successes: 15.06 | F1: 51.745 | Flips: 39.357
Out-phenomenon test score: 0.506 | Successes: 2.008 | F1: 50.8 | Flips: 47.992


100%|██████████| 498/498 [00:00<00:00, 4269.08it/s]
100%|██████████| 498/498 [00:00<00:00, 4106.32it/s]


Generator: no f
In-phenomenon test score: 0.496 | Successes: 0.602 | F1: 49.293 | Flips: 50.602
Out-phenomenon test score: 0.51 | Successes: 1.807 | F1: 50.407 | Flips: 45.582


100%|██████████| 498/498 [00:00<00:00, 982.55it/s] 
100%|██████████| 498/498 [00:00<00:00, 1461.80it/s]


Generator: p = 0.1
In-phenomenon test score: 0.522 | Successes: 48.594 | F1: 50.826 | Flips: 26.908
Out-phenomenon test score: 0.51 | Successes: 3.012 | F1: 51.2 | Flips: 50.602


100%|██████████| 498/498 [00:00<00:00, 1006.39it/s]
100%|██████████| 498/498 [00:00<00:00, 1973.54it/s]


Generator: oracle
In-phenomenon test score: 0.588 | Successes: 100.0 | F1: 57.906 | Flips: 0.0
Out-phenomenon test score: 0.492 | Successes: 3.614 | F1: 48.473 | Flips: 50.602
-------------------- Phi: 0.6 -----------------


100%|██████████| 498/498 [00:00<00:00, 3546.27it/s]
100%|██████████| 498/498 [00:00<00:00, 3895.19it/s]


Generator: no sigma
In-phenomenon test score: 0.576 | Successes: 15.663 | F1: 56.495 | Flips: 32.329
Out-phenomenon test score: 0.508 | Successes: 1.406 | F1: 50.704 | Flips: 41.165


100%|██████████| 498/498 [00:00<00:00, 5568.19it/s]
100%|██████████| 498/498 [00:00<00:00, 4415.37it/s]


Generator: no f
In-phenomenon test score: 0.49 | Successes: 0.201 | F1: 50.196 | Flips: 38.755
Out-phenomenon test score: 0.484 | Successes: 1.606 | F1: 46.122 | Flips: 41.968


100%|██████████| 498/498 [00:00<00:00, 1285.88it/s]
100%|██████████| 498/498 [00:00<00:00, 2776.33it/s]


Generator: p = 0.1
In-phenomenon test score: 0.59 | Successes: 44.98 | F1: 58.367 | Flips: 20.683
Out-phenomenon test score: 0.506 | Successes: 3.213 | F1: 48.101 | Flips: 37.751


100%|██████████| 498/498 [00:00<00:00, 1319.71it/s]
100%|██████████| 498/498 [00:00<00:00, 2596.77it/s]


Generator: oracle
In-phenomenon test score: 0.606 | Successes: 100.0 | F1: 58.824 | Flips: 0.0
Out-phenomenon test score: 0.536 | Successes: 5.823 | F1: 54.076 | Flips: 37.149
-------------------- Phi: 0.9 -----------------


100%|██████████| 498/498 [00:00<00:00, 4255.34it/s]
100%|██████████| 498/498 [00:00<00:00, 5648.19it/s]


Generator: no sigma
In-phenomenon test score: 0.588 | Successes: 16.265 | F1: 57.906 | Flips: 8.635
Out-phenomenon test score: 0.514 | Successes: 2.008 | F1: 51.984 | Flips: 11.044


100%|██████████| 498/498 [00:00<00:00, 6739.88it/s]
100%|██████████| 498/498 [00:00<00:00, 5520.67it/s]


Generator: no f
In-phenomenon test score: 0.574 | Successes: 0.602 | F1: 55.833 | Flips: 11.245
Out-phenomenon test score: 0.526 | Successes: 2.61 | F1: 51.639 | Flips: 7.631


100%|██████████| 498/498 [00:00<00:00, 1021.77it/s]
100%|██████████| 498/498 [00:00<00:00, 2182.83it/s]


Generator: p = 0.1
In-phenomenon test score: 0.586 | Successes: 47.791 | F1: 56.723 | Flips: 6.225
Out-phenomenon test score: 0.532 | Successes: 2.811 | F1: 52.738 | Flips: 7.229


100%|██████████| 498/498 [00:00<00:00, 786.02it/s]
100%|██████████| 498/498 [00:00<00:00, 1769.19it/s]


Generator: oracle
In-phenomenon test score: 0.582 | Successes: 100.0 | F1: 55.932 | Flips: 0.0
Out-phenomenon test score: 0.54 | Successes: 6.426 | F1: 52.784 | Flips: 9.639
-------------------- Phi: 1.0 -----------------


100%|██████████| 498/498 [00:00<00:00, 2825.32it/s]
100%|██████████| 498/498 [00:00<00:00, 2846.28it/s]


Generator: no sigma
In-phenomenon test score: 0.622 | Successes: 18.876 | F1: 61.157 | Flips: 0.0
Out-phenomenon test score: 0.548 | Successes: 1.807 | F1: 53.988 | Flips: 0.0


100%|██████████| 498/498 [00:00<00:00, 4337.19it/s]
100%|██████████| 498/498 [00:00<00:00, 2286.56it/s]


Generator: no f
In-phenomenon test score: 0.586 | Successes: 0.201 | F1: 57.787 | Flips: 0.0
Out-phenomenon test score: 0.502 | Successes: 3.213 | F1: 48.971 | Flips: 0.0


100%|██████████| 498/498 [00:00<00:00, 674.88it/s]
100%|██████████| 498/498 [00:00<00:00, 1645.09it/s]


Generator: p = 0.1
In-phenomenon test score: 0.598 | Successes: 52.008 | F1: 57.265 | Flips: 0.0
Out-phenomenon test score: 0.544 | Successes: 2.61 | F1: 54.326 | Flips: 0.0


100%|██████████| 498/498 [00:00<00:00, 551.00it/s]
100%|██████████| 498/498 [00:00<00:00, 1394.39it/s]

Generator: oracle
In-phenomenon test score: 0.582 | Successes: 100.0 | F1: 55.172 | Flips: 0.0
Out-phenomenon test score: 0.534 | Successes: 4.618 | F1: 54.15 | Flips: 0.0





# Experiment 2: LLM Tests
1. Baselines are GPT-4o and o3-mini
2. Includes ablation study with the standard generator vs the picker

The LLM Client we are using supports SLMs that are open, and _techincally_ also the closed AI models (like Open AI models). 
However, you'll have to bring in your own subscription. For these cases just modify `llmclient.py` to suit your needs.

### LLM parameters, generators, etc

In [1]:
from llmclient import LLMClient, get_llm_response
import re

def build_synthetic_rationale_eval(x, rubric, rubric_nl):
    '''
    Automatically generate the chain-of-thought for the label.
    What, you thought I'd just randomly prompt it and call it science? 
    '''
    reasons = ""
    values = []
    for i, (c, _d) in enumerate(zip(rubric, rubric_nl.split("\n"))):
        d = _d.split(".")[0].replace("-", "").strip().lower()
        is_full_string = False
        if i == 0 and rubric[0]("0000") == 1:
            is_full_string = True
        position = None
        if i == 1:
            position = 0 if rubric[i]("010") == 1 else -1
        subsets = what_matched_what(x, c, position=position, is_full_string=is_full_string)
        if subsets is None or subsets == []:
            reasons += f"- There are no matches for {d}, so we make sure to not add any to our final string.\n"
            values.append("0")
        else:
            if is_full_string:
                reasons += f"- The string fully matches {d}, so the value is 1.\n"
            elif position:
                reasons += f"- There is a match for {d} in the string at {position}, so the value is 1.\n"
            else:
                reasons += f"- There is a match for {d} in the string: {subsets[0]}, so the value is 1.\n"
            values.append("1")
    final_label = get_label_for("".join(values))
    reasons += f"- Hence the final label is {final_label}"
    return reasons


def get_evaluator_prompt(x, num_exemplars, data, use_other=False, include_aggregator=True):
    '''
    Evaluator prompt. Honestly there was more writing here than in the decision tree...
    '''
    rubric_nl = good_rubric_nl if not use_other else other_rubric_nl
    rubric = rubric_good if not use_other else rubric_other

    system_prompt = "You are labelling binary strings based on a rubric (given below).\n\
    First return the parts of the criterion that match the string, and the values.\n"
    if include_aggregator:
        system_prompt += "Then return the label based on the aggregate function.\n"
    system_prompt += f"\n# Rubric: {rubric_nl}\n"
    if include_aggregator:
        system_prompt += "\n# Aggregation Function:\n{aggregator_nl}\n"
    system_prompt += "Return your answer in the form:\n|reasons|\n(list of reasons)\n|reasons|\n|label|\n(the label)\n|label|"

    exemplars = []
    for n in range(num_exemplars):
        x_data, y_data = data[n]
        reasons = build_synthetic_rationale_eval(x_data, rubric, rubric_nl)
        ex = f"|reasons|\n{reasons}\n|reasons|\n"
        ex += f"|label|\n{y_data}\n|label|"
        exemplars.append({"role": "user", "content": x_data})
        exemplars.append({"role": "assistant", "content": ex})

    prompt = [{"role": "system", "content": system_prompt}]
    prompt += exemplars
    prompt += [{"role": "user", "content": x}]
    return prompt


def build_synthetic_rationale_generation(index, rubric_nl):
    '''
    Automatically generate the chain-of-thought for the label.
    ---
    1) If the string contains an even number of zeros, 1.
    2) If the string starts with a zero OR contains 10101 (but not both), it is 1.
    3) If the string has more than five ones, it is 1.
    '''
    map_to_wrongs = [
        ['111001111100', ('1', '101')],
        ['001000001001', ('0', '010')],
        ['011010000100', ('1', '110')],
        ['010111101111', ('1', '011')],
        ['110010001010', ('0', '000')]
    ]
    x, tmp_cc_data = map_to_wrongs[index]
    y, cc_data = tmp_cc_data

    running_string = "0100"
    reasons = f"Our starting datapoint is {running_string}, as always.\n"

    rubric_entries = rubric_nl.split("\n")
    for i, value in enumerate(cc_data): 
        entry = rubric_entries[i]
        d = entry.split(".")[0].replace("-", "").strip().lower()
        kth = ["first", "second", "third"][i]
        reasons += f"We now look at the {kth} criterion. The criterion is {d}.\n"
        boilerplate = "Since it is a match, we move on to the next criterion.\n"

        if i == 0:
            reasons += f"The value for {x} in this criterion is {value}, because it has an {'even' if value == 1 else 'odd'} number of zeros.\n"
            reasons += f"Our datapoint ({running_string}) has 3 zeros, which is odd. "
            if value == 1:
                reasons += boilerplate
            else:
                reasons += f"We add a new zero at the end to avoid conflicts with the other criteria; and we get {running_string}0.\n"
                running_string += "0"
        elif i == 1:
            reasons += f"The value for {x} in this criterion is {value}, because it {'starts' if value == 1 else 'does not start'} with a zero.\n"
            reasons += f"Moreover, it {'does not contain' if '10101' not in x else 'contains'} the pattern 10101.\n"
            # Triggered when neither or both patterns are in x.
            # However, x_tilde does NOT have one of the patterns, so we need to fix this.
            xor_triggered = False
            if x[0] == "0" and "10101" in x:
                xor_triggered = True
            if x[0] != "0" and "10101" not in x:
                xor_triggered = True
                
            if not xor_triggered:
                reasons += "The value is because only one of the tests is in the criteria."
                if "10101" in x:
                    reasons += f"Our datapoint ({running_string}) starts with a zero, and does not have the pattern. The datapoint, however, has it backwards.\n"
                    running_string = "1" + running_string[1:]
                    reasons += f"We flip the first bit to match it: {running_string}, "
                    running_string += "10101"
                    reasons += f"and add the pattern to it: {running_string}.\n"
                else:
                    reasons += f"Our datapoint ({running_string}) starts with a zero, and does not have the pattern. This matches the datapoint.\n"
                    reasons += boilerplate
            else:
                if x[0] != "0":
                    reasons += f"Our datapoint ({running_string}) starts with a zero and does not have the pattern."
                    running_string = "1" + running_string[1:]
                    reasons += f"We then flip the first bit to match it: {running_string}.\n"
                else:
                    reasons += "Since our string already starts with a zero, we need to add the pattern to ensure that it matches.\n"
                    running_string += "10101"
                    reasons += f"Our string is now {running_string}.\n"
        elif i == 2:
            reasons += f"The value for the datapoint in this criterion is {value}, since it has {x.count('1')} {"(less than or equal to)" if value == 1 else "(at most)"} five ones.\n"
            reasons += f"Our datapoint ({running_string}) has {running_string.count('1')} ones. "
            if value == 1:
                reasons += "Both strings have at most five ones, so we are finished.\n"
            else:
                nzeros = running_string.count('0')
                flips_needed = nzeros - 5 if value == 1 else 5 - nzeros + 1
                reasons += f"We have {nzeros} zeros, so we need to flip at least {flips_needed} zeros. We focus on the inside of the string, to not break criterions 1 and 2.\n"
                zindices = [j for j, c in enumerate(running_string) if c == '0'][:flips_needed]
                new_string = [c if j not in zindices else str(int(not c)) for j, c in enumerate(running_string)]
                running_string = "".join(new_string)
                reasons += f"So we now have {running_string}.\n"
    reasons += f"Hence our final string is:\n|datapoint|\n{running_string}\n|datapoint|"
    return x, y, reasons


def get_generator_prompt(x, y, k, num_exemplars, data, use_other=False):
    '''
    Generator prompt, generating a new x-tilde based on the rubric.
    Note: the experiments never use `use_other`
    '''
    rubric_nl = good_rubric_nl if not use_other else other_rubric_nl
    rubric = rubric_good if not use_other else rubric_other

    system_prompt = f"You are a datapoint generator over binary strings.\n\
        Given a rubric (given below), a datapoint, and a label, return a similar datapoint that has the same label, and fulfils the same conditions as the rubric.\n\
        For convenience, always start with the same datapoint: 0100. It will be easier to work with.\n"
    # First analyse the datapoint based on the rubric and then return a similar datapoint.\n"
    system_prompt += f"\n# Rubric: {rubric_nl}\n"
    system_prompt += "Return your rationale, and then the final datapoint in the form:\n|datapoint|\n(the datapoint)\n|datapoint|"

    exemplars = []
    for n in range(num_exemplars):
        x_data, y_data, reasons = build_synthetic_rationale_generation(n, rubric_nl)
        exemplars.append({"role": "user", "content": f"Datapoint: {x_data}\nLabel: {y_data}"})
        exemplars.append({"role": "assistant", "content": reasons})

    prompt = [{"role": "system", "content": system_prompt}]
    prompt += exemplars
    prompt += [{"role": "user", "content": f"Datapoint: {x}\nLabel: {y}"}]
    return prompt



In [1]:
def llm_generator(x, y, rubric, data=good_data_train, num_exemplars=5):
    ''' 
    LLM version of the data generation, which involves a call.
    It does not need the rubric (always rubric_good) because it is already baked in.
    '''
    def parse_response(_r):
        r = _r.split("|datapoint|")
        if len(r) > 1:
            r = r[-2]
            r = r.replace("|datapoint|", "").strip()
        else:
            x = re.findall('[0|1].+', r[0])[-1]
            r = x.replace("|", "").strip()
        if not all(c in '01' for c in r): 
            return None
        return r

    prompt = get_generator_prompt(x, y, k=K, num_exemplars=num_exemplars, 
                                  data=data)
    resp, _resp = None, None
    max_retries = 5
    trial = 0
    while resp is None:
        if trial > max_retries: break
        _resp = get_llm_response(llm, prompt)
        try:
            resp = parse_response(_resp)
            with open("o3_log_generator.json", "a", encoding="utf-8") as f:
                f.write(json.dumps({"raw_resp": _resp, 
                                    "parse": resp, 
                                    "eval": "".join([str(c(resp)) for c in rubric_good]),
                                    "eval_x": "".join([str(c(x)) for c in rubric_good])}) + "\n")
        except:
            resp = None
            trial += 1
    if trial > 0: 
        with open("o3_generator_parse_failure_logs.tsv", "a", encoding="utf-8") as f:
            f.write(f"{trial}\t{_resp}\n") #\t{''.join([str(c(resp)) for c in rubric_good])}\n")
    if resp is None or resp == "":
        resp = "".join([random.choice(["0", "1"]) for _ in range(K)])
    return resp


def llm_evaluator(x, data=good_data_train, num_exemplars=5):
    ''' 
    Wrapper to maintain signatures 
    '''
    def parse_response(_r):
        r = _r.split("|reasons|")[-1]
        r = r.replace("|label|", "").strip()
        return int(r)

    prompt = get_evaluator_prompt(x, num_exemplars, data)
    resp = None
    max_retries = 10
    trial = 0 
    while resp is None:
        if trial > max_retries: break
        _resp = get_llm_response(llm, prompt)
        try:
            resp = parse_response(_resp)
        except:
            resp = None
            trial += 1
    if resp is None: return random.choice([0, 1])
    return resp


### Baseline

In [1]:
# First we baseline (omni)
MODEL = "gpt-4o-2024-05-13"
params = {"max_tokens": 1024, "temperature": 0.0} 
llm = LLMClient(params, MODEL)


preds = []
for pt in tqdm(good_data_test):
    x = pt[0]
    y = llm_evaluator(x, good_data_train, 5)
    preds.append(int(y))

Y = [int(p[-1]) for p in good_data_test]
gpt4o_in_phenomenon_baseline_accuracy = round(accuracy_score(Y, preds), 3)
gpt4o_in_phenomenon_baseline_f1 = round(f1_score(Y, preds), 3)

preds = []
for pt in tqdm(other_data_test):
    x = pt[0]
    y = llm_evaluator(x, good_data_train, 5) # the model only knows good
    preds.append(int(y))

Y = [int(p[-1]) for p in other_data_test]
gpt4o_out_phenomenon_baseline_accuracy = round(accuracy_score(Y, preds), 3)
gpt4o_out_phenomenon_baseline_f1 = round(f1_score(Y, preds), 3)

print(f"In-phenomenon baseline test score: {gpt4o_in_phenomenon_baseline_accuracy} | {gpt4o_in_phenomenon_baseline_f1}")
print(f"Out-of-phenomenon baseline test score: {gpt4o_out_phenomenon_baseline_accuracy} | {gpt4o_out_phenomenon_baseline_f1}")


100%|██████████| 498/498 [28:17<00:00,  3.41s/it]
100%|██████████| 498/498 [20:48<00:00,  2.51s/it]

In-phenomenon baseline test score: 0.61 | 0.697
Out-of-phenomenon baseline test score: 0.558 | 0.664





In [1]:
# First we baseline (o3-mini)
MODEL = "gpt-o3-mini"
params = {"max_completion_tokens": 10000} #, "temperature": 0.0} 
llm = LLMClient(params, MODEL)

preds = []
for pt in tqdm(good_data_test):
    x = pt[0]
    y = llm_evaluator(x, good_data_train, 5)
    preds.append(int(y))

Y = [int(p[-1]) for p in good_data_test]
gpt4o_in_phenomenon_baseline_accuracy = round(accuracy_score(Y, preds), 3)
gpt4o_in_phenomenon_baseline_f1 = round(f1_score(Y, preds), 3)

preds = []
for pt in tqdm(other_data_test):
    x = pt[0]
    y = llm_evaluator(x, good_data_train, 5) # the model only knows good
    preds.append(int(y))

Y = [int(p[-1]) for p in other_data_test]
gpt4o_out_phenomenon_baseline_accuracy = round(accuracy_score(Y, preds), 3)
gpt4o_out_phenomenon_baseline_f1 = round(f1_score(Y, preds), 3)

print(f"In-phenomenon baseline test score: {gpt4o_in_phenomenon_baseline_accuracy} | {gpt4o_in_phenomenon_baseline_f1}")
print(f"Out-of-phenomenon baseline test score: {gpt4o_out_phenomenon_baseline_accuracy} | {gpt4o_out_phenomenon_baseline_f1}")


100%|██████████| 498/498 [2:20:42<00:00, 16.95s/it]  
100%|██████████| 498/498 [2:14:21<00:00, 16.19s/it]  

In-phenomenon baseline test score: 0.998 | 0.998
Out-of-phenomenon baseline test score: 0.606 | 0.662





In [1]:
# First we baseline (DeepSeek)
MODEL = "deepseek-r1-distill-qwen-32b"
params = {"max_tokens": 2048, "temperature": 0.0} 
llm = LLMClient(params, MODEL)

preds = []
for pt in tqdm(good_data_test):
    x = pt[0]
    y = llm_evaluator(x, good_data_train, 5)
    preds.append(int(y))

Y = [int(p[-1]) for p in good_data_test]
gpt4o_in_phenomenon_baseline_accuracy = round(accuracy_score(Y, preds), 3)
gpt4o_in_phenomenon_baseline_f1 = round(f1_score(Y, preds), 3)

preds = []
for pt in tqdm(other_data_test):
    x = pt[0]
    y = llm_evaluator(x, good_data_train, 5) # the model only knows good
    preds.append(int(y))

Y = [int(p[-1]) for p in other_data_test]
gpt4o_out_phenomenon_baseline_accuracy = round(accuracy_score(Y, preds), 3)
gpt4o_out_phenomenon_baseline_f1 = round(f1_score(Y, preds), 3)

print(f"In-phenomenon baseline test score: {gpt4o_in_phenomenon_baseline_accuracy} | {gpt4o_in_phenomenon_baseline_f1}")
print(f"Out-of-phenomenon baseline test score: {gpt4o_out_phenomenon_baseline_accuracy} | {gpt4o_out_phenomenon_baseline_f1}")


100%|██████████| 498/498 [4:56:54<00:00, 35.77s/it]    
100%|██████████| 498/498 [5:28:04<00:00, 39.53s/it]    

In-phenomenon baseline test score: 0.61 | 0.708
Out-of-phenomenon baseline test score: 0.544 | 0.656





In [1]:
# First we baseline (qwen-25-vl7b)
MODEL = "qwen-25-vl7b"
params = {"max_tokens": 1024, "temperature": 0.0} 
llm = LLMClient(params, MODEL)

preds = []
for pt in tqdm(good_data_test):
    x = pt[0]
    y = llm_evaluator(x, good_data_train, 5)
    preds.append(int(y))

Y = [int(p[-1]) for p in good_data_test]
gpt4o_in_phenomenon_baseline_accuracy = round(accuracy_score(Y, preds), 3)
gpt4o_in_phenomenon_baseline_f1 = round(f1_score(Y, preds), 3)

preds = []
for pt in tqdm(other_data_test):
    x = pt[0]
    y = llm_evaluator(x, good_data_train, 5) # the model only knows good
    preds.append(int(y))

Y = [int(p[-1]) for p in other_data_test]
gpt4o_out_phenomenon_baseline_accuracy = round(accuracy_score(Y, preds), 3)
gpt4o_out_phenomenon_baseline_f1 = round(f1_score(Y, preds), 3)

print(f"In-phenomenon baseline test score: {gpt4o_in_phenomenon_baseline_accuracy} | {gpt4o_in_phenomenon_baseline_f1}")
print(f"Out-of-phenomenon baseline test score: {gpt4o_out_phenomenon_baseline_accuracy} | {gpt4o_out_phenomenon_baseline_f1}")


100%|██████████| 498/498 [14:37<00:00,  1.76s/it]
100%|██████████| 498/498 [14:29<00:00,  1.75s/it]

In-phenomenon baseline test score: 0.5 | 0.667
Out-of-phenomenon baseline test score: 0.502 | 0.668





## The ND Algorithm

o3-mini

In [1]:
phi = 0.0 #o3-mini, picker
MODEL = "gpt-o3-mini"
params = {"max_completion_tokens": 50000} #, "temperature": 0.0} 
llm = LLMClient(params, MODEL)

good_labels, good_successes, good_flips = no_data_algorithm_with_flips([p[0] for p in good_data_test], 
                                                evaluator=llm_evaluator,
                                                generator=llm_generator_with_picker,
                                                rubric=rubric_good,
                                                phi=phi,
                                                max_rounds=rounds)

other_labels, other_successes, other_flips = no_data_algorithm_with_flips([p[0] for p in other_data_test], 
                                                  evaluator=llm_evaluator,
                                                  generator=llm_generator_with_picker,
                                                  rubric=rubric_other,
                                                  phi=phi,
                                                  max_rounds=rounds,)

print_metrics(good_labels, good_successes, good_data_test, "In", good_flips)
print_metrics(other_labels, other_successes, other_data_test, "Out", other_flips)

100%|██████████| 498/498 [15:43:34<00:00, 113.68s/it]   
100%|██████████| 498/498 [8:18:26<00:00, 60.05s/it]   

In-phenomenon test score: 0.805 | Successes: 81.124 | F1: 78.775 | Flips: 18.876
Out-phenomenon test score: 0.492 | Successes: 27.711 | F1: 43.4 | Flips: 72.289





In [1]:
phi = 0.9 #o3-mini, picker
MODEL = "gpt-o3-mini"
params = {"max_completion_tokens": 50000} #, "temperature": 0.0} 
llm = LLMClient(params, MODEL)

good_labels, good_successes, good_flips = no_data_algorithm_with_flips([p[0] for p in good_data_test], 
                                                evaluator=llm_evaluator,
                                                generator=llm_generator_with_picker,
                                                rubric=rubric_good,
                                                phi=phi,
                                                max_rounds=rounds)

other_labels, other_successes, other_flips = no_data_algorithm_with_flips([p[0] for p in other_data_test], 
                                                  evaluator=llm_evaluator,
                                                  generator=llm_generator_with_picker,
                                                  rubric=rubric_other,
                                                  phi=phi,
                                                  max_rounds=rounds,)

print_metrics(good_labels, good_successes, good_data_test, "In", good_flips)
print_metrics(other_labels, other_successes, other_data_test, "Out", other_flips)

100%|██████████| 498/498 [18:14:34<00:00, 131.88s/it]     
100%|██████████| 498/498 [6:12:11<00:00, 44.84s/it]  

In-phenomenon test score: 0.976 | Successes: 81.325 | F1: 97.581 | Flips: 1.807
Out-phenomenon test score: 0.59 | Successes: 27.912 | F1: 63.958 | Flips: 6.024





In [1]:
phi = 0.5 #o3-mini, picker
MODEL = "gpt-o3-mini"
params = {"max_completion_tokens": 50000} #, "temperature": 0.0} 
llm = LLMClient(params, MODEL)

good_labels, good_successes, good_flips = no_data_algorithm_with_flips([p[0] for p in good_data_test], 
                                                evaluator=llm_evaluator,
                                                generator=llm_generator_with_picker,
                                                rubric=rubric_good,
                                                phi=phi,
                                                max_rounds=rounds)

other_labels, other_successes, other_flips = no_data_algorithm_with_flips([p[0] for p in other_data_test], 
                                                  evaluator=llm_evaluator,
                                                  generator=llm_generator_with_picker,
                                                  rubric=rubric_other,
                                                  phi=phi,
                                                  max_rounds=rounds,)

print_metrics(good_labels, good_successes, good_data_test, "In", good_flips)
print_metrics(other_labels, other_successes, other_data_test, "Out", other_flips)

100%|██████████| 498/498 [10:18:07<00:00, 74.47s/it]   
100%|██████████| 498/498 [5:47:07<00:00, 41.82s/it]  

In-phenomenon test score: 0.894 | Successes: 80.723 | F1: 88.795 | Flips: 10.442
Out-phenomenon test score: 0.534 | Successes: 27.912 | F1: 55.725 | Flips: 37.349





In [1]:
phi = 0.6 #o3-mini, picker
MODEL = "gpt-o3-mini"
params = {"max_completion_tokens": 50000} #, "temperature": 0.0} 
llm = LLMClient(params, MODEL)

good_labels, good_successes, good_flips = no_data_algorithm_with_flips([p[0] for p in good_data_test], 
                                                evaluator=llm_evaluator,
                                                generator=llm_generator_with_picker,
                                                rubric=rubric_good,
                                                phi=phi,
                                                max_rounds=rounds)

other_labels, other_successes, other_flips = no_data_algorithm_with_flips([p[0] for p in other_data_test], 
                                                  evaluator=llm_evaluator,
                                                  generator=llm_generator_with_picker,
                                                  rubric=rubric_other,
                                                  phi=phi,
                                                  max_rounds=rounds,)

print_metrics(good_labels, good_successes, good_data_test, "In", good_flips)
print_metrics(other_labels, other_successes, other_data_test, "Out", other_flips)

100%|██████████| 498/498 [10:51:35<00:00, 78.51s/it]  
100%|██████████| 498/498 [7:08:04<00:00, 51.58s/it]    

In-phenomenon test score: 0.898 | Successes: 79.518 | F1: 89.441 | Flips: 9.639
Out-phenomenon test score: 0.574 | Successes: 27.912 | F1: 60.0 | Flips: 30.723





GPT-4o

In [1]:
phi = 0.0 #gpt-4o, picker
MODEL = "gpt-4o-2024-05-13"
params = {"max_tokens": 1024, "temperature": 0.0} 
llm = LLMClient(params, MODEL)

good_labels, good_successes, good_flips = no_data_algorithm_with_flips([p[0] for p in good_data_test], 
                                                evaluator=llm_evaluator,
                                                generator=llm_generator_with_picker,
                                                rubric=rubric_good,
                                                phi=phi,
                                                max_rounds=rounds)

other_labels, other_successes, other_flips = no_data_algorithm_with_flips([p[0] for p in other_data_test], 
                                                  evaluator=llm_evaluator,
                                                  generator=llm_generator_with_picker,
                                                  rubric=rubric_other,
                                                  phi=phi,
                                                  max_rounds=rounds,)

print_metrics(good_labels, good_successes, good_data_test, "In", good_flips)
print_metrics(other_labels, other_successes, other_data_test, "Out", other_flips)

100%|██████████| 498/498 [4:02:51<00:00, 29.26s/it]  
100%|██████████| 498/498 [2:03:09<00:00, 14.84s/it]  

In-phenomenon test score: 0.371 | Successes: 28.715 | F1: 27.714 | Flips: 71.285
Out-phenomenon test score: 0.404 | Successes: 9.438 | F1: 20.375 | Flips: 90.562





In [1]:
phi = 0.9 #gpt-4o, picker
MODEL = "gpt-4o-2024-05-13"
params = {"max_tokens": 1024, "temperature": 0.0} 
llm = LLMClient(params, MODEL)

good_labels, good_successes, good_flips = no_data_algorithm_with_flips([p[0] for p in good_data_test], 
                                                evaluator=llm_evaluator,
                                                generator=llm_generator_with_picker,
                                                rubric=rubric_good,
                                                phi=phi,
                                                max_rounds=rounds)

other_labels, other_successes, other_flips = no_data_algorithm_with_flips([p[0] for p in other_data_test], 
                                                  evaluator=llm_evaluator,
                                                  generator=llm_generator_with_picker,
                                                  rubric=rubric_other,
                                                  phi=phi,
                                                  max_rounds=rounds,)

print_metrics(good_labels, good_successes, good_data_test, "In", good_flips)
print_metrics(other_labels, other_successes, other_data_test, "Out", other_flips)

100%|██████████| 498/498 [2:55:23<00:00, 21.13s/it]  
100%|██████████| 498/498 [2:39:31<00:00, 19.22s/it]  

In-phenomenon test score: 0.58 | Successes: 28.715 | F1: 67.087 | Flips: 5.622
Out-phenomenon test score: 0.564 | Successes: 10.843 | F1: 65.391 | Flips: 8.835





In [1]:
phi = 0.5 #gpt-4o, picker
MODEL = "gpt-4o-2024-05-13"
params = {"max_tokens": 1024, "temperature": 0.0} 
llm = LLMClient(params, MODEL)

good_labels, good_successes, good_flips = no_data_algorithm_with_flips([p[0] for p in good_data_test], 
                                                evaluator=llm_evaluator,
                                                generator=llm_generator_with_picker,
                                                rubric=rubric_good,
                                                phi=phi,
                                                max_rounds=rounds)

other_labels, other_successes, other_flips = no_data_algorithm_with_flips([p[0] for p in other_data_test], 
                                                  evaluator=llm_evaluator,
                                                  generator=llm_generator_with_picker,
                                                  rubric=rubric_other,
                                                  phi=phi,
                                                  max_rounds=rounds,)

print_metrics(good_labels, good_successes, good_data_test, "In", good_flips)
print_metrics(other_labels, other_successes, other_data_test, "Out", other_flips)

100%|██████████| 498/498 [3:07:32<00:00, 22.60s/it]  
100%|██████████| 498/498 [3:55:23<00:00, 28.36s/it]   

In-phenomenon test score: 0.508 | Successes: 30.924 | F1: 54.545 | Flips: 33.534
Out-phenomenon test score: 0.486 | Successes: 8.835 | F1: 49.606 | Flips: 46.988





In [1]:
phi = 0.6 #gpt-4o, picker
MODEL = "gpt-4o-2024-05-13"
params = {"max_tokens": 1024, "temperature": 0.0} 
llm = LLMClient(params, MODEL)

good_labels, good_successes, good_flips = no_data_algorithm_with_flips([p[0] for p in good_data_test], 
                                                evaluator=llm_evaluator,
                                                generator=llm_generator_with_picker,
                                                rubric=rubric_good,
                                                phi=phi,
                                                max_rounds=rounds)

other_labels, other_successes, other_flips = no_data_algorithm_with_flips([p[0] for p in other_data_test], 
                                                  evaluator=llm_evaluator,
                                                  generator=llm_generator_with_picker,
                                                  rubric=rubric_other,
                                                  phi=phi,
                                                  max_rounds=rounds,)

print_metrics(good_labels, good_successes, good_data_test, "In", good_flips)
print_metrics(other_labels, other_successes, other_data_test, "Out", other_flips)

100%|██████████| 498/498 [6:05:10<00:00, 44.00s/it]   
100%|██████████| 498/498 [2:19:07<00:00, 16.76s/it]  

In-phenomenon test score: 0.482 | Successes: 25.301 | F1: 52.574 | Flips: 29.518
Out-phenomenon test score: 0.514 | Successes: 10.643 | F1: 54.851 | Flips: 35.141





DeepSeek-R1

In [1]:
phi = 0.0 #deepseek, picker
MODEL = "deepseek-r1-distill-qwen-32b"
params = {"max_tokens": 2048, "temperature": 0.0} 
llm = LLMClient(params, MODEL)

good_labels, good_successes, good_flips = no_data_algorithm_with_flips([p[0] for p in good_data_test], 
                                                evaluator=llm_evaluator,
                                                generator=llm_generator_with_picker,
                                                rubric=rubric_good,
                                                phi=phi,
                                                max_rounds=rounds)

other_labels, other_successes, other_flips = no_data_algorithm_with_flips([p[0] for p in other_data_test], 
                                                  evaluator=llm_evaluator,
                                                  generator=llm_generator_with_picker,
                                                  rubric=rubric_other,
                                                  phi=phi,
                                                  max_rounds=rounds,)

print_metrics(good_labels, good_successes, good_data_test, "In", good_flips)
print_metrics(other_labels, other_successes, other_data_test, "Out", other_flips)

100%|██████████| 498/498 [42:10:54<00:00, 304.93s/it]   
100%|██████████| 498/498 [31:19:05<00:00, 226.40s/it]   

In-phenomenon test score: 0.42 | Successes: 24.498 | F1: 32.947 | Flips: 75.502
Out-phenomenon test score: 0.396 | Successes: 11.647 | F1: 19.303 | Flips: 88.353





In [1]:
phi = 0.9 #deepseek, picker
MODEL = "deepseek-r1-distill-qwen-32b"
params = {"max_tokens": 2048, "temperature": 0.0} 
llm = LLMClient(params, MODEL)

good_labels, good_successes, good_flips = no_data_algorithm_with_flips([p[0] for p in good_data_test], 
                                                evaluator=llm_evaluator,
                                                generator=llm_generator_with_picker,
                                                rubric=rubric_good,
                                                phi=phi,
                                                max_rounds=rounds)

other_labels, other_successes, other_flips = no_data_algorithm_with_flips([p[0] for p in other_data_test], 
                                                  evaluator=llm_evaluator,
                                                  generator=llm_generator_with_picker,
                                                  rubric=rubric_other,
                                                  phi=phi,
                                                  max_rounds=rounds,)

print_metrics(good_labels, good_successes, good_data_test, "In", good_flips)
print_metrics(other_labels, other_successes, other_data_test, "Out", other_flips)

100%|██████████| 498/498 [44:03:13<00:00, 318.46s/it]     
100%|██████████| 498/498 [36:01:40<00:00, 260.44s/it]   

In-phenomenon test score: 0.598 | Successes: 26.506 | F1: 68.454 | Flips: 7.43
Out-phenomenon test score: 0.526 | Successes: 6.627 | F1: 62.42 | Flips: 11.847





In [1]:
phi = 0.5 #deepseek, picker
MODEL = "deepseek-r1-distill-qwen-32b"
params = {"max_tokens": 2048, "temperature": 0.0} 
llm = LLMClient(params, MODEL)


good_labels, good_successes, good_flips = no_data_algorithm_with_flips([p[0] for p in good_data_test], 
                                                evaluator=llm_evaluator,
                                                generator=llm_generator_with_picker,
                                                rubric=rubric_good,
                                                phi=phi,
                                                max_rounds=rounds)

other_labels, other_successes, other_flips = no_data_algorithm_with_flips([p[0] for p in other_data_test], 
                                                  evaluator=llm_evaluator,
                                                  generator=llm_generator_with_picker,
                                                  rubric=rubric_other,
                                                  phi=phi,
                                                  max_rounds=rounds,)

print_metrics(good_labels, good_successes, good_data_test, "In", good_flips)
print_metrics(other_labels, other_successes, other_data_test, "Out", other_flips)

100%|██████████| 498/498 [45:55:56<00:00, 332.04s/it]   
100%|██████████| 498/498 [38:23:24<00:00, 277.52s/it]   

In-phenomenon test score: 0.508 | Successes: 16.466 | F1: 55.21 | Flips: 38.554
Out-phenomenon test score: 0.48 | Successes: 6.225 | F1: 49.116 | Flips: 50.402





In [1]:
phi = 0.6 #deepseek, picker
MODEL = "deepseek-r1-distill-qwen-32b"
params = {"max_tokens": 2048, "temperature": 0.0} 
llm = LLMClient(params, MODEL)

good_labels, good_successes, good_flips = no_data_algorithm_with_flips([p[0] for p in good_data_test], 
                                                evaluator=llm_evaluator,
                                                generator=llm_generator_with_picker,
                                                rubric=rubric_good,
                                                phi=phi,
                                                max_rounds=rounds)

other_labels, other_successes, other_flips = no_data_algorithm_with_flips([p[0] for p in other_data_test], 
                                                  evaluator=llm_evaluator,
                                                  generator=llm_generator_with_picker,
                                                  rubric=rubric_other,
                                                  phi=phi,
                                                  max_rounds=rounds,)

print_metrics(good_labels, good_successes, good_data_test, "In", good_flips)
print_metrics(other_labels, other_successes, other_data_test, "Out", other_flips)

100%|██████████| 498/498 [44:06:29<00:00, 318.85s/it]   
100%|██████████| 498/498 [40:08:51<00:00, 290.22s/it]    

In-phenomenon test score: 0.52 | Successes: 15.261 | F1: 56.781 | Flips: 36.145
Out-phenomenon test score: 0.48 | Successes: 5.622 | F1: 50.478 | Flips: 40.161





Qwen

In [1]:
phi = 0.0 # qwen, with phi picked based on baseline perf
MODEL = "qwen-25-vl7b"
params = {"max_tokens": 1024, "temperature": 0.0} 
llm = LLMClient(params, MODEL)

good_labels, good_successes, good_flips = no_data_algorithm_with_flips([p[0] for p in good_data_test], 
                                                evaluator=llm_evaluator,
                                                generator=llm_generator,
                                                rubric=rubric_good,
                                                phi=phi,
                                                max_rounds=rounds)
print_metrics(good_labels, good_successes, good_data_test, "In", good_flips)

other_labels, other_successes, other_flips = no_data_algorithm_with_flips([p[0] for p in other_data_test], 
                                                  evaluator=llm_evaluator,
                                                  generator=llm_generator,
                                                  rubric=rubric_other,
                                                  phi=phi,
                                                  max_rounds=rounds,)

print_metrics(other_labels, other_successes, other_data_test, "Out", other_flips)

100%|██████████| 498/498 [1:07:53<00:00,  8.18s/it]


In-phenomenon test score: 0.54 | Successes: 13.253 | F1: 27.302 | Flips: 86.747


100%|██████████| 498/498 [1:11:08<00:00,  8.57s/it]

Out-phenomenon test score: 0.335 | Successes: 16.265 | F1: 0.0 | Flips: 83.735





In [1]:
phi = 0.9 # qwen, with phi picked based on baseline perf
MODEL = "qwen-25-vl7b"
params = {"max_tokens": 1024, "temperature": 0.0} 
llm = LLMClient(params, MODEL)

good_labels, good_successes, good_flips = no_data_algorithm_with_flips([p[0] for p in good_data_test], 
                                                evaluator=llm_evaluator,
                                                generator=llm_generator,
                                                rubric=rubric_good,
                                                phi=phi,
                                                max_rounds=rounds)
print_metrics(good_labels, good_successes, good_data_test, "In", good_flips)

other_labels, other_successes, other_flips = no_data_algorithm_with_flips([p[0] for p in other_data_test], 
                                                  evaluator=llm_evaluator,
                                                  generator=llm_generator,
                                                  rubric=rubric_other,
                                                  phi=phi,
                                                  max_rounds=rounds,)

print_metrics(other_labels, other_successes, other_data_test, "Out", other_flips)

100%|██████████| 498/498 [1:07:03<00:00,  8.08s/it]


In-phenomenon test score: 0.512 | Successes: 13.052 | F1: 65.434 | Flips: 8.835


100%|██████████| 498/498 [1:10:58<00:00,  8.55s/it]

Out-phenomenon test score: 0.502 | Successes: 16.265 | F1: 64.873 | Flips: 8.032





In [1]:
phi = 0.5 # qwen, with phi picked based on baseline perf
MODEL = "qwen-25-vl7b"
params = {"max_tokens": 1024, "temperature": 0.0} 
llm = LLMClient(params, MODEL)

good_labels, good_successes, good_flips = no_data_algorithm_with_flips([p[0] for p in good_data_test], 
                                                evaluator=llm_evaluator,
                                                generator=llm_generator,
                                                rubric=rubric_good,
                                                phi=phi,
                                                max_rounds=rounds)
print_metrics(good_labels, good_successes, good_data_test, "In", good_flips)

other_labels, other_successes, other_flips = no_data_algorithm_with_flips([p[0] for p in other_data_test], 
                                                  evaluator=llm_evaluator,
                                                  generator=llm_generator,
                                                  rubric=rubric_other,
                                                  phi=phi,
                                                  max_rounds=rounds,)

print_metrics(other_labels, other_successes, other_data_test, "Out", other_flips)

100%|██████████| 498/498 [1:07:49<00:00,  8.17s/it]


In-phenomenon test score: 0.532 | Successes: 13.052 | F1: 54.224 | Flips: 47.791


100%|██████████| 498/498 [1:10:38<00:00,  8.51s/it]

Out-phenomenon test score: 0.444 | Successes: 16.265 | F1: 47.834 | Flips: 43.574





## Ablation -- Generator

### ND Algorithm

In [1]:
phi = 0.0 # o3-mini, with phi picked based on baseline perf

MODEL = "gpt-o3-mini"
params = {"max_completion_tokens": 50000} #, "temperature": 0.0} 
llm = LLMClient(params, MODEL)

good_labels, good_successes, good_flips = no_data_algorithm_with_flips([p[0] for p in good_data_test], 
                                                evaluator=llm_evaluator,
                                                generator=llm_generator,
                                                rubric=rubric_good,
                                                phi=phi,
                                                max_rounds=rounds)
print_metrics(good_labels, good_successes, good_data_test, "In", good_flips)

other_labels, other_successes, other_flips = no_data_algorithm_with_flips([p[0] for p in other_data_test], 
                                                  evaluator=llm_evaluator,
                                                  generator=llm_generator,
                                                  rubric=rubric_other,
                                                  phi=phi,
                                                  max_rounds=rounds,)

print_metrics(other_labels, other_successes, other_data_test, "Out", other_flips)

100%|██████████| 498/498 [15:03:31<00:00, 108.86s/it]  


In-phenomenon test score: 0.54 | Successes: 53.815 | F1: 58.887 | Flips: 46.185


100%|██████████| 498/498 [9:29:31<00:00, 68.62s/it]    

Out-phenomenon test score: 0.584 | Successes: 26.305 | F1: 56.604 | Flips: 73.695





In [1]:
phi = 0.9 # o3-mini, with phi picked based on baseline perf

MODEL = "gpt-o3-mini"
params = {"max_completion_tokens": 50000} #, "temperature": 0.0} 
llm = LLMClient(params, MODEL)

good_labels, good_successes, good_flips = no_data_algorithm_with_flips([p[0] for p in good_data_test], 
                                                evaluator=llm_evaluator,
                                                generator=llm_generator,
                                                rubric=rubric_good,
                                                phi=phi,
                                                max_rounds=rounds)
print_metrics(good_labels, good_successes, good_data_test, "In", good_flips)

other_labels, other_successes, other_flips = no_data_algorithm_with_flips([p[0] for p in other_data_test], 
                                                  evaluator=llm_evaluator,
                                                  generator=llm_generator,
                                                  rubric=rubric_other,
                                                  phi=phi,
                                                  max_rounds=rounds,)

print_metrics(other_labels, other_successes, other_data_test, "Out", other_flips)

100%|██████████| 498/498 [14:35:06<00:00, 105.43s/it]  


In-phenomenon test score: 0.942 | Successes: 55.422 | F1: 94.235 | Flips: 5.02


100%|██████████| 498/498 [10:38:16<00:00, 76.90s/it]   

Out-phenomenon test score: 0.606 | Successes: 28.715 | F1: 65.493 | Flips: 7.831





In [1]:
phi = 0.5 # o3-mini, with phi picked based on baseline perf

MODEL = "gpt-o3-mini"
params = {"max_completion_tokens": 50000} #, "temperature": 0.0} 
llm = LLMClient(params, MODEL)

good_labels, good_successes, good_flips = no_data_algorithm_with_flips([p[0] for p in good_data_test], 
                                                evaluator=llm_evaluator,
                                                generator=llm_generator,
                                                rubric=rubric_good,
                                                phi=phi,
                                                max_rounds=rounds)
print_metrics(good_labels, good_successes, good_data_test, "In", good_flips)

other_labels, other_successes, other_flips = no_data_algorithm_with_flips([p[0] for p in other_data_test], 
                                                  evaluator=llm_evaluator,
                                                  generator=llm_generator,
                                                  rubric=rubric_other,
                                                  phi=phi,
                                                  max_rounds=rounds,)

print_metrics(other_labels, other_successes, other_data_test, "Out", other_flips)

100%|██████████| 498/498 [13:43:47<00:00, 99.25s/it]   


In-phenomenon test score: 0.815 | Successes: 58.233 | F1: 82.375 | Flips: 18.474


100%|██████████| 498/498 [9:32:06<00:00, 68.93s/it]    

Out-phenomenon test score: 0.564 | Successes: 28.715 | F1: 59.287 | Flips: 35.944





GPT-4o

In [1]:
phi = 0.0 # gpt-4o, with phi picked based on baseline perf

MODEL = "gpt-4o-2024-05-13"
params = {"max_tokens": 1024, "temperature": 0.0} 
llm = LLMClient(params, MODEL)

good_labels, good_successes, good_flips = no_data_algorithm_with_flips([p[0] for p in good_data_test], 
                                                evaluator=llm_evaluator,
                                                generator=llm_generator,
                                                rubric=rubric_good,
                                                phi=phi,
                                                max_rounds=rounds)
print_metrics(good_labels, good_successes, good_data_test, "In", good_flips)

other_labels, other_successes, other_flips = no_data_algorithm_with_flips([p[0] for p in other_data_test], 
                                                  evaluator=llm_evaluator,
                                                  generator=llm_generator,
                                                  rubric=rubric_other,
                                                  phi=phi,
                                                  max_rounds=rounds,)

print_metrics(other_labels, other_successes, other_data_test, "Out", other_flips)

100%|██████████| 498/498 [1:48:04<00:00, 13.02s/it]  


In-phenomenon test score: 0.428 | Successes: 6.627 | F1: 23.592 | Flips: 93.373


100%|██████████| 498/498 [4:27:02<00:00, 32.17s/it]  

Out-phenomenon test score: 0.787 | Successes: 34.137 | F1: 78.099 | Flips: 65.863





In [1]:
phi = 0.9 # gpt-4o, with phi picked based on baseline perf

MODEL = "gpt-4o-2024-05-13"
params = {"max_tokens": 1024, "temperature": 0.0} 
llm = LLMClient(params, MODEL)

good_labels, good_successes, good_flips = no_data_algorithm_with_flips([p[0] for p in good_data_test], 
                                                evaluator=llm_evaluator,
                                                generator=llm_generator,
                                                rubric=rubric_good,
                                                phi=phi,
                                                max_rounds=rounds)
print_metrics(good_labels, good_successes, good_data_test, "In", good_flips)

other_labels, other_successes, other_flips = no_data_algorithm_with_flips([p[0] for p in other_data_test], 
                                                  evaluator=llm_evaluator,
                                                  generator=llm_generator,
                                                  rubric=rubric_other,
                                                  phi=phi,
                                                  max_rounds=rounds,)

print_metrics(other_labels, other_successes, other_data_test, "Out", other_flips)

100%|██████████| 498/498 [2:52:16<00:00, 20.76s/it]  


In-phenomenon test score: 0.566 | Successes: 5.823 | F1: 63.758 | Flips: 10.843


100%|██████████| 498/498 [2:25:13<00:00, 17.50s/it]  

Out-phenomenon test score: 0.534 | Successes: 34.538 | F1: 63.636 | Flips: 4.618





In [1]:
phi = 0.5 # gpt-4o, with phi picked based on baseline perf

MODEL = "gpt-4o-2024-05-13"
params = {"max_tokens": 1024, "temperature": 0.0} 
llm = LLMClient(params, MODEL)

good_labels, good_successes, good_flips = no_data_algorithm_with_flips([p[0] for p in good_data_test], 
                                                evaluator=llm_evaluator,
                                                generator=llm_generator,
                                                rubric=rubric_good,
                                                phi=phi,
                                                max_rounds=rounds)
print_metrics(good_labels, good_successes, good_data_test, "In", good_flips)

other_labels, other_successes, other_flips = no_data_algorithm_with_flips([p[0] for p in other_data_test], 
                                                  evaluator=llm_evaluator,
                                                  generator=llm_generator,
                                                  rubric=rubric_other,
                                                  phi=phi,
                                                  max_rounds=rounds,)

print_metrics(other_labels, other_successes, other_data_test, "Out", other_flips)

100%|██████████| 498/498 [1:10:26<00:00,  8.49s/it]


In-phenomenon test score: 0.486 | Successes: 6.627 | F1: 48.8 | Flips: 47.791


100%|██████████| 498/498 [1:33:05<00:00, 11.22s/it]

Out-phenomenon test score: 0.633 | Successes: 33.735 | F1: 67.38 | Flips: 34.137



