# The Logic of Unsupervised Evaluation : A simple demonstration

Unsupervised evaluation is trying to test noisy decision makers with unlabeled data. This is the principal/agent monitoring paradox. A job has been delegated to an agent - whether robotic or human. But we know the agents are not perfect in their decisions. In some settings, this imperfection is unsafe and must be detected to prevent possible future costly failures. The paradox lies in us having to then "do the work twice" - check the job for which we deployed the agent so as not to do ourselves!

We hear this often nowadays with LLMs - they are great until they are not. They speed up our work but now we have to do work to check they are doing their tasks correctly. NTQR logic is about mitigating this monitoring paradox by increasing the reliability of unsupervised evaluation.

The utility of NTQR logic comes from three properties:

1. It is **universal**: The solution for error independent classifiers in the code applies to all domains as long as its sole assumption is satisfied - error independency on the test. But more importantly, the postulates in `ntqr.r2.postulates` are **always** true for any group of classifiers in **any** domain. It achieves this by being completely algebraic on the observed decisions of the classifiers. It has no other parameters to tune or set to produce its outputs.
2. It is **complete**: Universal postulates have been recognized before. Most notably, Platanios et al in the **agreement equations**. What has not been recognized at all by most ML researchers and engineers is that there are a complete set of postulates that explain **any** unsupervised evaluation of classifiers. The agreement equations are but a small subset of the complete set needed for classifiers. The power of completeness for safety is that it provides iron clad proof that an evaluation cannot possibly be correct.
3. It allows you to create **self-alarming** evaluation algorithms: The finite nature of any test for classifiers allows us to restrict all solutions of sample statistics to a small set of rational numbers inside the unit cube. Since the algebraic algorithms of NTQR logic are polynomial functions of unknown statistics of correctness for the classifiers they always produce **algebraic** numbers. The class of **algebraic** numbers also includes the **irrationals**. Using this, one can prove that if you evaluate a set of three binary classifiers and the prevalence estimates contain irrational numbers - the classifiers **are** error correlated on the test. We know of no other evaluation method on unlabeled data that can signal the failure of its own independence assumption.

## A simple demonstration of their universality

We say the expressions in `ntqr.r2.postulates` are postulates because they are always identically zero for sample statistics that come from summarizing the decisions of binary classifiers. This section will demonstrate this with the example data sketch from an UCI Adult dataset evaluation contained in `ntqr.r2.examples`.

In [1]:
from pprint import pprint

import ntqr

In [5]:
from ntqr.r2.examples import uciadult_label_counts as data_sketch
pprint(data_sketch)

{'a': {('a', 'a', 'a'): 715,
       ('a', 'a', 'b'): 161,
       ('a', 'b', 'a'): 2406,
       ('a', 'b', 'b'): 455,
       ('b', 'a', 'a'): 290,
       ('b', 'a', 'b'): 94,
       ('b', 'b', 'a'): 1335,
       ('b', 'b', 'b'): 231},
 'b': {('a', 'a', 'a'): 271,
       ('a', 'a', 'b'): 469,
       ('a', 'b', 'a'): 3395,
       ('a', 'b', 'b'): 7517,
       ('b', 'a', 'a'): 272,
       ('b', 'a', 'b'): 399,
       ('b', 'b', 'a'): 6377,
       ('b', 'b', 'b'): 12455}}


Let us use these by-label voting events counts to calculate the exact evaluation for the classifiers on this test.

In [6]:
trio_label_counts = ntqr.TrioLabelVoteCounts(data_sketch)
supervised_eval = ntqr.SupervisedEvaluation(trio_label_counts)
supervised_eval_exact = supervised_eval.evaluation_exact
supervised_eval_float = supervised_eval.evaluation_float

The exact evaluation - it contains only rational integers - looks as follows,

In [7]:
pprint(supervised_eval_exact)

{'accuracies': [{'a': Fraction(3737, 5687), 'b': Fraction(6501, 10385)},
                {'a': Fraction(1260, 5687), 'b': Fraction(29744, 31155)},
                {'a': Fraction(4746, 5687), 'b': Fraction(4168, 6231)}],
 'pair_correlations': {'a': {(0, 1): Fraction(273192, 32341969),
                             (0, 2): Fraction(13325, 32341969),
                             (1, 2): Fraction(-264525, 32341969)},
                       'b': {(0, 1): Fraction(2204576, 323544675),
                             (0, 2): Fraction(-79682, 12941787),
                             (1, 2): Fraction(94508, 38825361)}},
 'prevalence': {'a': Fraction(5687, 36842), 'b': Fraction(31155, 36842)}}


There should also be two three-way error correlations in a full supervised evaluation. That is for 0.2! Although exact, these numbers are hard for humans to interpret since we prefer decimal numbers.

In [8]:
pprint(supervised_eval_float)

{'accuracies': [{'a': 0.6571127132055565, 'b': 0.625999037072701},
                {'a': 0.22155793915948654, 'b': 0.9547103193708875},
                {'a': 0.8345349041673993, 'b': 0.6689134970309741}],
 'pair_correlations': {'a': {(0, 1): 0.008446981072797392,
                             (0, 2): 0.00041200336318422665,
                             (1, 2): -0.008179001099160041},
                       'b': {(0, 1): 0.0068138225424356005,
                             (0, 2): -0.006156954986200901,
                             (1, 2): 0.0024341821316226785}},
 'prevalence': {'a': 0.15436186960534173, 'b': 0.8456381303946583}}


We can now use these sample statistics of correctness for the three binary classifiers on this test to see if the postulates are obeyed or not.

## The NTQR postulates for N=2, R=2 binary classifiers

In [2]:
from ntqr.r2.postulates import single_binary_classifier_postulate

Whenever we have $N$ classifiers, there will be some postulates for all $N$ of them signalled by its use of N-way error correlations. There will also be postulates for $N-1$ of them and so on down to $N=1$. So let us demonstrate them backwards. We start with the single postulate that applies to any single binary classifier.

In [3]:
single_binary_classifier_postulate

P_a*(P_{i, a} - f_{a_i}) - P_b*(P_{i, b} - f_{b_i})

Let's redefine some names to make the checks easier to write out.

In [11]:
pa = supervised_eval_exact['prevalence']['a']
pb = supervised_eval_exact['prevalence']['b']
pprint(pa)
pprint(pb)

Fraction(5687, 36842)
Fraction(31155, 36842)


We also need to know the label accuracy for each of the three classifiers.

The single postulate requires that we know the voting frequency for each classifier.

In [26]:
classifier_accuracies = [
    {label: supervised_eval_exact['accuracies'][classifier][label]
     for label in ('a', 'b')
    } for classifier in range(3)]
pprint(classifier_accuracies)
cla = classifier_accuracies
    

[{'a': Fraction(3737, 5687), 'b': Fraction(6501, 10385)},
 {'a': Fraction(1260, 5687), 'b': Fraction(29744, 31155)},
 {'a': Fraction(4746, 5687), 'b': Fraction(4168, 6231)}]


In [18]:
trio_vote_counts = trio_label_counts.to_TrioVoteCounts()
pprint(trio_vote_counts)
tvc = trio_vote_counts

TrioVoteCounts(vote_counts={('a', 'a', 'a'): 986,
                            ('a', 'a', 'b'): 630,
                            ('a', 'b', 'a'): 5801,
                            ('a', 'b', 'b'): 7972,
                            ('b', 'a', 'a'): 562,
                            ('b', 'a', 'b'): 493,
                            ('b', 'b', 'a'): 7712,
                            ('b', 'b', 'b'): 12686})


In [20]:
classifiers_voting_frequencies = [
    {label: tvc.classifier_label_frequency(classifier, label) 
     for label in ('a', 'b')
    } for classifier in range(3)]
pprint(classifiers_voting_frequencies)
cvf = classifiers_voting_frequencies

[{'a': Fraction(15389, 36842), 'b': Fraction(21453, 36842)},
 {'a': Fraction(2671, 36842), 'b': Fraction(34171, 36842)},
 {'a': Fraction(15061, 36842), 'b': Fraction(21781, 36842)}]


In [31]:
# By construction, these observed label voting frequencies must add to one.
[sum(cvf[classifier].values()) for classifier in range(3)]

[Fraction(1, 1), Fraction(1, 1), Fraction(1, 1)]

In [27]:
# Checking the single binary classifier postulate on all three classifiers
[ (pa * (cla[classifier]['a'] - cvf[classifier]['a']) - 
   pb * (cla[classifier]['b'] - cvf[classifier]['b']))
 for classifier in range(3)]

[Fraction(0, 1), Fraction(0, 1), Fraction(0, 1)]

### Simple demonstration of the utility of exact over inexact calculations for logic

How well are the postulates obeyed when we use inexact float computations?

In [28]:
# Checking the single binary classifier postulate on all three classifiers
# but now we use floats

[ (float(pa) * (float(cla[classifier]['a']) - float(cvf[classifier]['a'])) - 
   float(pb) * (float(cla[classifier]['b']) - float(cvf[classifier]['b'])))
 for classifier in range(3)]

[3.469446951953614e-17, 5.551115123125783e-17, -4.163336342344337e-17]

In [32]:
# It gets worse when we use the fact that P_b = 1 - P_a

[ (float(pa) * (float(cla[classifier]['a']) - float(cvf[classifier]['a'])) - 
   float(1-pb) * (float(cla[classifier]['b']) - float(cvf[classifier]['b'])))
 for classifier in range(3)]

[0.03020991947924493, 0.01880900348260213, 0.053721315513076]