# The Logic of Unsupervised Evaluation : A simple demonstration

Unsupervised evaluation is trying to test noisy decision makers with unlabeled data. This is the principal/agent monitoring paradox. A job has been delegated to an agent - whether robotic or human. But we know the agents are not perfect in their decisions. In some settings, this imperfection is unsafe and must be detected to prevent possible future costly failures. The paradox lies in us having to then "do the work twice" - check the job for which we deployed the agent so as not to do ourselves!

We hear this often nowadays with LLMs - they are great until they are not. They speed up our work but now we have to do work to check they are doing their tasks correctly. NTQR logic is about mitigating this monitoring paradox by increasing the reliability of unsupervised evaluation.

The utility of NTQR logic comes from three properties:

1. It is **universal**: The solution for error independent classifiers in the code applies to all domains as long as its sole assumption is satisfied - error independency on the test. But more importantly, the postulates in `ntqr.r2.postulates` are **always** true for any group of classifiers in **any** domain. It achieves this by being completely algebraic on the observed decisions of the classifiers. It has no other parameters to tune or set to produce its outputs.
2. It is **complete**: Universal postulates have been recognized before. Most notably, Platanios et al in the **agreement equations**. What has not been recognized at all by most ML researchers and engineers is that there are a complete set of postulates that explain **any** unsupervised evaluation of classifiers. The agreement equations are but a small subset of the complete set needed for classifiers. The power of completeness for safety is that it provides iron clad proof that an evaluation cannot possibly be correct.
3. It allows you to create **self-alarming** evaluation algorithms: The finite nature of any test for classifiers allows us to restrict all solutions of sample statistics to a small set of rational numbers inside the unit cube. Since the algebraic algorithms of NTQR logic are polynomial functions of unknown statistics of correctness for the classifiers they always produce **algebraic** numbers. The class of **algebraic** numbers also includes the **irrationals**. Using this, one can prove that if you evaluate a set of three binary classifiers and the prevalence estimates contain irrational numbers - the classifiers **are** error correlated on the test. We know of no other evaluation method on unlabeled data that can signal the failure of its own independence assumption.

## A simple demonstration of their universality