# Pass^K Tutorial

This tutorial is based on the metric proposed in this [arxiv paper](https://arxiv.org/pdf/2406.12045), which is called "Pass^K". They developed the metric to expose the reliability of AI agents across multiple attempts because it matters less that an agent can achieve success once. They state:

> For real-world agent tasks requiring reliability and consistency like
customer service, we propose a new metric – pass^k (pass hat k), defined as the chance that all k
i.i.d. task trials are successful, averaged across tasks. 

In there paper they define "Pass^K" with the equation: 

$$
 \text{pass\textasciicircum k} = \mathbb{E}_{\text{task}}\left[ \binom{c}{k} \middle/ \binom{n}{k} \right]
$$

Where the variables in the formula are as follows:
* `c` is the number of successes
* `n` is the total number of trials for each task 
  * Note if this feels weird, you're not alone. I struggled to see why they wouldn't make this "k", and then define n as the current trial you're on...
* `k` is the current trial number for the task

In this tutorial you will build a function that allows you to calculate `pass^k` for a dataframe with any number of iterations. We're taking it out of the context of agents to attempt to generalize it to nondeterministic systems so we can leverage it for any Generative AI application.

## Getting started 
Start by importing the data and requirements. Then learn more about the shape and nature of your data by running the following commands:
* `tutorial_data.info()`
* `tutorial_data.describe()`

In [3]:
# Import the necessary requirements to work with the data 
from math import comb
import pandas as pd

In [14]:
#load the tutorial data into a csv
tutorial_data = pd.read_csv("Data/passHat10_data.csv")

#print the head of the dataframe so you can see it
tutorial_data.head()

Unnamed: 0,ID,Pass1,Pass2,Pass3,Pass4,Pass5,Pass6,Pass7,Pass8,Pass9,Pass10
0,task1,pass,pass,pass,fail,pass,fail,fail,fail,pass,fail
1,task2,pass,fail,fail,fail,pass,fail,pass,pass,fail,pass
2,task3,pass,pass,pass,fail,fail,fail,fail,pass,fail,fail
3,task4,pass,pass,pass,pass,pass,pass,fail,pass,fail,fail
4,task5,pass,fail,pass,pass,pass,fail,fail,pass,fail,fail


In [None]:
#Cell where you can do some simple summarization to explore the data.

## A brute force implementation of how to calculate pass^1

We will start by taking just the first column `Pass1` and calculating the metric for this dataset.

By doing this you will explore:
* extracting data from a data frame into a list
* encoding the mathematical formula for pass^1 in python
* looping through a list to calculate pass^1 for each index
* validating your work

In [15]:
# Turn a column in a dataframe into a list
pass1_values = tutorial_data.Pass1.values.tolist()
print("Values:", pass1_values)

Values: ['pass', 'pass', 'pass', 'pass', 'pass', 'pass', 'pass', 'pass', 'fail', 'fail', 'pass', 'fail', 'pass', 'pass', 'fail', 'pass', 'fail', 'pass', 'fail']


In [16]:
# encode the mathematical formula into a function for a naive pass^1 function
def passHat1(list:list):
    '''This function takes a list and then iterates over it to calculate the pass^1 value
    for it. It then adds that value to the total. At the end of the list it returns the
    total divided by the length of the list to provide the average/expected value.'''
    total = 0
    for n in list:
        c = 1 if n == "pass" else 0
        total += comb(c,1)/comb(1,1)
    return total / len(list)

In [17]:
print(passHat1(pass1_values))

0.6842105263157895


## Preparation for building your pass^K function
Now that you've seen a simple implementation for pass^1 it's time for you to build your own pass^K function that will traverse a dataframe of any size. To do this you will need the function to do the following three things:
1. iterate over every task (i.e., row) in the data frame
2. increase `k` as you iterate over the column its in (i.e., for `Pass1` k == 1 and `Pass2` k ==2)
3. sum the successes across a single task to make sure `c` is correct at each step of the iteration
4. use the python version of the formula `comb(c,k) / comb(n,k)` to calculate the metric per data point
5. average across the rows in the dataframe to account for the expected value in the metric

To support you in ensuring your values are correct I've provided `passK_reference_data.csv` which includes the correct calculation for the following pass^K:
* pass^1 == 68.42%
* pass^3 == 61.40%
* pass^6 == 22.02%
* pass^10 == 07.98%

To do this directly in pandas you will want to use `df.itertuples()` and you can see the documentation for it [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.itertuples.html). This will also provide you other options you can explore.

In [80]:
#I've provided the variables I used to complete this, with a default value for success
#based on the fact that the tutorial data has "pass" and "fail".
def passHatK(frame,n:int,success="pass"):
    total = 0
    #insert your code here to calculate the pass^K metric
    
    return total/len(frame)

In [75]:
print(passHatK(tutorial_data,1))

0.6842105263157895


In [76]:
print(passHatK(tutorial_data,3))

0.6140350877192984


In [77]:
print(passHatK(tutorial_data,6))

0.2201754385964912


In [78]:
print(passHatK(tutorial_data,10))

0.07976190476190476
