## Testing our performance

As any respectable scientist will tell you: proof is in the pudding, baby

Since this is a code generating agent, we will focus on tests specifically for programming

### HumanEval Leaderboard
[Code Generation on HumanEval](https://paperswithcode.com/sota/code-generation-on-humaneval)

| Rank | Model                   | pass@1 | Paper Title                                                     | Year |
|------|-------------------------|-------|-----------------------------------------------------------------|------|
| 1    | Reflexion (GPT-4)       | 91.0  | Reflexion: Language Agents with Verbal Reinforcement Learning   | 2023 |
| 2    | Parsel (GPT-4 + CodeT) | 85.1  | Parsel: Algorithmic Reasoning with Language Models by Composing Decompositions | 2023 |
| *    | SimpleCoder (GPT-3.5)  | 68.9 | <--- this repo  | July, 2023 |
| 3    | GPT-4 (zero-shot)       | 67.0  | GPT-4 Technical Report                                          | 2023 |
| ... |
| 8    | GPT-3.5                  | 48.1  |  | 2023 |

#### Installing HumanEval
Clone and install the test suite:

In [None]:
%%bash
source ../venv/bin/activate
pip install -r ../requirements.txt

In [None]:
import sys
#!source venv/bin/activate
!git clone https://github.com/openai/human-eval
!{sys.executable} -m pip install -e human-eval
sys.path.append('human-eval')
sys.path.append('src')

### Code execution

In order for the HumanEval test suite to test your code, you must uncomment the line that executes the code.

#### WARNING: This will cause your machine to execute code generated by the LLM, which may be unsafe. But who cares, this is science!

In [4]:
!sed -i '58s/# //' ../venv/src/human-eval/human_eval/execution.py

#### Take the test
Now let's have our agent take the test and generate responses, which we will grade in the next step.
To run these in a jupyter notebook, we need to add the following sources to our PATH first. If you did not run this in the setup step above, run it now:

In [None]:
import sys
import os
sys.path.append('human-eval')
sys.path.append('src')
print(sys.path)

Now let's import the file with our test code, and run the test suite:

In [5]:
from test_humaneval import generate_samples
C_OUTPUT_FOLDER='test-output'
C_TEST_NAME='human-eval-output-3.5'
await generate_samples(output_dir = os.path.join(C_OUTPUT_FOLDER, C_TEST_NAME))


### Evaluating Performance

In [9]:
!pwd
!evaluate_functional_correctness {C_OUTPUT_FOLDER}/{C_TEST_NAME}/samples.jsonl

/home/space/Documents/projects/ai-code-builder/git-molecul-ai/src
Reading samples...
164it [00:00, 7410.91it/s]
Running test suites...
100%|████████████████████████████████████████| 164/164 [00:01<00:00, 116.42it/s]
Writing results to test-output/human-eval-output-3.5/samples.jsonl_results.jsonl...
100%|██████████████████████████████████████| 164/164 [00:00<00:00, 30527.04it/s]
{'pass@1': 0.6890243902439024}
