A flexible and easily adaptable tool for testing the reliability of Large Language Models in code modification. The code provided here allows you to test the modification of tikz files with GPT 3.5 turbo.
The code was designed and tested with python 3.10.12. The code's dependencies will be installed automatically the first time you use it.
To use the code, first modify the bench.yaml file
Example :
parameters:
- prompt: "your_prompt_here"
tikz: "../Ressources/Tikz/your_file.tex"
perf_tikz: "../Ressources/Tikz/your_perfect_file.tex"
Then execute main.py without any arguments.
You can modify the various code parameters by editing the config.yaml file. For example, you can change the LLM temperature:
#Default
temperature : 0.2
#Modified
temperature : 0.8Notice that you need to set up your OPENAI API key as an environment variable named OPENAI_API_KEY ( See this guide for more informations ).
The file main.py reads the config.yaml file and install all the dependencies if this is the first time the program is used.
Several benchmarks can be listed in the benchs.yaml file (see Usage/Examples).
For each benchmark, main.py will call caller.py with the parameters specific to that benchmark.
caller.py will then compile the tikz file ("tikz" in benchs.yaml), call an LLM with the prompt ("prompt" in benchs.yaml) passed as a parameter, compile the LLM result and then perform a difference between the results and the ground truth ("perf_tikz" in benchs.yaml).
The results will be stored in a yaml in the Resources/Results folder so that they can be reused later for statistical purposes.