☁️ CloudEval-YAML

🤔 Considering integrating LLMs into your cloud application but unsure about their performance? Check this out 👇

📊 The table is generated with CloudEval-YAML, a practical Large Language Model (LLM) benchmark specifically designed for cloud-native application configuration generation. It comprises a comprehensive dataset of 1011 problems covering widely deployed applications including Kubernetes, Envoy and Istio. This benchmark is unique in its practical and realistic approach, featuring a combination of hand-written question contexts, reference YAML files, and unit test scripts for each problem. This benchmark can facilitate comparing different LLM models, as well as different sampling and prompt engineering techniques. 📄 Please refer to the following technical report for more detailed information: CloudEval-YAML: A Practical Benchmark for Cloud Native YAML Configuration Generation.

🚀 Quick Start

Benchmarking GPT-3.5 with all metrics but unit tests.

📋 Prerequisites:

Install the python prerequisites

pip install -r requirements.txt

🔑 Configure OpenAI API Key

export OPENAI_API_KEY='YOUR_API_KEY'

🏃 Run the Benchmark

python benchmark.py

Sample output from the terminal:

------ Problem: Kubernetes_pod_q1 ------
bleu score: 0.69 / 1
edit_distance score: 0.57 / 1
exact_match score: 0.00 / 1
kv_match score: 0.00 / 1
kv_wildcard score: 0.73 / 1

(more problems...)

------ Category: Kubernetes_pod ------
bleu score: 30.44 / 48
edit_distance score: 23.60 / 48
exact_match score: 4.00 / 48
kv_match score: 10.00 / 48
kv_wildcard score: 28.83 / 48

(more categories...)

------ Overall --- Model: gpt-3.5 ------
Dataset: original (337 problems)
bleu score: 211.66 / 337
edit_distance score: 174.34 / 337
exact_match score: 29.00 / 337
kv_match score: 58.00 / 337
kv_wildcard score: 212.39 / 337

(repeat above: more datasets...)

====== Overall === Model: gpt-3.5 ======
Dataset: original, simplified, translated (1011 problems)
bleu score: 0.631
edit_distance score: 0.522
exact_match score: 0.071
kv_match score: 0.154
kv_wildcard score: 0.605

All prompts, generated codes and scores by default will be saved in outputs/. You man use them as checkpoints for further analysis.

🎉 Congratulations! You have successfully completed the quickstart and benchmarked GPT-3.5 (other than unit tests).

🤖 Benchmarking Different Models

OpenAI GPT-3.5/4

Configure OpenAI API key
```
export OPENAI_API_KEY='YOUR_API_KEY'
```
Include GPT-3.5/4 for benchmarking in config.json
```
"models": [
  "gpt-3_5",
  "gpt-4"
]
```
Model-specific query is defined in models/gpt.py. You may customize them if needed. Please refer to OpenAI API documentation for more details.

Google PaLM 2

Configure PaLM API key
```
export PALM_API_KEY='YOUR_API_KEY'
```
Include PaLM 2 for benchmarking in config.json
```
"models": [
  "palm-2"
]
```
Model-specific query is defined in models/palm.py. You may customize them if needed. Please refer to Google API documentation for more details.

Replicate

Replicate is a platform that allow you to run open-source or your own models at scale in the cloud. It abstracts away the complexity of deploying models locally and provides a simple API for query and response.

Configure Replicate API token

export REPLICATE_API_TOKEN=='YOUR_API_TOKEN'

Currently, a subset of supported open-source models are integrated into the benchmark. You may include them for benchmarking in config.json.

"models": [
  "llama-2-70b-chat",
  "llama-2-13b-chat",
  "llama-2-7b-chat",
  "llama-13b-lora",
  "llama-7b",
  "wizardcoder-34b-v1.0",
  "wizardcoder-15b-v1.0",
  "codellama-13b-instruct",
  "codellama-7b-instruct"
]

Model-specific query is defined in models/replicate.py. You may customize them, integrate more or even include your own models if needed. Please refer to Replicate API documentation for more details.

🏃 Run the Unit Tests

The easiest way to run unit tests is to launch an AWS EC2 instance using public AMI with cloudyaml_public as name and ami-012306c17129a3b71 as ID. Notice that this is going to take several hours to evaluate. Alternatively, one can create a pull request and we can help to run on a cluster.

After launching the EC2 instance, you can pull this repo and enable unit test in config.json

"metrics": {
  "unit_test": true
}

Then run the benchmark:

python benchmark.py

Sample output from the terminal:

------ Problem: Kubernetes_pod_q1 ------
bleu score: 0.69 / 1
edit_distance score: 0.57 / 1
exact_match score: 0.00 / 1
kv_match score: 0.00 / 1
kv_wildcard score: 0.73 / 1
unit_test score: 1.00 / 1

(more problems, categories and datasets...)

====== Overall === Model: gpt-3.5 ======
Dataset: original, simplified, translated (1011 problems)
bleu score: 0.631
edit_distance score: 0.522
exact_match score: 0.071
kv_match score: 0.154
kv_wildcard score: 0.605
unit_test score: 0.418

Please refer to Advanced.md for how to run your own model and other advanced features.

📄 Citation

If you find this benchmark useful, please cite our paper:

@article{xu2023cloudeval,
  title={CloudEval-YAML: A Practical Benchmark for Cloud Configuration Generation},
  author={Xu, Yifei and Chen, Yuning and Zhang, Xumiao and Lin, Xianshang and Hu, Pan and Ma, Yunfei and Lu, Songwu and Du, Wan and Mao, Zhuoqing and Zhai, Ennan and others},
  journal={arXiv preprint arXiv:2401.06786},
  year={2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
assets		assets
metrics		metrics
models		models
.gitignore		.gitignore
Advanced.md		Advanced.md
LICENSE		LICENSE
README.md		README.md
benchmark.py		benchmark.py
config.json		config.json
data.zip		data.zip
evaluate.py		evaluate.py
loader.py		loader.py
prompt.py		prompt.py
query.py		query.py
requirements.txt		requirements.txt

License

alibaba/CloudEval-YAML

Folders and files

Latest commit

History

Repository files navigation

☁️ CloudEval-YAML

🚀 Quick Start

📋 Prerequisites:

🔑 Configure OpenAI API Key

🏃 Run the Benchmark

🎉 Congratulations! You have successfully completed the quickstart and benchmarked GPT-3.5 (other than unit tests).

🤖 Benchmarking Different Models

OpenAI GPT-3.5/4

Google PaLM 2

Replicate

🏃 Run the Unit Tests

📄 Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Languages