# Twinkle AI Evaluation Tool.


## Chinese Version

### Introduction

本專案為 LLM（Large Language Model）評測框架，採用並行且隨機化測試方法，提供客觀的模型性能分析與穩定性評估，並支援多種常見評測數據集。

Github: [Link](https://github.com/ai-twinkle/Eval)

### 支援格式及常見數據集

任何符合以下格式的 .csv、.json、.jsonl 或 .parquet 檔案，內容需包含下列欄位格式（不限於 TMMLU+）：

```markdown
  question,A,B,C,D,answer
```

### 功能特色

- **自動化評測多個檔案**：可批次處理並統一生成評測結果。
- **可自訂評測參數與生成控制**：可設定溫度、top_p 等生成參數。
- **選項隨機排列功能**：避免模型因選項順序產生偏好。
- **Pattern 或 Box 雙模式評測**：支援文字匹配或框選評分邏輯。
- **多次測試平均分析**：設定測試回合數以觀察模型表現穩定性。
- **計算平均正確率與穩定性指標**：量化模型答題準確度與波動程度。
- **紀錄 LLM 推論與統計結果**：用於後續分析模型在各類題型的表現。
- **支援 OpenAI API 格式**：相容於常見的 GPT API 輸入與輸出格式。
- **安全地處理 API 金鑰**：避免金鑰暴露於程式碼或日誌中。
- **API 請求限流控制與自動重試機制**：減少錯誤發生並提高 API 請求成功率。

### Google Colab Example

此 Google Colab 範例由 Google AI GDE 與 APMIC MLOps 工程師 [Simon Liu](https://simonliuyuwei-4ndgcf4.gamma.site/) 製作。

此範例將透過 [Google Gemini Model API](https://aistudio.google.com/welcome) 和 [TMMLUplus](https://huggingface.co/datasets/ikala/tmmluplus) 資料集作為範例，讓大家了解如何使用 Twinkle AI Evaluation Tool。

## English Version

### Introduction

This project is an evaluation framework for Large Language Models (LLMs). It adopts a parallel and randomized testing methodology to provide objective model performance analysis and stability evaluation. It also supports various commonly used benchmark datasets.

GitHub: [Link](https://github.com/ai-twinkle/Eval)

### Supported Formats and Common Datasets

Any `.csv`, `.json`, `.jsonl`, or `.parquet` file matching the following format is supported. The content must include the following column structure (not limited to TMMLU+):

```markdown
  question,A,B,C,D,answer
```

### Key Features

- **Automated evaluation of multiple files**: Supports batch processing and unified result generation.
- **Customizable evaluation and generation parameters**: Allows setting generation parameters such as `temperature`, `top_p`, etc.
- **Randomized option ordering**: Prevents model bias due to fixed answer choices order.
- **Dual evaluation modes: Pattern and Box**: Supports string matching or bounding box-based evaluation logic.
- **Multi-round testing with average analysis**: Configure the number of test rounds to assess model performance consistency.
- **Average accuracy and stability metric calculation**: Quantifies model accuracy and variability across test runs.
- **LLM inference and statistics logging**: Enables follow-up analysis of model performance on different question types.
- **Compatible with OpenAI API format**: Supports standard GPT API input and output structures.
- **Secure API key handling**: Prevents exposure of API keys in code or logs.
- **Rate limiting and automatic retry mechanism for API requests**: Minimizes errors and improves request success rates.

### Google Colab Example

This Google Colab example is created by [Simon Liu](https://simonliuyuwei-4ndgcf4.gamma.site/), Google AI GDE and APMIC MLOps Engineer.

This example will use the [Google Gemini Model API](https://aistudio.google.com/welcome) and [TMMLUplus](https://huggingface.co/datasets/ikala/tmmluplus) data sets as examples to let everyone understand how to use the Twinkle AI Evaluation Tool.


## Step 1: Clone the project from Github Page.
Use `git` to clone the Github Project.

In [1]:
!git clone https://github.com/ai-twinkle/Eval.git

Cloning into 'Eval'...
remote: Enumerating objects: 40, done.[K
remote: Counting objects: 100% (40/40), done.[K
remote: Compressing objects: 100% (35/35), done.[K
remote: Total 40 (delta 13), reused 3 (delta 2), pack-reused 0 (from 0)[K
Receiving objects: 100% (40/40), 444.00 KiB | 12.33 MiB/s, done.
Resolving deltas: 100% (13/13), done.


## Step 2: Install related python packages
Use `pip` to install the python packages.

In [2]:
!pip install -q datasets

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/491.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m481.3/491.2 kB[0m [31m41.8 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m481.3/491.2 kB[0m [31m41.8 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.2/491.2 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m183.9/183.9 kB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.5/143.5 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.8/194.8 kB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dep

In [3]:
!pip install -qr ./Eval/requirements.txt

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m
[?25h

## Step 3: Prepare your evaluation dataset.

Here, we use datasets python package to get the tmmluplus accounting csv files and save the validation dataset to csv file.

In [4]:
from datasets import load_dataset

dataset_accounting = load_dataset("ikala/tmmluplus", "accounting")
dataset_accounting

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

accounting_dev.csv:   0%|          | 0.00/3.32k [00:00<?, ?B/s]

accounting_val.csv:   0%|          | 0.00/11.3k [00:00<?, ?B/s]

accounting_test.csv:   0%|          | 0.00/117k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/5 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/21 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/191 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['question', 'A', 'B', 'C', 'D', 'answer'],
        num_rows: 5
    })
    validation: Dataset({
        features: ['question', 'A', 'B', 'C', 'D', 'answer'],
        num_rows: 21
    })
    test: Dataset({
        features: ['question', 'A', 'B', 'C', 'D', 'answer'],
        num_rows: 191
    })
})

In [5]:
!mkdir ./Eval/dataset

In [6]:
dataset_accounting["validation"].to_csv("./Eval/dataset/accounting.csv")

Creating CSV from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

11350

## Step 4: Configure the Model and evaluation parameters.

Please enter GOOGLE_API_KEY to allow you to use the Google Gemini Model API.

In [7]:
from google.colab import userdata

try:
    GEMINI_API_KEY = userdata.get('GOOGLE_API_KEY')
except Exception as e:
    GEMINI_API_KEY = "你可以在這邊輸入 API Key，或者在左邊環境變數匯入 GOOGLE_API_KEY。" #@param {type:"string"}

You also can change the parameter to evaluate the datasets.

In [9]:
config_content = f"""
llm_api:
  base_url: "https://generativelanguage.googleapis.com/v1beta/openai/"
  api_key: "{GEMINI_API_KEY}"
  disable_ssl_verify: false
  api_rate_limit: 0.2
  max_retries: 3
  timeout: 600

model:
  name: "gemini-2.0-flash"
  temperature: 0.0
  top_p: 0.9
  max_tokens: 4096
  frequency_penalty: 0.0
  presence_penalty: 0.0

evaluation:
  dataset_paths:
    - dataset
  evaluation_method: "box"
  system_prompt: |
    使用者將提供一個題目，並附上選項 A、B、C、D
    請仔細閱讀題目要求，根據題意選出最符合的選項，並將選項以以下格式輸出：
    \\box{{選項}}
    請確保僅將選項包含在 {{ }} 中，否則將不計算為有效答案。
    務必精確遵循輸出格式，避免任何多餘內容或錯誤格式。
  repeat_runs: 1
  shuffle_options: true

logging:
  level: "INFO"
"""

with open("./Eval/config.yaml", "w") as f:
    f.write(config_content)

In [10]:
%cd Eval

!python main.py

/content/Eval
掃描目錄： dataset
正在讀取: dataset/accounting.csv
準備題庫中: 100% 21/21 [01:40<00:00,  4.76s/it]
處理回應中: 100% 21/21 [00:00<00:00, 39.89it/s]
✅ 評測完成，結果已儲存至 results/details/eval_results_20250402_0117_run0.jsonl
已執行 100.0% (1/1) 資料集 dataset 評測完成，平均正確率: 42.86% (±0.00%)


## Optional: If you want to evaluate hole datasets.

You can get the huggingface datasets by `git` to do the evaluation.

In [11]:
!git clone https://huggingface.co/datasets/ikala/tmmluplus

Cloning into 'tmmluplus'...
remote: Enumerating objects: 367, done.[K
remote: Total 367 (delta 0), reused 0 (delta 0), pack-reused 367 (from 1)[K
Receiving objects: 100% (367/367), 2.73 MiB | 6.30 MiB/s, done.
Resolving deltas: 100% (96/96), done.


In [12]:
config_content = f"""
llm_api:
  base_url: "https://generativelanguage.googleapis.com/v1beta/openai/"
  api_key: "{GEMINI_API_KEY}"
  disable_ssl_verify: false
  api_rate_limit: 0.2
  max_retries: 3
  timeout: 600

model:
  name: "gemini-2.0-flash"
  temperature: 0.0
  top_p: 0.9
  max_tokens: 4096
  frequency_penalty: 0.0
  presence_penalty: 0.0

evaluation:
  dataset_paths:
    - tmmluplus/data/
  evaluation_method: "box"
  system_prompt: |
    使用者將提供一個題目，並附上選項 A、B、C、D
    請仔細閱讀題目要求，根據題意選出最符合的選項，並將選項以以下格式輸出：
    \\box{{選項}}
    請確保僅將選項包含在 {{ }} 中，否則將不計算為有效答案。
    務必精確遵循輸出格式，避免任何多餘內容或錯誤格式。
  repeat_runs: 1
  shuffle_options: true

logging:
  level: "INFO"
"""

with open("./config.yaml", "w") as f:
    f.write(config_content)

In [13]:
!python main.py

掃描目錄： tmmluplus/data/
正在讀取: tmmluplus/data/business_management_dev.csv
準備題庫中: 100% 5/5 [00:20<00:00,  4.00s/it]
處理回應中: 100% 5/5 [00:00<00:00,  9.70it/s]
✅ 評測完成，結果已儲存至 results/details/eval_results_20250402_0120_run0.jsonl
已執行 0.4% (1/267) 正在讀取: tmmluplus/data/dentistry_train.csv
準備題庫中:  80% 4/5 [00:15<00:04,  4.09s/it]Exception ignored in: <generator object tqdm.__iter__ at 0x7f86d78e6770>
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/tqdm/std.py", line 1196, in __iter__
    self.close()
  File "/usr/local/lib/python3.11/dist-packages/tqdm/std.py", line 1275, in close
    self._decr_instances(self)
  File "/usr/local/lib/python3.11/dist-packages/tqdm/std.py", line 696, in _decr_instances
    with cls._lock:
  File "/usr/local/lib/python3.11/dist-packages/tqdm/std.py", line 110, in __enter__
    def __enter__(self):

KeyboardInterrupt: 
Traceback (most recent call last):
  File "/content/Eval/main.py", line 45, in <module>
    result = evaluate_file(c