## Run the KLUE benchmarks



In [1]:
%pwd

'/usr/local/google/home/thekim/github/aimldl/llm_benchmarks_asian_langs'

### KLUE MRC (Machine Reading Comprehension) Task

In [2]:
%cd klue_mrc

/usr/local/google/home/thekim/github/aimldl/llm_benchmarks_asian_langs/klue_mrc


In [3]:
%ls

ABOUT_KLUE_MRC.md   [0m[01;32mget_errors.sh[0m*              [01;32mrun[0m*
README.md           [01;32minstall_dependencies.sh[0m*    [01;32msetup.sh[0m*
TROUBLESHOOTING.md  klue_mrc-gemini2_5flash.py  [01;32mtest_logging.sh[0m*
VERTEX_AI_SETUP.md  [01;34mlogs[0m/                       test_setup.py
[01;34mbenchmark_results[0m/  requirements.txt            [01;32mverify_scripts.sh[0m*
[01;34meval_dataset[0m/       [01;34mresult_analysis[0m/


### Set up

Run the following command and log into your Google Cloud account.
```bash
$ gcloud auth login
```

In [5]:
!./setup.sh full

[0;34m[INFO][0m Checking prerequisites...
[0;32m[SUCCESS][0m Prerequisites check passed
[0;34m[INFO][0m Installing Python dependencies...
[0;32m[SUCCESS][0m Dependencies installed successfully!
[0;34m[INFO][0m Testing the setup...
KLUE MRC Benchmark Setup Test (Vertex AI)
Testing package imports...
✓ google.cloud.aiplatform
✓ vertexai
✓ datasets
✓ pandas
✓ tqdm
✓ huggingface_hub
✓ google.auth

✅ All packages imported successfully!

Testing environment variables...
✓ GOOGLE_CLOUD_PROJECT: vertex-workbench-notebook
⚠ GOOGLE_APPLICATION_CREDENTIALS: Not set (using default credentials)

Testing KLUE MRC dataset loading...
✓ KLUE mrc dataset for MRC loaded successfully
  - Train samples: 17554
  - Validation samples: 5841
  - Sample from validation set:
    - Title: BMW 코리아, 창립 25주년 기념 ‘BMW 코리아 25주년 에디션’ 한정 출시
    - Context: BMW 코리아(대표 한상윤)는 창립 25주년을 기념하는 ‘BMW 코리아 25주년 에디션’을 한정 출시한다고 밝혔다. 이번 BMW 코리아 25주년 에디션(이하 25주년 에디션)은 B...
    - Question: 말라카이트에서 나온 색깔을 사용한 에디션은?
    - Answers

### Test-Run

In [6]:
!./run test

Running small test with 10 samples...
2025-07-14 07:47:55,375 - INFO - Initialized Vertex AI with project: vertex-workbench-notebook, location: us-central1
2025-07-14 07:47:55,375 - INFO - Model name set to: gemini-2.5-flash
2025-07-14 07:47:55,375 - INFO - Loading KLUE MRC dataset for machine reading comprehension...
2025-07-14 07:48:06,218 - INFO - Preparing to load a subset of 10 samples.
2025-07-14 07:48:06,219 - INFO - Reached sample limit of 10. Halting data loading.
2025-07-14 07:48:06,219 - INFO - ✅ Successfully loaded 10 samples.
2025-07-14 07:48:06,219 - INFO - Starting benchmark...
project_id: vertex-workbench-notebook
Processing samples:   0%|          | 0/10 [00:00<?, ?it/s]2025-07-14 07:48:06,220 - INFO - AFC is enabled with max remote calls: 10.
2025-07-14 07:48:09,135 - INFO - HTTP Request: POST https://us-central1-aiplatform.googleapis.com/v1beta1/projects/vertex-workbench-notebook/locations/us-central1/publishers/google/models/gemini-2.5-flash:generateContent "HTTP/1.

### Test it with More Samples (after fixing errors)

In [11]:
!./run custom 50

Running custom benchmark with 50 samples...
2025-07-14 15:46:17,008 - INFO - Initialized Vertex AI with project: vertex-workbench-notebook, location: us-central1
2025-07-14 15:46:17,008 - INFO - Model name set to: gemini-2.5-flash
2025-07-14 15:46:17,008 - INFO - Loading KLUE MRC dataset for machine reading comprehension...
2025-07-14 15:46:30,908 - INFO - Preparing to load a subset of 50 samples.
2025-07-14 15:46:30,912 - INFO - Reached sample limit of 50. Halting data loading.
2025-07-14 15:46:30,912 - INFO - ✅ Successfully loaded 50 samples.
2025-07-14 15:46:30,913 - INFO - Starting benchmark...
project_id: vertex-workbench-notebook
Processing samples:   0%|          | 0/50 [00:00<?, ?it/s]2025-07-14 15:46:30,913 - INFO - AFC is enabled with max remote calls: 10.
2025-07-14 15:46:33,962 - INFO - HTTP Request: POST https://us-central1-aiplatform.googleapis.com/v1beta1/projects/vertex-workbench-notebook/locations/us-central1/publishers/google/models/gemini-2.5-flash:generateContent "H

### Baseline Performance

To evaluate baseline performance, use the `./run full` command. Be aware that this process is time-consuming and may run overnight or for a full workday.

To ensure uninterrupted execution, even if your terminal disconnects, we highly recommend running `./run full` within a **tmux session**.

### `tmux` Session Commands:

* **Create and start a new session:** `tmux new -s klue`
* **Run the command within the session:** `./run full`
* **Detach from the session:** Press `Ctrl+b` then `d`
* **Reattach to the `klue` session:** `tmux attach -t klue`

For more detailed information on `tmux`, refer to [background_processing_with_tmux.md](background_processing_with_tmux.md)

Alternatively, you can use the **nohup command** to run the process in the background, allowing it to continue after you log out of your session.

In [8]:
# Running the command in the background is recommended.
# Uncomment the following command if you still wish to run it in a cell
#!./run full

### Improve the Prompt for Better Performance

Added specific answer format rules with clear examples:
- 질문: "어디로 보내졌나?" → 답변: "노르웨이" (O), "노르웨이로 파견되었다" (X)
- 질문: "가격은?" → 답변: "79달러" (O), "79달러에 팔린다" (X)

Emphasized conciseness with explicit instructions:
- "답변은 가능한 한 짧고 정확해야 합니다"
- "문장을 완성하지 말고 핵심 답만 제공하세요"

The previous prompt is not specific enough about requiring short, concise answers. The model is giving verbose responses like "노르웨이로 파견되었다...." instead of just "노르웨이". The prompt was improved to be more explicit about answer format.

From
```
지침:
```

to
```
중요한 지침:

1. **정확한 답 찾기**: 질문에 대한 답이 지문에 명확히 나와 있는지 확인하세요.
2. **문맥 이해**: 지문의 전체적인 맥락을 파악하여 정확한 답을 찾으세요.
3. **답의 형태**: 
   - 답이 지문에 있으면: 지문에서 그대로 추출하여 답하세요
   - 답이 지문에 없으면: "답을 찾을 수 없습니다"라고 답하세요
4. **한국어 특성 고려**: 한국어의 문법과 표현을 정확히 이해하여 답하세요.
5. **명확성**: 답은 간결하고 명확해야 합니다.

**답변 형식 규칙:**
- 답변은 가능한 한 짧고 정확해야 합니다
- 문장을 완성하지 말고 핵심 답만 제공하세요
- 예시:
  - 질문: "어디로 보내졌나?" → 답변: "노르웨이" (O), "노르웨이로 파견되었다" (X)
  - 질문: "가격은?" → 답변: "79달러" (O), "79달러에 팔린다" (X)
  - 질문: "물질은?" → 답변: "실리콘유" (O), "무명천" (X)
```

Results:
- Before fix: Exact Match: 0.7600, F1: 0.8195
- After fix: Exact Match: 0.9000, F1: 0.9000
- Answerable questions: 0.9412 Exact Match (up from 0.7561)

#### Test-run

In [None]:
!./run test

#### Test it with More Samples

In [10]:
!./run custom 100

Running custom benchmark with 100 samples...
2025-07-14 09:07:58,077 - INFO - Initialized Vertex AI with project: vertex-workbench-notebook, location: us-central1
2025-07-14 09:07:58,077 - INFO - Model name set to: gemini-2.5-flash
2025-07-14 09:07:58,077 - INFO - Loading KLUE MRC dataset for machine reading comprehension...
2025-07-14 09:08:12,478 - INFO - Preparing to load a subset of 100 samples.
2025-07-14 09:08:12,486 - INFO - Reached sample limit of 100. Halting data loading.
2025-07-14 09:08:12,486 - INFO - ✅ Successfully loaded 100 samples.
2025-07-14 09:08:12,487 - INFO - Starting benchmark...
project_id: vertex-workbench-notebook
Processing samples:   0%|          | 0/100 [00:00<?, ?it/s]2025-07-14 09:08:12,487 - INFO - AFC is enabled with max remote calls: 10.
2025-07-14 09:08:15,494 - INFO - HTTP Request: POST https://us-central1-aiplatform.googleapis.com/v1beta1/projects/vertex-workbench-notebook/locations/us-central1/publishers/google/models/gemini-2.5-flash:generateConte

### Measure the Improved Performance

```bash
# Reattach to the `klue` session
tmux attach -t klue

# Run the target command within the `tmux session`
$ ./run full

# Detach from the Session
Ctrl+b d
```