[CI] llama-factory-finetuning / quick-train-llamafactory-lora failed on halo (windows)

This issue was opened automatically by the **Test Playbooks** workflow after the test `quick-train-llamafactory-lora` failed on the `main` branch.

## Failure scope

- **Playbook:** `llama-factory-finetuning`
- **Test id:** `quick-train-llamafactory-lora`
- **Device:** `halo`
- **Operating system:** `windows`
- **Runner labels:** `self-hosted`, `Windows`, `halo`
- **Runner name:** `xsj-aimlab-halo-0`
- **Commit:** `950e710579a751d40d80a40f8c2d323930685d25`
- **Workflow run:** https://github.com/amd/playbooks/actions/runs/27079427346

## Hardware / OS to use to reproduce

Run the failing test on a machine that matches the runner labels above (OS = `windows`, device = `halo`). The repo's self-hosted runners already advertise these labels; if you reproduce locally, use the same OS family and the same AMD device class.

## How to dispatch the same test from CI

Re-run only the failing playbook on the same matrix entry by triggering the workflow with the playbook id:

```bash
gh workflow run test-playbooks.yml --repo amd/playbooks -f playbook_id=llama-factory-finetuning
```

The workflow's matrix narrows down to this `(device, platform)` combination automatically based on the playbook's `tested_platforms`.

## How to run just this test locally

```bash
python .github/scripts/run_playbook_tests.py --playbook llama-factory-finetuning --platform windows --device halo
```

The runner extracts test blocks from `playbooks/*/llama-factory-finetuning/README.md` (the failing block starts around line 490).

## Failing test (verbatim from the README)

- **Setup:** `venv\Scripts\activate`
- **Timeout:** `3600s`

```powershell
Set-Location -Path "LlamaFactory"

Copy-Item -Path "examples/train_lora/qwen3_lora_sft.yaml" -Destination "examples/train_lora/qwen3_lora_sft_ci.yaml"

$filePath = "examples/train_lora/qwen3_lora_sft_ci.yaml"
(Get-Content -Path $filePath) -replace 'lora_rank: 8', 'lora_rank: 6' | Set-Content -Path $filePath
(Get-Content -Path $filePath) -replace 'bf16:\s*true', 'fp16: true' | Set-Content -Path $filePath
(Get-Content -Path $filePath) -replace 'dataloader_num_workers:\s*4', 'dataloader_num_workers: 0' | Set-Content -Path $filePath
(Get-Content -Path $filePath) -replace 'output_dir: .*', 'output_dir: saves/qwen3_lora_sft_ci' | Set-Content -Path $filePath
(Get-Content -Path $filePath) -replace 'overwrite_output_dir: false', 'overwrite_output_dir: true' | Set-Content -Path $filePath
(Get-Content -Path $filePath) -replace 'per_device_train_batch_size: .*', 'per_device_train_batch_size: 1' | Set-Content -Path $filePath
(Get-Content -Path $filePath) -replace 'gradient_accumulation_steps: .*', 'gradient_accumulation_steps: 1' | Set-Content -Path $filePath
(Get-Content -Path $filePath) -replace 'num_train_epochs: .*', 'num_train_epochs: 1' | Set-Content -Path $filePath
(Get-Content -Path $filePath) -replace 'logging_steps: .*', 'logging_steps: 1' | Set-Content -Path $filePath
(Get-Content -Path $filePath) -replace 'save_steps: .*', 'save_steps: 5' | Set-Content -Path $filePath

llamafactory-cli train examples/train_lora/qwen3_lora_sft_ci.yaml
```

## Result

- **Exit code:** `-1`
- **Runner error:** Test timed out after 3600 seconds

---
_This issue is opened and deduplicated by `.github/scripts/create_failure_issues.py`. Close it once the failure is fixed; subsequent failures with the same scope will reopen a fresh issue._

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] llama-factory-finetuning / quick-train-llamafactory-lora failed on halo (windows) #355

Failure scope

Hardware / OS to use to reproduce

How to dispatch the same test from CI

How to run just this test locally

Failing test (verbatim from the README)

Result

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[CI] llama-factory-finetuning / quick-train-llamafactory-lora failed on halo (windows) #355

Description

Failure scope

Hardware / OS to use to reproduce

How to dispatch the same test from CI

How to run just this test locally

Failing test (verbatim from the README)

Result

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions