Skip to content

[CI] llama-factory-finetuning / quick-train-llamafactory-lora failed on halo (windows) #355

@github-actions

Description

@github-actions

This issue was opened automatically by the Test Playbooks workflow after the test quick-train-llamafactory-lora failed on the main branch.

Failure scope

  • Playbook: llama-factory-finetuning
  • Test id: quick-train-llamafactory-lora
  • Device: halo
  • Operating system: windows
  • Runner labels: self-hosted, Windows, halo
  • Runner name: xsj-aimlab-halo-0
  • Commit: 950e710579a751d40d80a40f8c2d323930685d25
  • Workflow run: https://github.com/amd/playbooks/actions/runs/27079427346

Hardware / OS to use to reproduce

Run the failing test on a machine that matches the runner labels above (OS = windows, device = halo). The repo's self-hosted runners already advertise these labels; if you reproduce locally, use the same OS family and the same AMD device class.

How to dispatch the same test from CI

Re-run only the failing playbook on the same matrix entry by triggering the workflow with the playbook id:

gh workflow run test-playbooks.yml --repo amd/playbooks -f playbook_id=llama-factory-finetuning

The workflow's matrix narrows down to this (device, platform) combination automatically based on the playbook's tested_platforms.

How to run just this test locally

python .github/scripts/run_playbook_tests.py --playbook llama-factory-finetuning --platform windows --device halo

The runner extracts test blocks from playbooks/*/llama-factory-finetuning/README.md (the failing block starts around line 490).

Failing test (verbatim from the README)

  • Setup: venv\Scripts\activate
  • Timeout: 3600s
Set-Location -Path "LlamaFactory"

Copy-Item -Path "examples/train_lora/qwen3_lora_sft.yaml" -Destination "examples/train_lora/qwen3_lora_sft_ci.yaml"

$filePath = "examples/train_lora/qwen3_lora_sft_ci.yaml"
(Get-Content -Path $filePath) -replace 'lora_rank: 8', 'lora_rank: 6' | Set-Content -Path $filePath
(Get-Content -Path $filePath) -replace 'bf16:\s*true', 'fp16: true' | Set-Content -Path $filePath
(Get-Content -Path $filePath) -replace 'dataloader_num_workers:\s*4', 'dataloader_num_workers: 0' | Set-Content -Path $filePath
(Get-Content -Path $filePath) -replace 'output_dir: .*', 'output_dir: saves/qwen3_lora_sft_ci' | Set-Content -Path $filePath
(Get-Content -Path $filePath) -replace 'overwrite_output_dir: false', 'overwrite_output_dir: true' | Set-Content -Path $filePath
(Get-Content -Path $filePath) -replace 'per_device_train_batch_size: .*', 'per_device_train_batch_size: 1' | Set-Content -Path $filePath
(Get-Content -Path $filePath) -replace 'gradient_accumulation_steps: .*', 'gradient_accumulation_steps: 1' | Set-Content -Path $filePath
(Get-Content -Path $filePath) -replace 'num_train_epochs: .*', 'num_train_epochs: 1' | Set-Content -Path $filePath
(Get-Content -Path $filePath) -replace 'logging_steps: .*', 'logging_steps: 1' | Set-Content -Path $filePath
(Get-Content -Path $filePath) -replace 'save_steps: .*', 'save_steps: 5' | Set-Content -Path $filePath

llamafactory-cli train examples/train_lora/qwen3_lora_sft_ci.yaml

Result

  • Exit code: -1
  • Runner error: Test timed out after 3600 seconds

This issue is opened and deduplicated by .github/scripts/create_failure_issues.py. Close it once the failure is fixed; subsequent failures with the same scope will reopen a fresh issue.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions