# Fine-tune using fms-hf-tuning with Alpaca Dataset

This example demonstrates how to fine-tune LLMs with the Alpaca Dataset using fms-hf-tuning utilizing `BuiltinTrainer` similar to torchtune from Kubeflow Trainer SDK.

The demo is built by creating a new training runtime called fms-hf-tuning and also converting torchtune args to fms-hf-tuning args.

Granite 4.0 350m: https://huggingface.co/ibm-granite/granite-4.0-350m/

Alpaca Dataset: https://huggingface.co/datasets/tatsu-lab/alpaca

## List the training runtimes

In [11]:
# List all available Kubeflow Training Runtimes.
from kubeflow.trainer import *
from kubeflow_trainer_api import models

client = TrainerClient()
for runtime in client.list_runtimes():
    print(runtime)

Runtime(name='deepspeed-distributed', trainer=RuntimeTrainer(trainer_type=<TrainerType.CUSTOM_TRAINER: 'CustomTrainer'>, framework='deepspeed', image='ghcr.io/kubeflow/trainer/deepspeed-runtime', num_nodes=1, device='Unknown', device_count='1'), pretrained_model=None)
Runtime(name='fms-hf-tuning-runtime', trainer=RuntimeTrainer(trainer_type=<TrainerType.BUILTIN_TRAINER: 'BuiltinTrainer'>, framework='torchtune', image='fms-hf-tuning:test', num_nodes=1, device='gpu', device_count='2.0'), pretrained_model=None)
Runtime(name='mlx-distributed', trainer=RuntimeTrainer(trainer_type=<TrainerType.CUSTOM_TRAINER: 'CustomTrainer'>, framework='mlx', image='ghcr.io/kubeflow/trainer/mlx-runtime', num_nodes=1, device='Unknown', device_count='1'), pretrained_model=None)
Runtime(name='torch-distributed', trainer=RuntimeTrainer(trainer_type=<TrainerType.CUSTOM_TRAINER: 'CustomTrainer'>, framework='torch', image='fms-hf-tuning:test', num_nodes=1, device='Unknown', device_count='Unknown'), pretrained_mode

### Create PVCs for Models and Datasets

In [12]:
# Create a PersistentVolumeClaim for the fms-hf-tuning runtime.
client.backend.core_api.create_namespaced_persistent_volume_claim(
    namespace="default",
    body=models.IoK8sApiCoreV1PersistentVolumeClaim(
        apiVersion="v1",
        kind="PersistentVolumeClaim",
        metadata=models.IoK8sApimachineryPkgApisMetaV1ObjectMeta(
            name="fms-hf-tuning-pvc"
        ),
        spec=models.IoK8sApiCoreV1PersistentVolumeClaimSpec(
            accessModes=["ReadWriteOnce"],
            resources=models.IoK8sApiCoreV1VolumeResourceRequirements(
                requests={
                    "storage": models.IoK8sApimachineryPkgApiResourceQuantity("10Gi")
                }
            ),
        ),
    ).to_dict(),
)

{'api_version': 'v1',
 'kind': 'PersistentVolumeClaim',
 'metadata': {'annotations': None,
              'creation_timestamp': datetime.datetime(2025, 11, 12, 21, 45, 7, tzinfo=tzutc()),
              'deletion_grace_period_seconds': None,
              'deletion_timestamp': None,
              'finalizers': ['kubernetes.io/pvc-protection'],
              'generate_name': None,
              'generation': None,
              'labels': None,
              'managed_fields': [{'api_version': 'v1',
                                  'fields_type': 'FieldsV1',
                                  'fields_v1': {'f:spec': {'f:accessModes': {},
                                                           'f:resources': {'f:requests': {'.': {},
                                                                                          'f:storage': {}}},
                                                           'f:volumeMode': {}}},
                                  'manager': 'OpenAPI-Generator',
    

## Bootstrap LLM Fine-tuning Workflow

Kubeflow TrainJob will train the model in the referenced (Cluster)TrainingRuntime.

In [13]:
job_name = client.train(
    runtime=client.get_runtime(name="fms-hf-tuning-runtime"),
    initializer=Initializer(
        dataset=HuggingFaceDatasetInitializer(
            storage_uri="hf://tatsu-lab/alpaca/data"
        ),
        model=HuggingFaceModelInitializer(
            storage_uri="hf://ibm-granite/granite-4.0-350M",
        )
    ),
    trainer=BuiltinTrainer(
        config=TorchTuneConfig(
            dataset_preprocess_config=TorchTuneInstructDataset(
                source=DataFormat.PARQUET, split="train[:1000]"
            ),
            resources_per_node={
                "memory": "4G",
                "gpu": 1,
            },
            
        )
    )
)

## Wait for running status

In [14]:

# Wait for the running status.
client.wait_for_job_status(name=job_name, status={"Running"})


TrainJob(name='h871cd09097a', runtime=Runtime(name='fms-hf-tuning-runtime', trainer=RuntimeTrainer(trainer_type=<TrainerType.BUILTIN_TRAINER: 'BuiltinTrainer'>, framework='torchtune', image='fms-hf-tuning:test', num_nodes=1, device='gpu', device_count='2.0'), pretrained_model=None), steps=[Step(name='dataset-initializer', status='Succeeded', pod_name='h871cd09097a-dataset-initializer-0-0-tkh4s', device='Unknown', device_count='Unknown'), Step(name='model-initializer', status='Succeeded', pod_name='h871cd09097a-model-initializer-0-0-rjpv7', device='Unknown', device_count='Unknown'), Step(name='node-0', status='Running', pod_name='h871cd09097a-node-0-0-mmzr2', device='gpu', device_count='1')], num_nodes=1, creation_timestamp=datetime.datetime(2025, 11, 12, 21, 45, 51, tzinfo=TzInfo(0)), status='Running')

## Watch the TrainJob Logs

### Dataset Initializer

In [15]:
from kubeflow.trainer.constants import constants

for line in client.get_job_logs(job_name, follow=True, step=constants.DATASET_INITIALIZER):
    print(line)

2025-11-12T21:45:55Z INFO     [__main__.py:18] Starting dataset initialization
2025-11-12T21:45:55Z INFO     [huggingface.py:28] Downloading dataset: tatsu-lab/alpaca
2025-11-12T21:45:55Z INFO     [huggingface.py:29] ----------------------------------------
Fetching 3 files: 100%|██████████| 3/3 [00:00<00:00,  3.10it/s]
2025-11-12T21:45:57Z INFO     [huggingface.py:41] Dataset has been downloaded


### Model Initializer

In [16]:
for line in client.get_job_logs(job_name, follow=True, step=constants.MODEL_INITIALIZER):
    print(line)

2025-11-12T21:45:55Z INFO     [__main__.py:17] Starting pre-trained model initialization
2025-11-12T21:45:55Z INFO     [huggingface.py:26] Downloading model: ibm-granite/granite-4.0-350M
2025-11-12T21:45:55Z INFO     [huggingface.py:27] ----------------------------------------
Fetching 8 files: 100%|██████████| 8/8 [00:15<00:00,  1.96s/it]
2025-11-12T21:46:12Z INFO     [huggingface.py:43] Model has been downloaded


### Trainer Node 

In [17]:
for c in client.get_job(name=job_name).steps:
    print(f"Step: {c.name}, Status: {c.status}, Devices: {c.device} x {c.device_count}\n")

for line in client.get_job_logs(job_name, follow=True):
    print(line)

Step: dataset-initializer, Status: Succeeded, Devices: Unknown x Unknown

Step: model-initializer, Status: Succeeded, Devices: Unknown x Unknown

Step: node-0, Status: Running, Devices: gpu x 1

Rank-0 [INFO]:sft_trainer.py:main: fms-hf-tuning execution start
Rank-0 [INFO]:sft_trainer.py:main: 
---------------------------- Model Arguments -----------------------
  embedding_size_multiple_of : 1
  flash_attn_implementation  : flash_attention_2
  model_name_or_path         : ibm-granite/granite-4.0-350M
  tokenizer_name_or_path     : None
  torch_dtype                : torch.bfloat16
  use_flash_attn             : False
---------------------------- Data Arguments -----------------------
  add_special_tokens         : None
  chat_template              : None
  data_config_path           : None
  data_formatter_template    : None
  dataset_conversation_field : None
  dataset_image_field        : None
  dataset_text_field         : text
  do_dataprocessing_only     : False
  instruction_tem

In [18]:
client.delete_job(name=job_name)

# Get the Fine-tuned Model

After Trainer node completes the fine-tuning task, the fine-tuned model will be stored into the `/workspace/output` directory, which can be shared across Pods through PVC mounting. You can find it in another Pod's `/<mountDir>/output` directory if you mount the PVC under `/<mountDir>`.