From 914aa539378928240804e4009ea1d2574fe195e7 Mon Sep 17 00:00:00 2001
From: Wu Yi <typhoonzero1986@gmail.com>
Date: Fri, 29 May 2026 03:58:28 +0000
Subject: [PATCH] docs: add MLOps-with-coding-agents how-to

Follow-on to the on-prem coding agent guide. Covers four agent-driven
MLOps workflows: managing InferenceService and LLMInferenceService
resources, configuring authentication and rate limiting on Envoy AI
Gateway, an iterative agent-driven performance tuning loop, and reusable
templates for fine-tuning plans and post-run reports. Links to the
existing fine-tuning paths (Workbench Notebook, Training Hub, Kubeflow
Trainer v2, LLM Compressor) and to the Envoy AI Gateway install doc.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 .../how_to/mlops_with_coding_agents.mdx       | 266 ++++++++++++++++++
 1 file changed, 266 insertions(+)
 create mode 100644 docs/en/model_inference/inference_service/how_to/mlops_with_coding_agents.mdx

diff --git a/docs/en/model_inference/inference_service/how_to/mlops_with_coding_agents.mdx b/docs/en/model_inference/inference_service/how_to/mlops_with_coding_agents.mdx
new file mode 100644
index 0000000..044bfed
--- /dev/null
+++ b/docs/en/model_inference/inference_service/how_to/mlops_with_coding_agents.mdx
@@ -0,0 +1,266 @@
+---
+weight: 18
+i18n:
+  title:
+    en: Run MLOps with Coding Agents and On-Premise LLMs
+    zh: 使用编码智能体与本地部署 LLM 开展 MLOps
+---
+
+# Run MLOps with Coding Agents and On-Premise LLMs
+
+## Introduction
+
+Once a coding agent is wired to a self-hosted model on Alauda AI (see [Use Coding Agents with On-Premise Inference Services](./coding_agents_with_inference_service.mdx)), the same agent can drive day-to-day MLOps on the platform. Because both the model and the operations target the same cluster, prompts, manifests, training data references, and benchmark results never leave your environment — which is what makes self-hosted agents attractive for regulated work.
+
+This document describes four workflows where a coding agent is most useful:
+
+- Authoring and managing `InferenceService` and `LLMInferenceService` resources.
+- Configuring the inference traffic gateway — authentication and rate limits via [Alauda Build of Envoy AI Gateway](../../../envoy_ai_gateway/intro.mdx).
+- Iteratively tuning an inference service's performance to fit specific hardware.
+- Planning fine-tuning runs and generating structured reports from their results.
+
+It assumes you are already running the agent and that it can reach an on-premise OpenAI-compatible endpoint with **tool calling** enabled. If not, start with the prerequisites doc above.
+
+:::warning
+A coding agent that can run `kubectl` against a real cluster can also delete things. Scope its kubeconfig to a single namespace, prefer `--dry-run=server` for any apply during exploration, and require a human review of every change before it lands in production. Treat the agent like a junior engineer with cluster access, not an autonomous operator.
+:::
+
+## Set up the agent's working environment \{#set-up-environment}
+
+Before delegating MLOps work, give the agent a small, reliable context to operate in. Three things are almost always worth doing once per project:
+
+1. **Scope cluster access.** Create a dedicated namespace (for example, `mlops-demo-ai-test` used in the platform samples) and a `ServiceAccount` / kubeconfig with permissions limited to the resources the agent should touch — typically `InferenceService`, `LLMInferenceService`, `TrainJob`, `TrainingRuntime`, `AIGatewayRoute`, `AIServiceBackend`, `BackendSecurityPolicy`, `SecurityPolicy`, `BackendTrafficPolicy`, and the secrets/configmaps they reference. Avoid cluster-wide write access.
+2. **Pin a default hardware profile.** Platform Hardware Profiles encode the GPU type, taints, tolerations, and node selectors for your fleet. Pick the right profile up front and tell the agent to use it — this prevents the agent from inventing affinity blocks. See [Hardware Profiles](../../../infrastructure_management/hardware_profile/intro.mdx).
+3. **Commit an agent context file.** Most coding agents read a project-level instructions file (for example, `AGENTS.md`, `CLAUDE.md`, or `opencode.md`). Use it to record the cluster name, target namespace, the on-prem model endpoint, naming conventions, "always run `kubectl apply --dry-run=server` first", and any internal links the agent should follow. Once this file exists, every subsequent prompt becomes shorter and more accurate.
+
+## Manage InferenceServices and LLMInferenceServices \{#manage-inference-services}
+
+The platform supports two related resources for serving models:
+
+- **`InferenceService`** (`serving.kserve.io/v1beta1`) — the standard KServe predictor used in [Create Inference Service using CLI](./create_inference_service_cli.mdx). Best for single-container model servers (vLLM, Triton, custom runtimes).
+- **`LLMInferenceService`** — KServe's higher-level LLM resource for multi-component LLM serving (orchestrating predictors, optional prefill/decode disaggregation, and gateway/inference-extension integration). It is recognized by platform features such as Hardware Profiles, which mention it alongside `InferenceService` (see [Hardware Profiles](../../../infrastructure_management/hardware_profile/intro.mdx)). Use it when a single-container `InferenceService` is no longer enough.
+
+A good agent loop for either resource is the same:
+
+```text
+draft YAML  →  kubectl apply --dry-run=server  →  apply  →  poll status  →  smoke test  →  iterate
+```
+
+Useful prompts to start from:
+
+- "Generate an `InferenceService` for model `Qwen2.5-Coder-7B-Instruct` using the `aml-vllm` runtime, hardware profile `single-a30-24g`, namespace `mlops-demo-ai-test`. Enable prefix caching and tool calling with the `hermes` parser. Run `kubectl apply --dry-run=server` and show me the diff against any existing object before applying."
+- "Convert this `InferenceService` to an `LLMInferenceService` for prefill/decode disaggregation; keep the same model, hardware profile, and served-model name. Show me what changes and why."
+- "List all `InferenceService` and `LLMInferenceService` objects in `mlops-demo-ai-test`, their `READY` status, and the model each one serves. Flag any that have been `NotReady` for more than 10 minutes and summarize the most recent predictor pod events."
+
+For the YAML fields and platform-specific labels/annotations the agent needs to reproduce, point it at [Create Inference Service using CLI](./create_inference_service_cli.mdx) as the canonical example. For exposing a new service externally, point it at [Configure External Access for Inference Services](./external_access_inference_service.mdx).
+
+## Manage gateways: authentication and rate limits \{#manage-gateways}
+
+Alauda Build of Envoy AI Gateway is a required dependency of Alauda Build of KServe and fronts inference traffic with an OpenAI-compatible API surface, AI-aware routing, and per-model policies (see [Envoy AI Gateway introduction](../../../envoy_ai_gateway/intro.mdx) and [installation](../../../envoy_ai_gateway/install.mdx)). The agent is well-suited to author its CRDs, which are otherwise verbose:
+
+| Concern | CRD / Resource | Where it comes from |
+|---|---|---|
+| Route requests to one or more model backends | `AIGatewayRoute`, `AIServiceBackend` | Envoy AI Gateway |
+| Authenticate the **client** (downstream): API key, JWT, OIDC | `SecurityPolicy` | Envoy Gateway |
+| Authenticate to the **upstream** model (when chaining to a hosted provider) | `BackendSecurityPolicy` | Envoy AI Gateway |
+| Per-route or per-model rate limiting and token-budget enforcement | `BackendTrafficPolicy` (global rate limit) or `AIGatewayRoute` token-rate-limit settings | Envoy Gateway / Envoy AI Gateway |
+| TLS termination, observability | Standard `Gateway` / `HTTPRoute` and Envoy Gateway features | Envoy Gateway |
+
+A practical agent workflow:
+
+1. **Tell the agent your intent in business terms.** For example: "Expose `qwen-2` and `llama-3-70b` behind one OpenAI-compatible endpoint at `https://ai.example.internal`. Require an `Authorization: Bearer` API key from a Kubernetes `Secret` named `ai-gateway-keys`. Limit each key to 60 requests/minute and 200k tokens/hour. Send `qwen-2` traffic to the `qwen-2` `InferenceService` in `mlops-demo-ai-test` and `llama-3-70b` to the `LLMInferenceService` of the same name."
+2. **Have the agent draft the CRDs** in a directory under your infra repo, one file per resource, with comments calling out each policy decision.
+3. **Validate before applying.** Ask the agent to run `kubectl apply --dry-run=server -f ./gateway/` and to summarize what would change. Apply only after you review.
+4. **Smoke-test the new policies.** Have the agent send a valid request, an unauthenticated request, and a request that exceeds the rate limit, and confirm the expected 200 / 401 / 429 responses. Capture the test as a small script alongside the manifests so future changes can be re-verified.
+
+For the exact field shape of each CRD, defer to the upstream documentation linked below — versions change, and the agent should read the live spec rather than inventing fields.
+
+## Tune service performance to fit your hardware \{#tune-performance}
+
+The list of vLLM and KServe knobs is unchanged from [Best practices: tune inference service performance](./coding_agents_with_inference_service.mdx#best-practices) — this section focuses on how an agent can *drive* that tuning instead of you doing it by hand.
+
+A productive loop:
+
+<Steps>
+
+### 1. Define service-level objectives
+
+Pin numbers before tuning. Tell the agent what "good enough" looks like:
+
+- Maximum first-token latency (TTFT) at the expected concurrency.
+- Maximum P95 inter-token latency or total response time for a representative prompt.
+- Minimum sustainable throughput (requests/min or tokens/sec).
+- Maximum context length the agent traffic will send.
+
+### 2. Generate a reproducible benchmark
+
+Ask the agent to write a small benchmark script that mirrors your real traffic — typical prompt size, system prompt, concurrency. Useful starting points include the built-in `vllm bench serve` command, `genai-perf`, or a `k6`/Python script that drives `/v1/chat/completions` directly. Have the agent run it against the current `InferenceService` and record the results in a markdown table.
+
+### 3. Have the agent propose one change at a time
+
+Give the agent the benchmark output and the current YAML. Ask for **one** change with an expected effect, for example:
+
+- "Add `--enable-prefix-caching` and re-run; expected: lower TTFT on the repeated system-prompt prefix."
+- "Switch the model from FP16 to AWQ INT4 and raise `--gpu-memory-utilization` to 0.92; expected: more KV cache headroom, larger sustainable context length."
+- "Increase `--max-num-seqs`; expected: higher throughput at the cost of higher P95 latency."
+
+One change per iteration keeps cause and effect attributable.
+
+### 4. Apply, measure, and record
+
+The agent updates the `InferenceService` YAML, applies it, waits for `READY`, re-runs the benchmark, and appends a new row to the results table with the configuration delta.
+
+### 5. Stop on SLO or hardware ceiling
+
+The loop ends when SLOs are met, or when the next sensible knob is "different hardware" or "different model" — at which point the agent should say so explicitly rather than churn. Common ceilings: KV cache saturated at the target context length, tensor-parallel scaling no longer linear, decode-bound at single-request latency.
+
+</Steps>
+
+For model-size vs. GPU-memory selection, see the table in the prior doc's [Choose a model that fits your hardware](./coding_agents_with_inference_service.mdx#best-practices) section. For autoscaling and cold-start trade-offs, see [Configure Scaling for Inference Services](./autoscale_settings.mdx). For interactive-latency wins, see [Speculative Decoding for vLLM Inference Services](./vllm_speculative_decoding.mdx).
+
+## Plan fine-tuning and generate reports \{#fine-tuning-plans-and-reports}
+
+Fine-tuning has two failure modes that coding agents are unusually good at preventing: skipping the planning step ("just run SFT") and skipping the reporting step ("the loss looked fine"). The agent's job is to make both explicit.
+
+### Pick the right tool for the job
+
+| Situation | Recommended tool | Reference |
+|---|---|---|
+| Interactive exploration, small dataset, one or two GPUs | Workbench Notebook | [Fine-tuning with Notebooks](../../../workbench/how_to/fine_tunning_using_notebooks.mdx) |
+| Production-grade SFT / OSFT with automatic memory management | Training Hub | [Fine-tuning LLMs with Training Hub](../../../workbench/how_to/training_hub_fine_tuning.mdx) |
+| Reusable templates, many runs, scheduled / batched on Kueue | Kubeflow Trainer v2 + LlamaFactory | [Fine-Tuning with Kubeflow Trainer v2](../../../kubeflow/how_to/fine-tune-with-trainer-v2.mdx) |
+| Already-tuned model needs to fit a smaller GPU before serving | LLM Compressor | [LLM Compressor with Alauda AI](../../../llm-compressor/how_to/compressor_by_workbench.mdx) |
+
+### A reusable fine-tuning plan template
+
+Have the agent fill in this template **before** any job is submitted, and commit the result alongside the training code. This separates "what we intend" from "what we ran," which is exactly the comparison the report needs later.
+
+```markdown
+# Fine-tuning plan: <run-id>
+
+## Objective
+- Business goal:
+- Success metric (what improves; how it's measured):
+- Acceptance threshold (minimum acceptable score on the metric):
+
+## Base model
+- Model and revision:
+- Why this base (capability, license, context window, tool-calling support):
+
+## Dataset
+- Source(s) and license:
+- Size (examples / tokens):
+- Format (e.g., JSONL chat messages):
+- Splits (train / eval / held-out):
+- Known biases or contamination risks:
+
+## Method
+- Approach (SFT / LoRA / QLoRA / OSFT / continued pre-train):
+- Justification vs. the alternatives:
+- Tool (Training Hub / Kubeflow Trainer v2 / Notebook / LlamaFactory):
+
+## Compute budget
+- Hardware (GPU type, count, hours):
+- Hardware Profile to use:
+- Estimated cost / wall-clock:
+
+## Hyperparameters
+- Effective batch size, max_tokens_per_gpu, lr, epochs, scheduler, seed:
+- Checkpoint cadence and retention:
+
+## Evaluation plan
+- Benchmarks (public + internal):
+- Eval harness and seed:
+- Comparison baselines (the base model, prior runs):
+
+## Risks and rollback
+- What could go wrong (catastrophic forgetting, tool-calling regression, license conflict):
+- How we'll detect it:
+- Rollback (which model artifact to revert to):
+```
+
+Useful prompt: "Read `plan.md`. Draft a Kubeflow Trainer v2 `TrainingRuntime` and `TrainJob` (or a Training Hub notebook) that implements exactly this plan in namespace `mlops-demo-ai-test`. Highlight any field where the plan is ambiguous and ask me before guessing."
+
+### A reusable fine-tuning report template
+
+After the job finishes, ask the agent to ingest the training logs, eval outputs, and resource metrics, and fill in this report. Commit it next to the plan.
+
+```markdown
+# Fine-tuning report: <run-id>
+
+## Provenance
+- Plan: link to plan.md and its commit SHA
+- TrainJob / Notebook: name, namespace, start/end time
+- Hardware actually used (vs. planned):
+- Model artifact location (PVC / model repo path / OCI image):
+
+## Training summary
+- Steps / epochs completed:
+- Final training loss; loss trend (link to TensorBoard / MLflow run):
+- Throughput (tokens/sec, samples/sec):
+- Wall-clock and GPU-hours:
+- Anomalies (loss spikes, restarts, OOMs):
+
+## Evaluation results
+- Headline metric vs. baseline and acceptance threshold:
+- Per-benchmark scores table (this run, base model, prior best):
+- Tool-calling sanity check (pass/fail with example):
+- Qualitative samples (3–5 prompts; this run vs. base, side by side):
+
+## Cost
+- GPU-hours, $ (if applicable), $/percentage-point of improvement:
+
+## Decision
+- Promote / re-run / abandon:
+- If promote: which `InferenceService` to update and how (image, storageUri, runtime flags):
+- If re-run: what to change in the next plan.md:
+
+## Next actions
+- Owner / date:
+```
+
+Useful prompt: "Generate `report.md` for TrainJob `qwen-coder-sft-2026-05-29` in `mlops-demo-ai-test`. Pull metrics from MLflow run `<id>`, training logs from the pod, and eval results from `s3://aml-evals/<run-id>/`. Compare against the previous run `qwen-coder-sft-2026-05-15`. If any section can't be filled in from the available data, mark it `TODO` rather than fabricating numbers."
+
+For experiment tracking and run metadata, [MLflow on Kubeflow](../../../kubeflow/how_to/mlflow.mdx) is the platform-native option; tell the agent to log there from inside the training code so the report has a real source of truth.
+
+## A daily MLOps loop \{#daily-loop}
+
+A useful end-to-end sequence the agent can drive, given the setup above:
+
+1. **Triage.** "List inference services in my namespace, surface anything `NotReady` or scaled to zero unexpectedly, summarize recent gateway 4xx/5xx rates."
+2. **Tune.** "P95 on `qwen-2` is over budget. Propose one change, apply, re-benchmark, report."
+3. **Update.** "There's a new model artifact for `qwen-coder-sft-2026-05-29`. Draft the YAML to swap it into the `qwen-2` `InferenceService`, gate the rollout to one replica first, and write the smoke test."
+4. **Plan.** "Draft a fine-tuning plan to fix the tool-calling regression we saw in last week's eval. Justify the method choice."
+5. **Report.** "Last night's job finished. Generate the report and tell me whether to promote."
+
+Each step is a separate prompt with its own diff to review. The agent is the typist; you are still the engineer of record.
+
+## Best practices and guardrails \{#best-practices-and-guardrails}
+
+- **Read-only first, write second.** Start every new task by asking the agent to read state (`get`, `describe`, logs, metrics) and *describe what it would do* before making changes.
+- **Always `--dry-run=server`.** Make it a standing rule in the agent context file; mention it in every prompt that involves `kubectl apply`.
+- **One change per iteration.** Especially for performance tuning, mixing two changes hides which one helped.
+- **Never let the agent fabricate metrics.** Require it to cite the file, log, or run ID it pulled each number from, and to mark `TODO` when data is missing.
+- **Keep the loop on-prem.** Confirm that no fallback model in any agent config points at a hosted provider (see [Connect your coding agent](./coding_agents_with_inference_service.mdx) for the per-agent settings to check).
+- **Commit everything.** Plans, reports, generated YAML, and benchmark scripts all go into Git so the next person — or the next agent — can pick up where you left off.
+
+## References
+
+- [Use Coding Agents with On-Premise Inference Services](./coding_agents_with_inference_service.mdx)
+- [Create Inference Service using CLI](./create_inference_service_cli.mdx)
+- [Configure External Access for Inference Services](./external_access_inference_service.mdx)
+- [Configure Scaling for Inference Services](./autoscale_settings.mdx)
+- [Speculative Decoding for vLLM Inference Services](./vllm_speculative_decoding.mdx)
+- [Extend Inference Runtimes](./custom_inference_runtime.mdx)
+- [Envoy AI Gateway — introduction](../../../envoy_ai_gateway/intro.mdx)
+- [Install Envoy AI Gateway](../../../envoy_ai_gateway/install.mdx)
+- [Hardware Profiles](../../../infrastructure_management/hardware_profile/intro.mdx)
+- [Fine-Tuning with Kubeflow Trainer v2](../../../kubeflow/how_to/fine-tune-with-trainer-v2.mdx)
+- [Fine-tuning LLMs with Training Hub](../../../workbench/how_to/training_hub_fine_tuning.mdx)
+- [Fine-tuning with Notebooks](../../../workbench/how_to/fine_tunning_using_notebooks.mdx)
+- [LLM Compressor with Alauda AI](../../../llm-compressor/how_to/compressor_by_workbench.mdx)
+- [MLflow on Kubeflow](../../../kubeflow/how_to/mlflow.mdx)
+- [Envoy AI Gateway upstream documentation](https://aigateway.envoyproxy.io/)
+- [Envoy Gateway upstream documentation](https://gateway.envoyproxy.io/)
+- [KServe LLMInferenceService](https://kserve.github.io/website/)
+- [vLLM benchmarking](https://docs.vllm.ai/en/latest/serving/usage_stats.html)