Skip to content

aws-samples/sample-slemify

Slemify

Generate, fine-tune, and validate Small Language Models. One YAML, one command.

SLMs handle the high-volume, repetitive tasks in your AI workflows (classification, routing, extraction) so your LLMs can focus on what they're best at. Slemify automates the path from raw data to a validated, production-ready GGUF model. How you deploy that model is up to you.

slemify deploy --config expert.yaml

When to Use an SLM

Not every task needs a frontier model. Most agentic AI systems have "hot spots". repetitive sub-tasks that run thousands of times a day with the same pattern. These are ideal for a specialized SLM:

Task Type Example Why SLM
Classification Alert triage, intent routing, document categorization Same pattern, different inputs. Fast, predictable output.
Extraction Pull structured fields from logs, invoices, clinical notes Rigid output schema. Doesn't need world knowledge.
Routing Pick which tool/API/agent handles a request Binary or multi-class decision. Sub-100ms matters.
Validation Safety checks, compliance gates, format verification Rule-based logic baked into weights. Runs on every request.

The criteria: high repetition, low semantic variation, structured output. If the task looks the same every time with different inputs, an SLM can do it faster and cheaper than a general-purpose LLM. often with higher accuracy for that specific task.

SLMs + LLMs Together

Slemify doesn't replace LLMs. It adds a fast, cheap layer alongside them.

[Request] → [SLM Router] → high confidence → [SLM Result] → done (50ms, $0)
                          → low confidence  → [LLM Fallback] → done (3s, $0.01)

The inference endpoint exposes an OpenAI-compatible API (/v1/chat/completions). Any agent, orchestrator, or application can call it directly via HTTP. Set llm_endpoint in your config to any OpenAI-compatible API (vLLM, llama.cpp, Bedrock proxy) for LLM fallback. The SLM handles 70-90% of requests at fixed cost. The LLM handles the rest.

Architecture Cost at 10K requests/day Avg Latency
100% LLM API ~$3,000/mo 1-3s
SLM + LLM (90/10) ~$500/mo 200ms avg

How It Works

expert.yaml → [DATA] → [TRAINING] → [SERVING + VALIDATION]
                 │          │                   │
            Ingest +    QLoRA on          Deploy model,
            Synthetic   Spot GPU          run eval report,
            via Bedrock via Unsloth       generate HTML
  1. Data. Ingests your raw data from S3. Bedrock generates synthetic training pairs from your source content. You verify the output before training.
  2. Training. QLoRA fine-tuning on Spot GPU via Unsloth. Exports a quantized GGUF model to S3.
  3. Serving + Validation. Deploys the model on a live endpoint, runs the evaluation dataset through it, and generates an HTML report with accuracy, latency, and cost projections.

The output is a GGUF model file in S3 and a production readiness report. The serving deployment that Slemify creates is production-quality and serves as a reference for your own infrastructure. You can use it as-is, adapt it, or serve the GGUF with any compatible runtime (llama.cpp, vLLM, Ollama). See the Serving deep dive for deployment guidance and best practices.

Quick Start

Prerequisites

  • EKS cluster with Karpenter
  • S3 bucket for data and artifacts
  • AWS credentials with Bedrock access
  • kubectl configured for your cluster

1. Define your task

apiVersion: slemify/v1

project:
  name: support-intent-noisy
  domain: >
    Email triage for customer support. Extract intent and sentiment
    from noisy, unstructured emails containing OCR artifacts,
    mobile-device typos, conversational tangents, and corrupted
    character encodings.
  labels:
    intent:
      - refund_request
      - setup_help
      - billing_question
      - technical_issue
      - feedback
      - account_change
      - shipping_inquiry
    sentiment:
      - angry
      - frustrated
      - neutral
      - satisfied

model:
  base: ""  # HuggingFace model ID
  quantize: q4_k_m

data:
  bucket: slemify-data
  path: support-intent-noisy/data/
  sources:
    - path: emails/
      type: raw
  synthetic:
    model: eu.anthropic.claude-sonnet-4-6
    pairs: 800
  evaluation:
    model: eu.anthropic.claude-sonnet-4-6
    pairs: 100
    sources:
      - path: eval-emails/
        type: raw

training:
  spot: true

2. Upload your training data

aws s3 sync ./data/emails s3://slemify-data/support-intent-noisy/data/emails/
aws s3 sync ./data/eval-emails s3://slemify-data/support-intent-noisy/data/eval-emails/

3. Deploy

slemify deploy --config expert.yaml

Slemify handles data processing, synthetic pair generation, training, quantization, and validation. The resulting GGUF model is uploaded to S3. You then deploy it in your own infrastructure using the reference deployment as a starting point.

4. View the report

slemify report --config expert.yaml

Downloads the HTML report from S3 and opens it in your browser. The report includes accuracy metrics, latency benchmarks, SLM vs LLM comparison, and cost projections.

How Much Data Do I Need?

Task Type Training Examples Notes
Classification (routing, triage) 200-500 Binary or multi-class. Clear categories.
Extraction (fields from text) 500-1,000 More examples = better edge case coverage.
Structured generation (commands, configs) 500-1,000 Model needs to learn output format precisely.

Quality matters more than quantity. 500 well-curated instruction-response pairs beat 10,000 noisy ones. Bedrock generates synthetic examples from your source data, so you don't need to write them all by hand.

Cost

Item Cost
Training (Spot GPU, one-time) ~$0.15
Synthetic data (Bedrock) ~$10-50
Total to generate a model ~$15-55

Inference cost depends on how you deploy. The reference deployment (llama.cpp on CPU Spot) runs at ~$117/mo per replica. Throughput scales linearly: 3 replicas = 3x throughput at 3x cost. No rate limits, no per-token charges. See the Serving deep dive for cost comparisons across CPU, GPU, and LLM API options.

Examples

  • Support Intent (Noisy). Classify messy customer support emails into intent categories
  • K8s Autoscaling Auditor. Tiered SLM system: a 4B triage classifier routes queries, an 8B auditor produces structured reasoning about Karpenter/KEDA/HPA misconfigurations

Deep Dives

Technical docs covering the design decisions, best practices, and research behind each pipeline stage. Written for Platform Engineers.

  • Getting Started. End-to-end tutorial: build a multi-agent K8s expert from scratch
  • Data Stage. Raw data quality, synthetic generation, label taxonomy, verification
  • Training Stage. QLoRA, model sizing, Spot GPU, checkpointing, quantization
  • Serving Stage. Reference deployment, CPU inference, autoscaling guidance
  • Report Stage. Accuracy measurement, SLM vs LLM comparison, cost projections

Architecture

The pipeline runs on Kubernetes (EKS). The output is a GGUF model in S3.

  • Karpenter. GPU nodes for training (Spot), CPU nodes for the reference deployment
  • Unsloth. QLoRA fine-tuning, 2-5x faster than standard training
  • llama.cpp. GGUF inference on CPU (used in the reference deployment and validation report)
  • Pod Identity. IAM access to S3 and Bedrock, no static credentials
  • Systems Manager. Remote container builds via SSM, no SSH keys or open ports required

The reference serving deployment (llama.cpp on CPU) is included for validation and as a starting point. You can serve the GGUF model with any compatible runtime: llama.cpp, vLLM, Ollama, or any tool that reads GGUF files.

Commands

Command Description
slemify deploy Run the full pipeline
slemify deploy --stage training --no-wait Submit a stage and exit
slemify status my-project Show pipeline progress
slemify status my-project -o json Machine-readable status for agents
slemify validate Validate config without deploying
slemify report Download and open the accuracy report in the browser
slemify report --output my-report.html Save report to a custom path
slemify report --no-open Download without opening the browser
slemify build Build container images to ECR

FAQ

Q: When should I use an SLM vs just calling an LLM API? A: If the task is repetitive, structured, and runs more than ~1,000 times/day. or if data can't leave your VPC. Below that volume, an LLM API is simpler and fine.

Q: Can a 3B model really match a frontier LLM? A: For general tasks, no. For YOUR specific structured task with YOUR categories, a fine-tuned 3B model matches or beats general-purpose LLMs. Salesforce's xLAM-2-8B beat GPT-4o and Claude 3.5 at tool calling on the Berkeley Function-Calling Leaderboard. Specialization beats size.

Q: What about RAG? A: SLMs and RAG solve different problems. RAG retrieves relevant context for knowledge questions. SLMs handle classification, routing, and extraction where you don't need retrieval. you need a fast decision. They work well together: SLM routes the query, RAG handles the knowledge lookup.

Q: Can I use a different base model? A: Yes. Any HuggingFace-compatible model supported by Unsloth. The auto-sizer adjusts infrastructure based on model size.

Q: What happens during a Spot interruption? A: Training checkpoints sync to S3 every 500 steps. The next pod resumes from the last checkpoint automatically. Max work lost: ~20 minutes.

Agent Skill

Slemify includes an agent skill compatible with Claude Code, OpenAI Codex, Gemini CLI, and Cursor. The skill teaches AI coding agents how to identify SLM opportunities in your system, design the agent's role, write the expert.yaml config, run the pipeline, and interpret results.

Install in Claude Code:

/plugin install slemify@<your-repo>

Or reference the skill directly:

"Use the Slemify skill to identify which of my LLM calls could be replaced with a specialized SLM."

The skill includes templates for two patterns:

  • Router Agent (1-4B): fast classification and routing decisions
  • Analyst Agent (7-8B): structured reasoning grounded by RAG

References

About

Slemify demonstrates how to fine-tune and serve Small Language Models (1-8B parameters) on Kubernetes using CPUs. It takes a single YAML configuration, generates synthetic training data via an LLM API, fine-tunes a base model on a Spot GPU, quantizes it, and deploys it for inference on CPU nodes with autoscaling.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages