Slemify

Generate, fine-tune, and validate Small Language Models. One YAML, one command.

SLMs handle the high-volume, repetitive tasks in your AI workflows (classification, routing, extraction) so your LLMs can focus on what they're best at. Slemify automates the path from raw data to a validated, production-ready GGUF model. How you deploy that model is up to you.

slemify deploy --config expert.yaml

When to Use an SLM

Not every task needs a frontier model. Most agentic AI systems have "hot spots". repetitive sub-tasks that run thousands of times a day with the same pattern. These are ideal for a specialized SLM:

Task Type	Example	Why SLM
Classification	Alert triage, intent routing, document categorization	Same pattern, different inputs. Fast, predictable output.
Extraction	Pull structured fields from logs, invoices, clinical notes	Rigid output schema. Doesn't need world knowledge.
Routing	Pick which tool/API/agent handles a request	Binary or multi-class decision. Sub-100ms matters.
Validation	Safety checks, compliance gates, format verification	Rule-based logic baked into weights. Runs on every request.

The criteria: high repetition, low semantic variation, structured output. If the task looks the same every time with different inputs, an SLM can do it faster and cheaper than a general-purpose LLM. often with higher accuracy for that specific task.

SLMs + LLMs Together

Slemify doesn't replace LLMs. It adds a fast, cheap layer alongside them.

[Request] → [SLM Router] → high confidence → [SLM Result] → done (50ms, $0)
                          → low confidence  → [LLM Fallback] → done (3s, $0.01)

The inference endpoint exposes an OpenAI-compatible API (/v1/chat/completions). Any agent, orchestrator, or application can call it directly via HTTP. Set llm_endpoint in your config to any OpenAI-compatible API (vLLM, llama.cpp, Bedrock proxy) for LLM fallback. The SLM handles 70-90% of requests at fixed cost. The LLM handles the rest.

Architecture	Cost at 10K requests/day	Avg Latency
100% LLM API	~$3,000/mo	1-3s
SLM + LLM (90/10)	~$500/mo	200ms avg

How It Works

expert.yaml → [DATA] → [TRAINING] → [SERVING + VALIDATION]
                 │          │                   │
            Ingest +    QLoRA on          Deploy model,
            Synthetic   Spot GPU          run eval report,
            via Bedrock via Unsloth       generate HTML

Data. Ingests your raw data from S3. Bedrock generates synthetic training pairs from your source content. You verify the output before training.
Training. QLoRA fine-tuning on Spot GPU via Unsloth. Exports a quantized GGUF model to S3.
Serving + Validation. Deploys the model on a live endpoint, runs the evaluation dataset through it, and generates an HTML report with accuracy, latency, and cost projections.

The output is a GGUF model file in S3 and a production readiness report. The serving deployment that Slemify creates is production-quality and serves as a reference for your own infrastructure. You can use it as-is, adapt it, or serve the GGUF with any compatible runtime (llama.cpp, vLLM, Ollama). See the Serving deep dive for deployment guidance and best practices.

Quick Start

Prerequisites

EKS cluster with Karpenter
S3 bucket for data and artifacts
AWS credentials with Bedrock access
kubectl configured for your cluster

1. Define your task

apiVersion: slemify/v1

project:
  name: support-intent-noisy
  domain: >
    Email triage for customer support. Extract intent and sentiment
    from noisy, unstructured emails containing OCR artifacts,
    mobile-device typos, conversational tangents, and corrupted
    character encodings.
  labels:
    intent:
      - refund_request
      - setup_help
      - billing_question
      - technical_issue
      - feedback
      - account_change
      - shipping_inquiry
    sentiment:
      - angry
      - frustrated
      - neutral
      - satisfied

model:
  base: ""  # HuggingFace model ID
  quantize: q4_k_m

data:
  bucket: slemify-data
  path: support-intent-noisy/data/
  sources:
    - path: emails/
      type: raw
  synthetic:
    model: eu.anthropic.claude-sonnet-4-6
    pairs: 800
  evaluation:
    model: eu.anthropic.claude-sonnet-4-6
    pairs: 100
    sources:
      - path: eval-emails/
        type: raw

training:
  spot: true

2. Upload your training data

aws s3 sync ./data/emails s3://slemify-data/support-intent-noisy/data/emails/
aws s3 sync ./data/eval-emails s3://slemify-data/support-intent-noisy/data/eval-emails/

3. Deploy

slemify deploy --config expert.yaml

Slemify handles data processing, synthetic pair generation, training, quantization, and validation. The resulting GGUF model is uploaded to S3. You then deploy it in your own infrastructure using the reference deployment as a starting point.

4. View the report

slemify report --config expert.yaml

Downloads the HTML report from S3 and opens it in your browser. The report includes accuracy metrics, latency benchmarks, SLM vs LLM comparison, and cost projections.

How Much Data Do I Need?

Task Type	Training Examples	Notes
Classification (routing, triage)	200-500	Binary or multi-class. Clear categories.
Extraction (fields from text)	500-1,000	More examples = better edge case coverage.
Structured generation (commands, configs)	500-1,000	Model needs to learn output format precisely.

Quality matters more than quantity. 500 well-curated instruction-response pairs beat 10,000 noisy ones. Bedrock generates synthetic examples from your source data, so you don't need to write them all by hand.

Cost

Item	Cost
Training (Spot GPU, one-time)	~$0.15
Synthetic data (Bedrock)	~$10-50
Total to generate a model	~$15-55

Inference cost depends on how you deploy. The reference deployment (llama.cpp on CPU Spot) runs at ~$117/mo per replica. Throughput scales linearly: 3 replicas = 3x throughput at 3x cost. No rate limits, no per-token charges. See the Serving deep dive for cost comparisons across CPU, GPU, and LLM API options.

Examples

Support Intent (Noisy). Classify messy customer support emails into intent categories
K8s Autoscaling Auditor. Tiered SLM system: a 4B triage classifier routes queries, an 8B auditor produces structured reasoning about Karpenter/KEDA/HPA misconfigurations

Deep Dives

Technical docs covering the design decisions, best practices, and research behind each pipeline stage. Written for Platform Engineers.

Getting Started. End-to-end tutorial: build a multi-agent K8s expert from scratch
Data Stage. Raw data quality, synthetic generation, label taxonomy, verification
Training Stage. QLoRA, model sizing, Spot GPU, checkpointing, quantization
Serving Stage. Reference deployment, CPU inference, autoscaling guidance
Report Stage. Accuracy measurement, SLM vs LLM comparison, cost projections

Architecture

The pipeline runs on Kubernetes (EKS). The output is a GGUF model in S3.

Karpenter. GPU nodes for training (Spot), CPU nodes for the reference deployment
Unsloth. QLoRA fine-tuning, 2-5x faster than standard training
llama.cpp. GGUF inference on CPU (used in the reference deployment and validation report)
Pod Identity. IAM access to S3 and Bedrock, no static credentials
Systems Manager. Remote container builds via SSM, no SSH keys or open ports required

The reference serving deployment (llama.cpp on CPU) is included for validation and as a starting point. You can serve the GGUF model with any compatible runtime: llama.cpp, vLLM, Ollama, or any tool that reads GGUF files.

Commands

Command	Description
`slemify deploy`	Run the full pipeline
`slemify deploy --stage training --no-wait`	Submit a stage and exit
`slemify status my-project`	Show pipeline progress
`slemify status my-project -o json`	Machine-readable status for agents
`slemify validate`	Validate config without deploying
`slemify report`	Download and open the accuracy report in the browser
`slemify report --output my-report.html`	Save report to a custom path
`slemify report --no-open`	Download without opening the browser
`slemify build`	Build container images to ECR

FAQ

Q: When should I use an SLM vs just calling an LLM API? A: If the task is repetitive, structured, and runs more than ~1,000 times/day. or if data can't leave your VPC. Below that volume, an LLM API is simpler and fine.

Q: Can a 3B model really match a frontier LLM? A: For general tasks, no. For YOUR specific structured task with YOUR categories, a fine-tuned 3B model matches or beats general-purpose LLMs. Salesforce's xLAM-2-8B beat GPT-4o and Claude 3.5 at tool calling on the Berkeley Function-Calling Leaderboard. Specialization beats size.

Q: What about RAG? A: SLMs and RAG solve different problems. RAG retrieves relevant context for knowledge questions. SLMs handle classification, routing, and extraction where you don't need retrieval. you need a fast decision. They work well together: SLM routes the query, RAG handles the knowledge lookup.

Q: Can I use a different base model? A: Yes. Any HuggingFace-compatible model supported by Unsloth. The auto-sizer adjusts infrastructure based on model size.

Q: What happens during a Spot interruption? A: Training checkpoints sync to S3 every 500 steps. The next pod resumes from the last checkpoint automatically. Max work lost: ~20 minutes.

Agent Skill

Slemify includes an agent skill compatible with Claude Code, OpenAI Codex, Gemini CLI, and Cursor. The skill teaches AI coding agents how to identify SLM opportunities in your system, design the agent's role, write the expert.yaml config, run the pipeline, and interpret results.

Install in Claude Code:

/plugin install slemify@<your-repo>

Or reference the skill directly:

"Use the Slemify skill to identify which of my LLM calls could be replaced with a specialized SLM."

The skill includes templates for two patterns:

Router Agent (1-4B): fast classification and routing decisions
Analyst Agent (7-8B): structured reasoning grounded by RAG

References

Small Language Models are the Future of Agentic AI (NVIDIA, 2025). Position paper arguing SLMs under 10B parameters can handle 60-80% of agentic AI tasks
xLAM: Large Action Models (Salesforce). 8B model that beat GPT-4o at tool calling, proving specialization beats size
Forbes: Don't Default to the Biggest AI Model. 40-70% of agentic AI invocations can use SLMs
Hallucination Propensity in Small Models. Research on knowledge mismatch between fine-tuning data and base model knowledge
QLoRA: Efficient Finetuning of Quantized Language Models. The fine-tuning technique Slemify uses
Unsloth. 2-5x faster QLoRA training with custom Triton kernels
llama.cpp. GGUF inference engine for CPU deployment
Model Context Protocol. How SLMs expose tools to AI assistants

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.github/workflows		.github/workflows
docs		docs
examples		examples
skills/slemify		skills/slemify
src		src
.gitignore		.gitignore
.semgrepignore		.semgrepignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
THIRD-PARTY-LICENSES		THIRD-PARTY-LICENSES

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Slemify

When to Use an SLM

SLMs + LLMs Together

How It Works

Quick Start

Prerequisites

1. Define your task

2. Upload your training data

3. Deploy

4. View the report

How Much Data Do I Need?

Cost

Examples

Deep Dives

Architecture

Commands

FAQ

Agent Skill

References

About

Uh oh!

Releases 1

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Slemify

When to Use an SLM

SLMs + LLMs Together

How It Works

Quick Start

Prerequisites

1. Define your task

2. Upload your training data

3. Deploy

4. View the report

How Much Data Do I Need?

Cost

Examples

Deep Dives

Architecture

Commands

FAQ

Agent Skill

References

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages