# SLM - Small Language Model Examples

Small Language Models are efficient, compact versions of LLMs optimized for speed and resource constraints.

## What is an SLM?
- **Size**: 100M - 7B parameters
- **Purpose**: Efficient text processing on edge devices
- **Examples**: Phi-3, Gemma, TinyLlama, Mistral 7B
- **Advantages**: Fast, low memory, can run locally

---

In [None]:
# Install required packages
!pip install transformers torch accelerate -q

## Example 1: Running Phi-3 Locally

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load Phi-3 Mini (3.8B parameters)
model_name = "microsoft/Phi-3-mini-4k-instruct"

print(f"Loading {model_name}...")
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype=torch.float16,
    trust_remote_code=True
)

print("‚úì Model loaded successfully!")
print(f"Model size: ~3.8B parameters")
print(f"Memory usage: ~7GB (fp16)")

In [None]:
def generate_with_slm(prompt, max_length=200):
    """Generate text using Small Language Model"""
    messages = [{"role": "user", "content": prompt}]
    
    inputs = tokenizer.apply_chat_template(
        messages,
        add_generation_prompt=True,
        return_tensors="pt"
    ).to(model.device)
    
    outputs = model.generate(
        inputs,
        max_length=max_length,
        temperature=0.7,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )
    
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response

# Test the model
prompt = "Explain what a neural network is in simple terms."
print(f"Prompt: {prompt}\n")
print("Response:")
print(generate_with_slm(prompt))

## Example 2: Code Generation with SLM

In [None]:
code_prompt = "Write a Python function to calculate the factorial of a number."

print(f"Prompt: {code_prompt}\n")
print("Generated Code:")
result = generate_with_slm(code_prompt, max_length=300)
print(result)

## Example 3: Running TinyLlama (Even Smaller - 1.1B)

In [None]:
from transformers import pipeline

# Load TinyLlama - extremely lightweight
print("Loading TinyLlama (1.1B parameters)...")
tiny_llama = pipeline(
    "text-generation",
    model="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    torch_dtype=torch.float16,
    device_map="auto"
)

print("‚úì TinyLlama loaded!")
print("Memory usage: ~2GB (fp16)")
print("Perfect for: Mobile devices, edge computing, real-time applications")

In [None]:
# Test TinyLlama
prompt = "What are the benefits of machine learning?"

response = tiny_llama(
    prompt,
    max_length=150,
    temperature=0.7,
    do_sample=True
)

print(f"Prompt: {prompt}\n")
print("TinyLlama Response:")
print(response[0]['generated_text'])

## Example 4: Performance Comparison - SLM vs LLM

In [None]:
import time

def benchmark_model(model_fn, prompt, name):
    """Benchmark model performance"""
    start = time.time()
    result = model_fn(prompt)
    end = time.time()
    
    return {
        "model": name,
        "time": end - start,
        "response_length": len(result),
        "tokens_per_second": len(result.split()) / (end - start)
    }

test_prompt = "Explain photosynthesis."

# Benchmark TinyLlama
def tiny_generate(p):
    return tiny_llama(p, max_length=100)[0]['generated_text']

tiny_stats = benchmark_model(tiny_generate, test_prompt, "TinyLlama-1.1B")

print("Performance Comparison:\n")
print(f"Model: {tiny_stats['model']}")
print(f"Response time: {tiny_stats['time']:.2f}s")
print(f"Tokens/sec: {tiny_stats['tokens_per_second']:.1f}")
print("\nNote: SLMs are 5-10x faster than large LLMs!")

## Example 5: Text Classification with SLM

In [None]:
def classify_sentiment_slm(text):
    """Classify sentiment using SLM"""
    prompt = f"""Classify the sentiment of this text as positive, negative, or neutral:
    
Text: {text}

Sentiment:"""
    
    response = tiny_llama(prompt, max_length=20, temperature=0.1)
    return response[0]['generated_text'].split("Sentiment:")[-1].strip()

# Test sentiment classification
texts = [
    "I love this product! It's amazing!",
    "Terrible experience. Very disappointed.",
    "The item arrived on time."
]

for text in texts:
    sentiment = classify_sentiment_slm(text)
    print(f"Text: {text}")
    print(f"Sentiment: {sentiment}\n")

## Example 6: Quantization for Even Smaller Models

In [None]:
# Using 4-bit quantization for extreme efficiency
from transformers import BitsAndBytesConfig

# 4-bit quantization config
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4"
)

print("Loading quantized model...")
quantized_model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    quantization_config=quantization_config,
    device_map="auto",
    trust_remote_code=True
)

print("‚úì Quantized model loaded!")
print("Memory savings:")
print("- FP16: ~7GB")
print("- 4-bit: ~2GB (70% reduction!)")
print("- Speed: Minimal performance loss")

## Example 7: Use Cases for SLMs

In [None]:
print("‚úÖ Perfect Use Cases for SLMs:\n")

use_cases = {
    "Mobile Apps": "Run AI directly on smartphones without cloud",
    "Edge Devices": "IoT devices, smart cameras, embedded systems",
    "Privacy-Sensitive": "Medical, legal, financial (on-premise)",
    "Real-time Apps": "Chatbots, autocomplete, live translation",
    "Cost Reduction": "Lower API costs, no cloud fees",
    "Offline Apps": "Work without internet connection",
    "Prototyping": "Quick testing and development",
    "Fine-tuning": "Easier to customize for specific tasks"
}

for use_case, description in use_cases.items():
    print(f"‚Ä¢ {use_case}: {description}")

print("\n‚ùå Not Suitable For:")
print("‚Ä¢ Complex reasoning requiring deep context")
print("‚Ä¢ Tasks needing extensive world knowledge")
print("‚Ä¢ Multi-step complex problem solving")
print("‚Ä¢ When accuracy is critical (medical diagnosis)")

## Summary

### SLM Examples Covered:
1. ‚úÖ Running Phi-3 (3.8B) locally
2. ‚úÖ TinyLlama (1.1B) for extreme efficiency
3. ‚úÖ Performance benchmarking
4. ‚úÖ Text classification tasks
5. ‚úÖ 4-bit quantization for memory savings
6. ‚úÖ Real-world use cases

### Key Advantages:
- ‚ö° **Speed**: 5-10x faster than large LLMs
- üíæ **Memory**: 2-7GB vs 40-200GB for large models
- üí∞ **Cost**: No API fees, lower compute costs
- üîí **Privacy**: Run completely offline
- üì± **Portability**: Works on consumer hardware

### Popular SLMs:
- **Phi-3** (Microsoft): 3.8B params, excellent quality
- **Gemma** (Google): 2B/7B params, open source
- **TinyLlama**: 1.1B params, fastest
- **Mistral 7B**: 7B params, strong performance
- **StableLM**: Various sizes, customizable

### When to Choose SLM:
‚úÖ Resource constraints (mobile, edge)
‚úÖ Real-time requirements
‚úÖ Privacy concerns
‚úÖ Cost optimization
‚úÖ Offline operation needed

### Comparison:
| Metric | SLM (Phi-3) | LLM (GPT-4) |
|--------|-------------|-------------|
| Parameters | 3.8B | ~1.7T |
| Memory | ~7GB | Cloud only |
| Speed | Fast | Slower |
| Cost | Free (local) | $$ API |
| Quality | Good | Excellent |
| Use Case | Specific tasks | General purpose |