# Local LLM Inference: Running Modern Models on Your Computer

* * * 

<div class="alert alert-success">  
    
### Learning Objectives 
    
* Run modern language models locally on Mac (Intel/Apple Silicon) or PC
* Work with efficient models optimized for CPU/Apple Silicon
* Understand practical limits of local inference
* Master text generation without cloud dependencies
* Implement efficient processing for research tasks

</div>

### 💻 Best Models for Local Inference

| Model | Size | RAM Needed 
|-------|------|------------|
| Llama-3.2-1B | 1B | 4GB | 
| Qwen2.5-1.5B | 1.5B | 4GB | 
| Qwen2.5-3B | 3B | 6GB | 
| Llama-3.2-3B | 3B | 6GB | 
| Phi-3.5-mini | 3.8B | 8GB | 
| GPT-OSS-20B | 21B | 16GB |


### Sections
1. [Hardware Detection](#setup)
2. [Simple Installation](#install)
3. [Loading Models](#load)
4. [Text Generation](#generation)
5. [Practical Examples](#examples)
6. [Performance Tips](#performance)

<a id='setup'></a>

# Hardware Detection

Let's check what hardware you have available for local inference.

In [1]:
import platform
import psutil
import torch

def detect_hardware():
    """Detect local hardware capabilities"""
    print("System Information")
    print("=" * 50)
    
    # Operating System
    system = platform.system()
    print(f"OS: {system} {platform.release()}")
    print(f"Python: {platform.python_version()}")
    
    # CPU
    print(f"\nCPU: {platform.processor()}")
    print(f"Cores: {psutil.cpu_count(logical=False)} physical, {psutil.cpu_count()} logical")
    
    # Memory
    ram = psutil.virtual_memory().total / (1024**3)
    available_ram = psutil.virtual_memory().available / (1024**3)
    print(f"\nRAM: {ram:.1f} GB total")
    print(f"Available: {available_ram:.1f} GB")
    
    # Check for acceleration
    print("\nAcceleration:")
    
    # Apple Silicon (MPS)
    if torch.backends.mps.is_available():
        print("Apple Silicon detected (MPS acceleration available)")
        device = "mps"
    # NVIDIA GPU
    elif torch.cuda.is_available():
        print(f"NVIDIA GPU detected: {torch.cuda.get_device_name(0)}")
        device = "cuda"
    # CPU only
    else:
        print("ℹUsing CPU (no GPU acceleration detected)")
        device = "cpu"
    
    return device

# Detect hardware
DEVICE = detect_hardware()

System Information
OS: Darwin 24.0.0
Python: 3.10.13

CPU: arm
Cores: 10 physical, 10 logical

RAM: 16.0 GB total
Available: 4.2 GB

Acceleration:
Apple Silicon detected (MPS acceleration available)


<a id='install'></a>

# Simple Installation

We only need the essential packages - no complex dependencies!

In [2]:
# Install only essential packages
!pip install -q torch transformers accelerate
!pip install -q psutil  # For system monitoring

# Optional: visualization
# !pip install -q matplotlib pandas

# Verify versions
import transformers
print(f"\nTransformers version: {transformers.__version__}")
print(f"PyTorch version: {torch.__version__}")


Transformers version: 4.44.2
PyTorch version: 2.4.1


<a id='load'></a>

# Loading Models

Load a model appropriate for your hardware. We'll use standard transformers library - simple and reliable!

In [3]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Choose your model (change this based on your RAM)
# MODEL_NAME = RECOMMENDED_MODEL  # Use auto-detected recommendation
MODEL_NAME = "Qwen/Qwen2.5-3B-Instruct"  

# Alternative models for different RAM sizes:
# 4GB RAM:  "microsoft/phi-2" (2.7B)
# 8GB RAM:  "meta-llama/Llama-3.2-1B-Instruct" (1B)
# 16GB RAM: "microsoft/Phi-3.5-mini-instruct" (3.8B)
# 32GB RAM: "Qwen/Qwen2.5-7B-Instruct" (7B)

print(f"Loading {MODEL_NAME}...")
print("This may take a few minutes on first download.\n")

try:
    # Load tokenizer
    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
    
    # Load model with appropriate settings
    if DEVICE == "mps":  # Apple Silicon
        model = AutoModelForCausalLM.from_pretrained(
            MODEL_NAME,
            torch_dtype=torch.float16,
            low_cpu_mem_usage=True,
            trust_remote_code=True
        ).to(DEVICE)
    elif DEVICE == "cuda":  # NVIDIA GPU
        model = AutoModelForCausalLM.from_pretrained(
            MODEL_NAME,
            torch_dtype=torch.float16,
            device_map="auto",
            trust_remote_code=True
        )
    else:  # CPU
        model = AutoModelForCausalLM.from_pretrained(
            MODEL_NAME,
            torch_dtype=torch.float32,  # Full precision for CPU
            low_cpu_mem_usage=True,
            trust_remote_code=True
        )
    
    # Set pad token if needed
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token
    
    print(f"✅ Model loaded successfully!")
    print(f"   Device: {DEVICE}")
    print(f"   Model size: ~{sum(p.numel() for p in model.parameters())/1e9:.1f}B parameters")
    
except Exception as e:
    print(f"❌ Error loading model: {e}")

Loading Qwen/Qwen2.5-3B-Instruct...
This may take a few minutes on first download.



Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

✅ Model loaded successfully!
   Device: mps
   Model size: ~3.1B parameters


<a id='generation'></a>

# Text Generation

Simple, efficient text generation optimized for local hardware.

In [4]:
def generate_text(prompt, max_new_tokens=100, temperature=0.7):
    """
    Generate text locally with your model
    
    Args:
        prompt: Input text
        max_new_tokens: Maximum tokens to generate
        temperature: Creativity (0.0 = focused, 1.0+ = creative)
    """
    # Format for instruction models
    if "instruct" in MODEL_NAME.lower() or "chat" in MODEL_NAME.lower():
        if hasattr(tokenizer, 'apply_chat_template'):
            messages = [{"role": "user", "content": prompt}]
            formatted_prompt = tokenizer.apply_chat_template(
                messages, tokenize=False, add_generation_prompt=True
            )
        else:
            formatted_prompt = f"User: {prompt}\nAssistant:"
    else:
        formatted_prompt = prompt
    
    # Tokenize
    inputs = tokenizer(formatted_prompt, return_tensors="pt", truncation=True)
    
    # Move to device if not CPU
    if DEVICE != "cpu":
        inputs = {k: v.to(DEVICE) for k, v in inputs.items()}
    
    # Generate
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            do_sample=True if temperature > 0 else False,
            pad_token_id=tokenizer.pad_token_id,
        )
    
    # Decode
    generated = tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    # Remove prompt from output
    if generated.startswith(formatted_prompt):
        generated = generated[len(formatted_prompt):]
    elif prompt in generated:
        generated = generated.split(prompt)[-1]
    
    return generated.strip()

## Test Generation

In [5]:
# Test the model
prompt = "What are the main benefits of renewable energy?"

print(f"Model: {MODEL_NAME}")
print(f"Prompt: {prompt}\n")
print("Generating response...\n")

import time
start = time.time()

response = generate_text(prompt, max_new_tokens=150, temperature=0.7)

elapsed = time.time() - start

print(f"Response:\n{response}\n")
print(f"⏱️ Generation time: {elapsed:.2f} seconds")
print(f"   Speed: ~{150/elapsed:.1f} tokens/second")

Model: Qwen/Qwen2.5-3B-Instruct
Prompt: What are the main benefits of renewable energy?

Generating response...



  test_elements = torch.tensor(test_elements)


Response:
assistant
Renewable energy offers several significant benefits that make it an attractive and essential component of our future energy mix. Here are some of the main advantages:

1. **Environmental Sustainability**: Renewable energy sources like solar, wind, hydro, and geothermal power do not produce greenhouse gases or other pollutants during operation. This reduces air pollution and helps mitigate climate change.

2. **Abundance**: Unlike fossil fuels, which are finite resources, renewable energy sources are virtually inexhaustible. Solar, wind, and hydroelectric power can be generated continuously in most cases, making them sustainable for long-term use.

3. **Energy Security**: Diversifying energy sources with renewables can reduce dependency on imported fuels, enhancing national energy security. Countries with abundant renewable resources can

⏱️ Generation time: 49.70 seconds
   Speed: ~3.0 tokens/second


<a id='examples'></a>

# Practical Examples

## Multiple Prompts

In [6]:
# Process multiple prompts efficiently
prompts = [
    "Explain machine learning in simple terms",
    "What are the causes of climate change?",
    "How does social media affect society?"
]

print("Processing multiple prompts...\n")

for i, prompt in enumerate(prompts, 1):
    print(f"Prompt {i}: {prompt}")
    response = generate_text(prompt, max_new_tokens=100, temperature=0.7)
    print(f"Response: {response}\n")
    print("-" * 50)

Processing multiple prompts...

Prompt 1: Explain machine learning in simple terms
Response: assistant
Sure! Imagine you have a big pile of data about different types of fruits: apples, bananas, oranges, and so on. Each piece of fruit has some characteristics, like color, size, and texture.

Now, let's say you want to create a program that can automatically identify what kind of fruit an image is of. This program is essentially a "machine" that learns from the data you give it.

Here’s how it works in simple steps:

1. **Data**: The machine starts

--------------------------------------------------
Prompt 2: What are the causes of climate change?
Response: assistant
Climate change is primarily caused by human activities that lead to an increase in greenhouse gases (GHGs) in the atmosphere, particularly carbon dioxide (CO₂), methane (CH₄), and nitrous oxide (N₂O). Here are some key factors contributing to this increase:

1. **Deforestation and Land Use Changes**: Trees absorb CO₂ during

## Different Temperature Settings

Theoretical Max: No hard limit! You can set temperature to 10, 100, or even 1000.

Practical Max: Usually 1.5-2.0 is the useful limit.

In [16]:
# Compare different temperature settings
prompt = "Complete this sentence: 'Happiness is like a"
temperatures = [0.3, 1.0, 2.0, 5.0]

print(f"Testing temperature effects\n")
print(f"Prompt: {prompt}\n")

for temp in temperatures:
    print(f"Temperature {temp}:")
    response = generate_text(prompt, max_new_tokens=80, temperature=temp)
    print(f"{response}\n")

Testing temperature effects

Prompt: Complete this sentence: 'Happiness is like a

Temperature 0.3:


  test_elements = torch.tensor(test_elements)


assistant
Happiness is like a fine wine that improves with time and sharing, or a flower that blooms beautifully when nurtured with care and attention.

Temperature 1.0:
assistant
Happiness is like a puzzle, beautifully complete only when all its pieces align just right in your life.

Temperature 2.0:
assistant
flower; sometimes you sow seeds with it in your heart and patiently wait for the bloom, or you stumble upon it unexpectedly as if it just appeared on the breeze. Happiness can come from within through personal growth and satisfaction, it can be a sudden burst like winning the lottery or finding the perfect partner, or it can be a series of small, everyday moments of kindness, appreciation, and joy that add

Temperature 5.0:
assistant
"Happier moments, like the morning mist lifting over a serene valley after night has settled,"" though not every such analogy may apply personally or universally. Happiness isn't always directly measurable or tied purely temporally like those serene

Smaller models have less diverse "creativity" - they've learned fewer patterns, so they default to common metaphors.

## Explore Probabilities

In [18]:
import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt
import numpy as np

def get_token_probabilities(prompt, temperature=1.0):
    """Get probability distribution for next token"""
    
    # Tokenize - returns PyTorch tensors
    inputs = tokenizer(prompt, return_tensors="pt")

    # move to GPU/Apple Silicon if available
    if DEVICE != "cpu":
        inputs = {k: v.to(DEVICE) for k, v in inputs.items()}
    
    # Get model output (raw logits)
    with torch.no_grad():                   # Don't calculate gradients (save memory)
        outputs = model(**inputs)           # Run the model forward pass
        logits = outputs.logits[0, -1, :]   # Last token's predictions. 0 = batch item, -1 = last position in seq (after "a"), : = all vocab tokens
        
    # Apply temperature - this is what a model does internally
    logits_with_temp = logits / temperature
    
    # Convert to probabilities with softmax 
    probs = F.softmax(logits_with_temp, dim=-1)
    
    # Get top tokens
    top_k = 20
    top_probs, top_indices = torch.topk(probs, top_k)
    
    # Decode tokens back to text
    tokens = [tokenizer.decode([idx.item()]) for idx in top_indices]
    
    return tokens, top_probs.cpu().numpy(), logits.cpu().numpy()

# Analyze a prompt
prompt = "Complete this sentence: 'Happiness is like a"
tokens, probs, raw_logits = get_token_probabilities(prompt, temperature=1.0)

# Display results
print(f"Top 20 token probabilities for: '{prompt}'\\n")
for token, prob in zip(tokens[:10], probs[:10]):
    bar = "█" * int(prob * 100)
    print(f"{token:15s} {prob:.4f} {bar}")

tensor([ 4.8359,  6.0898,  1.5527,  ..., -1.6309, -1.6309, -1.6309],
       device='mps:0', dtype=torch.float16)
Top 20 token probabilities for: 'Complete this sentence: 'Happiness is like a'\n
 butterfly      0.1519 ███████████████
 garden         0.0995 █████████
 __             0.0428 ████
 flower         0.0421 ████
 ___            0.0313 ███
 ______         0.0252 ██
 pe             0.0209 ██
 rose           0.0187 █
 beautiful      0.0168 █
____            0.0139 █


## Using Templated prompts

In [21]:
topics = ["remote work", "artificial intelligence", "meditation"]

for topic in topics:
    prompt = f"Write a brief summary about {topic}:"
    print(f"\nSummary for {topic}:")
    output = generate_text(prompt, max_new_tokens=100, temperature=0.5)
    print(output)



Summary for remote work:
assistant
Remote work, also known as telecommuting or working from home, refers to the practice of performing one's job duties and responsibilities outside of a traditional office setting. This approach allows employees to work from any location with reliable internet access, using digital tools and communication platforms to collaborate with colleagues and complete tasks.

Key aspects of remote work include:

1. Flexibility: Workers can often set their own schedules, which can help balance work and personal life.
2. Cost savings: Reduced commuting time and expenses

Summary for artificial intelligence:
assistant
Artificial Intelligence (AI) is a field of computer science that aims to create intelligent machines capable of performing tasks that typically require human intelligence, such as visual perception, speech recognition, decision-making, and language translation. AI systems can be broadly categorized into two types: narrow or specialized AI, which is de

## Using Structured Output (JSON)

In [29]:
import json
from pydantic import BaseModel
from typing import List

class TopicAnalysis(BaseModel):
    topic: str
    pros: List[str]
    cons: List[str]
    summary: str
    key_points: List[str]

def generate_structured_real(topic):
    prompt = f"""
    Analyze {topic} and return a JSON object with this exact structure:
    {{
        "topic": "{topic}",
        "pros": ["pro1", "pro2", "pro3"],
        "cons": ["con1", "con2", "con3"], 
        "summary": "brief summary here",
        "key_points": ["point1", "point2", "point3"]
    }}
    
    Return only valid JSON, no other text.
    """
    
    response = generate_text(prompt, temperature=0.1, max_new_tokens=300)
    
    # Clean up the response - remove "assistant" prefix and find JSON
    cleaned_response = response
    if cleaned_response.startswith("assistant"):
        cleaned_response = cleaned_response[len("assistant"):].strip()

    # Print the JSON response
    print("JSON Response:")
    print(cleaned_response)
    print()

    # Parse and validate with Pydantic
    try:
        data = json.loads(cleaned_response)
        return TopicAnalysis(**data)
    except:
        print(f"Raw response: {response}")
        print(f"Cleaned response: {cleaned_response}")
        raise ValueError("Model didn't return valid JSON")

# Run it
result = generate_structured_real("remote work")

print("Structured data:")
print(result.pros)  # ['Flexibility', 'No commuting', ...]
print(len(result.cons))  # 3

JSON Response:
{
    "topic": "remote work",
    "pros": ["increased flexibility", "improved work-life balance", "cost savings on commuting"],
    "cons": ["potential for reduced social interaction", "difficulty in maintaining focus", "limited collaboration opportunities"],
    "summary": "Remote work offers several benefits such as increased flexibility and improved work-life balance, but it also presents challenges like reduced social interaction and difficulty in maintaining focus.",
    "key_points": ["increased flexibility", "improved work-life balance", "potential for reduced social interaction", "difficulty in maintaining focus", "limited collaboration opportunities"]
}

['increased flexibility', 'improved work-life balance', 'cost savings on commuting']
3


## Tips for Local Inference

1. **Model Selection**
   - Start small: Test with 1-3B models first
   - Match to RAM: ~2GB RAM per billion parameters
   - Quality vs Speed: Larger models are better but slower

2. **Performance Optimization**
   - Close other applications to free RAM
   - Reduce max_new_tokens for faster responses
   - Lower temperature for more focused (faster) generation

3. **When to Use Local vs Cloud**
   - **Local**: Privacy-sensitive data, offline work, no usage limits
   - **Cloud**: Need larger models, faster inference, GPU acceleration

---

## 🌟 Stretch Goals 

With Hugging Face’s transformers library, you can try out a variety of pretrained and fine-tuned models. If you finish early, explore some of these challenges:

1. Sentiment Analysis → Analyze the sentiment of AITA posts. (Hint: distilbert-base-uncased-finetuned-sst-2-english)
2. Text Classification → Classify Reddit posts by topic or category. (Hint: search Hugging Face for “text classification”)
3. Question Answering → Ask questions about an AITA post and see if the model can extract an answer. (Hint: deepset/roberta-base-squad2)
4. Summarization → Generate concise summaries of posts. (Hint: facebook/bart-large-cnn)
5. Translation → Try translating posts into another language. (Hint: Helsinki-NLP opus-mt models)

In [None]:
# Space for stretch goals
