# Running Inference Locally

Large Language Models (LLMs) have revolutionized AI applications, but they don't always need to be accessed through cloud APIs. In this lesson, we'll explore how to download, save, and run LLMs locally in your development environment.

## Learning Objectives

By the end of this notebook, you will be able to:
- Download and save LLMs for offline use
- Load models from local storage
- Generate text with various parameters
- Understand the tradeoffs between local and cloud-based inference

In [None]:
## Setup: Install Required Libraries

# Install necessary libraries (if not already installed)
# - Use copy link mode in containers to avoid hardlink warnings
# - Install PyTorch CPU build compatible with your Python
%env UV_LINK_MODE=copy
!uv pip install --upgrade pip
!uv pip install --extra-index-url https://download.pytorch.org/whl/cpu torch torchvision torchaudio
!uv pip install transformers

## Verify Installation

In [None]:
# Quick verification of installed packages
try:
    import torch
    import transformers
    print("✓ PyTorch version:", torch.__version__)
    print("✓ CUDA available:", torch.cuda.is_available())
    print("✓ Transformers version:", transformers.__version__)
    print("\nSetup complete! You're ready to work with local LLMs.")
except Exception as e:
    print("✗ Error during verification:")
    import traceback
    traceback.print_exc()

## Understanding Local LLMs

Running LLMs locally offers several advantages:
- **Privacy**: Your data doesn't leave your environment
- **Cost**: No per-token API charges
- **Latency**: No network delays for inference
- **Customization**: Full control over model parameters
- **Offline capability**: Works without internet connection

However, local LLMs also have limitations:
- **Hardware requirements**: Models need sufficient RAM and compute resources
- **Model size**: Smaller models fit locally but may have reduced capabilities
- **Updates**: You manage model versions yourself
- **Initial download**: First-time setup requires downloading the model

### Downloading a Model

Let's start by downloading **DistilGPT2**, a distilled version of GPT-2 that's lightweight and perfect for learning.

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import os

# Set the directory where you want to save the model
save_directory = "./downloaded_model"

# Create the directory if it doesn't exist
os.makedirs(save_directory, exist_ok=True)

# Download model and tokenizer from Hugging Face Hub
print("Downloading model from Hugging Face Hub...")
model_name = "distilgpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Display model information
print(f"\nModel: {model_name}")
print(f"Number of parameters: {model.num_parameters():,}")
print(f"Model size on disk: ~{model.num_parameters() * 4 / (1024 * 1024):.2f} MB (estimated)")

# Save the model and tokenizer to the specified directory
print(f"\nSaving model to {save_directory}...")
model.save_pretrained(save_directory)
tokenizer.save_pretrained(save_directory)
print("Model and tokenizer saved successfully!")

## Loading and Using a Local Model

Once saved, we can load the model from local storage instead of downloading it again. This is especially useful for larger models or when working in environments with limited internet access.

In [None]:
# Load model from local directory instead of downloading again
print("Loading model from local directory...")
local_model = AutoModelForCausalLM.from_pretrained(save_directory)
local_tokenizer = AutoTokenizer.from_pretrained(save_directory)

# Set pad token to avoid warnings during generation
local_tokenizer.pad_token = local_tokenizer.eos_token

print("Model loaded successfully from local directory!")

# Set device (CPU for most codespace environments)
device = "cpu"
print(f"Using device: {device}")

## Generating Text with Your Local LLM

Now let's create a text generation function with customizable parameters. Understanding these parameters is crucial for controlling the model's output:

- **Temperature**: Controls randomness (higher = more creative, lower = more deterministic)
  - `< 1.0`: More focused and deterministic
  - `= 1.0`: Default randomness
  - `> 1.0`: More creative and unpredictable

- **Max length**: The maximum number of tokens to generate (including the prompt)

- **Top-p (nucleus sampling)**: Limits token selection to the top tokens whose cumulative probability exceeds p
  - Range: 0.0 to 1.0
  - Lower values = more focused responses

- **Top-k**: Limits selection to the k most likely tokens at each step
  - Common values: 20-100

- **do_sample**: Whether to use sampling (True) or greedy decoding (False)

In [2]:
import re

def generate_text(prompt,
                  max_length=50,
                  temperature=0.8,
                  top_p=0.9,
                  top_k=50,
                  do_sample=True):
    """
    Generate text from a prompt with customizable parameters.
    
    Args:
        prompt (str): The input text to continue
        max_length (int): Maximum length of generated text (including prompt)
        temperature (float): Controls randomness (>1.0 = more random, <1.0 = more deterministic)
        top_p (float): Nucleus sampling parameter (0.0-1.0)
        top_k (int): Limits selection to k most likely tokens
        do_sample (bool): If False, uses greedy decoding instead of sampling
        
    Returns:
        str: The generated text including the prompt
    """
    # Prepare the inputs
    inputs = local_tokenizer(
        prompt, 
        return_tensors="pt",
        return_attention_mask=True
    ).to(device)

    # Generate text
    output = local_model.generate(
        input_ids=inputs.input_ids,
        attention_mask=inputs.attention_mask,
        max_length=max_length,
        do_sample=do_sample,
        temperature=temperature if do_sample else None,
        top_p=top_p if do_sample else None,
        top_k=top_k if do_sample else None,
        pad_token_id=local_tokenizer.pad_token_id
    )

    # Decode and clean the output
    generated_text = local_tokenizer.decode(output[0], skip_special_tokens=True)
    
    # Clean up excess whitespace
    cleaned_text = re.sub(r'\s+', ' ', generated_text).strip()

    return cleaned_text

print("✓ Text generation function defined successfully!")

✓ Text generation function defined successfully!


## Experimenting with Different Generation Parameters

Let's see how different parameters affect the model's output. We'll use the same prompt with varying settings to observe the differences.

In [None]:
prompt = "Welcome to Fundamentals of AI Engineering on LinkedIn Learning. This class"

print("=" * 100)
print("Example 1: Default parameters (temperature=0.8)")
print("=" * 100)
print(f"Prompt: \"{prompt}\"")
print(f"Generated: {generate_text(prompt, max_length=75)}")

print("\n" + "=" * 100)
print("Example 2: Low temperature (more deterministic)")
print("=" * 100)
print(f"Prompt: \"{prompt}\"")
print(f"Generated: {generate_text(prompt, temperature=0.2, max_length=75)}")

print("\n" + "=" * 100)
print("Example 3: High temperature (more creative/random)")
print("=" * 100)
print(f"Prompt: \"{prompt}\"")
print(f"Generated: {generate_text(prompt, temperature=1.5, max_length=75)}")

print("\n" + "=" * 100)
print("Example 4: Greedy decoding (no sampling)")
print("=" * 100)
print(f"Prompt: \"{prompt}\"")
print(f"Generated: {generate_text(prompt, do_sample=False, max_length=75)}")

## Key Observations

**Temperature Effects:**
- Low temperature (0.2): Produces more consistent, predictable outputs
- High temperature (1.5): Creates more varied and creative (sometimes incoherent) outputs
- Greedy decoding: Always selects the most likely token, producing deterministic results

**Best Practices:**
- Use low temperature (0.2-0.5) for factual or structured tasks
- Use medium temperature (0.7-1.0) for balanced creativity
- Use high temperature (1.0-1.5+) for creative writing or brainstorming
- Use greedy decoding when you need reproducible outputs