# Running Inference Locally

Large Language Models (LLMs) have revolutionized AI applications, but they don't always need to be accessed through cloud APIs. In this lesson, we'll explore how to download, save, and run LLMs locally in your development environment.

In [1]:
# Install necessary libraries (if not already installed)
# - Use copy link mode in containers to avoid hardlink warnings
# - Install PyTorch CPU build compatible with your Python
%env UV_LINK_MODE=copy
!uv pip install --upgrade pip
!uv pip install --extra-index-url https://download.pytorch.org/whl/cpu torch torchvision torchaudio
!uv pip install transformers

env: UV_LINK_MODE=copy
[2mUsing Python 3.12.11 environment at: /workspaces/fundamentals-of-ai-engineering-principles-and-practical-applications-6026542/.venv[0m
[2K[37m⠙[0m [2mResolving dependencies...                                                     [0m[2mUsing Python 3.12.11 environment at: /workspaces/fundamentals-of-ai-engineering-principles-and-practical-applications-6026542/.venv[0m
[2K[37m⠙[0m [2mResolving dependencies...                                                     [0m

[2K[2mResolved [1m1 package[0m [2min 176ms[0m[0m                                          [0m
[2mAudited [1m1 package[0m [2min 0.12ms[0m[0m
[2mUsing Python 3.12.11 environment at: /workspaces/fundamentals-of-ai-engineering-principles-and-practical-applications-6026542/.venv[0m
[2mAudited [1m3 packages[0m [2min 8ms[0m[0m
[2mUsing Python 3.12.11 environment at: /workspaces/fundamentals-of-ai-engineering-principles-and-practical-applications-6026542/.venv[0m
[2mAudited [1m3 packages[0m [2min 8ms[0m[0m
[2mUsing Python 3.12.11 environment at: /workspaces/fundamentals-of-ai-engineering-principles-and-practical-applications-6026542/.venv[0m
[2mAudited [1m1 package[0m [2min 6ms[0m[0m
[2mUsing Python 3.12.11 environment at: /workspaces/fundamentals-of-ai-engineering-principles-and-practical-applications-6026542/.venv[0m
[2mAudited [1m1 package[0m [2min 6ms[0m[0m


Understanding Local LLMs

Running LLMs locally offers several advantages:
- **Privacy**: Your data doesn't leave your environment
- **Cost**: No per-token API charges
- **Latency**: No network delays
- **Customization**: Full control over model parameters

However, local LLMs also have limitations:
- **Hardware requirements**: Models need sufficient RAM and GPU
- **Model size**: Smaller models fit locally but may have reduced capabilities
- **Updates**: You manage model versions yourself

Let's start by downloading a small LLM called DistilGPT2, a distilled version of GPT-2.

In [3]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import os

# Set the directory where you want to save the model
save_directory = "./downloaded_model"  # Change this to your preferred path

# Create the directory if it doesn't exist
os.makedirs(save_directory, exist_ok=True)

# Load model and tokenizer
print("Downloading model from Hugging Face Hub...")
model_name = "distilgpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Print model information
print(f"\nModel: {model_name}")
print(f"Number of parameters: {model.num_parameters():,}")
print(
    f"Model size on disk: ~{model.num_parameters() * 4 / (1024 * 1024):.2f} MB (estimated)")

# Save the model and tokenizer to the specified directory
print(f"\nSaving model to {save_directory}...")
model.save_pretrained(save_directory)
tokenizer.save_pretrained(save_directory)
print("Model and tokenizer saved successfully!")

Downloading model from Hugging Face Hub...


model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]


Model: distilgpt2
Number of parameters: 81,912,576
Model size on disk: ~312.47 MB (estimated)

Saving model to ./downloaded_model...
Model and tokenizer saved successfully!
Model and tokenizer saved successfully!


## Loading and Using a Local Model

Once saved, we can load the model from local storage instead of downloading it again. This is especially useful for larger models or when working in environments with limited internet access.

In [4]:
# Now we can load from local directory instead of downloading again
print("Loading model from local directory...")
local_model = AutoModelForCausalLM.from_pretrained(save_directory)
local_tokenizer = AutoTokenizer.from_pretrained(save_directory)
local_tokenizer.pad_token = local_tokenizer.eos_token
print("Model loaded from local directory!")

# Codespace is mostly run on cpus, so we're going to use a CPU
device = "cpu"
print(f"\nUsing device: {device}")

Loading model from local directory...
Model loaded from local directory!

Using device: cpu
Model loaded from local directory!

Using device: cpu


## Generating Text with Your Local LLM

Now let's create a more versatile text generation function that allows us to control various parameters:

- **Temperature**: Controls randomness (higher = more creative, lower = more deterministic)
- **Max length**: The maximum number of tokens to generate
- **Top-p (nucleus sampling)**: Limits token selection to a subset of most likely tokens
- **Top-k**: Limits selection to the k most likely tokens

In [6]:
def generate_text(prompt,
                  max_length=50,
                  temperature=0.8,
                  top_p=0.9,
                  top_k=50,
                  do_sample=True):
    """Generate text from a prompt with customizable parameters
    
    Args:
        prompt (str): The input text to continue
        max_length (int): Maximum length of generated text (including prompt)
        temperature (float): Higher values (>1.0) increase randomness, lower values (<1.0) make it more deterministic
        top_p (float): Nucleus sampling parameter (0-1.0)
        top_k (int): Limits selection to k most likely tokens
        do_sample (bool): If False, uses greedy decoding instead of sampling
        
    Returns:
        str: The generated text including the prompt
    """
    # Prepare the inputs
    inputs = local_tokenizer(prompt, return_tensors="pt",
                             return_attention_mask=True).to(device)

    # Generate text
    output = local_model.generate(
        input_ids=inputs.input_ids,
        attention_mask=inputs.attention_mask,
        max_length=max_length,
        do_sample=do_sample,
        temperature=temperature,
        top_p=top_p,
        top_k=top_k,
        pad_token_id=local_tokenizer.pad_token_id
    )

    # Decode the output
    generated_text = local_tokenizer.decode(
        output[0], skip_special_tokens=True)
    
    # Clean up excess whitespace with regex
    import re
    cleaned_text = re.sub(r'\s+', ' ', generated_text)

    return cleaned_text

## Experimenting with Different Generation Parameters

Let's try generating text with different parameters to see how they affect the output:

In [7]:
prompt = "Welcome to Fundamentals of AI Engineering on LinkedIn Learning. This class"

print("Example 1: Default parameters (temperature=0.8)")
print(f"Prompt: \"{prompt}\"")
print(f"Generated: {generate_text(prompt, max_length=75)}")
print("-"*100 + "\n")
print("Example 2: Low temperature (more deterministic)")
print(f"Prompt: \"{prompt}\"")
print(f"Generated: {generate_text(prompt, temperature=0.2, max_length=75)}")
print("-"*100 + "\n")
print("Example 3: High temperature (more creative/random)")
print(f"Prompt: \"{prompt}\"")
print(f"Generated: {generate_text(prompt, temperature=1.5, max_length=75)}")
print("-"*100 + "\n")
print("Example 4: Greedy decoding (no sampling, always selects most likely token)")
print(f"Prompt: \"{prompt}\"")
print(f"Generated: {generate_text(prompt, top_p=None, temperature=None, do_sample=False, max_length=75)}")

Example 1: Default parameters (temperature=0.8)
Prompt: "Welcome to Fundamentals of AI Engineering on LinkedIn Learning. This class"
Generated: Welcome to Fundamentals of AI Engineering on LinkedIn Learning. This class will cover a wide range of topics. Topics include: 
----------------------------------------------------------------------------------------------------

Example 2: Low temperature (more deterministic)
Prompt: "Welcome to Fundamentals of AI Engineering on LinkedIn Learning. This class"
Generated: Welcome to Fundamentals of AI Engineering on LinkedIn Learning. This class will cover a wide range of topics. Topics include: 
----------------------------------------------------------------------------------------------------

Example 2: Low temperature (more deterministic)
Prompt: "Welcome to Fundamentals of AI Engineering on LinkedIn Learning. This class"
Generated: Welcome to Fundamentals of AI Engineering on LinkedIn Learning. This class is designed to help you learn how t

In [8]:
# Quick verification of installs
try:
    import torch, transformers
    print("torch:", torch.__version__, "cuda:", torch.cuda.is_available())
    print("transformers:", transformers.__version__)
except Exception as e:
    import traceback; traceback.print_exc()

torch: 2.9.0+cpu cuda: False
transformers: 4.57.1
