# LLM Fundamentals: Understanding the Basics

This notebook covers the fundamental concepts of working with Large Language Models (LLMs).

## Topics Covered:
1. Setting up API connections
2. Basic API calls
3. Understanding tokens and tokenization
4. Temperature and sampling parameters
5. Streaming responses
6. Cost estimation
7. Testing our custom utils

---

## 1. Setup and Installation

First, let's import the necessary libraries and set up our environment.

In [2]:
import os
import sys
from dotenv import load_dotenv
from openai import OpenAI
import tiktoken

# Add parent directory to path to import our utils
sys.path.append('..')
from utils.config import Config
from utils.text_processing import count_tokens
from utils.performance import CostEstimator, timer

# Load environment variables
load_dotenv()

# Initialize OpenAI client
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

print("‚úÖ Setup complete!")

‚úÖ Setup complete!


## 2. Your First LLM API Call

Let's start with a simple API call to understand the basic structure.

In [3]:
def simple_completion(prompt: str, model: str = "gpt-3.5-turbo"):
    """Make a simple completion request."""
    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "user", "content": prompt}
        ]
    )
    return response.choices[0].message.content

# Test it out
prompt = "Explain what a large language model is in one sentence."
response = simple_completion(prompt)

print(f"Prompt: {prompt}")
print(f"\nResponse: {response}")

Prompt: Explain what a large language model is in one sentence.

Response: A large language model is a type of artificial intelligence system that uses vast amounts of data and complex algorithms to understand and generate human language.


### Understanding the Response Object

Let's examine what information the API returns.

In [4]:
# Get full response object
response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[{"role": "user", "content": "Hello!"}]
)

print("=" * 60)
print("RESPONSE OBJECT STRUCTURE")
print("=" * 60)
print(f"\nModel used: {response.model}")
print(f"ID: {response.id}")
print(f"Created: {response.created}")
print(f"\nToken usage:")
print(f"  - Prompt tokens: {response.usage.prompt_tokens}")
print(f"  - Completion tokens: {response.usage.completion_tokens}")
print(f"  - Total tokens: {response.usage.total_tokens}")
print(f"\nFinish reason: {response.choices[0].finish_reason}")
print(f"\nMessage content: {response.choices[0].message.content}")

RESPONSE OBJECT STRUCTURE

Model used: gpt-3.5-turbo-0125
ID: chatcmpl-CnQ2XT8zIKlExoqmWIgsD8Vrq9rsG
Created: 1765894205

Token usage:
  - Prompt tokens: 9
  - Completion tokens: 9
  - Total tokens: 18

Finish reason: stop

Message content: Hello! How can I assist you today?


## 3. Understanding Tokens and Tokenization

Tokens are the fundamental units that LLMs process. Understanding tokenization is crucial for:
- Cost estimation (pricing is per token)
- Context window management
- Prompt engineering

### What is a Token?
- A token is a piece of text (not exactly a word)
- Common words = 1 token (e.g., "cat")
- Uncommon words = multiple tokens (e.g., "unconventional" ‚âà 3 tokens)
- 1 token ‚âà 4 characters in English

In [5]:
def count_and_display_tokens(text: str, model: str = "gpt-4"):
    """Count tokens and show token breakdown."""
    encoding = tiktoken.encoding_for_model(model)
    tokens = encoding.encode(text)

    print(f"Text: '{text}'")
    print(f"Length: {len(text)} characters")
    print(f"Tokens: {len(tokens)}")
    print(f"Token IDs: {tokens[:20]}...")  # Show first 20
    print(f"Chars per token: {len(text) / len(tokens):.2f}")
    print("-" * 60)

# Test with different texts
texts = [
    "Hello, world!",
    "The quick brown fox jumps over the lazy dog.",
    "Supercalifragilisticexpialidocious",
    "AI and ML are transforming technology.",
    "‰∫∫Â∑•Êô∫ËÉΩÊ≠£Âú®ÊîπÂèò‰∏ñÁïå"  # Chinese text
]

for text in texts:
    count_and_display_tokens(text)

Text: 'Hello, world!'
Length: 13 characters
Tokens: 4
Token IDs: [9906, 11, 1917, 0]...
Chars per token: 3.25
------------------------------------------------------------
Text: 'The quick brown fox jumps over the lazy dog.'
Length: 44 characters
Tokens: 10
Token IDs: [791, 4062, 14198, 39935, 35308, 927, 279, 16053, 5679, 13]...
Chars per token: 4.40
------------------------------------------------------------
Text: 'Supercalifragilisticexpialidocious'
Length: 34 characters
Tokens: 11
Token IDs: [10254, 3035, 278, 333, 4193, 321, 4633, 4683, 532, 307, 78287]...
Chars per token: 3.09
------------------------------------------------------------
Text: 'AI and ML are transforming technology.'
Length: 38 characters
Tokens: 7
Token IDs: [15836, 323, 20187, 527, 46890, 5557, 13]...
Chars per token: 5.43
------------------------------------------------------------
Text: '‰∫∫Â∑•Êô∫ËÉΩÊ≠£Âú®ÊîπÂèò‰∏ñÁïå'
Length: 10 characters
Tokens: 11
Token IDs: [17792, 49792, 45114, 118, 27327, 97655, 23226, 

### Token Counting with Our Custom Utility

In [6]:
# Using our custom utility
text = """Large Language Models (LLMs) are advanced AI systems trained on vast amounts
of text data. They can understand and generate human-like text, making them useful
for tasks like translation, summarization, and question-answering."""

token_count = count_tokens(text, model="gpt-4")
print(f"Text length: {len(text)} characters")
print(f"Estimated tokens: {token_count}")
print(f"\nText preview:")
print(text)

Text length: 226 characters
Estimated tokens: 48

Text preview:
Large Language Models (LLMs) are advanced AI systems trained on vast amounts
of text data. They can understand and generate human-like text, making them useful
for tasks like translation, summarization, and question-answering.


## 4. Temperature and Sampling Parameters

Temperature controls the randomness of the model's output:
- **Temperature = 0**: Deterministic, always picks most likely token (good for factual tasks)
- **Temperature = 0.3-0.7**: Balanced creativity and consistency
- **Temperature = 1.0+**: More random and creative (good for creative writing)

In [7]:
def compare_temperatures(prompt: str, temperatures: list):
    """Compare outputs at different temperatures."""
    print(f"Prompt: {prompt}")
    print("=" * 80)

    for temp in temperatures:
        response = client.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=[{"role": "user", "content": prompt}],
            temperature=temp,
            max_tokens=100
        )

        print(f"\nüå°Ô∏è  Temperature: {temp}")
        print(f"Response: {response.choices[0].message.content}")
        print("-" * 80)

# Test with a factual question
compare_temperatures(
    "What is the capital of France?",
    temperatures=[0.0, 0.7, 1.5]
)

Prompt: What is the capital of France?

üå°Ô∏è  Temperature: 0.0
Response: The capital of France is Paris.
--------------------------------------------------------------------------------

üå°Ô∏è  Temperature: 0.7
Response: The capital of France is Paris.
--------------------------------------------------------------------------------

üå°Ô∏è  Temperature: 1.5
Response: The capital of France is Paris.
--------------------------------------------------------------------------------


In [8]:
# Test with a creative task
compare_temperatures(
    "Write a creative tagline for an AI startup.",
    temperatures=[0.0, 0.7, 1.5]
)

Prompt: Write a creative tagline for an AI startup.

üå°Ô∏è  Temperature: 0.0
Response: "Empowering the future with intelligent technology."
--------------------------------------------------------------------------------

üå°Ô∏è  Temperature: 0.7
Response: "Empowering the future with intelligent technology."
--------------------------------------------------------------------------------

üå°Ô∏è  Temperature: 1.5
Response: "Evolving intelligence, extending possibilities."
--------------------------------------------------------------------------------


### Other Important Parameters

In [9]:
# max_tokens: Limit response length
response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[{"role": "user", "content": "Explain quantum computing."}],
    max_tokens=50  # Short response
)
print("Short response (max_tokens=50):")
print(response.choices[0].message.content)
print(f"\nTokens used: {response.usage.completion_tokens}")

print("\n" + "="*60 + "\n")

# top_p: Nucleus sampling (alternative to temperature)
response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[{"role": "user", "content": "Write a haiku about AI."}],
    top_p=0.9,
    temperature=0.8
)
print("Creative output (top_p=0.9, temperature=0.8):")
print(response.choices[0].message.content)

Short response (max_tokens=50):
Quantum computing is a type of computing that uses principles of quantum mechanics to perform operations. Traditional computers use bits, which can be either a 0 or a 1, to perform calculations. Quantum computers, on the other hand, use quantum bits

Tokens used: 50


Creative output (top_p=0.9, temperature=0.8):
Artificial minds
Learning and evolving fast
Future now in hand


## 5. System Messages and Conversation Context

System messages set the behavior and context for the model.

In [10]:
def chat_with_system_message(system_msg: str, user_msg: str):
    """Make a call with a system message."""
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": system_msg},
            {"role": "user", "content": user_msg}
        ]
    )
    return response.choices[0].message.content

# Example 1: Technical expert
response1 = chat_with_system_message(
    system_msg="You are a senior software engineer with expertise in Python.",
    user_msg="How do I optimize a loop that processes 1 million items?"
)
print("Technical Expert Response:")
print(response1)

print("\n" + "="*80 + "\n")

# Example 2: Simple explainer
response2 = chat_with_system_message(
    system_msg="You explain complex topics to 10-year-olds using simple language and analogies.",
    user_msg="How do I optimize a loop that processes 1 million items?"
)
print("Simple Explainer Response:")
print(response2)

Technical Expert Response:
Optimizing a loop that processes 1 million items in Python is important to ensure efficient execution and minimize processing time. Here are some tips to optimize your loop:

1. **Use List Comprehension**: List comprehension can be more efficient than traditional loops for simple operations. If you can express your processing logic using list comprehension, it can be faster.

2. **Minimize Function Calls**: If there are function calls inside the loop that are not necessary for each iteration (e.g., calling the same function with the same arguments multiple times), consider moving them outside the loop.

3. **Avoid Unnecessary Calculations**: Make sure your loop only performs calculations that are necessary. Avoid unnecessary checks or calculations inside the loop that can be done before the loop starts.

4. **Use Built-in Functions**: Utilize built-in functions like `map()`, `filter()`, and `reduce()` where appropriate. These functions are optimized for perfo

### Multi-turn Conversations

In [11]:
# Maintaining conversation context
conversation = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user", "content": "What's the capital of France?"},
]

# First exchange
response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=conversation
)
assistant_reply = response.choices[0].message.content
conversation.append({"role": "assistant", "content": assistant_reply})

print("User: What's the capital of France?")
print(f"Assistant: {assistant_reply}\n")

# Follow-up question (references previous context)
conversation.append({"role": "user", "content": "What's its population?"})
response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=conversation
)
assistant_reply = response.choices[0].message.content

print("User: What's its population?")
print(f"Assistant: {assistant_reply}")

User: What's the capital of France?
Assistant: The capital of France is Paris.

User: What's its population?
Assistant: As of 2021, the population of Paris, France is estimated to be around 2.2 million people.


## 6. Streaming Responses

Streaming allows to get responses token-by-token as they're generated, improving perceived latency.

In [12]:
def stream_completion(prompt: str):
    """Stream a completion response."""
    print("Streaming response: ", end="", flush=True)

    stream = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": prompt}],
        stream=True
    )

    full_response = ""
    for chunk in stream:
        if chunk.choices[0].delta.content is not None:
            content = chunk.choices[0].delta.content
            print(content, end="", flush=True)
            full_response += content

    print()  # New line
    return full_response

# Test streaming
response = stream_completion(
    "Explain the concept of neural networks in 3 sentences."
)

Streaming response: Neural networks are a type of machine learning model inspired by the structure of the human brain. They consist of layers of interconnected nodes, or artificial neurons, that process and learn from data. By adjusting the weights and connections between neurons, neural networks can recognize patterns and make predictions in a wide range of applications, from image recognition to natural language processing.


## 7. Cost Estimation

Understanding and tracking costs is crucial when working with LLM APIs.

In [13]:
def estimate_and_call(prompt: str, model: str = "gpt-3.5-turbo"):
    """Estimate cost before making the call."""
    # Estimate input tokens
    input_tokens = count_tokens(prompt, model)

    # Estimate cost (assuming ~150 output tokens)
    estimated_cost = CostEstimator.estimate_cost(
        model=model,
        input_tokens=input_tokens,
        output_tokens=150
    )

    print(f"Estimated input tokens: {input_tokens}")
    print(f"Estimated cost: ${estimated_cost:.6f}")
    print("-" * 60)

    # Make the actual call
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}]
    )

    # Calculate actual cost
    actual_cost = CostEstimator.estimate_cost(
        model=model,
        input_tokens=response.usage.prompt_tokens,
        output_tokens=response.usage.completion_tokens
    )

    print(f"\nActual tokens:")
    print(f"  Input: {response.usage.prompt_tokens}")
    print(f"  Output: {response.usage.completion_tokens}")
    print(f"  Total: {response.usage.total_tokens}")
    print(f"Actual cost: ${actual_cost:.6f}")

    return response.choices[0].message.content

# Test it
response = estimate_and_call(
    "Explain the difference between supervised and unsupervised learning."
)
print(f"\nResponse:\n{response}")

Estimated input tokens: 12
Estimated cost: $0.000231
------------------------------------------------------------

Actual tokens:
  Input: 19
  Output: 175
  Total: 194
Actual cost: $0.000272

Response:
Supervised learning is a type of machine learning where the algorithm is trained on a labeled dataset, meaning the input data is paired with the correct output. The algorithm learns to map the input to the output based on the labeled examples provided during training. Common types of supervised learning algorithms include classification and regression.

On the other hand, unsupervised learning is a type of machine learning where the algorithm is trained on an unlabeled dataset, meaning there is no corresponding output given for the input data. The algorithm must learn the inherent patterns and structures in the data without any explicit guidance. Common types of unsupervised learning algorithms include clustering and dimensionality reduction.

In summary, the main difference between sup

### Comparing Costs Across Models

In [14]:
# Compare costs for different models
prompt = "Write a 100-word summary of machine learning."
models = ["gpt-3.5-turbo", "gpt-4", "gpt-4-turbo"]

input_tokens = 100  # Approximate
output_tokens = 150  # Approximate

print("Cost Comparison for Same Task:")
print("=" * 60)
for model in models:
    cost = CostEstimator.estimate_cost(model, input_tokens, output_tokens)
    print(f"{model:20s}: ${cost:.6f}")

Cost Comparison for Same Task:
gpt-3.5-turbo       : $0.000275
gpt-4               : $0.012000
gpt-4-turbo         : $0.005500


## 8. Testing Our Custom Utils

Let's test the utility functions we created.

In [15]:
# Test timer decorator
@timer
def slow_completion(prompt: str):
    """A completion that we'll time."""
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

result = slow_completion("Tell me a fun fact about Python.")
print(f"\nResult: {result}")

slow_completion took 0.81 seconds

Result: Python was named after the British comedy show "Monty Python's Flying Circus" and not the snake as many people assume. Guido van Rossum, the creator of Python, is a fan of the show and named the programming language in honor of it.


In [16]:
# Test Config utility
config = Config.from_env()

print("Configuration loaded from environment:")
print(f"Model: {config.get('model')}")
print(f"Temperature: {config.get('temperature')}")
print(f"Max tokens: {config.get('max_tokens')}")

# Use config in a call
response = client.chat.completions.create(
    model=config.get('model'),
    messages=[{"role": "user", "content": "Hello!"}],
    temperature=config.get('temperature'),
    max_tokens=config.get('max_tokens')
)
print(f"\nResponse: {response.choices[0].message.content}")

Configuration loaded from environment:
Model: gpt-3.5-turbo
Temperature: 0.0
Max tokens: 2000

Response: Hello! How can I assist you today?


## 9. Best Practices Summary

### When to Use Different Settings:

| Task Type | Temperature | Model | Why |
|-----------|-------------|-------|-----|
| Factual Q&A | 0.0 - 0.3 | GPT-3.5-Turbo | Consistent, cheap |
| Data extraction | 0.0 | GPT-3.5-Turbo | Deterministic output |
| Creative writing | 0.7 - 1.0 | GPT-4 | More creative, high quality |
| Code generation | 0.0 - 0.3 | GPT-4 | Reliable, accurate |
| Summarization | 0.3 - 0.5 | GPT-3.5-Turbo | Balanced |
| Brainstorming | 0.8 - 1.2 | GPT-4 | Diverse ideas |

### Cost Optimization Tips:
1. Use GPT-3.5-Turbo for simple tasks
2. Keep prompts concise
3. Use `max_tokens` to limit output
4. Cache common responses
5. Batch similar requests

### Common Pitfalls:
1. Not counting tokens properly
2. Using GPT-4 when GPT-3.5 suffices
3. Forgetting to handle rate limits
4. Not streaming for long responses
5. Ignoring token context limits