## Using OpenAI's Language Models

This notebook provides an explanatory guide to the first component of Retrieval-Augmented Generation (RAG), focusing on leveraging OpenAI's Large Language Models (LLMs). The key areas covered include:

1. **Calling OpenAI**: Demonstrating a simple test message to understand how to interact with OpenAI's API.
2. **Chat Completion Models**: Focusing on chat completion models in "normal mode." While JSON mode can be useful in some scenarios, it is not the primary focus here.
3. **Parameter Explanation**: Detailing important parameters and making a case for setting the temperature parameter to 0 for predictable outputs.
4. **OpenAI Pricing**: Discussing the pricing model, including:
    - **Tokens**: How usage is calculated based on tokens.
    - **Different Models**: Exploring the various available models.
    - **Website Updates**: Noting some discrepancies on OpenAI's website, such as references to models that no longer exist.

This introduction will set the stage for understanding the integration of OpenAI's models within the RAG framework, ensuring reliable and accurate outputs.


In [1]:
from IPython.display import Markdown, display
from openai import OpenAI
from dotenv import load_dotenv

# Load the OpenAI API key from the .env file into the environment variable called OPENAI_API_KEY
load_dotenv()

# Instantiate the OpenAI client
client = OpenAI()

# Calling the OpenAI api

### The call:
```python
response = client.chat.completions.create()
```
- client: The object representing your OpenAI client, which interfaces with the API.
- chat: The namespace for handling chat-based requests.
- completions: A sub-namespace for generating completions in a chat model.
- create(): The method that sends a request to the API to generate a response.

### Method attributes:
- model: Specifies which model to use (e.g., GPT-4).
- messages: A list of messages representing the conversation or inputs to the model.
- max_tokens: Limits the length of the model’s output (in tokens).
- temperature: Controls the randomness or creativity of the output.
- n: The number of completions (responses) the model will generate.

### Why ```choices[0]``` is Always Indexed at 0? 
The choices array in the API response contains all the generated completions. By default, the array has just one completion (so it's accessed via choices[0]). If you specify n>1, the API returns multiple completions, and you can loop through choices to access each completion individually.

### Most important setting
- Temperature == 1 maximizes creativity
- Temperature == 0 ensures reproducible results

In [2]:
# Colors for the chat messages
PASTELS = [('Pale Red', '#ffcccc'), ('Pale Green', '#ccffcc'), ('Pale Blue', '#cceeff')]

# We need a message for both the user and the system to start the conversation, the roles are 'user' and 'system'
TEST_MESSAGE = [
        { "role": "system",
          "content": "You are a helpful assistant."},
        {
           "role": "user",
           "content": "If Sarah is older than Tom, and Tom is older than Jane, who is the youngest and how do you know?"
        }
    ]

In [3]:
# Temperature == 1 maximizes creativity, but also randomness
completion = client.chat.completions.create(
      model="gpt-4o-mini"
    , messages=TEST_MESSAGE
    , max_tokens=100 # limits the outputted completion_tokens, so will be cut short if it exceeds max_tokens tokens
    , temperature=1  # the randomness of the output, 0 is deterministic, 1 is random (repated calls will give different results for 1)
    , n=3            # This will generate 3 separate completions
)

for i, choice in enumerate(completion.choices):
    color = PASTELS[i][1]
    display(Markdown(f"<div style='background-color: {color}; color: black;'>{choice.message.content}</div>"))



<div style='background-color: #ffcccc; color: black;'>Based on the information provided:

1. Sarah is older than Tom.
2. Tom is older than Jane.

From this, we can conclude the following order of ages:

- Sarah > Tom > Jane

Since Jane is at the end of this order, it indicates that Jane is the youngest. Therefore, Jane is the youngest because she is younger than both Tom and Sarah.</div>

<div style='background-color: #ccffcc; color: black;'>Based on the information provided, if Sarah is older than Tom, and Tom is older than Jane, we can establish the following order of age from oldest to youngest:

1. Sarah (oldest)
2. Tom
3. Jane (youngest)

From this order, we can conclude that Jane is the youngest because both Sarah and Tom are older than her.</div>

<div style='background-color: #cceeff; color: black;'>If Sarah is older than Tom, and Tom is older than Jane, then Jane is the youngest. 

We can determine this from the relationships given:
- Sarah > Tom (Sarah is older than Tom)
- Tom > Jane (Tom is older than Jane)

Since both Sarah and Tom are older than Jane, it logically follows that Jane must be the youngest among the three.</div>

## The exact same call, but with temperature == 0

In [4]:
# Temperature == 1 maximizes creativity, but also randomness
completion = client.chat.completions.create(
      model="gpt-4o-mini"
    , messages=TEST_MESSAGE
    , max_tokens=100 # limits the outputted completion_tokens, so will be cut short if it exceeds max_tokens tokens
    , temperature=0  # the randomness of the output, 0 is deterministic, 1 is random (repated calls will give different results for 1)
    , n=3            # This will generate 3 separate completions
)

for i, choice in enumerate(completion.choices):
    color = PASTELS[i][1]
    display(Markdown(f"<div style='background-color: {color}; color: black;'>{choice.message.content}</div>"))

<div style='background-color: #ffcccc; color: black;'>If Sarah is older than Tom, and Tom is older than Jane, then Jane is the youngest. 

We can determine this by analyzing the relationships:
- Sarah > Tom (Sarah is older than Tom)
- Tom > Jane (Tom is older than Jane)

From these two statements, we can infer that:
- Sarah > Tom > Jane

Since Jane is at the end of this chain, she is the youngest.</div>

<div style='background-color: #ccffcc; color: black;'>If Sarah is older than Tom, and Tom is older than Jane, then Jane is the youngest. 

We can determine this by analyzing the relationships:
- Sarah > Tom (Sarah is older than Tom)
- Tom > Jane (Tom is older than Jane)

From these two statements, we can infer that:
- Sarah > Tom > Jane

Since Jane is at the end of this chain, she is the youngest.</div>

<div style='background-color: #cceeff; color: black;'>If Sarah is older than Tom, and Tom is older than Jane, then Jane is the youngest. 

We can determine this by analyzing the relationships:
- Sarah > Tom (Sarah is older than Tom)
- Tom > Jane (Tom is older than Jane)

From these two statements, we can infer that:
- Sarah > Tom > Jane

Since Jane is at the end of this chain, she is the youngest.</div>

Using `temperature = 0` in a system that needs to be tested for answer accuracy is justified because it ensures deterministic and consistent outputs from the model. Here's why:

1. **Determinism and Reproducibility**: A temperature of 0 makes the model always choose the highest probability response for a given input. This eliminates randomness and guarantees that the same input will consistently produce the same output, which is crucial when testing the accuracy of the system. In scenarios where you need to compare results across multiple runs, having deterministic outputs allows for clear comparisons.

2. **Focus on Accuracy**: When testing for accuracy, it is important to avoid introducing variability. Higher temperatures introduce randomness in the sampling process, leading to more creative but less predictable outputs. By setting temperature to 0, the system is more likely to generate the "most likely" or "correct" answer based on its training data, which is essential for evaluating accuracy.

3. **Error Identification**: With a consistent response at `temperature = 0`, it is easier to identify errors or inaccuracies in the model's predictions. If the model consistently generates incorrect answers for certain inputs, you can more reliably track patterns or weaknesses that need to be addressed.

Thus, in a testing scenario where accuracy is the primary goal, using `temperature = 0` is an effective way to ensure precision and consistency.


# OpenAI's Pricing
OpenAI uses a [**pay-as-you-go**](https://openai.com/api/pricing/) pricing model, which means you only pay for what you use. The costs are based on the number of tokens processed.
 - [**Calculating LLM Token Counts: A Practical Guide**](https://winder.ai/calculating-token-counts-llm-context-windows-practical-guide/) - an article explaining what tokens are all about 
 - [**Tokenizer**](https://platform.openai.com/tokenizer) - an online tool that can be used to tokenize our text input. The tool colors the different tokens to help us get a visual understanding of the details

   
Tell them that tokens will be explained in more detail in the next session, when we will be talking about text embeddings




In [5]:
# Not all model names are working despite copying the information directly from the Openai website... 😳
MODEL_PRICING_PER_M_TOKENS = {
    'gpt-4o': {'prompt_tokens': 5.00, 'completion_tokens': 15.00},
    'gpt-4o-2024-08-06': {'prompt_tokens': 2.50, 'completion_tokens': 10.00},
    'gpt-4o-mini': {'prompt_tokens': 0.150, 'completion_tokens': 0.600},
    'gpt-4o-mini-2024-07-18': {'prompt_tokens': 0.150, 'completion_tokens': 0.600},
    'o1-preview': {'prompt_tokens': 15.00, 'completion_tokens': 60.00}
}

def model(persona, prompt, model="gpt-4o-mini"):
    completion = client.chat.completions.create(
          model=model
        , messages=[
            { "role": "system", "content": persona},
            { "role": "user", "content": prompt}
    ]
        , temperature=0
    )
    # Get the pricing for the model used in the completion
    pricing = MODEL_PRICING_PER_M_TOKENS[completion.model]

    # Calculate the cost of the completion
    prompt_cost = completion.usage.prompt_tokens * pricing['prompt_tokens']
    generation_cost = completion.usage.completion_tokens * pricing['completion_tokens']
    total_cost = (prompt_cost + generation_cost) / 10**6

    # Extract the message from the completion
    message = completion.choices[0].message.content

    return message, total_cost

In [6]:
model("you are a helpful assistant", "What is the capital of France?")

('The capital of France is Paris.', 7.65e-06)

In [7]:
# We can get the same answer for more money by insisting on using the gpt-4o model instead of the gpt-4o-mini 😀:
model("you are a helpful assistant", "What is the capital of France?", model='gpt-4o-2024-08-06')

('The capital of France is Paris.', 0.0001275)

## Logging and Pricing: Key Features of the OpenAI Integration

This code includes two important features that help manage the use of the OpenAI API: logging and pricing calculations.

### Logging Interactions

The code defines a decorator function called `log_interaction` that wraps the main `model` function. This decorator is responsible for:

1. Logging the user's prompt to a file called `llm_interactions.log`.
2. Calling the `model` function to get the AI's response and the cost of the interaction.
3. Logging the AI's response and the cost of the interaction to the same log file.
4. Updating a global variable called `total_price` to keep track of the total cost of all interactions.

This logging functionality is important for tracking the usage of the OpenAI API, both in terms of the content of the interactions and the associated costs.

### Calculating Pricing

The code also includes a dictionary called `MODEL_PRICING_PER_M_TOKENS` that stores the pricing information for different OpenAI models. This pricing information is used in the `model` function to calculate the cost of each interaction.

The `model` function first gets the pricing information for the specific model used in the completion. It then calculates the cost of the interaction by:

1. Multiplying the number of prompt tokens by the prompt token price.
2. Multiplying the number of completion tokens by the completion token price.
3. Adding the prompt cost and generation cost, and dividing the total by 1 million to get the cost in dollars.

This accurate pricing calculation is important for understanding the financial implications of using the OpenAI API, especially for applications that may generate a large number of requests.

By combining these logging and pricing features, the code provides a robust and transparent way to manage the use of the OpenAI API, making it easier to track usage, costs, and potential issues that may arise during the integration.

### Focused Explanation: Logging and Price Calculation

This code includes two important features: a **logging decorator** to keep track of user interactions and a **price calculation** system to manage OpenAI API costs.

---

### Logging Decorator (`log_interaction`)

The `log_interaction` decorator is a function that "wraps" around the main `model` function to add extra functionality—specifically logging and cost tracking—each time the `model` function is called.

- **Purpose**: The decorator records each user input, the AI’s response, and the cost of each interaction in a log file (`llm_interactions.log`).
- **How it Works**:
  - Before calling `model`, it logs the user’s message.
  - After `model` returns, it logs the AI's response and the cost of the interaction.
  - Finally, it updates a running total of all interaction costs and logs that cumulative total as well.

This way, each prompt and response pair, along with its cost, is recorded for easy tracking.

### Price Calculation in `model`

The `model` function calculates the cost of each API call based on **token usage**. OpenAI charges per token, and the cost varies by model. Here’s how it works:

1. **Retrieve Token Counts**: 
   - When the API responds, it includes `prompt_tokens` (tokens used in the prompt) and `completion_tokens` (tokens generated in the response).
   
2. **Look Up Model Pricing**: 
   - Each model has different rates, stored in the `MODEL_PRICING_PER_M_TOKENS` dictionary. For example, `gpt-4o-mini` might charge $0.150 per 1,000 prompt tokens and $0.600 per 1,000 completion tokens.

3. **Calculate Total Cost**:
   - The code calculates the cost of the prompt and the response separately:
     - `prompt_cost = prompt_tokens * rate_per_prompt_token`
     - `generation_cost = completion_tokens * rate_per_completion_token`
   - It then combines these two to get the total cost of the interaction, dividing by 1,000,000 to convert from tokens to dollars.

4. **Return Cost**:
   - The total cost for that interaction is returned alongside the AI’s response, which the decorator then logs.

---

### Summary

Together, the logging decorator and price calculation allow the code to:
- **Track each interaction** with the AI, logging both input and output.
- **Calculate and log the cost** of each call, giving a transparent view of API usage expenses.

In [8]:
import logging
#from datetime import datetime
from functools import wraps

# Set up logging
logging.basicConfig(filename='llm_interactions.log', level=logging.INFO,
                    format='%(asctime)s - %(message)s', datefmt='%Y-%m-%d %H:%M:%S')

# Global variable to keep track of total price
total_price = 0


MODEL_PRICING_PER_M_TOKENS = {
    'gpt-4o': {'prompt_tokens': 5.00, 'completion_tokens': 15.00},
    'gpt-4o-2024-08-06': {'prompt_tokens': 2.50, 'completion_tokens': 10.00},
    'gpt-4o-mini': {'prompt_tokens': 0.150, 'completion_tokens': 0.600},
    'gpt-4o-mini-2024-07-18': {'prompt_tokens': 0.150, 'completion_tokens': 0.600},
    'o1-preview': {'prompt_tokens': 15.00, 'completion_tokens': 60.00}
}

def log_interaction(func):
    @wraps(func)
    def wrapper(persona, prompt, model="gpt-4o-mini"):
        global total_price
        
        # Log user message
        logging.info(f"User: {prompt}")
        
        # Call the original function
        message, cost = func(persona, prompt, model)
        
        # Log LLM response and cost
        logging.info(f"LLM: {message}")
        logging.info(f"Cost of this interaction: ${cost:.6f}")
        
        # Update total price
        total_price += cost
        logging.info(f"Total price so far: ${total_price:.6f}")
        
        return message, cost
    return wrapper

@log_interaction
def model(persona, prompt, model="gpt-4o-mini"):
    completion = client.chat.completions.create(
          model=model
        , messages=[
            { "role": "system", "content": persona},
            { "role": "user", "content": prompt}
    ]
        , temperature=0
    )
    # Get the pricing for the model used in the completion
    pricing = MODEL_PRICING_PER_M_TOKENS[completion.model]

    # Calculate the cost of the completion
    prompt_cost = completion.usage.prompt_tokens * pricing['prompt_tokens']
    generation_cost = completion.usage.completion_tokens * pricing['completion_tokens']
    total_cost = (prompt_cost + generation_cost) / 10**6

    # Extract the message from the completion
    message = completion.choices[0].message.content

    return message, total_cost

Please observe the creation of ```llm_interactions.log``` file after the execution

In [9]:
# Example usage
if __name__ == "__main__":

    # first interaction
    persona = "You are a helpful assistant."
    prompt = "What is the capital of France?"
    response, cost = model(persona, prompt)
    print(f"Response: {response}")
    print(f"Cost: ${cost:.6f}")

    # second interaction
    prompt = "What is the capital of England?"
    response, cost = model(persona, prompt)
    print(f"Response: {response}")
    print(f"Cost: ${cost:.6f}")

    # print cumulative cost
    print(f"Total cost of all interactions: ${total_price:.6f}")

Response: The capital of France is Paris.
Cost: $0.000008
Response: The capital of England is London.
Cost: $0.000008
Total cost of all interactions: $0.000016
