# Generative AI Optimization Examples
This notebook demonstrates several prompt optimization techniques using the OpenAI API. You'll need an API key to run these examples.

> ⚠️ Make sure you have installed the `openai` Python package and set your `OPENAI_API_KEY` environment variable or configure it manually in the notebook.

In [2]:
# Setup
from openai import OpenAI

client = OpenAI()

# Optionally set your API key here if not using environment variables
# openai.api_key = 'your-api-key-here'

## Example 1: Prompt Compression
Reducing unnecessary instructions from the input prompt.

In [15]:
# Verbose prompt
verbose_prompt = '''
You're an intelligent and helpful assistant. Please help me answer this question in a clear, concise, and professional manner.
The question is: How can I reduce the latency of my GPT-4 application in production?
'''

response = client.chat.completions.create(model="gpt-4",
messages=[{"role": "user", "content": verbose_prompt}],
temperature=0.7)
print(response.choices[0].message.content)
print("\n--- Token Usage ---")
print(f"Prompt tokens: {response.usage.prompt_tokens}")
print(f"Completion tokens: {response.usage.completion_tokens}")
print(f"Total tokens: {response.usage.total_tokens}")

Reducing the latency of your GPT-4 application in production can be achieved through several strategies:

1. **Optimize your model**: Try using a smaller version of the GPT-4 model if the latency issue arises from the model's complexity. Smaller models can still deliver high performance with lower latency.

2. **Parallel Processing**: Use parallel processing to divide the model's tasks across multiple CPUs, GPUs, or machines. This can dramatically reduce the time it takes to process requests.

3. **Optimize your code**: Review your code for any inefficiencies or bottlenecks that could be slowing down processing. Use profiling tools to identify these problem areas.

4. **Use Caching**: If your application often produces the same responses, caching these responses can reduce the need for the model to generate them, thereby reducing latency.

5. **Upgrade your hardware**: Faster CPUs, more memory, and high-performance GPUs can all contribute to lower latency. Cloud-based solutions can pro

In [11]:
# Compressed prompt
compressed_prompt = "How can I reduce GPT-4 latency in production?"

response = client.chat.completions.create(model="gpt-4",
messages=[{"role": "user", "content": compressed_prompt}],
temperature=0.7)
print(response.choices[0].message.content)
print("\n--- Token Usage ---")
print(f"Prompt tokens: {response.usage.prompt_tokens}")
print(f"Completion tokens: {response.usage.completion_tokens}")
print(f"Total tokens: {response.usage.total_tokens}")

As of the time of writing this, GPT-4 has not been released by OpenAI. Therefore, it's hard to provide specific advice on how to reduce its latency in production. However, the following general tips might be useful for optimizing AI models like GPT-3 and potentially GPT-4:

1. Model Optimization: Use techniques like quantization, pruning, and knowledge distillation to reduce the complexity of the model. This could potentially reduce the time it takes to make predictions.

2. Hardware Acceleration: Use GPUs or other hardware accelerators, which are often designed to efficiently run AI workloads.

3. Efficient Coding: Optimize your code to reduce unnecessary computations. Make sure your implementation is as efficient as possible.

4. Use Edge Computing: If possible, deploy the model closer to the user to reduce network latency. This could mean using edge computing solutions.

5. Parallel Processing: If the model needs to make multiple predictions, try to parallelize these predictions to 

## Example 2: Output Compression
Limit the size and verbosity of the model's response.

In [16]:
# Unconstrained response
prompt = "Tell me about photosynthesis."
response = client.chat.completions.create(model="gpt-3.5-turbo",
messages=[{"role": "user", "content": prompt}])
print(response.choices[0].message.content)
print("\n--- Token Usage ---")
print(f"Prompt tokens: {response.usage.prompt_tokens}")
print(f"Completion tokens: {response.usage.completion_tokens}")
print(f"Total tokens: {response.usage.total_tokens}")

Photosynthesis is the process by which plants, algae, and certain bacteria convert light energy from the sun into chemical energy stored in glucose molecules. This process is essential for the survival of plants and other photosynthetic organisms, as it provides them with the energy they need to grow and sustain life.

Photosynthesis takes place in the chloroplasts of plant cells, where chlorophyll – a green pigment that absorbs light – is located. During photosynthesis, carbon dioxide from the air and water from the soil are taken up by the plant. The chlorophyll absorbs light energy from the sun and uses it to convert these raw materials into glucose and oxygen.

The overall chemical equation for photosynthesis is:

6CO2 + 6H2O + light energy → C6H12O6 + 6O2

In this process, carbon dioxide is converted into glucose, a form of sugar that plants use as a source of energy. Oxygen is also produced as a byproduct and released into the atmosphere.

Photosynthesis is a vital process for ma

In [17]:
# Constrained response
prompt = "Tell me about photosynthesis."
response = client.chat.completions.create(model="gpt-3.5-turbo",
messages=[
    {"role": "system", "content": "Answer in 3 short bullet points."},
    {"role": "user", "content": prompt}
],
max_tokens=100)
print(response.choices[0].message.content)
print("\n--- Token Usage ---")
print(f"Prompt tokens: {response.usage.prompt_tokens}")
print(f"Completion tokens: {response.usage.completion_tokens}")
print(f"Total tokens: {response.usage.total_tokens}")

- Photosynthesis is the process by which green plants, algae, and some bacteria convert light energy into chemical energy.
- This process involves using carbon dioxide, water, and sunlight to produce glucose (a sugar) and oxygen as a byproduct.
- Photosynthesis is essential for the survival of plants and ultimately fuels most life on Earth by providing food and oxygen.

--- Token Usage ---
Prompt tokens: 25
Completion tokens: 72
Total tokens: 97


## Example 3: Cache Prompts
Return responses for duplicate prompts from cache.

In [25]:
import hashlib
cache = {}
def prompt_hash(prompt):
    return hashlib.sha256(prompt.encode()).hexdigest()
def fetch_answer(prompt):
    key = prompt_hash(prompt)
    if key in cache:
        print("Cache hit! Returning cached response.")
        return cache[key]
    print("Cache miss. Calling API...")
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.7
    )
    result = response.choices[0].message.content
    cache[key] = result
    return result

# Test the caching functionality
test_prompt = "What is the capital of France?"

print("First call:")
answer1 = fetch_answer(test_prompt)
print(f"Answer: {answer1}\n")

print("Second call (should use cache):")
answer2 = fetch_answer(test_prompt)
print(f"Answer: {answer2}\n")

print(f"Cache size: {len(cache)} entries")

First call:
Cache miss. Calling API...
Answer: The capital of France is Paris.

Second call (should use cache):
Cache hit! Returning cached response.
Answer: The capital of France is Paris.

Cache size: 1 entries


## Example 4: Batch/Parallel Requests
Process multiple API requests concurrently to reduce total execution time. This example compares sequential vs parallel processing and shows the performance improvement.

In [30]:
import asyncio
import time
from openai import AsyncOpenAI

# Create async client
async_client = AsyncOpenAI()

# Sample prompts to process in parallel
prompts = [
    "Explain machine learning in one sentence.",
    "What is the capital of Japan?",
    "Define artificial intelligence briefly.",
    "How does photosynthesis work in simple terms?",
    "What is the speed of light?"
]

async def fetch_completion(prompt):
    """Fetch a single completion asynchronously"""
    response = await async_client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=50
    )
    return response.choices[0].message.content

async def parallel_requests(prompts):
    """Process multiple prompts in parallel"""
    tasks = [fetch_completion(prompt) for prompt in prompts]
    return await asyncio.gather(*tasks)

def sequential_requests(prompts):
    """Process prompts sequentially for comparison"""
    results = []
    for prompt in prompts:
        response = client.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=[{"role": "user", "content": prompt}],
            max_tokens=50
        )
        results.append(response.choices[0].message.content)
    return results

# Compare sequential vs parallel processing
print("Testing 5 API calls...")
print("\n--- Sequential Processing ---")
start_time = time.time()
sequential_results = sequential_requests(prompts)
sequential_time = time.time() - start_time
print(f"Sequential time: {sequential_time:.2f} seconds")

print("\n--- Parallel Processing ---")
start_time = time.time()
parallel_results = await parallel_requests(prompts)
parallel_time = time.time() - start_time
print(f"Parallel time: {parallel_time:.2f} seconds")

print(f"\nSpeedup: {sequential_time/parallel_time:.2f}x faster")

# Display results
print("\n--- Results ---")
for i, (prompt, result) in enumerate(zip(prompts, parallel_results)):
    print(f"{i+1}. Q: {prompt}")
    print(f"   A: {result}\n")

Testing 5 API calls...

--- Sequential Processing ---
Sequential time: 7.10 seconds

--- Parallel Processing ---
Sequential time: 7.10 seconds

--- Parallel Processing ---
Parallel time: 1.07 seconds

Speedup: 6.66x faster

--- Results ---
1. Q: Explain machine learning in one sentence.
   A: Machine learning is a subset of artificial intelligence that allows computers to learn and improve from experience without being explicitly programmed.

2. Q: What is the capital of Japan?
   A: The capital of Japan is Tokyo.

3. Q: Define artificial intelligence briefly.
   A: Artificial intelligence (AI) refers to the simulation of human intelligence processes by machines, particularly computer systems. This includes tasks such as reasoning, learning, problem-solving, perception, and language understanding.

4. Q: How does photosynthesis work in simple terms?
   A: Photosynthesis is the process by which plants, algae, and some bacteria convert sunlight into energy in the form of glucose (sugar).

## Example 5: Streaming Responses
Reduce perceived latency by streaming the response in real-time instead of waiting for the complete response.

In [4]:
import time

print("--- Non-streaming Response ---")
start_time = time.time()
response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "Explain quantum computing simply."}],
    temperature=0.7,
    stream=False
)
non_streaming_time = time.time() - start_time
print(response.choices[0].message.content)
print(f"\nTotal time to first output: {non_streaming_time:.2f} seconds")

print("\n\n--- Streaming Response ---")
start_time = time.time()
first_chunk_time = None

response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "Explain quantum computing simply."}],
    temperature=0.7,
    stream=True
)

for chunk in response:
    if chunk.choices[0].delta.content is not None:
        if first_chunk_time is None:
            first_chunk_time = time.time() - start_time
        print(chunk.choices[0].delta.content, end="", flush=True)

print(f"\n\nTime to first chunk: {first_chunk_time:.2f} seconds")
print(f"Perceived latency improvement: {((non_streaming_time - first_chunk_time) / non_streaming_time * 100):.1f}%")

--- Non-streaming Response ---
Quantum computing is a type of computing that uses quantum bits, or 'qubits', instead of the typical bits used in digital computing. While regular bits can be either 0s or 1s, qubits can be both at the same time, thanks to a property called superposition. This allows quantum computers to process a much higher number of possibilities instantly. 

In addition to superposition, quantum computers also use another quantum mechanic property called entanglement, where the state of one particle is directly related to the state of another, no matter the distance between them. This allows quantum computers to process information in a way that is significantly faster and more complex than traditional computers. 

However, quantum computing is currently still in the experimental stage, as there are many difficulties in maintaining the state of qubits and reducing errors in calculations.

Total time to first output: 9.16 seconds


--- Streaming Response ---
Quantum co