# Week 1 Exercise —By Mougang Thomas Gasmyr

## Approach
A technical Q&A tool that takes a question and gets explanations from two LLMs:
- **GPT-4o-mini** (via OpenRouter) — streamed response with live display updates
- **Llama 3.2** (via local Ollama) — streamed response with live display updates

In [1]:
# imports

import os
from dotenv import load_dotenv
from openai import OpenAI
from IPython.display import Markdown, display, update_display

In [2]:
# constants

MODEL_GPT = 'gpt-4o-mini'
MODEL_LLAMA = 'llama3.2'

In [3]:
# set up environment

load_dotenv(override=True)

api_key = os.getenv("OPENROUTER_API_KEY")
if not api_key:
    raise ValueError("OPENROUTER_API_KEY not found in .env file")

openrouter_base_url = "https://openrouter.ai/api/v1"

# OpenAI client
openai_client = OpenAI(
    base_url=openrouter_base_url,
    api_key=api_key
)

# Ollama client (OpenAI-compatible endpoint)
ollama_client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

# System prompt for technical explanations
system_prompt = "You are a helpful technical tutor. Explain concepts clearly and concisely, using examples where appropriate."

In [4]:
# reusable functions for querying LLMs with streaming display

def ask_gpt(question, client, model, system_prompt):
    """Stream a response from GPT and display live."""
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": question}
    ]
    stream = client.chat.completions.create(model=model, messages=messages, stream=True)
    response = ""
    display_handle = display(Markdown(""), display_id=True)
    for chunk in stream:
        response += chunk.choices[0].delta.content or ""
        update_display(Markdown(response), display_id=display_handle.display_id)
    return response

def ask_ollama(question, client, model, system_prompt):
    """Stream a response from Ollama and display live."""
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": question}
    ]
    stream = client.chat.completions.create(model=model, messages=messages, stream=True)
    response = ""
    display_handle = display(Markdown(""), display_id=True)
    for chunk in stream:
        response += chunk.choices[0].delta.content or ""
        update_display(Markdown(response), display_id=display_handle.display_id)
    return response

In [9]:
# ask your technical question (defaults to the example if left blank)

question = "How to evaluate the performance of production AI agents? What are the best practices and tools for monitoring and improving their performance over time?"

In [10]:
# Get gpt-4o-mini to answer, with streaming

print("## GPT-4o-mini Response\n")
try:
    gpt_response = ask_gpt(question, openai_client, MODEL_GPT, system_prompt)
except Exception as e:
    print(f"GPT error: {e}")

## GPT-4o-mini Response



Evaluating the performance of production AI agents involves a systematic approach that encompasses various metrics, best practices, and tools. Here’s a concise guide:

### Key Concepts in Performance Evaluation

1. **Performance Metrics**:
   - **Accuracy**: The percentage of correct predictions made by the AI agent.
   - **Precision**: The ratio of true positives to the sum of true positives and false positives (useful in classification tasks).
   - **Recall (Sensitivity)**: The ratio of true positives to the sum of true positives and false negatives (important in imbalanced classes).
   - **F1 Score**: The harmonic mean of precision and recall, providing a balance between the two.
   - **AUC-ROC**: The area under the receiver operating characteristic curve, useful for binary classification.
   - **Response Time**: For real-time AI applications, how quickly the agent responds to queries.
   - **Throughput**: The number of queries processed per unit of time.

2. **Error Analysis**: Understanding the types of errors the AI agent is making (false positives vs. false negatives) can help identify areas for improvement.

3. **User Feedback**: Collect feedback from end-users to assess the practical effectiveness of the AI agent in real-world scenarios.

### Best Practices for Performance Evaluation

1. **Continuous Monitoring**: Implement monitoring systems that track key performance metrics in real-time to quickly identify any performance degradation.

2. **Regular Retraining**: Periodically retrain the AI model with new data to ensure it stays relevant and adapts to changes over time.

3. **A/B Testing**: Deploy different versions of models or changes to model configurations concurrently to compare performance in a controlled manner.

4. **Version Control**: Maintain clear versioning for models to keep track of changes, allowing for easier rollback if a new version underperforms.

5. **Data Quality Assessment**: Ensure that the training and evaluation datasets are of high quality and representative of the real-world context in which the AI agent operates.

6. **Bias and Fairness Checks**: Regularly assess the model for bias to ensure outputs are fair and equitable across all user demographics.

### Tools for Monitoring and Improving Performance

1. **Monitoring Platforms**:
   - **Prometheus** and **Grafana**: For real-time performance monitoring and visualization of metrics.
   - **Elastic Stack (ELK)**: For logging and monitoring AI application performance.
   - **Neptune.ai**, **Weights & Biases**, or **Comet.ml**: Tools specifically designed for tracking experiments and model performance over time.

2. **Model Evaluation Frameworks**:
   - **MLflow**: For tracking experiments, managing models, and monitoring performance.
   - **TensorBoard**: Useful for visualizing the training process and metrics of TensorFlow models.

3. **Data Drift Detection**: Tools like **Evidently** or **NannyML** can be employed to detect changes in data distributions over time, guiding when a model may need retraining.

4. **Feedback Systems**: Incorporate user satisfaction surveys or feedback collection forms to gather actionable insights.

By utilizing these concepts, practices, and tools, you can establish a robust system for evaluating and continuously improving the performance of AI agents in production environments. This ensures these systems remain effective, reliable, and relevant as conditions change.

In [7]:
# Get Llama 3.2 to answer, with streaming

print("## Llama 3.2 Response\n")
try:
    llama_response = ask_ollama(question, ollama_client, MODEL_LLAMA, system_prompt)
except Exception as e:
    print(f"Ollama error: {e}. Is Ollama running?")

## Llama 3.2 Response



Evaluating the performance of production AI agents is crucial to ensure they are meeting their expected objectives and delivering consistent results. Here's a comprehensive overview of best practices, tools, and techniques for monitoring and improving performance:

**Evaluation Metrics**

1. **Accuracy**: Measure the proportion of correct predictions or classifications.
2. **Precision**: Evaluate the ratio of true positives to total predicted positive instances.
3. **Recall**: Calculate the proportion of actual positive instances that were correctly identified.
4. **F1-score**: Compute the harmonic mean of precision and recall.
5. **Time-to-Resolution**: Measure response time for critical tasks or decisions.

**Real-Time Monitoring**

1. **Instrumentation**: Integrate monitoring tools with your AI system to collect performance data as it runs.
2. **Logging**: Establish a logging framework to capture relevant metrics, errors, and warnings.
3. **Monitoring Dashboard**: Create visualizations to help developers quickly grasp the current health of their AI systems.

**Tools for Monitoring**

1. **TensorBoard**: A popular tool for monitoring TensorFlow models in real-time.
2. **HyperLance**: An all-in-one platform for logging, visualization, and hyperparameter tuning.
3. **Prometheus**: A scalable time-series database for storing performance metrics.
4. **Grafana**: A powerful dashboarding tool for creating visualizations and alerts.

**Improvement Techniques**

1. **Exploratory Data Analysis (EDA)**: Investigate data distribution and patterns to identify trends or biases.
2. **Model Evaluation**: Regularly assess model performance on a separate test dataset using metrics such as accuracy, precision, recall, and F1-score.
3. **Hyperparameter Tuning**: Use techniques like grid search, random search, or Bayesian optimization to optimize hyperparameters for production AI agents.
4. **Data Augmentation**: Enhance data quality by creating augmented versions of existing data (e.g., noisy instances) to improve model robustness.

**Automated Testing and Validation**

1. **Randomized Controlled Trials (RCTs)**: Conduct experiments to test the efficacy of new models or techniques in production environments.
2. **Automated Testing Frameworks**: Use frameworks like Pytest, Behave, or Nose to write automated tests for AI systems.
3. **Validation Frameworks**: Establish a suite of pre-built verification routines to check against expected behaviors.

**Model Serving and Model Updates**

1. **Model Serving Platforms**: Utilize platforms like TensorFlow Serving, AWS SageMaker, or Azure Machine Learning to host and deploy AI models in production environments.
2. **Automated Model Updating**: Design pipelines that automate feature updates, data refreshes, or new model deployment for maintaining accurate and updated performance.

By implementing these best practices, tools, and techniques, you can ensure the reliable and high-performing operation of your production AI agents. Regular monitoring, exploration, and improvement will keep your models adaptable to changing environments and continuous learning from user feedback and operational context.