# 3. [Experiments](https://github.com/microsoft/rag-openai/blob/main/topics/RAG_EnablingExperimentation.md)

It sounds like a simplistic statement, but "the best Search solution is the one that returns the best results".

The only way to get the best and most relevant results is via experimentation. As a result, the experimentation phase is crucial and effort should be invested to create experiments that can:

- be tracked and evaluated against a consistent set of metrics
- be repeatable, and works in a very methodical way to track and record results

When creating/running search experiments, there are multiple factors that shape the outcome of each experiment. These are small changes that add up over time and change the functionality and effectiveness of your search experience. These tweaks should help you determine which combination of document shaping and indexing techniques will provide the most relevant set of documents returned for the set of queries that you care about.

```{note}
Each situation is different, and the techniques you use will depend on the type of documents you have, the type of queries you expect to receive, and the type of results you want to return.
```

As you build out your experimentation process for search, reference the [Things to Consider](https://github.com/microsoft/rag-openai/blob/main/topics/RAG_ThingsToConsider.md) document which will highlight some important features to include in your experiments. You can see the [existing learnings from engagements ](https://github.com/microsoft/rag-openai/blob/main/topics/RAG_EnablingSearch.md#learnings-from-our-engagements-2)

## The Role of Experimentation

Creating an effective solution is a delicate balance of several factors:

- Which search mechanism to use - whether it is vector search, semantic search, a combination, or other (the options are discussed on this document Enabling Search )

- Which model to use - GPT4, GPT3 (Ada, Curie, Da Vinci), GPT Turbo etc. This is discussed on this link: https://learn.microsoft.com/en-us/azure/cognitive-services/openai/concepts/models

- The prompt - the instruction given to the model in order to produce the desired result. An effective prompt has several components: 1) the actual question a user typed into a box 2) the history of the conversation including both questions and answers 3) and the prompt per-se - scope established for the LLM which helps refine and focus the predictive elements the model uses to answer the question. Writing an effective prompt is referred to as "Prompt Engineering". It is an empirical and iterative process. This link has more information on prompt engineering: https://learn.microsoft.com/en-us/azure/cognitive-services/openai/concepts/prompt-engineering

- A series of arguments, passed to the OpenAI APIS also impact the result. These are arguments such as "temperature", "logit_bias", "presence_penalty" etc. More information can be found here https://learn.microsoft.com/en-us/azure/cognitive-services/openai/reference

```{seealso}
https://github.com/microsoft/rag-openai/blob/438999a5470bef7946fa1c8714ed1090e1ed40c3/topics/RAG_ThingsToConsider.md
```

## Evaluation

One of the most important aspects of developing a system for production is regular evaluation and iterative experimentation. This process allows you to measure performance, troubleshoot issues, and fine-tune to improve accuracy and efficiency.
Note that there are different types of metrics and ways to evaluate a system. There is an end-to-end business evaluation, where we want to measure whether the deployed system has met certain business metrics (e.g.: sales increase), and there is a technical evaluation where we want to ensure that each functional part of the system meets certain technical requirements or baseline metrics to ensure that we are pushing a quality system in production.

In RAG, there are two main components: the retrieval part and the generative part. These components must be evaluated individually before evaluating the end-to-end solution.

### Evaluation Strategy

An OpenAI solution (or system) may involve different components that will work together to make up the core functionalities of the chatbot, different group of prompts + an OpenAI endpoint call that will perform different tasks of the system. For example, we could have:

- An agent that takes care of the conversation part of the solution.
- An agent that detects intent within a conversation.
- An agent that extracts vehicle information from a conversation

The system may need to be integrated with live services apart from OpenAI. For example, a agent that needs to make calls to a hosted API (service) to help verify that the customer has fully specified the vehicle's manufacture, model, trim or variant to be able to progress the conversation.

In evaluating the system, we need to account for evaluating all components of the system individually

#### Data

https://github.com/microsoft/rag-openai/blob/main/topics/RAG_EnablingSearch.md

Quality measurement guides the entire development process of the Search solution. The team must know what performance metric they are chasing.

This is easier said than done. We are developing systems where the results are fuzzy by nature. What documents do you expect to be retrieved when you ask a question?

We can use three potential approaches:

- We build the test data already knowing the result - **ideal, but not always feasible**
- We have a human analyze each of the results that comes back and score it right or wrong - **laborious and prone to errors**
- We build an artificial intelligent validation mechanism - **dangerous, but sometimes the only method available**

Do not set a metric target until the baseline is well known. There are many AI projects out there targeting 99% accuracy, when the current methods produce 10~12% accuracy, and when even humans are not better than that. So, before setting a target, take time to understand what humans can do, what the current process produces. When accepting a target, focus on small improvements. We would never have automated Speech to Text if we had started with ambitious and unrealistic targets.

Do not set a metric target until the baseline is well known. There are many AI projects out there targeting 99% accuracy, when the current methods produce 10~12% accuracy, and when even humans are not better than that. So, before setting a target, take time to understand what humans can do, what the current process produces. When accepting a target, focus on small improvements. We would never have automated Speech to Text if we had started with ambitious and unrealistic targets. https://github.com/microsoft/rag-openai/blob/main/topics/RAG_EnablingSearch.md#learnings-from-our-engagements-1
