## Goal:

**Maximize Pipeline Efficiency:**

* Achieve the highest possible performance gains for the entire pipeline. 

## Agenda:

**Performance Analysis:**

* **A) Benchmarking:** Evaluate the performance improvement of the algorithm using the epam10k Azure Rag dataset.
* **B) Quality Assessment:** Determine if the algorithm modification impacts the quality of results on the epam10k Azure Rag dataset.
* **C) Component-Level Analysis:** Analyze the performance gains achieved by each individual component of the algorithm using the epam10k Azure Rag dataset.
* **D) Hard Benchmarking:** Evaluate the performance improvement of the algorithm using the gpt4 log dataset.

# A)
### LLM Models:

- gemini-flash1: gemini-flash.001
- gemini-flash2: gemini-flash.002
- gpt4o-mini: gpt4o-mini

### Prompt:

- same: nli and answer refusal tasks in the same prompt
- separated: nli and answer refusal tasks in different prompt

### Concurrency:

- no concur: old algorithm
- concur %d: number of threads


|               | LLM Model     | Prompt     | Concurrency | Time transforming | Time getting NLI | Input tokens | Output tokens | Total tokens |
|---------------|---------------|------------|-------------|-------------------|------------------|--------------|---------------|--------------|
|               | gemini-flash1 | same       | no concur   | 01:45             | 10:14            | NAN          | NAN           | NAN          |
|               | gemini-flash1 | separated  | no concur   | 00:41             | 07:01            | NAN          | NAN           | NAN          |
|               | gemini-flash1 | separated  | concur 32   | 00:05.9123        | 00:27            | 227514       | 35973         | 263487       |
|               | gemini-flash2 | separated  | concur 32   | 00:07.0558        | 00:38            | 227552       | 34828         | 262380       |
|               | gpt4o-mini    | separated  | concur 32   | 00:10.5841        | 00:51            | 213086       | 35989         | 249075       |


# B)

### ROC AUC SCORE, manually labeled hypotheses vs experiments


|               | LLM Model     | Prompt     | Concurrency | Context to answer ROC AUC | Answer to ground truth answer ROC AUC   |  Ground truth answer to answer ROC AUC |
|---------------|---------------|------------|-------------|---------------------------|-----------------------------------------|----------------------------------------|
|               | gemini-flash1 | same       | no concur   | 0.9473684210526316        | 0.8596837944664033                      | 0.8538461538461539                     |
|               | gemini-flash1 | separated  | no concur   | 0.9473684210526316        | 0.969038208168643                       | 0.984615384615384                      |
|               | gemini-flash1 | separated  | concur 32   | 0.9473684210526316        | 0.9137022397891963                      | 0.923076923076923                      |
|               | gemini-flash2 | separated  | concur 32   | 0.9342105263157895        | 0.9538866930171278                      | 0.8747252747252748                     |
|               | gpt4o-mini    | separated  | concur 32   | 0.9605263157894737        | 0.7476943346508563                      | 0.9285714285714286                     |


The comparison of evaluations is made as follows: hypotheses were compared in pairs between different evaluations, and if the difference between the hypotheses in different evaluations is less or equal than eps(0.2), the hypotheses were considered to be labeled the same.

### Comparison "gflash1, nli / ar, no concur" experiment vs others(best previous experiment vs others)


|               | LLM Model     | Prompt     | Concurrency | Context to answer (same / all) | Answer to ground truth answer (same / all) | Ground truth answer to answer (same / all) |
|---------------|---------------|------------|-------------|-----------------------|-----------------------------------|-----------------------------------|
|               | gemini-flash1 | separated  | concur 32   | 41 / 48    | 48 / 56                | 44 / 48                |
|               | gemini-flash2 | separated  | concur 32  | 38 / 48    | 48 / 56                 | 38 / 48                 |
|               | gpt4o-mini | separated  | concur 32   | 41 / 48    | 35 / 56                | 37 / 48                 |



LLM Model | Prompt | Concurrency | Time Transforming | Time NLI Preprocessing | Time Getting NLI Only | Time Getting Answer Refusal Only | Time NLI Postprocessing | Algorithm Performance Change vs 1 Thread (Getting NLI Only) | Input Tokens | Output Tokens | Total Tokens
------- | -------- | -------- | -------- | -------- | -------- | -------- | -------- | -------- | -------- | -------- | --------
gemini-flash1 | separated | concur 1 | 00:42.7573 | 00:00.00011 | 05:49.913 | 00:09.1756 | 00:00.0087 | 1 | 227600 | 35881 | 263481
gemini-flash1 | separated | concur 8 | 00:08.0823 | 00:00.00010 | 00:51.0014 | 00:03.455 | 00:00.0720 | 6.860850878603333 | 227514 | 35835 | 263349
gemini-flash1 | separated | concur 32 | 00:05.9123 | 00:00.0001 | 00:25.6965 | 00:04.7539 | 00:00.0077 | 13.617146303971358 | 227514 | 35973 | 263487

# C)

LLM Model | Prompt | Concurrency | Time Transforming | Time NLI Preprocessing | Time Getting NLI Only | Time Getting Answer Refusal Only | Time NLI Postprocessing | Algorithm Performance Change vs 1 Thread (Getting NLI Only) | Input Tokens | Output Tokens | Total Tokens
------- | -------- | -------- | -------- | -------- | -------- | -------- | -------- | -------- | -------- | -------- | --------
gemini-flash1 | separated | concur 1 | 00:42.7573 | 00:00.00011 | 05:49.913 | 00:09.1756 | 00:00.0087 | 1 | 227600 | 35881 | 263481
gemini-flash1 | separated | concur 8 | 00:08.0823 | 00:00.00010 | 00:51.0014 | 00:03.455 | 00:00.0720 | 6.860850878603333 | 227514 | 35835 | 263349
gemini-flash1 | separated | concur 32 | 00:05.9123 | 00:00.0001 | 00:25.6965 | 00:04.7539 | 00:00.0077 | 13.617146303971358 | 227514 | 35973 | 263487


### ROC AUC SCORE, manually labeled hypotheses vs experiments

LLM Model | Prompt | Concurrency | context_to_answer ROC AUC | answer_to_ground_truth_answer ROC AUC | ground_truth_answer_to_answer ROC AUC
------- | -------- | -------- | -------- | -------- | --------
gemini-flash1 | separated | concur 1 | 0.9342105263157895 | 0.9137022397891963 | 0.9241758241758241
gemini-flash1 | separated | concur 8 | 0.9342105263157895 | 0.9137022397891963 | 0.9241758241758241
gemini-flash1 | separated | concur 32 | 0.9342105263157895 | 0.9137022397891963 | 0.9241758241758241

# D)
As you recall, in all the experiments above, the hypotheses are submitted to the prompt one at a time. 

New parameter: batch_size, which specifies how many hypotheses are simultaneously fed into a single prompt for getting NLI. 



| LLM Model | Prompt | Concurrency | Batch Size | Time Transforming | Time Getting NLI | Input Tokens | Output Tokens | Total Tokens |
|---|---|---|---|---|---|---|---|---|
| gemini-flash1 | same | no concur | 1 | 15:11 | 47:05 | 4115347 | 501073 | 4616420 |
| gemini-flash1 | separated | no concur | 8 | 03:36 | 17:34 | 766753 | 306424 | 1073177 |
| gemini-flash2 | separated | concur 32 | 1 | 00:25.6166 | 03:06.249 | 3534768 | 372050 | 3906818 |
| gpt4o-mini | separated | concur 32 | 1 | 00:32.7575 | 06:16.492 | 3987196 | 402602 | 4389798 | 

## Conclusion:

Either I've made a mistake somewhere, or I don't understand why the results between the concur algorithm and the no concur algorithm have started to differ so drastically in just one week. The temperature is 0, the prompts are the same. I do not see particular shif in new results. Just some hypotheses have different NLI score.