# Inference with `google/gemma-7b-it`. Fine-tuned using PEFT QLoRA

## Purpose
This notebook performs inference using the `google/gemma-7b-it` language model from Hugging Face's `transformers` library, fine-tuned using **PEFT (Parameter-Efficient Fine-Tuning)** with **QLoRA** (Quantized Low Rank Adaptation). The test dataset consists of 1363 records, which are fetched via an HTTP call to the **Palaven API**—a custom API developed exclusively for this thesis. The generated responses, along with evaluation metrics, are saved back to the Palaven API for future analysis.

## Overview

1. **Declaration of Global Variables**:
    - The notebook initializes all necessary global variables, including paths, API credentials, batch sizes, and any configuration settings required for inference and evaluation.

2. **Loading the Fine-tuned Model and Tokenizer**:
    - The `google/gemma-7b-it` model and its corresponding tokenizer are loaded from the Hugging Face library (`transformers`).
    - The model has been fine-tuned using **PEFT** with **QLoRA**, allowing for efficient adaptation to the specific task with reduced memory requirements.

3. **Inference Execution**:
    - **Dataset fetching**: The test dataset, consisting of 1363 records, is retrieved from the cloud via the **Palaven API** using an HTTP request.
    - **Batch processing**: The dataset is processed in batches, with each batch containing 50 instructions. This ensures efficient use of memory and processing resources, especially when working with the fine-tuned model.
    - **Model Inference**: For each instruction in the dataset, the fine-tuned `google/gemma-7b-it` model, enhanced by the RAG component, generates responses.
    - **Saving responses**: All generated responses are persisted back to the Palaven API for future evaluation and analysis.

4. **Evaluation Metrics**:
    - **BERTScore**: Precision, Recall, and F1 scores are calculated for each batch to measure the semantic similarity between the generated text and the ground truth.
    - **ROUGE**: ROUGE 1, ROUGE 2, and ROUGE L are computed to evaluate n-gram overlap, with Precision, Recall, and F1 calculated for each metric.
    - **BLEU**: The BLEU score is calculated for each batch, assessing the fluency and correctness of the generated text based on n-gram precision.
    - **Persisting metrics**: All computed metrics are persisted to the cloud via the Palaven API, allowing for deeper analysis and comparisons with other experiments (e.g., out-of-the-box vs. fine-tuned).


## Key Sections
### 1. Declaration of global variables

| **Variable**             | **Description**                                                                                   | **Value**                                                   |
|--------------------------|---------------------------------------------------------------------------------------------------|-------------------------------------------------------------|
| `hface_read_token`        | Access token for reading from Hugging Face, retrieved from user data.                             | `userdata.get('hface-read-token')`                          |
| `palaven_base_url`        | Base URL for the Palaven API, retrieved from user data.                                           | `userdata.get('palaven-base-url')`                          |
| `batch_size`              | Batch size for inference and evaluation execution.                                                | `50`                                                        |
| `dataset_id`              | Unique identifier for the dataset used in evaluation.                                             | `'F0444B12-5485-4299-B03B-3BDB6D4A2578'`                    |
| `evaluation_session_id`   | Unique identifier for the current evaluation session.                                             | `'EB9C5839-7B20-4D7D-B3F7-17528180676D'`                    |
| `llm_palaven_name`        | Name of the model used in Palaven.                                                 | `'google-gemma'`                                            |
| `llm_model_name`          | Full name of the model used from Hugging Face.                                                    | `'google/gemma-7b-it'`                                      |
| `device_info`             | Information about the device where the inferences and evaluations are executed.                   | `'GPU A100'`                                                |


In [2]:
from google.colab import userdata
from palaven_api_v2 import PalavenApi

hface_read_token = userdata.get('hface-read-token')
palaven_base_url = userdata.get('palaven-base-url')

batch_size = 50
dataset_id = 'F0444B12-5485-4299-B03B-3BDB6D4A2578'
evaluation_session_id = 'EB9C5839-7B20-4D7D-B3F7-17528180676D'
llm_palaven_name = 'google-gemma'
llm_model_name = 'google/gemma-7b-it'
llm_palaven_peft = '/content/drive/MyDrive/Colab Notebooks/palaven/gemma-7b-it-palaven'
device_info = 'GPU A100'

palaven_api = PalavenApi(palaven_base_url)

### 2. Load the base model and the tokenizer

In [None]:
!pip install -U transformers
!pip install accelerate

Collecting transformers
  Downloading transformers-4.44.2-py3-none-any.whl.metadata (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.7/43.7 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
Downloading transformers-4.44.2-py3-none-any.whl (9.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.5/9.5 MB[0m [31m73.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.42.4
    Uninstalling transformers-4.42.4:
      Successfully uninstalled transformers-4.42.4
Successfully installed transformers-4.44.2


In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained(llm_model_name, token=hface_read_token)
base_model = AutoModelForCausalLM.from_pretrained(llm_model_name, device_map="auto", token=hface_read_token)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/34.2k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/694 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/20.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/2.11G [00:00<?, ?B/s]

`config.hidden_act` is ignored, you should use `config.hidden_activation` instead.
Gemma's activation function will be set to `gelu_pytorch_tanh`. Please, use
`config.hidden_activation` if you want to override this behaviour.
See https://github.com/huggingface/transformers/pull/29402 for more details.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

### 2.1. Load the PEFT QLoRA finetuned adapters and merge them with the base model google/gemma-7b-it

In [None]:
!pip install peft

Collecting peft
  Downloading peft-0.12.0-py3-none-any.whl.metadata (13 kB)
Downloading peft-0.12.0-py3-none-any.whl (296 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m296.4/296.4 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: peft
Successfully installed peft-0.12.0


In [None]:
from peft import PeftModel

peft_model = PeftModel.from_pretrained(base_model, llm_palaven_peft, token=hface_read_token)
model = peft_model.merge_and_unload()
model.to("cuda")

GemmaForCausalLM(
  (model): GemmaModel(
    (embed_tokens): Embedding(256000, 3072, padding_idx=0)
    (layers): ModuleList(
      (0-27): 28 x GemmaDecoderLayer(
        (self_attn): GemmaSdpaAttention(
          (q_proj): Linear(in_features=3072, out_features=4096, bias=False)
          (k_proj): Linear(in_features=3072, out_features=4096, bias=False)
          (v_proj): Linear(in_features=3072, out_features=4096, bias=False)
          (o_proj): Linear(in_features=4096, out_features=3072, bias=False)
          (rotary_emb): GemmaRotaryEmbedding()
        )
        (mlp): GemmaMLP(
          (gate_proj): Linear(in_features=3072, out_features=24576, bias=False)
          (up_proj): Linear(in_features=3072, out_features=24576, bias=False)
          (down_proj): Linear(in_features=24576, out_features=3072, bias=False)
          (act_fn): PytorchGELUTanh()
        )
        (input_layernorm): GemmaRMSNorm((3072,), eps=1e-06)
        (post_attention_layernorm): GemmaRMSNorm((3072,), eps=1

### 3. Inference Execution

In [None]:
import time
import torch

def get_chat_completion(instruction_prompt, instruction_id, instruction_count):

  start_time = time.time()

  tokenized_instruction = tokenizer(instruction_prompt, return_tensors="pt").to("cuda")

  with torch.no_grad():
    chat_completion_result = model.generate(**tokenized_instruction, max_new_tokens=1245)

  chat_completion = tokenizer.decode(chat_completion_result[0])

  end_time = time.time()
  elapsed_time = end_time - start_time

  print(f'{instruction_count} - LLM-ChatCompletion. InstructionId: {instruction_id},  Elapsed-Time: {elapsed_time:.2f} seconds')

  return chat_completion, elapsed_time

In [None]:
def add_instruction_to_df(df, instruction_id, instruction):
    df.loc[instruction_id, 'instruction'] = instruction
    df.loc[instruction_id, 'chat_completion'] = None
    df.loc[instruction_id, 'elapsed_time'] = None

In [None]:
import pandas as pd
import numpy as np
import json

data_shape = {
    'instruction_id': [],
    'instruction': [],
    'chat_completion': [],
    'elapsed_time': []
}

for batch_number in range(1, 29):

  print(f'Start fetching batch {batch_number}')

  instructions = palaven_api.fetch_instruction_test_dataset(evaluation_session_id, batch_number=batch_number, evaluation_exercise='llmfinetuned')

  print(f'Start fetching batch done...')

  llm_responses_df = pd.DataFrame(data_shape)
  llm_responses_df.set_index('instruction_id', inplace=True)

  for item in instructions:
    add_instruction_to_df(llm_responses_df, item['instructionId'], item['instruction'])

  instruction_count = 1

  for index, row in llm_responses_df.iterrows():
    instruction = llm_responses_df.loc[index, 'instruction']

    instruction_prompt = f"""
      <start_of_turn>user
      Answer the following question in a concise and informative manner. The question is written in Spanish language, then answer in Spanish language.
      {instruction}<end_of_turn>
      <start_of_turn>model
    """

    instruction_id = int(index)

    chat_completion, elapsed_time = get_chat_completion(instruction_prompt, instruction_id, instruction_count)
    palaven_api.save_model_response(evaluation_session_id, 'llmfinetuned', batch_number, instruction_id, chat_completion, elapsed_time)

    instruction_count += 1

Start fetching batch 1
Start fetching batch done...


  df.loc[instruction_id, 'instruction'] = instruction


1 - LLM-ChatCompletion. InstructionId: 3878,  Elapsed-Time: 3.89 seconds
Palaven.SaveResponse. InstructionId: 3878,  Elapsed-Time: 0.51 seconds
2 - LLM-ChatCompletion. InstructionId: 3880,  Elapsed-Time: 1.17 seconds
Palaven.SaveResponse. InstructionId: 3880,  Elapsed-Time: 0.25 seconds
3 - LLM-ChatCompletion. InstructionId: 3882,  Elapsed-Time: 1.96 seconds
Palaven.SaveResponse. InstructionId: 3882,  Elapsed-Time: 0.17 seconds
4 - LLM-ChatCompletion. InstructionId: 3884,  Elapsed-Time: 3.06 seconds
Palaven.SaveResponse. InstructionId: 3884,  Elapsed-Time: 0.15 seconds
5 - LLM-ChatCompletion. InstructionId: 3886,  Elapsed-Time: 2.51 seconds
Palaven.SaveResponse. InstructionId: 3886,  Elapsed-Time: 0.20 seconds
6 - LLM-ChatCompletion. InstructionId: 3888,  Elapsed-Time: 1.92 seconds
Palaven.SaveResponse. InstructionId: 3888,  Elapsed-Time: 0.24 seconds
7 - LLM-ChatCompletion. InstructionId: 3894,  Elapsed-Time: 4.36 seconds
Palaven.SaveResponse. InstructionId: 3894,  Elapsed-Time: 0.19 

  df.loc[instruction_id, 'instruction'] = instruction


1 - LLM-ChatCompletion. InstructionId: 4017,  Elapsed-Time: 3.19 seconds
Palaven.SaveResponse. InstructionId: 4017,  Elapsed-Time: 0.15 seconds
2 - LLM-ChatCompletion. InstructionId: 4025,  Elapsed-Time: 5.53 seconds
Palaven.SaveResponse. InstructionId: 4025,  Elapsed-Time: 0.25 seconds
3 - LLM-ChatCompletion. InstructionId: 4027,  Elapsed-Time: 1.47 seconds
Palaven.SaveResponse. InstructionId: 4027,  Elapsed-Time: 0.15 seconds
4 - LLM-ChatCompletion. InstructionId: 4029,  Elapsed-Time: 1.79 seconds
Palaven.SaveResponse. InstructionId: 4029,  Elapsed-Time: 0.19 seconds
5 - LLM-ChatCompletion. InstructionId: 4031,  Elapsed-Time: 1.68 seconds
Palaven.SaveResponse. InstructionId: 4031,  Elapsed-Time: 0.17 seconds
6 - LLM-ChatCompletion. InstructionId: 4033,  Elapsed-Time: 3.17 seconds
Palaven.SaveResponse. InstructionId: 4033,  Elapsed-Time: 0.17 seconds
7 - LLM-ChatCompletion. InstructionId: 4035,  Elapsed-Time: 2.61 seconds
Palaven.SaveResponse. InstructionId: 4035,  Elapsed-Time: 0.26 

  df.loc[instruction_id, 'instruction'] = instruction


1 - LLM-ChatCompletion. InstructionId: 4163,  Elapsed-Time: 1.62 seconds
Palaven.SaveResponse. InstructionId: 4163,  Elapsed-Time: 0.18 seconds
2 - LLM-ChatCompletion. InstructionId: 4165,  Elapsed-Time: 1.01 seconds
Palaven.SaveResponse. InstructionId: 4165,  Elapsed-Time: 0.19 seconds
3 - LLM-ChatCompletion. InstructionId: 4175,  Elapsed-Time: 4.72 seconds
Palaven.SaveResponse. InstructionId: 4175,  Elapsed-Time: 0.15 seconds
4 - LLM-ChatCompletion. InstructionId: 4177,  Elapsed-Time: 2.23 seconds
Palaven.SaveResponse. InstructionId: 4177,  Elapsed-Time: 0.25 seconds
5 - LLM-ChatCompletion. InstructionId: 4179,  Elapsed-Time: 1.77 seconds
Palaven.SaveResponse. InstructionId: 4179,  Elapsed-Time: 0.22 seconds
6 - LLM-ChatCompletion. InstructionId: 4181,  Elapsed-Time: 2.29 seconds
Palaven.SaveResponse. InstructionId: 4181,  Elapsed-Time: 0.17 seconds
7 - LLM-ChatCompletion. InstructionId: 4183,  Elapsed-Time: 2.59 seconds
Palaven.SaveResponse. InstructionId: 4183,  Elapsed-Time: 0.16 

  df.loc[instruction_id, 'instruction'] = instruction


1 - LLM-ChatCompletion. InstructionId: 4308,  Elapsed-Time: 2.26 seconds
Palaven.SaveResponse. InstructionId: 4308,  Elapsed-Time: 0.23 seconds
2 - LLM-ChatCompletion. InstructionId: 4310,  Elapsed-Time: 1.38 seconds
Palaven.SaveResponse. InstructionId: 4310,  Elapsed-Time: 0.17 seconds
3 - LLM-ChatCompletion. InstructionId: 4318,  Elapsed-Time: 4.40 seconds
Palaven.SaveResponse. InstructionId: 4318,  Elapsed-Time: 0.16 seconds
4 - LLM-ChatCompletion. InstructionId: 4320,  Elapsed-Time: 2.01 seconds
Palaven.SaveResponse. InstructionId: 4320,  Elapsed-Time: 0.17 seconds
5 - LLM-ChatCompletion. InstructionId: 4322,  Elapsed-Time: 2.12 seconds
Palaven.SaveResponse. InstructionId: 4322,  Elapsed-Time: 0.21 seconds
6 - LLM-ChatCompletion. InstructionId: 4324,  Elapsed-Time: 1.41 seconds
Palaven.SaveResponse. InstructionId: 4324,  Elapsed-Time: 0.20 seconds
7 - LLM-ChatCompletion. InstructionId: 4326,  Elapsed-Time: 2.54 seconds
Palaven.SaveResponse. InstructionId: 4326,  Elapsed-Time: 0.20 

  df.loc[instruction_id, 'instruction'] = instruction


1 - LLM-ChatCompletion. InstructionId: 4449,  Elapsed-Time: 1.84 seconds
Palaven.SaveResponse. InstructionId: 4449,  Elapsed-Time: 0.18 seconds
2 - LLM-ChatCompletion. InstructionId: 4459,  Elapsed-Time: 4.98 seconds
Palaven.SaveResponse. InstructionId: 4459,  Elapsed-Time: 0.16 seconds
3 - LLM-ChatCompletion. InstructionId: 4461,  Elapsed-Time: 1.15 seconds
Palaven.SaveResponse. InstructionId: 4461,  Elapsed-Time: 0.24 seconds
4 - LLM-ChatCompletion. InstructionId: 4463,  Elapsed-Time: 1.85 seconds
Palaven.SaveResponse. InstructionId: 4463,  Elapsed-Time: 0.19 seconds
5 - LLM-ChatCompletion. InstructionId: 4465,  Elapsed-Time: 2.15 seconds
Palaven.SaveResponse. InstructionId: 4465,  Elapsed-Time: 0.17 seconds
6 - LLM-ChatCompletion. InstructionId: 4467,  Elapsed-Time: 2.33 seconds
Palaven.SaveResponse. InstructionId: 4467,  Elapsed-Time: 0.19 seconds
7 - LLM-ChatCompletion. InstructionId: 4469,  Elapsed-Time: 1.71 seconds
Palaven.SaveResponse. InstructionId: 4469,  Elapsed-Time: 0.17 

  df.loc[instruction_id, 'instruction'] = instruction


1 - LLM-ChatCompletion. InstructionId: 4599,  Elapsed-Time: 3.20 seconds
Palaven.SaveResponse. InstructionId: 4599,  Elapsed-Time: 0.17 seconds
2 - LLM-ChatCompletion. InstructionId: 4601,  Elapsed-Time: 2.20 seconds
Palaven.SaveResponse. InstructionId: 4601,  Elapsed-Time: 0.18 seconds
3 - LLM-ChatCompletion. InstructionId: 4603,  Elapsed-Time: 1.29 seconds
Palaven.SaveResponse. InstructionId: 4603,  Elapsed-Time: 0.17 seconds
4 - LLM-ChatCompletion. InstructionId: 4605,  Elapsed-Time: 2.18 seconds
Palaven.SaveResponse. InstructionId: 4605,  Elapsed-Time: 0.24 seconds
5 - LLM-ChatCompletion. InstructionId: 4614,  Elapsed-Time: 4.48 seconds
Palaven.SaveResponse. InstructionId: 4614,  Elapsed-Time: 0.17 seconds
6 - LLM-ChatCompletion. InstructionId: 4616,  Elapsed-Time: 0.84 seconds
Palaven.SaveResponse. InstructionId: 4616,  Elapsed-Time: 0.16 seconds
7 - LLM-ChatCompletion. InstructionId: 4618,  Elapsed-Time: 1.31 seconds
Palaven.SaveResponse. InstructionId: 4618,  Elapsed-Time: 0.17 

  df.loc[instruction_id, 'instruction'] = instruction


1 - LLM-ChatCompletion. InstructionId: 4742,  Elapsed-Time: 1.87 seconds
Palaven.SaveResponse. InstructionId: 4742,  Elapsed-Time: 0.15 seconds
2 - LLM-ChatCompletion. InstructionId: 4747,  Elapsed-Time: 4.47 seconds
Palaven.SaveResponse. InstructionId: 4747,  Elapsed-Time: 0.24 seconds
3 - LLM-ChatCompletion. InstructionId: 4749,  Elapsed-Time: 2.45 seconds
Palaven.SaveResponse. InstructionId: 4749,  Elapsed-Time: 0.17 seconds
4 - LLM-ChatCompletion. InstructionId: 4751,  Elapsed-Time: 2.20 seconds
Palaven.SaveResponse. InstructionId: 4751,  Elapsed-Time: 0.17 seconds
5 - LLM-ChatCompletion. InstructionId: 4753,  Elapsed-Time: 1.77 seconds
Palaven.SaveResponse. InstructionId: 4753,  Elapsed-Time: 0.17 seconds
6 - LLM-ChatCompletion. InstructionId: 4755,  Elapsed-Time: 1.94 seconds
Palaven.SaveResponse. InstructionId: 4755,  Elapsed-Time: 0.17 seconds
7 - LLM-ChatCompletion. InstructionId: 4757,  Elapsed-Time: 2.38 seconds
Palaven.SaveResponse. InstructionId: 4757,  Elapsed-Time: 0.24 

  df.loc[instruction_id, 'instruction'] = instruction


1 - LLM-ChatCompletion. InstructionId: 4881,  Elapsed-Time: 0.84 seconds
Palaven.SaveResponse. InstructionId: 4881,  Elapsed-Time: 0.24 seconds
2 - LLM-ChatCompletion. InstructionId: 4889,  Elapsed-Time: 4.52 seconds
Palaven.SaveResponse. InstructionId: 4889,  Elapsed-Time: 0.17 seconds
3 - LLM-ChatCompletion. InstructionId: 4891,  Elapsed-Time: 2.72 seconds
Palaven.SaveResponse. InstructionId: 4891,  Elapsed-Time: 0.17 seconds
4 - LLM-ChatCompletion. InstructionId: 4893,  Elapsed-Time: 1.69 seconds
Palaven.SaveResponse. InstructionId: 4893,  Elapsed-Time: 0.20 seconds
5 - LLM-ChatCompletion. InstructionId: 4895,  Elapsed-Time: 1.82 seconds
Palaven.SaveResponse. InstructionId: 4895,  Elapsed-Time: 0.16 seconds
6 - LLM-ChatCompletion. InstructionId: 4897,  Elapsed-Time: 2.05 seconds
Palaven.SaveResponse. InstructionId: 4897,  Elapsed-Time: 0.18 seconds
7 - LLM-ChatCompletion. InstructionId: 4899,  Elapsed-Time: 1.94 seconds
Palaven.SaveResponse. InstructionId: 4899,  Elapsed-Time: 0.15 

  df.loc[instruction_id, 'instruction'] = instruction


1 - LLM-ChatCompletion. InstructionId: 5024,  Elapsed-Time: 1.74 seconds
Palaven.SaveResponse. InstructionId: 5024,  Elapsed-Time: 0.17 seconds
2 - LLM-ChatCompletion. InstructionId: 5030,  Elapsed-Time: 4.48 seconds
Palaven.SaveResponse. InstructionId: 5030,  Elapsed-Time: 0.16 seconds
3 - LLM-ChatCompletion. InstructionId: 5032,  Elapsed-Time: 3.90 seconds
Palaven.SaveResponse. InstructionId: 5032,  Elapsed-Time: 0.19 seconds
4 - LLM-ChatCompletion. InstructionId: 5034,  Elapsed-Time: 2.15 seconds
Palaven.SaveResponse. InstructionId: 5034,  Elapsed-Time: 0.20 seconds
5 - LLM-ChatCompletion. InstructionId: 5036,  Elapsed-Time: 1.00 seconds
Palaven.SaveResponse. InstructionId: 5036,  Elapsed-Time: 0.19 seconds
6 - LLM-ChatCompletion. InstructionId: 5038,  Elapsed-Time: 2.68 seconds
Palaven.SaveResponse. InstructionId: 5038,  Elapsed-Time: 0.18 seconds
7 - LLM-ChatCompletion. InstructionId: 5040,  Elapsed-Time: 1.46 seconds
Palaven.SaveResponse. InstructionId: 5040,  Elapsed-Time: 0.20 

  df.loc[instruction_id, 'instruction'] = instruction


1 - LLM-ChatCompletion. InstructionId: 5168,  Elapsed-Time: 2.56 seconds
Palaven.SaveResponse. InstructionId: 5168,  Elapsed-Time: 0.26 seconds
2 - LLM-ChatCompletion. InstructionId: 5170,  Elapsed-Time: 1.51 seconds
Palaven.SaveResponse. InstructionId: 5170,  Elapsed-Time: 0.20 seconds
3 - LLM-ChatCompletion. InstructionId: 5172,  Elapsed-Time: 1.87 seconds
Palaven.SaveResponse. InstructionId: 5172,  Elapsed-Time: 0.17 seconds
4 - LLM-ChatCompletion. InstructionId: 5174,  Elapsed-Time: 2.18 seconds
Palaven.SaveResponse. InstructionId: 5174,  Elapsed-Time: 0.19 seconds
5 - LLM-ChatCompletion. InstructionId: 5176,  Elapsed-Time: 1.94 seconds
Palaven.SaveResponse. InstructionId: 5176,  Elapsed-Time: 0.16 seconds
6 - LLM-ChatCompletion. InstructionId: 5178,  Elapsed-Time: 1.80 seconds
Palaven.SaveResponse. InstructionId: 5178,  Elapsed-Time: 0.19 seconds
7 - LLM-ChatCompletion. InstructionId: 5180,  Elapsed-Time: 2.08 seconds
Palaven.SaveResponse. InstructionId: 5180,  Elapsed-Time: 0.19 

  df.loc[instruction_id, 'instruction'] = instruction


1 - LLM-ChatCompletion. InstructionId: 5309,  Elapsed-Time: 2.01 seconds
Palaven.SaveResponse. InstructionId: 5309,  Elapsed-Time: 0.18 seconds
2 - LLM-ChatCompletion. InstructionId: 5311,  Elapsed-Time: 1.83 seconds
Palaven.SaveResponse. InstructionId: 5311,  Elapsed-Time: 0.18 seconds
3 - LLM-ChatCompletion. InstructionId: 5313,  Elapsed-Time: 1.84 seconds
Palaven.SaveResponse. InstructionId: 5313,  Elapsed-Time: 0.16 seconds
4 - LLM-ChatCompletion. InstructionId: 5315,  Elapsed-Time: 1.45 seconds
Palaven.SaveResponse. InstructionId: 5315,  Elapsed-Time: 0.21 seconds
5 - LLM-ChatCompletion. InstructionId: 5317,  Elapsed-Time: 1.88 seconds
Palaven.SaveResponse. InstructionId: 5317,  Elapsed-Time: 0.24 seconds
6 - LLM-ChatCompletion. InstructionId: 5319,  Elapsed-Time: 1.49 seconds
Palaven.SaveResponse. InstructionId: 5319,  Elapsed-Time: 0.16 seconds
7 - LLM-ChatCompletion. InstructionId: 5321,  Elapsed-Time: 2.21 seconds
Palaven.SaveResponse. InstructionId: 5321,  Elapsed-Time: 0.17 

  df.loc[instruction_id, 'instruction'] = instruction


1 - LLM-ChatCompletion. InstructionId: 5452,  Elapsed-Time: 1.53 seconds
Palaven.SaveResponse. InstructionId: 5452,  Elapsed-Time: 0.15 seconds
2 - LLM-ChatCompletion. InstructionId: 5454,  Elapsed-Time: 2.08 seconds
Palaven.SaveResponse. InstructionId: 5454,  Elapsed-Time: 0.15 seconds
3 - LLM-ChatCompletion. InstructionId: 5456,  Elapsed-Time: 2.99 seconds
Palaven.SaveResponse. InstructionId: 5456,  Elapsed-Time: 0.21 seconds
4 - LLM-ChatCompletion. InstructionId: 5458,  Elapsed-Time: 1.43 seconds
Palaven.SaveResponse. InstructionId: 5458,  Elapsed-Time: 0.16 seconds
5 - LLM-ChatCompletion. InstructionId: 5460,  Elapsed-Time: 2.24 seconds
Palaven.SaveResponse. InstructionId: 5460,  Elapsed-Time: 0.17 seconds
6 - LLM-ChatCompletion. InstructionId: 5469,  Elapsed-Time: 4.14 seconds
Palaven.SaveResponse. InstructionId: 5469,  Elapsed-Time: 0.17 seconds
7 - LLM-ChatCompletion. InstructionId: 5471,  Elapsed-Time: 1.33 seconds
Palaven.SaveResponse. InstructionId: 5471,  Elapsed-Time: 0.23 

  df.loc[instruction_id, 'instruction'] = instruction


1 - LLM-ChatCompletion. InstructionId: 5597,  Elapsed-Time: 1.62 seconds
Palaven.SaveResponse. InstructionId: 5597,  Elapsed-Time: 0.18 seconds
2 - LLM-ChatCompletion. InstructionId: 5599,  Elapsed-Time: 2.10 seconds
Palaven.SaveResponse. InstructionId: 5599,  Elapsed-Time: 0.19 seconds
3 - LLM-ChatCompletion. InstructionId: 5606,  Elapsed-Time: 5.14 seconds
Palaven.SaveResponse. InstructionId: 5606,  Elapsed-Time: 0.25 seconds
4 - LLM-ChatCompletion. InstructionId: 5608,  Elapsed-Time: 1.61 seconds
Palaven.SaveResponse. InstructionId: 5608,  Elapsed-Time: 0.17 seconds
5 - LLM-ChatCompletion. InstructionId: 5610,  Elapsed-Time: 2.41 seconds
Palaven.SaveResponse. InstructionId: 5610,  Elapsed-Time: 0.15 seconds
6 - LLM-ChatCompletion. InstructionId: 5612,  Elapsed-Time: 1.39 seconds
Palaven.SaveResponse. InstructionId: 5612,  Elapsed-Time: 0.18 seconds
7 - LLM-ChatCompletion. InstructionId: 5614,  Elapsed-Time: 2.12 seconds
Palaven.SaveResponse. InstructionId: 5614,  Elapsed-Time: 0.17 

  df.loc[instruction_id, 'instruction'] = instruction


1 - LLM-ChatCompletion. InstructionId: 5737,  Elapsed-Time: 1.24 seconds
Palaven.SaveResponse. InstructionId: 5737,  Elapsed-Time: 0.18 seconds
2 - LLM-ChatCompletion. InstructionId: 5739,  Elapsed-Time: 2.00 seconds
Palaven.SaveResponse. InstructionId: 5739,  Elapsed-Time: 0.15 seconds
3 - LLM-ChatCompletion. InstructionId: 5741,  Elapsed-Time: 2.65 seconds
Palaven.SaveResponse. InstructionId: 5741,  Elapsed-Time: 0.16 seconds
4 - LLM-ChatCompletion. InstructionId: 5748,  Elapsed-Time: 6.12 seconds
Palaven.SaveResponse. InstructionId: 5748,  Elapsed-Time: 0.26 seconds
5 - LLM-ChatCompletion. InstructionId: 5750,  Elapsed-Time: 2.21 seconds
Palaven.SaveResponse. InstructionId: 5750,  Elapsed-Time: 0.17 seconds
6 - LLM-ChatCompletion. InstructionId: 5752,  Elapsed-Time: 2.27 seconds
Palaven.SaveResponse. InstructionId: 5752,  Elapsed-Time: 0.18 seconds
7 - LLM-ChatCompletion. InstructionId: 5754,  Elapsed-Time: 1.72 seconds
Palaven.SaveResponse. InstructionId: 5754,  Elapsed-Time: 0.15 

  df.loc[instruction_id, 'instruction'] = instruction


1 - LLM-ChatCompletion. InstructionId: 5884,  Elapsed-Time: 5.65 seconds
Palaven.SaveResponse. InstructionId: 5884,  Elapsed-Time: 0.23 seconds
2 - LLM-ChatCompletion. InstructionId: 5886,  Elapsed-Time: 1.21 seconds
Palaven.SaveResponse. InstructionId: 5886,  Elapsed-Time: 0.17 seconds
3 - LLM-ChatCompletion. InstructionId: 5888,  Elapsed-Time: 1.59 seconds
Palaven.SaveResponse. InstructionId: 5888,  Elapsed-Time: 0.17 seconds
4 - LLM-ChatCompletion. InstructionId: 5890,  Elapsed-Time: 1.73 seconds
Palaven.SaveResponse. InstructionId: 5890,  Elapsed-Time: 0.17 seconds
5 - LLM-ChatCompletion. InstructionId: 5892,  Elapsed-Time: 1.63 seconds
Palaven.SaveResponse. InstructionId: 5892,  Elapsed-Time: 0.15 seconds
6 - LLM-ChatCompletion. InstructionId: 5894,  Elapsed-Time: 2.33 seconds
Palaven.SaveResponse. InstructionId: 5894,  Elapsed-Time: 0.15 seconds
7 - LLM-ChatCompletion. InstructionId: 5896,  Elapsed-Time: 1.94 seconds
Palaven.SaveResponse. InstructionId: 5896,  Elapsed-Time: 0.18 

  df.loc[instruction_id, 'instruction'] = instruction


1 - LLM-ChatCompletion. InstructionId: 6020,  Elapsed-Time: 1.95 seconds
Palaven.SaveResponse. InstructionId: 6020,  Elapsed-Time: 0.16 seconds
2 - LLM-ChatCompletion. InstructionId: 6022,  Elapsed-Time: 1.98 seconds
Palaven.SaveResponse. InstructionId: 6022,  Elapsed-Time: 0.19 seconds
3 - LLM-ChatCompletion. InstructionId: 6030,  Elapsed-Time: 1.63 seconds
Palaven.SaveResponse. InstructionId: 6030,  Elapsed-Time: 0.22 seconds
4 - LLM-ChatCompletion. InstructionId: 6032,  Elapsed-Time: 1.93 seconds
Palaven.SaveResponse. InstructionId: 6032,  Elapsed-Time: 0.19 seconds
5 - LLM-ChatCompletion. InstructionId: 6034,  Elapsed-Time: 1.58 seconds
Palaven.SaveResponse. InstructionId: 6034,  Elapsed-Time: 0.15 seconds
6 - LLM-ChatCompletion. InstructionId: 6036,  Elapsed-Time: 2.55 seconds
Palaven.SaveResponse. InstructionId: 6036,  Elapsed-Time: 0.18 seconds
7 - LLM-ChatCompletion. InstructionId: 6038,  Elapsed-Time: 2.21 seconds
Palaven.SaveResponse. InstructionId: 6038,  Elapsed-Time: 0.17 

  df.loc[instruction_id, 'instruction'] = instruction


1 - LLM-ChatCompletion. InstructionId: 6162,  Elapsed-Time: 1.91 seconds
Palaven.SaveResponse. InstructionId: 6162,  Elapsed-Time: 0.16 seconds
2 - LLM-ChatCompletion. InstructionId: 6164,  Elapsed-Time: 1.64 seconds
Palaven.SaveResponse. InstructionId: 6164,  Elapsed-Time: 0.16 seconds
3 - LLM-ChatCompletion. InstructionId: 6166,  Elapsed-Time: 1.43 seconds
Palaven.SaveResponse. InstructionId: 6166,  Elapsed-Time: 0.19 seconds
4 - LLM-ChatCompletion. InstructionId: 6168,  Elapsed-Time: 1.88 seconds
Palaven.SaveResponse. InstructionId: 6168,  Elapsed-Time: 0.25 seconds
5 - LLM-ChatCompletion. InstructionId: 6176,  Elapsed-Time: 4.82 seconds
Palaven.SaveResponse. InstructionId: 6176,  Elapsed-Time: 0.17 seconds
6 - LLM-ChatCompletion. InstructionId: 6178,  Elapsed-Time: 1.40 seconds
Palaven.SaveResponse. InstructionId: 6178,  Elapsed-Time: 0.16 seconds
7 - LLM-ChatCompletion. InstructionId: 6180,  Elapsed-Time: 1.29 seconds
Palaven.SaveResponse. InstructionId: 6180,  Elapsed-Time: 0.19 

  df.loc[instruction_id, 'instruction'] = instruction


1 - LLM-ChatCompletion. InstructionId: 6306,  Elapsed-Time: 1.52 seconds
Palaven.SaveResponse. InstructionId: 6306,  Elapsed-Time: 0.27 seconds
2 - LLM-ChatCompletion. InstructionId: 6308,  Elapsed-Time: 1.74 seconds
Palaven.SaveResponse. InstructionId: 6308,  Elapsed-Time: 0.17 seconds
3 - LLM-ChatCompletion. InstructionId: 6310,  Elapsed-Time: 1.50 seconds
Palaven.SaveResponse. InstructionId: 6310,  Elapsed-Time: 0.17 seconds
4 - LLM-ChatCompletion. InstructionId: 6312,  Elapsed-Time: 2.01 seconds
Palaven.SaveResponse. InstructionId: 6312,  Elapsed-Time: 0.17 seconds
5 - LLM-ChatCompletion. InstructionId: 6314,  Elapsed-Time: 1.49 seconds
Palaven.SaveResponse. InstructionId: 6314,  Elapsed-Time: 0.15 seconds
6 - LLM-ChatCompletion. InstructionId: 6322,  Elapsed-Time: 5.06 seconds
Palaven.SaveResponse. InstructionId: 6322,  Elapsed-Time: 0.25 seconds
7 - LLM-ChatCompletion. InstructionId: 6324,  Elapsed-Time: 2.00 seconds
Palaven.SaveResponse. InstructionId: 6324,  Elapsed-Time: 0.17 

  df.loc[instruction_id, 'instruction'] = instruction


1 - LLM-ChatCompletion. InstructionId: 6444,  Elapsed-Time: 1.51 seconds
Palaven.SaveResponse. InstructionId: 6444,  Elapsed-Time: 0.17 seconds
2 - LLM-ChatCompletion. InstructionId: 6446,  Elapsed-Time: 1.79 seconds
Palaven.SaveResponse. InstructionId: 6446,  Elapsed-Time: 0.17 seconds
3 - LLM-ChatCompletion. InstructionId: 6453,  Elapsed-Time: 4.86 seconds
Palaven.SaveResponse. InstructionId: 6453,  Elapsed-Time: 0.24 seconds
4 - LLM-ChatCompletion. InstructionId: 6455,  Elapsed-Time: 2.55 seconds
Palaven.SaveResponse. InstructionId: 6455,  Elapsed-Time: 0.21 seconds
5 - LLM-ChatCompletion. InstructionId: 6457,  Elapsed-Time: 3.15 seconds
Palaven.SaveResponse. InstructionId: 6457,  Elapsed-Time: 0.16 seconds
6 - LLM-ChatCompletion. InstructionId: 6459,  Elapsed-Time: 1.84 seconds
Palaven.SaveResponse. InstructionId: 6459,  Elapsed-Time: 0.17 seconds
7 - LLM-ChatCompletion. InstructionId: 6461,  Elapsed-Time: 2.36 seconds
Palaven.SaveResponse. InstructionId: 6461,  Elapsed-Time: 0.22 

  df.loc[instruction_id, 'instruction'] = instruction


1 - LLM-ChatCompletion. InstructionId: 6582,  Elapsed-Time: 2.42 seconds
Palaven.SaveResponse. InstructionId: 6582,  Elapsed-Time: 0.23 seconds
2 - LLM-ChatCompletion. InstructionId: 6584,  Elapsed-Time: 3.54 seconds
Palaven.SaveResponse. InstructionId: 6584,  Elapsed-Time: 0.17 seconds
3 - LLM-ChatCompletion. InstructionId: 6591,  Elapsed-Time: 4.49 seconds
Palaven.SaveResponse. InstructionId: 6591,  Elapsed-Time: 0.19 seconds
4 - LLM-ChatCompletion. InstructionId: 6593,  Elapsed-Time: 2.37 seconds
Palaven.SaveResponse. InstructionId: 6593,  Elapsed-Time: 0.26 seconds
5 - LLM-ChatCompletion. InstructionId: 6595,  Elapsed-Time: 2.24 seconds
Palaven.SaveResponse. InstructionId: 6595,  Elapsed-Time: 0.15 seconds
6 - LLM-ChatCompletion. InstructionId: 6597,  Elapsed-Time: 2.01 seconds
Palaven.SaveResponse. InstructionId: 6597,  Elapsed-Time: 0.23 seconds
7 - LLM-ChatCompletion. InstructionId: 6599,  Elapsed-Time: 2.26 seconds
Palaven.SaveResponse. InstructionId: 6599,  Elapsed-Time: 0.15 

  df.loc[instruction_id, 'instruction'] = instruction


1 - LLM-ChatCompletion. InstructionId: 6724,  Elapsed-Time: 2.47 seconds
Palaven.SaveResponse. InstructionId: 6724,  Elapsed-Time: 0.23 seconds
2 - LLM-ChatCompletion. InstructionId: 6726,  Elapsed-Time: 1.82 seconds
Palaven.SaveResponse. InstructionId: 6726,  Elapsed-Time: 0.19 seconds
3 - LLM-ChatCompletion. InstructionId: 6728,  Elapsed-Time: 0.76 seconds
Palaven.SaveResponse. InstructionId: 6728,  Elapsed-Time: 0.18 seconds
4 - LLM-ChatCompletion. InstructionId: 6736,  Elapsed-Time: 4.96 seconds
Palaven.SaveResponse. InstructionId: 6736,  Elapsed-Time: 0.19 seconds
5 - LLM-ChatCompletion. InstructionId: 6738,  Elapsed-Time: 1.09 seconds
Palaven.SaveResponse. InstructionId: 6738,  Elapsed-Time: 0.15 seconds
6 - LLM-ChatCompletion. InstructionId: 6740,  Elapsed-Time: 1.93 seconds
Palaven.SaveResponse. InstructionId: 6740,  Elapsed-Time: 0.26 seconds
7 - LLM-ChatCompletion. InstructionId: 6742,  Elapsed-Time: 0.88 seconds
Palaven.SaveResponse. InstructionId: 6742,  Elapsed-Time: 0.18 

  df.loc[instruction_id, 'instruction'] = instruction


1 - LLM-ChatCompletion. InstructionId: 6860,  Elapsed-Time: 0.95 seconds
Palaven.SaveResponse. InstructionId: 6860,  Elapsed-Time: 0.21 seconds
2 - LLM-ChatCompletion. InstructionId: 6868,  Elapsed-Time: 4.89 seconds
Palaven.SaveResponse. InstructionId: 6868,  Elapsed-Time: 0.17 seconds
3 - LLM-ChatCompletion. InstructionId: 6870,  Elapsed-Time: 1.70 seconds
Palaven.SaveResponse. InstructionId: 6870,  Elapsed-Time: 0.18 seconds
4 - LLM-ChatCompletion. InstructionId: 6872,  Elapsed-Time: 1.60 seconds
Palaven.SaveResponse. InstructionId: 6872,  Elapsed-Time: 0.25 seconds
5 - LLM-ChatCompletion. InstructionId: 6874,  Elapsed-Time: 1.85 seconds
Palaven.SaveResponse. InstructionId: 6874,  Elapsed-Time: 0.17 seconds
6 - LLM-ChatCompletion. InstructionId: 6876,  Elapsed-Time: 2.00 seconds
Palaven.SaveResponse. InstructionId: 6876,  Elapsed-Time: 0.17 seconds
7 - LLM-ChatCompletion. InstructionId: 6878,  Elapsed-Time: 1.24 seconds
Palaven.SaveResponse. InstructionId: 6878,  Elapsed-Time: 0.16 

  df.loc[instruction_id, 'instruction'] = instruction


1 - LLM-ChatCompletion. InstructionId: 6999,  Elapsed-Time: 2.27 seconds
Palaven.SaveResponse. InstructionId: 6999,  Elapsed-Time: 0.19 seconds
2 - LLM-ChatCompletion. InstructionId: 7001,  Elapsed-Time: 1.43 seconds
Palaven.SaveResponse. InstructionId: 7001,  Elapsed-Time: 0.18 seconds
3 - LLM-ChatCompletion. InstructionId: 7003,  Elapsed-Time: 1.66 seconds
Palaven.SaveResponse. InstructionId: 7003,  Elapsed-Time: 0.18 seconds
4 - LLM-ChatCompletion. InstructionId: 7005,  Elapsed-Time: 1.18 seconds
Palaven.SaveResponse. InstructionId: 7005,  Elapsed-Time: 0.21 seconds
5 - LLM-ChatCompletion. InstructionId: 7015,  Elapsed-Time: 4.91 seconds
Palaven.SaveResponse. InstructionId: 7015,  Elapsed-Time: 0.15 seconds
6 - LLM-ChatCompletion. InstructionId: 7017,  Elapsed-Time: 1.53 seconds
Palaven.SaveResponse. InstructionId: 7017,  Elapsed-Time: 0.18 seconds
7 - LLM-ChatCompletion. InstructionId: 7019,  Elapsed-Time: 2.54 seconds
Palaven.SaveResponse. InstructionId: 7019,  Elapsed-Time: 0.24 

  df.loc[instruction_id, 'instruction'] = instruction


1 - LLM-ChatCompletion. InstructionId: 7144,  Elapsed-Time: 3.16 seconds
Palaven.SaveResponse. InstructionId: 7144,  Elapsed-Time: 0.19 seconds
2 - LLM-ChatCompletion. InstructionId: 7146,  Elapsed-Time: 3.41 seconds
Palaven.SaveResponse. InstructionId: 7146,  Elapsed-Time: 0.21 seconds
3 - LLM-ChatCompletion. InstructionId: 7154,  Elapsed-Time: 4.91 seconds
Palaven.SaveResponse. InstructionId: 7154,  Elapsed-Time: 0.17 seconds
4 - LLM-ChatCompletion. InstructionId: 7156,  Elapsed-Time: 2.33 seconds
Palaven.SaveResponse. InstructionId: 7156,  Elapsed-Time: 0.17 seconds
5 - LLM-ChatCompletion. InstructionId: 7158,  Elapsed-Time: 3.25 seconds
Palaven.SaveResponse. InstructionId: 7158,  Elapsed-Time: 0.24 seconds
6 - LLM-ChatCompletion. InstructionId: 7160,  Elapsed-Time: 2.30 seconds
Palaven.SaveResponse. InstructionId: 7160,  Elapsed-Time: 0.17 seconds
7 - LLM-ChatCompletion. InstructionId: 7162,  Elapsed-Time: 1.83 seconds
Palaven.SaveResponse. InstructionId: 7162,  Elapsed-Time: 0.18 

  df.loc[instruction_id, 'instruction'] = instruction


1 - LLM-ChatCompletion. InstructionId: 7294,  Elapsed-Time: 5.30 seconds
Palaven.SaveResponse. InstructionId: 7294,  Elapsed-Time: 0.26 seconds
2 - LLM-ChatCompletion. InstructionId: 7296,  Elapsed-Time: 0.89 seconds
Palaven.SaveResponse. InstructionId: 7296,  Elapsed-Time: 0.19 seconds
3 - LLM-ChatCompletion. InstructionId: 7298,  Elapsed-Time: 1.76 seconds
Palaven.SaveResponse. InstructionId: 7298,  Elapsed-Time: 0.20 seconds
4 - LLM-ChatCompletion. InstructionId: 7300,  Elapsed-Time: 1.64 seconds
Palaven.SaveResponse. InstructionId: 7300,  Elapsed-Time: 0.15 seconds
5 - LLM-ChatCompletion. InstructionId: 7302,  Elapsed-Time: 2.06 seconds
Palaven.SaveResponse. InstructionId: 7302,  Elapsed-Time: 0.19 seconds
6 - LLM-ChatCompletion. InstructionId: 7304,  Elapsed-Time: 1.74 seconds
Palaven.SaveResponse. InstructionId: 7304,  Elapsed-Time: 0.19 seconds
7 - LLM-ChatCompletion. InstructionId: 7306,  Elapsed-Time: 1.64 seconds
Palaven.SaveResponse. InstructionId: 7306,  Elapsed-Time: 0.19 

  df.loc[instruction_id, 'instruction'] = instruction


1 - LLM-ChatCompletion. InstructionId: 7433,  Elapsed-Time: 2.55 seconds
Palaven.SaveResponse. InstructionId: 7433,  Elapsed-Time: 0.17 seconds
2 - LLM-ChatCompletion. InstructionId: 7435,  Elapsed-Time: 1.79 seconds
Palaven.SaveResponse. InstructionId: 7435,  Elapsed-Time: 0.16 seconds
3 - LLM-ChatCompletion. InstructionId: 7437,  Elapsed-Time: 1.51 seconds
Palaven.SaveResponse. InstructionId: 7437,  Elapsed-Time: 0.19 seconds
4 - LLM-ChatCompletion. InstructionId: 7439,  Elapsed-Time: 1.91 seconds
Palaven.SaveResponse. InstructionId: 7439,  Elapsed-Time: 0.17 seconds
5 - LLM-ChatCompletion. InstructionId: 7441,  Elapsed-Time: 1.51 seconds
Palaven.SaveResponse. InstructionId: 7441,  Elapsed-Time: 0.23 seconds
6 - LLM-ChatCompletion. InstructionId: 7451,  Elapsed-Time: 4.41 seconds
Palaven.SaveResponse. InstructionId: 7451,  Elapsed-Time: 0.17 seconds
7 - LLM-ChatCompletion. InstructionId: 7453,  Elapsed-Time: 1.86 seconds
Palaven.SaveResponse. InstructionId: 7453,  Elapsed-Time: 0.16 

  df.loc[instruction_id, 'instruction'] = instruction


1 - LLM-ChatCompletion. InstructionId: 7577,  Elapsed-Time: 4.97 seconds
Palaven.SaveResponse. InstructionId: 7577,  Elapsed-Time: 0.21 seconds
2 - LLM-ChatCompletion. InstructionId: 7579,  Elapsed-Time: 1.42 seconds
Palaven.SaveResponse. InstructionId: 7579,  Elapsed-Time: 0.19 seconds
3 - LLM-ChatCompletion. InstructionId: 7581,  Elapsed-Time: 2.01 seconds
Palaven.SaveResponse. InstructionId: 7581,  Elapsed-Time: 0.15 seconds
4 - LLM-ChatCompletion. InstructionId: 7583,  Elapsed-Time: 0.90 seconds
Palaven.SaveResponse. InstructionId: 7583,  Elapsed-Time: 0.17 seconds
5 - LLM-ChatCompletion. InstructionId: 7585,  Elapsed-Time: 1.53 seconds
Palaven.SaveResponse. InstructionId: 7585,  Elapsed-Time: 0.15 seconds
6 - LLM-ChatCompletion. InstructionId: 7587,  Elapsed-Time: 1.91 seconds
Palaven.SaveResponse. InstructionId: 7587,  Elapsed-Time: 0.20 seconds
7 - LLM-ChatCompletion. InstructionId: 7589,  Elapsed-Time: 1.06 seconds
Palaven.SaveResponse. InstructionId: 7589,  Elapsed-Time: 0.26 

  df.loc[instruction_id, 'instruction'] = instruction


1 - LLM-ChatCompletion. InstructionId: 7712,  Elapsed-Time: 1.72 seconds
Palaven.SaveResponse. InstructionId: 7712,  Elapsed-Time: 0.16 seconds
2 - LLM-ChatCompletion. InstructionId: 7714,  Elapsed-Time: 1.67 seconds
Palaven.SaveResponse. InstructionId: 7714,  Elapsed-Time: 0.18 seconds
3 - LLM-ChatCompletion. InstructionId: 7716,  Elapsed-Time: 1.85 seconds
Palaven.SaveResponse. InstructionId: 7716,  Elapsed-Time: 0.20 seconds
4 - LLM-ChatCompletion. InstructionId: 7725,  Elapsed-Time: 4.99 seconds
Palaven.SaveResponse. InstructionId: 7725,  Elapsed-Time: 0.23 seconds
5 - LLM-ChatCompletion. InstructionId: 7727,  Elapsed-Time: 2.00 seconds
Palaven.SaveResponse. InstructionId: 7727,  Elapsed-Time: 0.17 seconds
6 - LLM-ChatCompletion. InstructionId: 7729,  Elapsed-Time: 2.03 seconds
Palaven.SaveResponse. InstructionId: 7729,  Elapsed-Time: 0.16 seconds
7 - LLM-ChatCompletion. InstructionId: 7731,  Elapsed-Time: 1.33 seconds
Palaven.SaveResponse. InstructionId: 7731,  Elapsed-Time: 0.17 

### 4. Evaluation metrics

4.1. BERTScore

In [None]:
!pip install bert_score

Collecting bert_score
  Downloading bert_score-0.3.13-py3-none-any.whl.metadata (15 kB)
Downloading bert_score-0.3.13-py3-none-any.whl (61 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.1/61.1 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: bert_score
Successfully installed bert_score-0.3.13


In [4]:
import pandas as pd

def build_responses_df():
  dataframe = pd.DataFrame(columns=['instruction_id',
    'evaluation_session_id', 'dataset_id', 'batch_size', 'large_language_model',
    'device_info', 'excercise_type', 'batch_number', 'instruction',
    'response', 'category', 'response_to_evaluate', 'elapsed_time'])

  dataframe['evaluation_session_id'] = dataframe['evaluation_session_id'].astype('object')
  dataframe['dataset_id'] = dataframe['dataset_id'].astype('object')
  dataframe['large_language_model'] = dataframe['large_language_model'].astype('object')
  dataframe['device_info'] = dataframe['device_info'].astype('object')
  dataframe['excercise_type'] = dataframe['excercise_type'].astype('object')
  dataframe['instruction'] = dataframe['instruction'].astype('object')
  dataframe['response'] = dataframe['response'].astype('object')
  dataframe['category'] = dataframe['category'].astype('object')
  dataframe['response_to_evaluate'] = dataframe['response_to_evaluate'].astype('object')

  dataframe.set_index('instruction_id', inplace=True)

  return dataframe


def add_response_to_df(dataframe, response):
  dataframe.loc[response['instructionId'], 'evaluation_session_id'] = response['evaluationSessionId']
  dataframe.loc[response['instructionId'], 'dataset_id'] = response['datasetId']
  dataframe.loc[response['instructionId'], 'batch_size'] = response['batchSize']
  dataframe.loc[response['instructionId'], 'large_language_model'] = response['largeLanguageModel']
  dataframe.loc[response['instructionId'], 'device_info'] = response['deviceInfo']
  dataframe.loc[response['instructionId'], 'excercise_type'] = response['evaluationExercise']
  dataframe.loc[response['instructionId'], 'batch_number'] = response['batchNumber']
  dataframe.loc[response['instructionId'], 'instruction'] = response['instruction']
  dataframe.loc[response['instructionId'], 'response'] = response['response']
  dataframe.loc[response['instructionId'], 'category'] = response['category']
  dataframe.loc[response['instructionId'], 'response_to_evaluate'] = response['llmResponseToEvaluate']
  dataframe.loc[response['instructionId'], 'elapsed_time'] = response['elapsedTime']

In [None]:
from bert_score import score as bert_score

for batch_number in range(1, 29):

  model_responses_df = build_responses_df()

  print(f'Palaven. Bertscore evaluation. Start processing batch: {batch_number}')

  model_responses = palaven_api.fetch_model_responses(evaluation_session_id, 'llmfinetuned', batch_number)

  for item in model_responses:
    add_response_to_df(model_responses_df, item)

  references = model_responses_df['response'].tolist()
  candidates = model_responses_df['response_to_evaluate'].tolist()

  accuracy, recall, f1 = bert_score(candidates, references, lang='es', verbose=True, device='cuda')

  accuracy = accuracy.tolist()
  recall = recall.tolist()
  f1 = f1.tolist()

  average_accuracy = sum(accuracy) / len(accuracy)
  average_recall = sum(recall) / len(recall)
  average_f1 = sum(f1) / len(f1)

  print(f'Palaven. Bertscore evaluation. Batch: {batch_number}. Average accuracy: {average_accuracy}. Average recall: {average_recall}. Average F1: {average_f1}')

  palaven_api.save_bert_score_metrics(evaluation_session_id, 'llmfinetuned', batch_number, average_accuracy, average_recall, average_f1)

  print(f'Palaven. Bertscore evaluation. Posted batch {batch_number}')

Palaven. Bertscore evaluation. Start processing batch: 1


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/714M [00:00<?, ?B/s]

calculating scores...
computing bert embedding.


  0%|          | 0/2 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/1 [00:00<?, ?it/s]

done in 0.61 seconds, 81.99 sentences/sec
Palaven. Bertscore evaluation. Batch: 1. Average accuracy: 0.7433831167221069. Average recall: 0.769189248085022. Average F1: 0.7550797998905182
Palaven. Bertscore evaluation. Posted batch 1
Palaven. Bertscore evaluation. Start processing batch: 2




calculating scores...
computing bert embedding.


  0%|          | 0/2 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/1 [00:00<?, ?it/s]

done in 0.34 seconds, 146.86 sentences/sec
Palaven. Bertscore evaluation. Batch: 2. Average accuracy: 0.7605230522155761. Average recall: 0.7669429576396942. Average F1: 0.7628635048866272
Palaven. Bertscore evaluation. Posted batch 2
Palaven. Bertscore evaluation. Start processing batch: 3




calculating scores...
computing bert embedding.


  0%|          | 0/2 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/1 [00:00<?, ?it/s]

done in 0.30 seconds, 166.14 sentences/sec
Palaven. Bertscore evaluation. Batch: 3. Average accuracy: 0.7644301629066468. Average recall: 0.7859627294540406. Average F1: 0.774216297864914
Palaven. Bertscore evaluation. Posted batch 3
Palaven. Bertscore evaluation. Start processing batch: 4




calculating scores...
computing bert embedding.


  0%|          | 0/2 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/1 [00:00<?, ?it/s]

done in 0.29 seconds, 172.54 sentences/sec
Palaven. Bertscore evaluation. Batch: 4. Average accuracy: 0.7473346304893493. Average recall: 0.7715391671657562. Average F1: 0.7586507439613343
Palaven. Bertscore evaluation. Posted batch 4
Palaven. Bertscore evaluation. Start processing batch: 5




calculating scores...
computing bert embedding.


  0%|          | 0/2 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/1 [00:00<?, ?it/s]

done in 0.32 seconds, 157.36 sentences/sec
Palaven. Bertscore evaluation. Batch: 5. Average accuracy: 0.7484136587381363. Average recall: 0.7876072919368744. Average F1: 0.7661528611183166
Palaven. Bertscore evaluation. Posted batch 5
Palaven. Bertscore evaluation. Start processing batch: 6




calculating scores...
computing bert embedding.


  0%|          | 0/2 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/1 [00:00<?, ?it/s]

done in 0.32 seconds, 156.97 sentences/sec
Palaven. Bertscore evaluation. Batch: 6. Average accuracy: 0.7555696129798889. Average recall: 0.7731912088394165. Average F1: 0.7630904507637024
Palaven. Bertscore evaluation. Posted batch 6
Palaven. Bertscore evaluation. Start processing batch: 7




calculating scores...
computing bert embedding.


  0%|          | 0/2 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/1 [00:00<?, ?it/s]

done in 0.37 seconds, 134.17 sentences/sec
Palaven. Bertscore evaluation. Batch: 7. Average accuracy: 0.7640463185310363. Average recall: 0.7878015053272247. Average F1: 0.7749661898612976
Palaven. Bertscore evaluation. Posted batch 7
Palaven. Bertscore evaluation. Start processing batch: 8




calculating scores...
computing bert embedding.


  0%|          | 0/2 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/1 [00:00<?, ?it/s]

done in 0.27 seconds, 185.62 sentences/sec
Palaven. Bertscore evaluation. Batch: 8. Average accuracy: 0.7306301546096802. Average recall: 0.7709739673137664. Average F1: 0.7494364726543427
Palaven. Bertscore evaluation. Posted batch 8
Palaven. Bertscore evaluation. Start processing batch: 9




calculating scores...
computing bert embedding.


  0%|          | 0/2 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/1 [00:00<?, ?it/s]

done in 0.31 seconds, 160.85 sentences/sec
Palaven. Bertscore evaluation. Batch: 9. Average accuracy: 0.7540300369262696. Average recall: 0.7854915690422059. Average F1: 0.7682133680582046
Palaven. Bertscore evaluation. Posted batch 9
Palaven. Bertscore evaluation. Start processing batch: 10




calculating scores...
computing bert embedding.


  0%|          | 0/2 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/1 [00:00<?, ?it/s]

done in 0.28 seconds, 176.46 sentences/sec
Palaven. Bertscore evaluation. Batch: 10. Average accuracy: 0.76978710770607. Average recall: 0.7780516886711121. Average F1: 0.7731849813461303
Palaven. Bertscore evaluation. Posted batch 10
Palaven. Bertscore evaluation. Start processing batch: 11




calculating scores...
computing bert embedding.


  0%|          | 0/2 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/1 [00:00<?, ?it/s]

done in 0.31 seconds, 162.43 sentences/sec
Palaven. Bertscore evaluation. Batch: 11. Average accuracy: 0.7462961173057556. Average recall: 0.7765087509155273. Average F1: 0.7602407884597778
Palaven. Bertscore evaluation. Posted batch 11
Palaven. Bertscore evaluation. Start processing batch: 12




calculating scores...
computing bert embedding.


  0%|          | 0/2 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/1 [00:00<?, ?it/s]

done in 0.31 seconds, 159.75 sentences/sec
Palaven. Bertscore evaluation. Batch: 12. Average accuracy: 0.7390928769111633. Average recall: 0.7751446402072907. Average F1: 0.7558504748344421
Palaven. Bertscore evaluation. Posted batch 12
Palaven. Bertscore evaluation. Start processing batch: 13




calculating scores...
computing bert embedding.


  0%|          | 0/2 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/1 [00:00<?, ?it/s]

done in 0.28 seconds, 176.11 sentences/sec
Palaven. Bertscore evaluation. Batch: 13. Average accuracy: 0.7582178282737732. Average recall: 0.7789684200286865. Average F1: 0.7679412388801574
Palaven. Bertscore evaluation. Posted batch 13
Palaven. Bertscore evaluation. Start processing batch: 14




calculating scores...
computing bert embedding.


  0%|          | 0/2 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/1 [00:00<?, ?it/s]

done in 0.35 seconds, 144.67 sentences/sec
Palaven. Bertscore evaluation. Batch: 14. Average accuracy: 0.7575856268405914. Average recall: 0.780554724931717. Average F1: 0.7682488143444062
Palaven. Bertscore evaluation. Posted batch 14
Palaven. Bertscore evaluation. Start processing batch: 15




calculating scores...
computing bert embedding.


  0%|          | 0/2 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/1 [00:00<?, ?it/s]

done in 0.29 seconds, 172.68 sentences/sec
Palaven. Bertscore evaluation. Batch: 15. Average accuracy: 0.736667799949646. Average recall: 0.7602731585502625. Average F1: 0.7475263905525208
Palaven. Bertscore evaluation. Posted batch 15
Palaven. Bertscore evaluation. Start processing batch: 16




calculating scores...
computing bert embedding.


  0%|          | 0/2 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/1 [00:00<?, ?it/s]

done in 0.37 seconds, 133.43 sentences/sec
Palaven. Bertscore evaluation. Batch: 16. Average accuracy: 0.7649624752998352. Average recall: 0.7870264899730682. Average F1: 0.7748508763313293
Palaven. Bertscore evaluation. Posted batch 16
Palaven. Bertscore evaluation. Start processing batch: 17




calculating scores...
computing bert embedding.


  0%|          | 0/2 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/1 [00:00<?, ?it/s]

done in 0.32 seconds, 154.85 sentences/sec
Palaven. Bertscore evaluation. Batch: 17. Average accuracy: 0.748036071062088. Average recall: 0.7764980673789978. Average F1: 0.7610766077041626
Palaven. Bertscore evaluation. Posted batch 17
Palaven. Bertscore evaluation. Start processing batch: 18




calculating scores...
computing bert embedding.


  0%|          | 0/2 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/1 [00:00<?, ?it/s]

done in 0.27 seconds, 183.42 sentences/sec
Palaven. Bertscore evaluation. Batch: 18. Average accuracy: 0.7483013164997101. Average recall: 0.7774491751194. Average F1: 0.7611872184276581
Palaven. Bertscore evaluation. Posted batch 18
Palaven. Bertscore evaluation. Start processing batch: 19




calculating scores...
computing bert embedding.


  0%|          | 0/2 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/1 [00:00<?, ?it/s]

done in 0.38 seconds, 132.10 sentences/sec
Palaven. Bertscore evaluation. Batch: 19. Average accuracy: 0.7628227961063385. Average recall: 0.791687558889389. Average F1: 0.7761777472496033
Palaven. Bertscore evaluation. Posted batch 19
Palaven. Bertscore evaluation. Start processing batch: 20




calculating scores...
computing bert embedding.


  0%|          | 0/2 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/1 [00:00<?, ?it/s]

done in 0.29 seconds, 171.36 sentences/sec
Palaven. Bertscore evaluation. Batch: 20. Average accuracy: 0.7528304255008698. Average recall: 0.7936446678638458. Average F1: 0.7718592834472656
Palaven. Bertscore evaluation. Posted batch 20
Palaven. Bertscore evaluation. Start processing batch: 21




calculating scores...
computing bert embedding.


  0%|          | 0/2 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/1 [00:00<?, ?it/s]

done in 0.33 seconds, 152.44 sentences/sec
Palaven. Bertscore evaluation. Batch: 21. Average accuracy: 0.7551948380470276. Average recall: 0.7888238036632538. Average F1: 0.7703576052188873
Palaven. Bertscore evaluation. Posted batch 21
Palaven. Bertscore evaluation. Start processing batch: 22




calculating scores...
computing bert embedding.


  0%|          | 0/2 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/1 [00:00<?, ?it/s]

done in 0.28 seconds, 177.06 sentences/sec
Palaven. Bertscore evaluation. Batch: 22. Average accuracy: 0.7779153430461884. Average recall: 0.7999326920509339. Average F1: 0.787801387310028
Palaven. Bertscore evaluation. Posted batch 22
Palaven. Bertscore evaluation. Start processing batch: 23




calculating scores...
computing bert embedding.


  0%|          | 0/2 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/1 [00:00<?, ?it/s]

done in 0.27 seconds, 182.66 sentences/sec
Palaven. Bertscore evaluation. Batch: 23. Average accuracy: 0.7500274431705475. Average recall: 0.7806232297420501. Average F1: 0.7638490295410156
Palaven. Bertscore evaluation. Posted batch 23
Palaven. Bertscore evaluation. Start processing batch: 24




calculating scores...
computing bert embedding.


  0%|          | 0/2 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/1 [00:00<?, ?it/s]

done in 0.32 seconds, 155.46 sentences/sec
Palaven. Bertscore evaluation. Batch: 24. Average accuracy: 0.7329964745044708. Average recall: 0.773547786474228. Average F1: 0.7517793703079224
Palaven. Bertscore evaluation. Posted batch 24
Palaven. Bertscore evaluation. Start processing batch: 25




calculating scores...
computing bert embedding.


  0%|          | 0/2 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/1 [00:00<?, ?it/s]

done in 0.35 seconds, 141.63 sentences/sec
Palaven. Bertscore evaluation. Batch: 25. Average accuracy: 0.7531195223331452. Average recall: 0.786001627445221. Average F1: 0.7682339859008789
Palaven. Bertscore evaluation. Posted batch 25
Palaven. Bertscore evaluation. Start processing batch: 26




calculating scores...
computing bert embedding.


  0%|          | 0/2 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/1 [00:00<?, ?it/s]

done in 0.27 seconds, 185.71 sentences/sec
Palaven. Bertscore evaluation. Batch: 26. Average accuracy: 0.7351296091079712. Average recall: 0.790207496881485. Average F1: 0.7600894284248352
Palaven. Bertscore evaluation. Posted batch 26
Palaven. Bertscore evaluation. Start processing batch: 27




calculating scores...
computing bert embedding.


  0%|          | 0/2 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/1 [00:00<?, ?it/s]

done in 0.28 seconds, 178.62 sentences/sec
Palaven. Bertscore evaluation. Batch: 27. Average accuracy: 0.7456803929805755. Average recall: 0.7789078187942505. Average F1: 0.7606176614761353
Palaven. Bertscore evaluation. Posted batch 27
Palaven. Bertscore evaluation. Start processing batch: 28




calculating scores...
computing bert embedding.


  0%|          | 0/1 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/1 [00:00<?, ?it/s]

done in 0.12 seconds, 108.61 sentences/sec
Palaven. Bertscore evaluation. Batch: 28. Average accuracy: 0.7982437473077041. Average recall: 0.79885811989124. Average F1: 0.7972541451454163
Palaven. Bertscore evaluation. Posted batch 28


4.2. ROUGE (1,2-L)

In [None]:
!pip install rouge_score

Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24935 sha256=2264653c5e22aed77a260be3e3cc12227f1315e324d5e4ac50d94d18a00ab78a
  Stored in directory: /root/.cache/pip/wheels/5f/dd/89/461065a73be61a532ff8599a28e9beef17985c9e9c31e541b4
Successfully built rouge_score
Installing collected packages: rouge_score
Successfully installed rouge_score-0.1.2


In [None]:
from rouge_score import rouge_scorer

for batch_number in range(1, 29):

  model_responses_df = build_responses_df()

  print(f'Palaven. ROUGE evaluation. Start processing batch: {batch_number}')
  model_responses = palaven_api.fetch_model_responses(evaluation_session_id, 'llmfinetuned', batch_number)

  for item in model_responses:
    add_response_to_df(model_responses_df, item)

  scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

  r1_precision, r1_recall, r1_f1 = [], [], []
  r2_precision, r2_recall, r2_f1 = [], [], []
  rL_precision, rL_recall, rL_f1 = [], [], []

  for _, row in model_responses_df.iterrows():
    reference = row['response']
    candidate = row['response_to_evaluate']

    scores = scorer.score(reference, candidate)

    #ROUGE-1
    r1_precision.append(scores['rouge1'].precision)
    r1_recall.append(scores['rouge1'].recall)
    r1_f1.append(scores['rouge1'].fmeasure)

    #ROUGE-2
    r2_precision.append(scores['rouge2'].precision)
    r2_recall.append(scores['rouge2'].recall)
    r2_f1.append(scores['rouge2'].fmeasure)

    #ROUGE-L
    rL_precision.append(scores['rougeL'].precision)
    rL_recall.append(scores['rougeL'].recall)
    rL_f1.append(scores['rougeL'].fmeasure)


  average_r1_precision = sum(r1_precision) / len(r1_precision)
  average_r1_recall = sum(r1_recall) / len(r1_recall)
  average_r1_f1 = sum(r1_f1) / len(r1_f1)

  average_r2_precision = sum(r2_precision) / len(r2_precision)
  average_r2_recall = sum(r2_recall) / len(r2_recall)
  average_r2_f1 = sum(r2_f1) / len(r2_f1)

  average_rL_precision = sum(rL_precision) / len(rL_precision)
  average_rL_recall = sum(rL_recall) / len(rL_recall)
  average_rL_f1 = sum(rL_f1) / len(rL_f1)

  rouge_metrics = []

  rouge_metrics.append({
      'rougeScoreType': 'rouge1',
      'batchNumber': batch_number,
      'precision': average_r1_precision,
      'recall': average_r1_recall,
      'f1': average_r1_f1
  });

  rouge_metrics.append({
      'rougeScoreType': 'rouge2',
      'batchNumber': batch_number,
      'precision': average_r2_precision,
      'recall': average_r2_recall,
      'f1': average_r2_f1
  });

  rouge_metrics.append({
      'rougeScoreType': 'rougeL',
      'batchNumber': batch_number,
      'precision': average_rL_precision,
      'recall': average_rL_recall,
      'f1': average_rL_f1
  });

  print(rouge_metrics)

  print(f'Palaven. ROUGE evaluation finished for batch: {batch_number}.')

  palaven_api.save_rouge_metrics(evaluation_session_id, 'llmfinetuned', batch_number, rouge_metrics)

  print(f'Palaven. Bertscore evaluation. Posted batch {batch_number}')

Palaven. ROUGE evaluation. Start processing batch: 1
[{'rougeScoreType': 'rouge1', 'batchNumber': 1, 'precision': 0.3745757037925945, 'recall': 0.5200344327247403, 'f1': 0.3994700148016524}, {'rougeScoreType': 'rouge2', 'batchNumber': 1, 'precision': 0.20918508754335519, 'recall': 0.26557602773344674, 'f1': 0.22013673991462032}, {'rougeScoreType': 'rougeL', 'batchNumber': 1, 'precision': 0.28583078366517883, 'recall': 0.40565022807253615, 'f1': 0.3035854998657952}]
Palaven. ROUGE evaluation finished for batch: 1.
Palaven. Bertscore evaluation. Posted batch 1
Palaven. ROUGE evaluation. Start processing batch: 2
[{'rougeScoreType': 'rouge1', 'batchNumber': 2, 'precision': 0.4215934871983799, 'recall': 0.48263873944799485, 'f1': 0.4245934431197033}, {'rougeScoreType': 'rouge2', 'batchNumber': 2, 'precision': 0.24555796082318845, 'recall': 0.27986940522397963, 'f1': 0.24894575912864922}, {'rougeScoreType': 'rougeL', 'batchNumber': 2, 'precision': 0.3324951399514729, 'recall': 0.38245134230

### 4.3. BLEU

In [5]:
from nltk.translate.bleu_score import sentence_bleu

for batch_number in range(1, 29):

    model_responses_df = build_responses_df()

    print(f'Palaven. BLEU evaluation. Start processing batch: {batch_number}')

    model_responses = palaven_api.fetch_model_responses(evaluation_session_id, 'llmfinetuned', batch_number)

    for item in model_responses:
        add_response_to_df(model_responses_df, item)

    references = model_responses_df['response'].tolist()
    candidates = model_responses_df['response_to_evaluate'].tolist()

    bleu_scores = []

    for candidate, reference in zip(candidates, references):
        reference_tokens = [reference.split()]
        candidate_tokens = candidate.split()
        bleu_score = sentence_bleu(reference_tokens, candidate_tokens)
        bleu_scores.append(bleu_score)

    average_bleu = sum(bleu_scores) / len(bleu_scores)

    print(f'Palaven. BLEU evaluation. Batch: {batch_number}. Average BLEU score: {average_bleu}')

    palaven_api.save_bleu_score_metrics(evaluation_session_id, 'llmfinetuned', batch_number, average_bleu)

    print(f'Palaven. BLEU evaluation. Posted batch {batch_number}')

Palaven. BLEU evaluation. Start processing batch: 1


The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


Palaven. BLEU evaluation. Batch: 1. Average BLEU score: 0.11486506402694947
Palaven. BLEU evaluation. Posted batch 1
Palaven. BLEU evaluation. Start processing batch: 2


The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


Palaven. BLEU evaluation. Batch: 2. Average BLEU score: 0.15236380874337024
Palaven. BLEU evaluation. Posted batch 2
Palaven. BLEU evaluation. Start processing batch: 3


The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


Palaven. BLEU evaluation. Batch: 3. Average BLEU score: 0.1655851070035646
Palaven. BLEU evaluation. Posted batch 3
Palaven. BLEU evaluation. Start processing batch: 4


The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


Palaven. BLEU evaluation. Batch: 4. Average BLEU score: 0.13790962083561736
Palaven. BLEU evaluation. Posted batch 4
Palaven. BLEU evaluation. Start processing batch: 5


The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


Palaven. BLEU evaluation. Batch: 5. Average BLEU score: 0.16511324529277963
Palaven. BLEU evaluation. Posted batch 5
Palaven. BLEU evaluation. Start processing batch: 6


The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


Palaven. BLEU evaluation. Batch: 6. Average BLEU score: 0.11459135034872349
Palaven. BLEU evaluation. Posted batch 6
Palaven. BLEU evaluation. Start processing batch: 7


The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


Palaven. BLEU evaluation. Batch: 7. Average BLEU score: 0.1673654001035878
Palaven. BLEU evaluation. Posted batch 7
Palaven. BLEU evaluation. Start processing batch: 8


The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


Palaven. BLEU evaluation. Batch: 8. Average BLEU score: 0.1334810953570832
Palaven. BLEU evaluation. Posted batch 8
Palaven. BLEU evaluation. Start processing batch: 9


The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


Palaven. BLEU evaluation. Batch: 9. Average BLEU score: 0.15909562838424118
Palaven. BLEU evaluation. Posted batch 9
Palaven. BLEU evaluation. Start processing batch: 10


The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


Palaven. BLEU evaluation. Batch: 10. Average BLEU score: 0.18807503957463545
Palaven. BLEU evaluation. Posted batch 10
Palaven. BLEU evaluation. Start processing batch: 11


The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


Palaven. BLEU evaluation. Batch: 11. Average BLEU score: 0.13379522801333416
Palaven. BLEU evaluation. Posted batch 11
Palaven. BLEU evaluation. Start processing batch: 12


The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


Palaven. BLEU evaluation. Batch: 12. Average BLEU score: 0.1487499391295174
Palaven. BLEU evaluation. Posted batch 12
Palaven. BLEU evaluation. Start processing batch: 13


The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


Palaven. BLEU evaluation. Batch: 13. Average BLEU score: 0.15356690445383264
Palaven. BLEU evaluation. Posted batch 13
Palaven. BLEU evaluation. Start processing batch: 14


The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


Palaven. BLEU evaluation. Batch: 14. Average BLEU score: 0.15022504453149765
Palaven. BLEU evaluation. Posted batch 14
Palaven. BLEU evaluation. Start processing batch: 15


The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


Palaven. BLEU evaluation. Batch: 15. Average BLEU score: 0.09036382165942221
Palaven. BLEU evaluation. Posted batch 15
Palaven. BLEU evaluation. Start processing batch: 16


The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


Palaven. BLEU evaluation. Batch: 16. Average BLEU score: 0.14316454528810357
Palaven. BLEU evaluation. Posted batch 16
Palaven. BLEU evaluation. Start processing batch: 17


The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


Palaven. BLEU evaluation. Batch: 17. Average BLEU score: 0.15681700826153133
Palaven. BLEU evaluation. Posted batch 17
Palaven. BLEU evaluation. Start processing batch: 18


The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


Palaven. BLEU evaluation. Batch: 18. Average BLEU score: 0.14190437746648665
Palaven. BLEU evaluation. Posted batch 18
Palaven. BLEU evaluation. Start processing batch: 19


The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


Palaven. BLEU evaluation. Batch: 19. Average BLEU score: 0.1495264740666921
Palaven. BLEU evaluation. Posted batch 19
Palaven. BLEU evaluation. Start processing batch: 20


The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


Palaven. BLEU evaluation. Batch: 20. Average BLEU score: 0.1551756253706172
Palaven. BLEU evaluation. Posted batch 20
Palaven. BLEU evaluation. Start processing batch: 21


The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


Palaven. BLEU evaluation. Batch: 21. Average BLEU score: 0.14946274423044706
Palaven. BLEU evaluation. Posted batch 21
Palaven. BLEU evaluation. Start processing batch: 22


The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


Palaven. BLEU evaluation. Batch: 22. Average BLEU score: 0.19301521132837338
Palaven. BLEU evaluation. Posted batch 22
Palaven. BLEU evaluation. Start processing batch: 23


The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


Palaven. BLEU evaluation. Batch: 23. Average BLEU score: 0.14608361968006928
Palaven. BLEU evaluation. Posted batch 23
Palaven. BLEU evaluation. Start processing batch: 24


The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


Palaven. BLEU evaluation. Batch: 24. Average BLEU score: 0.12550930187337916
Palaven. BLEU evaluation. Posted batch 24
Palaven. BLEU evaluation. Start processing batch: 25


The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


Palaven. BLEU evaluation. Batch: 25. Average BLEU score: 0.13167329291015042
Palaven. BLEU evaluation. Posted batch 25
Palaven. BLEU evaluation. Start processing batch: 26


The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


Palaven. BLEU evaluation. Batch: 26. Average BLEU score: 0.13396212907490831
Palaven. BLEU evaluation. Posted batch 26
Palaven. BLEU evaluation. Start processing batch: 27


The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


Palaven. BLEU evaluation. Batch: 27. Average BLEU score: 0.12628881974102202
Palaven. BLEU evaluation. Posted batch 27
Palaven. BLEU evaluation. Start processing batch: 28


The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


Palaven. BLEU evaluation. Batch: 28. Average BLEU score: 0.2671366620441368
Palaven. BLEU evaluation. Posted batch 28
