# F-One Responses Quality Evaluation with UniEval Metric

The quality of F-One responses was assessed through UniEval metrics. Specifically, we used UniEval to evaluate 
the chatbot responses associated with the Information-Retrieval system related to the FIA Formula 1 regulations. 

### UniEval paper:

*[Towards a Unified Multi-Dimensional Evaluator for Text Generation](https://arxiv.org/abs/2210.07197)*

### Unieval GitHub Repository:

https://github.com/maszhongming/UniEval.git


### Environment
```
git clone https://github.com/maszhongming/UniEval.git
cd UniEval
pip install -r requirements.txt
```


## First Phase

In this phase we used UniEval to obtain the Factual Consistency Score.
The consistency score has a value between [0,1].

In [None]:
from utils import convert_to_json
from metric.evaluator import get_evaluator


task = 'fact'

# a list of source documents
src_list = ["""The term "Engineering Trailer" refers to a branded temporary standalone structure that is brought into the paddock by an F1 Team. It includes any irremovable fixtures, fittings, and equipment integrated into the structure. The purpose of the Engineering Trailer is to provide a working environment for engineering activities during a Competition or Testing of Current Cars. It's important to note that the definition excludes any structures, fixtures, fittings, or equipment that are constructed or installed into permanent or existing paddock buildings, such as the pit garages.
 
"""]
# a list of model outputs (claims) to be evaluataed
output_list = [""""Engineering Trailer" means a branded temporary standalone structure, and any irremovable fixtures, fittings and equipment integrated into such structure that is brought into the paddock and constructed by an F1 Team to provide a working environment for engineering purposes during a Competition or Testing of Current Cars.

"""]

# Prepare data for pre-trained evaluators
data = convert_to_json(output_list=output_list, src_list=src_list)
# Initialize evaluator for a specific task
evaluator = get_evaluator(task)
# Get factual consistency scores
eval_scores = evaluator.evaluate(data, print_result=True)

```
OUTPUT:

Evaluation scores are shown below:
+-------------+----------+
|  Dimensions |  Score   |
+-------------+----------+
| consistency | 0.756445 |
+-------------+----------+
```

## Second Phase

In this phase we used UniEval to evaluate the response generated by F-One at the dialogue level. In this case, unlike the previous evaluation phase, we evaluated the following parameters using UNIEVAL: 
- Naturalness
- Coherence
- Engagingness
- Groundedness
- Understandability 

Engagingness is the only dimension that uses summation scores, as it indicates the total volume of interesting facts 
presented in the response. Therefore, the scoring range for engagingness is [0, +∞), while all others are [0, 1].


In [None]:
from utils import convert_to_json
from metric.evaluator import get_evaluator

task = 'dialogue'

# a list of dialogue histories
src_list = ["""What is the number of rear lights that all cars must have operating for the duration of the competition and which are their features?
"""]
# a list of additional context that should be included into the generated response
context_list = ["""All cars must have three rear lights in working order throughout the Competition which: have been supplied by FIA designated manufacturers, are clearly visible from the rear, can be switched on by the driver when seated normally in the car.

"""]
# a list of model outputs to be evaluated
output_list = ["""All cars must have three rear lights in working order throughout the competition. These lights must be supplied by FIA designated manufacturers, clearly visible from the rear, and can be switched on by the driver when seated normally in the car. Additionally, the lights described in Article 14.3 of the Technical Regulations must be illuminated at all times when using intermediate or wet-weather tyres.
"""]

# Prepare data for pre-trained evaluators
data = convert_to_json(output_list=output_list, 
                       src_list=src_list, context_list=context_list)
# Initialize evaluator for a specific task
evaluator = get_evaluator(task)
# Get multi-dimensional evaluation scores
eval_scores = evaluator.evaluate(data, print_result=True)

```
OUTPUT:

Evaluation scores are shown below:
+-------------------+----------+
|     Dimensions    |  Score   |
+-------------------+----------+
|    naturalness    | 0.999625 |
|     coherence     | 0.999819 |
|    engagingness   | 2.998616 |
|    groundedness   | 0.998462 |
| understandability | 0.999607 |
|      overall      | 1.399226 |
+-------------------+----------+
```