## Formatting improvements

The goal of this notebook is to improve on the formatting evaluator, and identify the reason for the high number of rules broken that are flagged by it.<br>
This was an iterative process, looking at the results, and fixing any potential issues with the evaluator.<br>
The result files are under `s3://project-rag/data/eval/formatting/`. They were obtained by running different versions of the formatting evaluator ont the `final_2_all.jsonl` dataset (except for `system-responds-results` which is the SystemResponse evaluator.)

In [14]:
import pandas as pd
import sys

from pathlib import Path
from textwrap import wrap
from collections import Counter
from itertools import chain

sys.path.append(Path.cwd().parent.parent.as_posix())

from src.online.data_models import EndToEndGeneration
from src.evaluation.system_response import SystemResponse
from src.evaluation.formatting import Formatting

from nltk.tokenize import sent_tokenize

In [15]:
system_response_evaluator = SystemResponse()
formatting_evaluator = Formatting()

In [16]:
all_gens = pd.read_json("s3://project-rag/data/dataset_generation/unece_sprint/final_2_all.jsonl", lines=True)

all_gens["gen_uuid"] = all_gens["generation"].apply(lambda x: EndToEndGeneration.model_validate(x).uuid)

In [52]:
df = pd.read_json("s3://project-rag/data/eval/formatting/formatting-results-with-comments-multiple-latest.jsonl", lines=True)

system_responds_df = pd.read_json("s3://project-rag/data/eval/formatting/system-responds-results.jsonl", lines=True)

In [59]:
def wrapped(text: str) -> str:
    return "\n".join(wrap(text.replace("\n", "NNNNN"), width=100)).replace("NNNNN", "\n")


def print_example(example: EndToEndGeneration):
    print("="*100)
    print(wrapped(example.rag_request.query))
    print("-"*100)
    print(wrapped(example.rag_response.text))
    print("-"*100)
    print(wrapped(example.rag_response.retrieved_passages_as_string()))

In [54]:
system_does_not_respond = system_responds_df[(system_responds_df["score"] == 0.0) & (system_responds_df["comment"].isna())].gen_uuid.tolist()

print("System did not respond:", len(system_does_not_respond))

system_half_responds = system_responds_df[(system_responds_df["score"] == 0.5) & (system_responds_df["comment"].isna())].gen_uuid.tolist()

print("System said it wouldn't respond, but later it did:", len(system_half_responds))

System did not respond: 2747
System said it wouldn't respond, but later it did: 735


In [55]:
print("Total rule break counts:")
Counter(chain.from_iterable(df[~df.comments.isna()].comments.tolist())).most_common()

Total rule break counts:


[('no_citation', 3450),
 ('fictitious_citation', 1179),
 ('rag_response_is_none', 524),
 ('quotations_not_verbatim', 347),
 ('answer_not_english', 5)]

In [56]:
print("Rule break counts without system-does-not-respond:")
Counter(chain.from_iterable(df[(~df.comments.isna()) & (~df.gen_uuid.isin(system_does_not_respond))].comments.tolist())).most_common()

Rule break counts without system-does-not-respond:


[('no_citation', 1280),
 ('fictitious_citation', 1145),
 ('quotations_not_verbatim', 314),
 ('answer_not_english', 5)]

In [57]:
print("Rule break counts where system responds:")

Counter(
    chain.from_iterable(
        df[
            (~df.comments.isna()) & 
            (~df.gen_uuid.isin(system_does_not_respond + system_half_responds))
        ].comments.tolist()
    )
).most_common()

Rule break counts where system responds:


[('fictitious_citation', 1116),
 ('no_citation', 585),
 ('quotations_not_verbatim', 102),
 ('answer_not_english', 5)]

Looking through the `no_citation` cases, we see that:
- most of them are legit missed citations  (e.g. `aa0596d9965f6f84973e40dcd40be45d`, `a84ef8ba73bc4332e71fe4927a68c416`, `7c170ed3940971fa0e6c33d3eec4d67c`)
- some are citation formatting issues, e.g. `[Source 2]` which is not our required citation
- some are no response cases differently worded (e.g. `d01fb208fe258d3483eef45eced6621f`, `28c2189d1140f196158ce45b0b586e3b`)
- and there are some False Positives too where the first sentence is weirdly formatted, but the later bulletpoints are cited (e.g. `8cdb8b0ceba78275839377d2d15a16a1`, or `c94fb4e53a9305de91ccfbcc2c090299` due to sentence splitting)

`fictitious_citation` cases show us the known issue of:
- citations are shifted by one, and hence the system references `[0]` (which doesn't exist) instead of `[1]`. (e.g. `fd8edd6c2e34c3a1c5ea4884ecb4532b`)

The `quotations_not_verbatim` cases are:
- catching hallucinations (e.g. `5a3099c0eec7822f4efdbb3ad687d4e3`, `cb14aa8c319acd54c03a4a81ff95195d`, `a8359b155f6333b23a609e9be21a23ee`)

`answer_not_english` scenario:
- 3/5 false positives
- 2/5 with text that's mostly not in english (`caed0a8b3b23fc3d3d21ac2be35f4d8f` and `5cf2fb63ba78db4bb07d0f8d6c409078`) and are of low quality


In [69]:
no_response_formatting = df[(~df.comments.isna()) & (~df.gen_uuid.isin(system_does_not_respond + system_half_responds))]

_ids = no_response_formatting[no_response_formatting.comments.apply(lambda x: "answer_not_english" in x)].gen_uuid.tolist()

item = _ids[3]

e2e = EndToEndGeneration.model_validate(all_gens[all_gens.gen_uuid == item]["generation"].tolist()[0])

print(e2e.uuid)
print_example(e2e)
print("="*100)
print(df[df.gen_uuid.isin({item})].comments.tolist())

5cf2fb63ba78db4bb07d0f8d6c409078
Does the document mention any specific dates?
----------------------------------------------------------------------------------------------------
- Yes, the document mentions the following dates:
    - 7 Jadishna Shakitau Ruretat [0]

- 240 Naka [1]
    - 2020/922 Yagyuna [1]
    - 4498 Mashmam Karfuk [1]
    - 2020 (222
Kagyan [1]
    - 5838 Jabitar Tashtashakti [1]
    - 24, 2000 Trithvaja, Ekkranti [2]
----------------------------------------------------------------------------------------------------
[1]: &quot;(S) Kamiyatanal Dhyaazbhajan Patra Jama Shat Nirsadaan Jana Thaaritan Nimnavid (Yagaaz 7
Jadishna Shakitau Ruretat, Yasha :-
[2]: Naka, 240] (Sakranti, 2020/922 Yagyuna, 4498
[3]:
Panibich marya kafreshak kk tighaniye
[['answer_not_english']]
