In [1]:
import os
import json
import sys
import pandas as pd

src_dir = os.path.abspath('./src')
sys.path.append(src_dir)

from evaluation.metrics import RetrievalMetrics, SpecificAssetQueriesMetrics

  from tqdm.autonotebook import tqdm, trange


# Query generation

### Generic queries

Three levels of descriptiveness:
- `least_descriptive` -> *A concise user query, up to 70 characters, capturing only the essential and most significant properties of the dataset.*

- `moderately_descriptive` -> *A detailed user query, up to 200 characters, providing additional information and properties to offer a clearer description of the dataset*

- `most_descriptive` -> *A comprehensive user query, up to 500 characters, encompassing a wide range of details and characteristics to thoroughly describe the dataset.*

In [2]:
generic_queries_path = "data/queries/generic"

for lvl in ["least_descriptive", "moderately_descriptive", "most_descriptive"]:
    print(f"DESCRIPTIVENESS LEVEL: {lvl}")
    
    with open(os.path.join(generic_queries_path, f"{lvl}.json")) as f:
        queries = json.load(f)

    for q in queries[:3]:
        print(q["text"])
    print("\n\n")

DESCRIPTIVENESS LEVEL: least_descriptive
image classification dataset
text summarization data
speech recognition dataset



DESCRIPTIVENESS LEVEL: moderately_descriptive
datasets for image classification tasks with labels
text summarization datasets in English language
speech recognition datasets with transcriptions



DESCRIPTIVENESS LEVEL: most_descriptive
image classification datasets with high-resolution images, labeled categories, and balanced class distribution
text summarization datasets in English, containing news articles, summaries, and metadata
speech recognition datasets with transcriptions, audio recordings in various accents, and noise levels





----

### Asset-specific queries

Four asset categories to evaluate:
- `long_description_many_tags`

- `long_description_few_tags`

- `moderate_description_many_tags`

- `poor_description_many_tags`

Three levels of descriptiveness:
- `least_descriptive`

- `moderately_descriptive`

- `most_descriptive`

#### Assets with good description

In [3]:
def get_asset_specific_queries_examples(asset_cat):
    asset_specific_queries_path = "data/queries/asset-specific"
    text_dirpath = "data/basic-texts"
    descrip_level = ["least_descriptive", "moderately_descriptive", "most_descriptive"]

    queries = [[],[],[]]
    docs = []
    for lvl_it, lvl in enumerate(descrip_level):
        path = os.path.join(asset_specific_queries_path, f"{lvl}-{asset_cat}.json")
        with open(path) as f:
            data = json.load(f)
        q = [obj["text"] for obj in data[:2]]
        queries[lvl_it].extend(q)
        
        if lvl_it == 0:
            doc_ids = [obj["annotated_docs"][0]["id"] for obj in data[:2]]
            for doc_id in doc_ids:
                with open(os.path.join(text_dirpath, f"{doc_id}.txt")) as f:
                    docs.append(f.read())
    
    return docs, queries

In [4]:
def print_out_assets(docs, queries):
    descrip_level = ["least_descriptive", "moderately_descriptive", "most_descriptive"]
    
    for it, (doc, least_q, moder_q, most_q) in enumerate(zip(docs, *queries)):
        print(f"============ DOCUMENT {it} ============")
        print(doc)

        print("\n")
        print(f"{descrip_level[0]} query: {least_q}")
        print(f"{descrip_level[1]} query: {moder_q}")
        print(f"{descrip_level[2]} query: {most_q}")

        print("\n\n\n")

In [5]:
docs, queries = get_asset_specific_queries_examples(asset_cat="long_description_many_tags")
print_out_assets(docs, queries)

Platform: huggingface
Asset name: ddrg/super_eurlex
Description: Super-EURLEX dataset containing legal documents from multiple languages.
                The datasets are build/scrapped from the EURLEX Website [https://eur-lex.europa.eu/homepage.html]
                With one split per language and sector, because the available features (metadata) differs for each 
                sector. Therefore, each sample contains the content of a full legal document in up to 3 different 
                formats. Those are raw HTML and cleaned HTML (if the HTML format was available on the EURLEX website 
                during the scrapping process) and cleaned text.
                The cleaned text should be available for each sample and was extracted from HTML or PDF.
                'Cleaned' HTML stands here for minor cleaning that was done to preserve to a large extent the necessary 
                HTML information like table structures while removing unnecessary complexity which was introd

#### Assets with little to no description

In [6]:
docs, queries = get_asset_specific_queries_examples(asset_cat="poor_description_many_tags")
print_out_assets(docs, queries)

Platform: huggingface
Asset name: KBLab/overlim
Description: \
Keywords: region:us | task_categories:text-classification | task_ids:sentiment-classification | task_ids:text-scoring | multilinguality:translation | license:cc-by-4.0 | size_categories:unknown | task_ids:natural-language-inference | language:sv | language:da | language_creators:other | annotations_creators:other | qa-nli | task_ids:semantic-similarity-classification | language:nb | paraphrase-identification | source_datasets:extended|glue | source_datasets:extended|super_glue


least_descriptive query: Sentiment classification dataset for Nordic languages
moderately_descriptive query: Dataset for sentiment classification and text scoring in Swedish, Danish, and Norwegian
most_descriptive query: Multilingual dataset for sentiment classification, text scoring, and natural language inference in Swedish, Danish, and Norwegian. Includes tasks like semantic similarity classification and paraphrase identification. Licensed under 

----

# Preliminary results of evaluation of retrieval systems

#### Aspects that were evaluated:

- embedding model -> GTE large / multilingual E5

- text processing -> relevant fields / basic fields

- evaluation pipelines -> precision evaluation / accuracy(recall) evaluation

- **TODO:** *(for precision evaluation) we can play with a function that calculates the relevance of documents/assets to user queries*

### Embedding models

**GTE**
- `Alibaba-NLP/gte-large-en-v1.5`
- encoder-only architecture, 430M params
- english language
- input size: 4k

**E5**
- `intfloat/multilingual-e5-large`
- encoder-only architecture, 560M params
- multilingual model
- input size: 512 (hierarchical document processing)

### Text processing

**Relevant fields**
- extract all the seemingly relevant fields from the documents

**Basic fields**
- take only: platform, name, description, tags

In [7]:
doc_id = "315961"

with open(os.path.join("./data/relevant-texts", f"{doc_id}.txt")) as f:
    rel_doc = f.read()
with open(os.path.join("./data/basic-texts", f"{doc_id}.txt")) as f:
    basic_doc = f.read()

In [8]:
print("RELEVANT FIELDS\n")
print(rel_doc)

RELEVANT FIELDS

platform: zenodo
name: Unpublished data on birds feeding on dead honey bees
date_published: 2022-08-09T00:00:00
year_published: 2022
month_published: 8
day_published: 9
description: The data were collected by me during two years with the aim to initiate a larger research project. However, I neither found the time nor the research funding for the project. Hence, I decided to upload the data so they may be used for a scientific publication, preliminary data set for a similar project, or any other research. About 2500 observations of birds visiting six honey bee colonies are available in the excel file.
keyword: ecology, bird, apis mellifera, honeybee, parus major, animal behaviour, pica pica, unpulished
DISTRIBUTION:
	name:Counts_All_2017_2018.xlsx, encoding_format:application/octet-stream
	name:Background_Material_Methods.docx, encoding_format:application/octet-stream
	name:sitesID-FolderID.xlsx, encoding_format:application/octet-stream
	name:Examples sites.zip, encodin

In [9]:
print("BASIC FIELDS\n")
print(basic_doc)

BASIC FIELDS

Platform: zenodo
Asset name: Unpublished data on birds feeding on dead honey bees
Description: The data were collected by me during two years with the aim to initiate a larger research project. However, I neither found the time nor the research funding for the project. Hence, I decided to upload the data so they may be used for a scientific publication, preliminary data set for a similar project, or any other research. About 2500 observations of birds visiting six honey bee colonies are available in the excel file.
Keywords: ecology | bird | apis mellifera | honeybee | parus major | animal behaviour | pica pica | unpulished


# Precision evaluation

- Retrieve top K (K=10) most similar documents to the **GENERIC QUERIES**
- Utilize LLM-as-a-judge to estimate the relevance of retrieved documents to queries
- Compute retrieval precision

Relevance score function: used LLM prediction
- **TODO**: *Create a simple heuristic function that estimates the document relevance based on fulfilled user critieria in a query*

In [10]:
def highlight_precision(s: pd.Series):
    is_max = s == s.max()
    is_second_max = s == sorted(s.unique())[-2] if len(s.unique()) > 1 else s == s.max()
    return ['background-color: rgba(255, 0, 0, 0.3); color: white' if v else 'background-color: rgba(255, 165, 0, 0.3); color: white' if is_second_max[idx] else '' for idx, v in is_max.items()]

In [11]:
def add_middle_border(df):
    border_style = {
        'selector': 'td:nth-child({})'.format(len(df.columns) // 2 + 1),
        'props': [('border-right', '8px solid white')]
    }
    return df.set_table_styles([border_style], overwrite=False)

In [12]:
def create_precision_results_dataframe(descriptiveness_level: str) -> pd.DataFrame:
    precision_results_path = "data/results/precision"
    model_variants = [
        "gte_large--basic",
        "multilingual_e5_large--basic",
        "gte_large--relevant",
        "multilingual_e5_large--relevant",
    ]
    metric_col_names = [
        "prec@3", "prec@5", "prec@10",
        "ndcg@3", "ndcg@5", "ndcg@10"
    ]
    dataframe_rows = [[] for _ in range(len(model_variants))]
    for it_var, var in enumerate(model_variants):
        path = os.path.join(
            precision_results_path, var, descriptiveness_level, "results.json"
        )
        with open(path) as f:
            metrics = RetrievalMetrics.load(json.load(f))
        
        dataframe_rows[it_var].append(var)
        for col_name in metric_col_names:
            m_name, k = col_name.split("@")
            m_value = getattr(metrics.results_in_top[k], m_name)
            dataframe_rows[it_var].append(m_value)    
        
    df = pd.DataFrame(data=dataframe_rows, columns=["Input Config"] + metric_col_names)
    df = df.set_index(keys=["Input Config"], drop=True)
    df = df.style.apply(highlight_precision, subset=pd.IndexSlice[:, df.columns])
    df = add_middle_border(df)
    return df

**Least_descriptive queries** (10 evaluated generic queries)

In [13]:
least_df = create_precision_results_dataframe("least_descriptive")
least_df

Unnamed: 0_level_0,prec@3,prec@5,prec@10,ndcg@3,ndcg@5,ndcg@10
Input Config,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
gte_large--basic,0.8,0.82,0.74,0.986923,0.95191,0.945446
multilingual_e5_large--basic,0.833333,0.8,0.7,0.972574,0.958498,0.950455
gte_large--relevant,1.0,0.94,0.94,0.984728,0.9815,0.976043
multilingual_e5_large--relevant,0.733333,0.7,0.58,0.947052,0.939588,0.946707


**Moderately_descriptive queries** (30 evaluated generic queries)

In [14]:
moderate_df = create_precision_results_dataframe("moderately_descriptive")
moderate_df

Unnamed: 0_level_0,prec@3,prec@5,prec@10,ndcg@3,ndcg@5,ndcg@10
Input Config,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
gte_large--basic,0.888889,0.866667,0.763333,0.979968,0.963921,0.960796
multilingual_e5_large--basic,0.633333,0.633333,0.536667,0.934481,0.90462,0.901566
gte_large--relevant,0.877778,0.833333,0.806667,0.970217,0.968852,0.958966
multilingual_e5_large--relevant,0.622222,0.6,0.51,0.949426,0.945942,0.936925


**Most_descriptive queries** (50 evaluated generic queries)

In [15]:
most_df = create_precision_results_dataframe("most_descriptive")
most_df

Unnamed: 0_level_0,prec@3,prec@5,prec@10,ndcg@3,ndcg@5,ndcg@10
Input Config,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
gte_large--basic,0.604938,0.562963,0.498148,0.949366,0.939821,0.923724
multilingual_e5_large--basic,0.401235,0.385185,0.355556,0.94985,0.92101,0.896907
gte_large--relevant,0.777778,0.722222,0.67037,0.943909,0.937441,0.921251
multilingual_e5_large--relevant,0.444444,0.433333,0.372222,0.936379,0.917776,0.901505


#### Precision evaluation conclusion
- GTE is superior to E5 model (<span style="color:red">**GTE > E5**</span>)
- Extraction of all relevant fields is crucial for retrieving more relevant documents (<span style="color:red">**relevant > basic**</span>)
    - This is especially evident in the `most_descriptive` queries

- **Assumptions:**
    - LLM relevance scores assigned to the retrieved assets are correct
    - TODO: To validate these obtained results, we need to reevaluate the models using additional heuristic relevance score function

# Accuracy/Hit-rate evaluation

- Retrieve top K (K=100) most similar documents to the **ASSET-SPECIFIC QUERIES**
- Check the existence and the position of the GROUND TRUTH assets that correspond to the individual queries
- Compute metrics

In [16]:
def highlight_hitrate(s: pd.Series):
    is_max = s == s.max()
    is_second_max = s == sorted(s.unique())[-2] if len(s.unique()) > 1 else s == s.max()
    return ['background-color: rgba(255, 0, 0, 0.3); color: white' if v else 'background-color: rgba(255, 165, 0, 0.3); color: white' if is_second_max[idx] else '' for idx, v in is_max.items()]

def highlight_hit_position(s: pd.Series):
    is_min = s == s.min()
    is_second_min = s == sorted(s.unique())[1] if len(s.unique()) > 1 else s == s.min()
    return ['background-color: rgba(255, 0, 0, 0.3); color: white' if v else 'background-color: rgba(255, 165, 0, 0.3); color: white' if is_second_min[idx] else '' for idx, v in is_min.items()]

In [17]:
def create_hitrate_results_dataframe(asset_quality: str, descriptiveness_level: str) -> pd.DataFrame:
    precision_results_path = "data/results/hit_rate"
    model_variants = [
        "gte_large--basic",
        "multilingual_e5_large--basic",
        "gte_large--relevant",
        "multilingual_e5_large--relevant",
    ]
    metric_col_names = [
        "asset_hit_rate@5", "asset_hit_rate@10", "asset_hit_rate@50", "asset_hit_rate@100",
        "asset_position@5", "asset_position@10", "asset_position@50", "asset_position@100",
    ]
    dataframe_rows = [[] for _ in range(len(model_variants))]
    for it_var, var in enumerate(model_variants):
        path = os.path.join(
            precision_results_path, var, 
            f"{descriptiveness_level}-{asset_quality}",
            "results.json"
        )
        with open(path) as f:
            metrics = SpecificAssetQueriesMetrics.load(json.load(f))
        
        dataframe_rows[it_var].append(var)
        for col_name in metric_col_names:
            m_name, k = col_name.split("@")
            m_value = getattr(metrics.results_in_top[k], m_name)
            dataframe_rows[it_var].append(m_value)    
        
    df = pd.DataFrame(data=dataframe_rows, columns=["Input Config"] + metric_col_names)
    df = df.set_index(keys=["Input Config"], drop=True)

    cols = df.columns
    df = df.style.apply(highlight_hitrate, subset=pd.IndexSlice[:, cols[: len(cols) // 2]])
    df = df.apply(highlight_hit_position, subset=pd.IndexSlice[:, cols[len(cols) // 2:]])
    df = add_middle_border(df)
    return df

#### Assets with long descriptions (50 assets)
- description with over 1000 characters

In [18]:
long_doc_least_df = create_hitrate_results_dataframe(asset_quality="long_description_many_tags", descriptiveness_level="least_descriptive")
long_doc_least_df

Unnamed: 0_level_0,asset_hit_rate@5,asset_hit_rate@10,asset_hit_rate@50,asset_hit_rate@100,asset_position@5,asset_position@10,asset_position@50,asset_position@100
Input Config,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
gte_large--basic,0.88,0.88,0.92,0.92,0.545455,0.545455,1.369565,1.369565
multilingual_e5_large--basic,0.7,0.78,0.86,0.88,0.742857,1.410256,3.697674,5.522727
gte_large--relevant,0.84,0.9,0.92,0.92,0.571429,0.977778,1.521739,1.521739
multilingual_e5_large--relevant,0.42,0.42,0.48,0.52,0.666667,0.666667,3.166667,6.961538


In [19]:
long_doc_moderate_df = create_hitrate_results_dataframe(asset_quality="long_description_many_tags", descriptiveness_level="moderately_descriptive")
long_doc_moderate_df


Unnamed: 0_level_0,asset_hit_rate@5,asset_hit_rate@10,asset_hit_rate@50,asset_hit_rate@100,asset_position@5,asset_position@10,asset_position@50,asset_position@100
Input Config,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
gte_large--basic,0.84,0.9,0.92,0.92,0.309524,0.666667,0.869565,0.869565
multilingual_e5_large--basic,0.8,0.84,0.94,0.94,0.425,0.642857,2.553191,2.553191
gte_large--relevant,0.84,0.88,0.9,0.9,0.380952,0.590909,0.844444,0.844444
multilingual_e5_large--relevant,0.54,0.6,0.66,0.66,0.740741,1.366667,3.424242,3.424242


In [20]:
long_doc_most_df = create_hitrate_results_dataframe(asset_quality="long_description_many_tags", descriptiveness_level="most_descriptive")
long_doc_most_df


Unnamed: 0_level_0,asset_hit_rate@5,asset_hit_rate@10,asset_hit_rate@50,asset_hit_rate@100,asset_position@5,asset_position@10,asset_position@50,asset_position@100
Input Config,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
gte_large--basic,0.86,0.88,0.88,0.88,0.27907,0.477273,0.477273,0.477273
multilingual_e5_large--basic,0.86,0.9,0.9,0.9,0.395349,0.755556,0.755556,0.755556
gte_large--relevant,0.9,0.92,0.94,0.94,0.2,0.304348,0.829787,0.829787
multilingual_e5_large--relevant,0.58,0.62,0.64,0.64,0.275862,0.83871,1.28125,1.28125


-----

#### Assets with mediocre descriptions (50 assets) 
- description with number of characters in between <200, 500>

In [21]:
moderate_doc_least_df = create_hitrate_results_dataframe(asset_quality="moderate_description_many_tags", descriptiveness_level="least_descriptive")
moderate_doc_least_df


Unnamed: 0_level_0,asset_hit_rate@5,asset_hit_rate@10,asset_hit_rate@50,asset_hit_rate@100,asset_position@5,asset_position@10,asset_position@50,asset_position@100
Input Config,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
gte_large--basic,0.74,0.86,0.92,0.92,0.567568,1.465116,2.152174,2.152174
multilingual_e5_large--basic,0.78,0.82,0.94,0.94,0.512821,0.878049,4.06383,4.06383
gte_large--relevant,0.78,0.8,0.88,0.88,0.871795,1.025,3.363636,3.363636
multilingual_e5_large--relevant,0.26,0.32,0.32,0.32,1.076923,2.0625,2.0625,2.0625


In [22]:
moderate_doc_moderate_df = create_hitrate_results_dataframe(asset_quality="moderate_description_many_tags", descriptiveness_level="moderately_descriptive")
moderate_doc_moderate_df


Unnamed: 0_level_0,asset_hit_rate@5,asset_hit_rate@10,asset_hit_rate@50,asset_hit_rate@100,asset_position@5,asset_position@10,asset_position@50,asset_position@100
Input Config,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
gte_large--basic,0.8,0.84,0.88,0.88,0.675,1.0,1.545455,1.545455
multilingual_e5_large--basic,0.84,0.86,0.88,0.88,0.285714,0.465116,0.727273,0.727273
gte_large--relevant,0.78,0.86,0.92,0.92,0.564103,1.162791,2.673913,2.673913
multilingual_e5_large--relevant,0.26,0.26,0.28,0.28,0.307692,0.307692,2.214286,2.214286


In [23]:
moderate_doc_most_df = create_hitrate_results_dataframe(asset_quality="moderate_description_many_tags", descriptiveness_level="most_descriptive")
moderate_doc_most_df


Unnamed: 0_level_0,asset_hit_rate@5,asset_hit_rate@10,asset_hit_rate@50,asset_hit_rate@100,asset_position@5,asset_position@10,asset_position@50,asset_position@100
Input Config,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
gte_large--basic,0.94,0.94,0.94,0.94,0.489362,0.489362,0.489362,0.489362
multilingual_e5_large--basic,0.9,0.9,0.94,0.94,0.288889,0.288889,0.978723,0.978723
gte_large--relevant,0.86,0.88,0.94,0.94,0.372093,0.5,1.212766,1.212766
multilingual_e5_large--relevant,0.44,0.46,0.46,0.46,0.272727,0.608696,0.608696,0.608696


-----

#### Assets with short to no descriptions (50 assets)
- description with fewer than 50 characters

In [24]:
short_doc_least_df = create_hitrate_results_dataframe(asset_quality="poor_description_many_tags", descriptiveness_level="least_descriptive")
short_doc_least_df


Unnamed: 0_level_0,asset_hit_rate@5,asset_hit_rate@10,asset_hit_rate@50,asset_hit_rate@100,asset_position@5,asset_position@10,asset_position@50,asset_position@100
Input Config,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
gte_large--basic,0.404762,0.428571,0.47619,0.5,1.0,1.222222,2.45,5.238095
multilingual_e5_large--basic,0.380952,0.452381,0.5,0.5,0.5,1.473684,4.0,4.0
gte_large--relevant,0.285714,0.357143,0.52381,0.547619,0.5,1.733333,7.772727,10.043478
multilingual_e5_large--relevant,0.095238,0.095238,0.119048,0.119048,0.5,0.5,7.4,7.4


In [25]:
short_doc_moderate_df = create_hitrate_results_dataframe(asset_quality="poor_description_many_tags", descriptiveness_level="moderately_descriptive")
short_doc_moderate_df


Unnamed: 0_level_0,asset_hit_rate@5,asset_hit_rate@10,asset_hit_rate@50,asset_hit_rate@100,asset_position@5,asset_position@10,asset_position@50,asset_position@100
Input Config,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
gte_large--basic,0.333333,0.47619,0.571429,0.690476,0.071429,2.05,5.458333,15.034483
multilingual_e5_large--basic,0.452381,0.5,0.571429,0.619048,0.473684,1.142857,3.541667,8.538462
gte_large--relevant,0.357143,0.380952,0.642857,0.714286,0.8,1.25,13.259259,20.3
multilingual_e5_large--relevant,0.095238,0.095238,0.142857,0.190476,0.0,0.0,7.0,21.5


In [26]:
short_doc_most_df = create_hitrate_results_dataframe(asset_quality="poor_description_many_tags", descriptiveness_level="most_descriptive")
short_doc_most_df


Unnamed: 0_level_0,asset_hit_rate@5,asset_hit_rate@10,asset_hit_rate@50,asset_hit_rate@100,asset_position@5,asset_position@10,asset_position@50,asset_position@100
Input Config,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
gte_large--basic,0.547619,0.595238,0.738095,0.809524,0.521739,0.96,5.129032,11.0
multilingual_e5_large--basic,0.595238,0.595238,0.690476,0.714286,0.48,0.48,3.758621,6.066667
gte_large--relevant,0.5,0.595238,0.809524,0.833333,0.619048,1.64,8.294118,9.771429
multilingual_e5_large--relevant,0.142857,0.166667,0.214286,0.285714,0.666667,1.714286,6.0,23.083333


#### Hit-rate evaluation conclusion
- E5 with extracted relevant fields perform much worse than the other alternatives for some reason...
    - I assume it may do something with the hierarchical approach of processing larger documents and computing the embedding of the whole documents (mean pool)
- E5 with basic fields performs similarly well if not sometimes better than to the GTE variants
    - In the basic extraction setting, the documents are typically shorter than 512 tokens, thus making pooling of chunk representations unnecessary...
    - However, we dont want a model that is dependent on the input format, hence we should rather use GTE

- TODO Best model for RECALL



-----

# Conclusions

- GTE models (namely GTE-relevant) achieved better results in terms of precision
- E5-basic model was capable of keeping up or even surpassing the performance of GTE models in the hit-rate evaluation
    - E5-relevant (that works with larger documents than E5-basic) may be disadvantaged due to the hierarchical processing, due to the aggregation of chunk embeddings


----

# Next steps

**Incorporate LLM into the retrieval system**
- start off with a simple **RAG pipeline**
    - GTE embedding model
    - test both basic/relevant text processing approaches

- experiment with **condition/filter parsing from user queries**...


**Misc: Hierarchical processing of documents**
- Compare our previous implementation (1 large document of N chunks == 1 emb) to (1 large document of N chunks == N emb)
    - We would store the embeddings of all the chunks of the documents without their aggregation into singular vector

**Bigger retrieval evaluation**
- <span style="color:red">[BLOCKER]: Wait for some AIoD fixes to be made in termrs of asset schema, etc.</span>
- Compare embedding models to other pipelines (RAG, ...)
- Create a large set of data to evaluate -> generate more queries