In [1]:
from typing import Literal
import os
import json
import sys
import pandas as pd

src_dir = os.path.abspath('./src')
sys.path.append(src_dir)

from evaluation.metrics import RetrievalMetrics, SpecificAssetQueriesMetrics

  from tqdm.autonotebook import tqdm, trange


# Query generation

### Generic queries

Three levels of descriptiveness:
- `least_descriptive` -> *A concise user query, up to 70 characters, capturing only the essential and most significant properties of the dataset.*

- `moderately_descriptive` -> *A detailed user query, up to 200 characters, providing additional information and properties to offer a clearer description of the dataset*

- `most_descriptive` -> *A comprehensive user query, up to 500 characters, encompassing a wide range of details and characteristics to thoroughly describe the dataset.*

In [2]:
generic_queries_path = "data/queries/generic"

for lvl in ["least_descriptive", "moderately_descriptive", "most_descriptive"]:
    print(f"DESCRIPTIVENESS LEVEL: {lvl}")
    
    with open(os.path.join(generic_queries_path, f"{lvl}.json")) as f:
        queries = json.load(f)

    for q in queries[:3]:
        print(q["text"])
    print("\n\n")

DESCRIPTIVENESS LEVEL: least_descriptive
image classification dataset
text summarization data
speech recognition dataset



DESCRIPTIVENESS LEVEL: moderately_descriptive
datasets for image classification tasks with labels
text summarization datasets in English language
speech recognition datasets with transcriptions



DESCRIPTIVENESS LEVEL: most_descriptive
image classification datasets with high-resolution images, labeled categories, and balanced class distribution
text summarization datasets in English, containing news articles, summaries, and metadata
speech recognition datasets with transcriptions, audio recordings in various accents, and noise levels





----

### Asset-specific queries

Four asset categories to evaluate:
- `long_description_many_tags`

- `long_description_few_tags`

- `moderate_description_many_tags`

- `poor_description_many_tags`

Three levels of descriptiveness:
- `least_descriptive`

- `moderately_descriptive`

- `most_descriptive`

#### Assets with good description

In [3]:
def get_asset_specific_queries_examples(asset_cat):
    asset_specific_queries_path = "data/queries/asset-specific"
    text_dirpath = "data/basic-texts"
    descrip_level = ["least_descriptive", "moderately_descriptive", "most_descriptive"]

    queries = [[],[],[]]
    docs = []
    for lvl_it, lvl in enumerate(descrip_level):
        path = os.path.join(asset_specific_queries_path, f"{lvl}-{asset_cat}.json")
        with open(path) as f:
            data = json.load(f)
        q = [obj["text"] for obj in data[:2]]
        queries[lvl_it].extend(q)
        
        if lvl_it == 0:
            doc_ids = [obj["annotated_docs"][0]["id"] for obj in data[:2]]
            for doc_id in doc_ids:
                with open(os.path.join(text_dirpath, f"{doc_id}.txt")) as f:
                    docs.append(f.read())
    
    return docs, queries

In [4]:
def print_out_assets(docs, queries):
    descrip_level = ["least_descriptive", "moderately_descriptive", "most_descriptive"]
    
    for it, (doc, least_q, moder_q, most_q) in enumerate(zip(docs, *queries)):
        print(f"============ DOCUMENT {it} ============")
        print(doc)

        print("\n")
        print(f"{descrip_level[0]} query: {least_q}")
        print(f"{descrip_level[1]} query: {moder_q}")
        print(f"{descrip_level[2]} query: {most_q}")

        print("\n\n\n")

In [5]:
docs, queries = get_asset_specific_queries_examples(asset_cat="long_description_many_tags")
print_out_assets(docs, queries)

Platform: huggingface
Asset name: ddrg/super_eurlex
Description: Super-EURLEX dataset containing legal documents from multiple languages.
                The datasets are build/scrapped from the EURLEX Website [https://eur-lex.europa.eu/homepage.html]
                With one split per language and sector, because the available features (metadata) differs for each 
                sector. Therefore, each sample contains the content of a full legal document in up to 3 different 
                formats. Those are raw HTML and cleaned HTML (if the HTML format was available on the EURLEX website 
                during the scrapping process) and cleaned text.
                The cleaned text should be available for each sample and was extracted from HTML or PDF.
                'Cleaned' HTML stands here for minor cleaning that was done to preserve to a large extent the necessary 
                HTML information like table structures while removing unnecessary complexity which was introd

#### Assets with little to no description

In [6]:
docs, queries = get_asset_specific_queries_examples(asset_cat="poor_description_many_tags")
print_out_assets(docs, queries)

Platform: huggingface
Asset name: KBLab/overlim
Description: \
Keywords: region:us | task_categories:text-classification | task_ids:sentiment-classification | task_ids:text-scoring | multilinguality:translation | license:cc-by-4.0 | size_categories:unknown | task_ids:natural-language-inference | language:sv | language:da | language_creators:other | annotations_creators:other | qa-nli | task_ids:semantic-similarity-classification | language:nb | paraphrase-identification | source_datasets:extended|glue | source_datasets:extended|super_glue


least_descriptive query: Sentiment classification dataset for Nordic languages
moderately_descriptive query: Dataset for sentiment classification and text scoring in Swedish, Danish, and Norwegian
most_descriptive query: Multilingual dataset for sentiment classification, text scoring, and natural language inference in Swedish, Danish, and Norwegian. Includes tasks like semantic similarity classification and paraphrase identification. Licensed under 

----

# Preliminary results of evaluation of retrieval systems

#### Aspects that were evaluated:

- **embedding models**
    - working with document embeddings => GTE large / multilingual E5
    - working with chunk embeddings => BGE large / multilingual E5

- **text processing** -> relevant fields / basic fields

- **evaluation pipelines** 
    - precision evaluation -> AI scores VS heuristic scores
    - hit-rate evaluation

### Embedding models

**GTE**
- `Alibaba-NLP/gte-large-en-v1.5`
- encoder-only architecture, 430M params
- english language
- input size: 4k

**E5**
- `intfloat/multilingual-e5-large`
- encoder-only architecture, 560M params
- multilingual model
- input size: 512 (need for chunking)

**BGE**
- `BAAI/bge-large-en-v1.5`
- encoder-only architecture, 335M params
- english language
- input size: 512 (need for chunking)

### Text processing

**Relevant fields**
- extract all the seemingly relevant fields from the documents

**Basic fields**
- take only: platform, name, description, tags

In [7]:
doc_id = "315961"

with open(os.path.join("./data/relevant-texts", f"{doc_id}.txt")) as f:
    rel_doc = f.read()
with open(os.path.join("./data/basic-texts", f"{doc_id}.txt")) as f:
    basic_doc = f.read()

In [8]:
print("RELEVANT FIELDS\n")
print(rel_doc)

RELEVANT FIELDS

platform: zenodo
name: Unpublished data on birds feeding on dead honey bees
date_published: 2022-08-09T00:00:00
year_published: 2022
month_published: 8
day_published: 9
description: The data were collected by me during two years with the aim to initiate a larger research project. However, I neither found the time nor the research funding for the project. Hence, I decided to upload the data so they may be used for a scientific publication, preliminary data set for a similar project, or any other research. About 2500 observations of birds visiting six honey bee colonies are available in the excel file.
keyword: ecology, bird, apis mellifera, honeybee, parus major, animal behaviour, pica pica, unpulished
DISTRIBUTION:
	name:Counts_All_2017_2018.xlsx, encoding_format:application/octet-stream
	name:Background_Material_Methods.docx, encoding_format:application/octet-stream
	name:sitesID-FolderID.xlsx, encoding_format:application/octet-stream
	name:Examples sites.zip, encodin

In [9]:
print("BASIC FIELDS\n")
print(basic_doc)

BASIC FIELDS

Platform: zenodo
Asset name: Unpublished data on birds feeding on dead honey bees
Description: The data were collected by me during two years with the aim to initiate a larger research project. However, I neither found the time nor the research funding for the project. Hence, I decided to upload the data so they may be used for a scientific publication, preliminary data set for a similar project, or any other research. About 2500 observations of birds visiting six honey bee colonies are available in the excel file.
Keywords: ecology | bird | apis mellifera | honeybee | parus major | animal behaviour | pica pica | unpulished


# Precision evaluation

- Retrieve top K (K=10) most similar documents to the **GENERIC QUERIES**
- Utilize LLM-as-a-judge to estimate the relevance of retrieved documents to queries
- Compute retrieval precision

In [10]:
model_variants = [
    # basic dataset representations
    "gte_large--basic",
    "multilingual_e5_large--basic",
    "multilingual_e5_large--CHUNK_EMBEDS--basic",
    "bge_large--CHUNK_EMBEDS--basic",

    # relevant dataset representations
    "gte_large--relevant",
    "multilingual_e5_large--relevant",   
    "multilingual_e5_large--CHUNK_EMBEDS--relevant",
    "bge_large--CHUNK_EMBEDS--relevant"
]

In [11]:
def highlight_precision(s: pd.Series):
    is_max = s == s.max()
    is_second_max = s == sorted(s.unique())[-2] if len(s.unique()) > 1 else s == s.max()
    return ['background-color: rgba(255, 0, 0, 0.3); color: white' if v else 'background-color: rgba(255, 165, 0, 0.3); color: white' if is_second_max[idx] else '' for idx, v in is_max.items()]

In [12]:
def add_middle_border(df):
    horizontal_border_style = {
        'selector': 'td:nth-child({})'.format(len(df.columns) // 2 + 1),
        'props': [('border-right', '4px solid white')]
    }
    df =  df.set_table_styles([horizontal_border_style], overwrite=False)

    vertical_border_style = {
        'selector': 'tr:nth-child({})'.format(len(df.index) // 2),
        'props': [('border-bottom', '4px solid white')]
    }
    return  df.set_table_styles([vertical_border_style], overwrite=False)



In [13]:
def create_precision_results_dataframe(
    model_variants: list[str],
    descriptiveness_level: Literal["least_descriptive", "moderately_descriptive", "most_descriptive", "all"] = "all", 
    function_score: Literal["llm_scores", "heuristic_scores"] = "llm_scores"
) -> pd.DataFrame:
    precision_results_path = "data/results/precision"
    metric_col_names = [
        "prec@3", "prec@5", "prec@10",
        "ndcg@3", "ndcg@5", "ndcg@10"
    ]
    dataframe_rows = [[] for _ in range(len(model_variants))]
    for it_var, var in enumerate(model_variants):
        path = os.path.join(
            precision_results_path, var, descriptiveness_level, f"{function_score}_results.json"
        )
        with open(path) as f:
            metrics = RetrievalMetrics.load(json.load(f))
        
        dataframe_rows[it_var].append(var)
        for col_name in metric_col_names:
            m_name, k = col_name.split("@")
            m_value = getattr(metrics.results_in_top[k], m_name)
            dataframe_rows[it_var].append(m_value)    
        
    df = pd.DataFrame(data=dataframe_rows, columns=["Input Config"] + metric_col_names)
    df = df.set_index(keys=["Input Config"], drop=True)
    
    index_values = df.index.values
    df = df.style.apply(highlight_precision, subset=pd.IndexSlice[index_values[: len(df.index) // 2], df.columns])
    df = df.apply(highlight_precision, subset=pd.IndexSlice[index_values[len(df.index) // 2:], df.columns])

    df = add_middle_border(df)
    return df

### LLM predictions

**Least_descriptive queries** (10 evaluated generic queries)

In [14]:
least_df = create_precision_results_dataframe(model_variants, descriptiveness_level="least_descriptive", function_score="llm_scores")
least_df

Unnamed: 0_level_0,prec@3,prec@5,prec@10,ndcg@3,ndcg@5,ndcg@10
Input Config,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
gte_large--basic,0.766667,0.8,0.8,0.972676,0.937512,0.942396
multilingual_e5_large--basic,0.733333,0.74,0.73,0.964233,0.957615,0.951463
multilingual_e5_large--CHUNK_EMBEDS--basic,0.733333,0.72,0.75,0.978543,0.948829,0.941886
bge_large--CHUNK_EMBEDS--basic,0.866667,0.88,0.89,0.956778,0.950238,0.936497
gte_large--relevant,0.9,0.92,0.93,0.976137,0.972917,0.969004
multilingual_e5_large--relevant,0.7,0.7,0.62,0.950564,0.943011,0.949284
multilingual_e5_large--CHUNK_EMBEDS--relevant,0.7,0.76,0.78,0.947159,0.920603,0.917171
bge_large--CHUNK_EMBEDS--relevant,0.933333,0.92,0.93,0.976098,0.97607,0.970392


**Moderately_descriptive queries** (30 evaluated generic queries)

In [15]:
moderate_df = create_precision_results_dataframe(model_variants, descriptiveness_level="moderately_descriptive", function_score="llm_scores")
moderate_df

Unnamed: 0_level_0,prec@3,prec@5,prec@10,ndcg@3,ndcg@5,ndcg@10
Input Config,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
gte_large--basic,0.888889,0.86,0.84,0.985571,0.969686,0.963227
multilingual_e5_large--basic,0.622222,0.626667,0.57,0.937279,0.906974,0.894725
multilingual_e5_large--CHUNK_EMBEDS--basic,0.722222,0.68,0.633333,0.945122,0.927554,0.918066
bge_large--CHUNK_EMBEDS--basic,0.722222,0.693333,0.713333,0.946026,0.921762,0.907321
gte_large--relevant,0.877778,0.826667,0.806667,0.969002,0.959832,0.953724
multilingual_e5_large--relevant,0.6,0.573333,0.523333,0.962599,0.954739,0.944011
multilingual_e5_large--CHUNK_EMBEDS--relevant,0.655556,0.64,0.583333,0.940617,0.927743,0.918019
bge_large--CHUNK_EMBEDS--relevant,0.733333,0.773333,0.76,0.946369,0.934562,0.926124


**Most_descriptive queries** (50 evaluated generic queries)

In [16]:
most_df = create_precision_results_dataframe(model_variants, descriptiveness_level="most_descriptive", function_score="llm_scores")
most_df

Unnamed: 0_level_0,prec@3,prec@5,prec@10,ndcg@3,ndcg@5,ndcg@10
Input Config,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
gte_large--basic,0.574074,0.537037,0.496296,0.955986,0.938104,0.915184
multilingual_e5_large--basic,0.382716,0.377778,0.359259,0.942299,0.91432,0.893017
multilingual_e5_large--CHUNK_EMBEDS--basic,0.345679,0.344444,0.361111,0.938123,0.912004,0.867619
bge_large--CHUNK_EMBEDS--basic,0.481481,0.455556,0.466667,0.932391,0.909394,0.868128
gte_large--relevant,0.753086,0.696296,0.637037,0.94603,0.932958,0.916135
multilingual_e5_large--relevant,0.41358,0.388889,0.366667,0.93178,0.92161,0.89621
multilingual_e5_large--CHUNK_EMBEDS--relevant,0.438272,0.433333,0.4,0.941046,0.905147,0.895709
bge_large--CHUNK_EMBEDS--relevant,0.598765,0.596296,0.590741,0.942958,0.922455,0.89624


**All queries** (90 queries)

In [17]:
all_df = create_precision_results_dataframe(model_variants, descriptiveness_level="all", function_score="llm_scores")
all_df

Unnamed: 0_level_0,prec@3,prec@5,prec@10,ndcg@3,ndcg@5,ndcg@10
Input Config,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
gte_large--basic,0.695035,0.668085,0.638298,0.967204,0.948121,0.933412
multilingual_e5_large--basic,0.496454,0.495745,0.465957,0.94303,0.916582,0.89978
multilingual_e5_large--CHUNK_EMBEDS--basic,0.507092,0.491489,0.489362,0.944657,0.920885,0.89162
bge_large--CHUNK_EMBEDS--basic,0.599291,0.576596,0.590426,0.939337,0.917686,0.88791
gte_large--relevant,0.808511,0.761702,0.72234,0.956565,0.945786,0.933756
multilingual_e5_large--relevant,0.503546,0.480851,0.443617,0.943614,0.934459,0.917112
multilingual_e5_large--CHUNK_EMBEDS--relevant,0.535461,0.534043,0.498936,0.94156,0.914002,0.905112
bge_large--CHUNK_EMBEDS--relevant,0.677305,0.687234,0.680851,0.947572,0.932023,0.913666


----

### Heuristic predictions

**Least_descriptive queries** (10 evaluated generic queries)

In [18]:
least_df = create_precision_results_dataframe(model_variants, descriptiveness_level="least_descriptive", function_score="heuristic_scores")
least_df

Unnamed: 0_level_0,prec@3,prec@5,prec@10,ndcg@3,ndcg@5,ndcg@10
Input Config,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
gte_large--basic,0.733333,0.78,0.79,0.78247,0.780608,0.760539
multilingual_e5_large--basic,0.7,0.72,0.71,0.89842,0.893156,0.862875
multilingual_e5_large--CHUNK_EMBEDS--basic,0.7,0.7,0.71,0.943353,0.917547,0.883177
bge_large--CHUNK_EMBEDS--basic,0.733333,0.74,0.8,0.805183,0.814949,0.824731
gte_large--relevant,0.866667,0.88,0.89,0.934847,0.911381,0.875902
multilingual_e5_large--relevant,0.666667,0.66,0.59,0.736219,0.739891,0.736146
multilingual_e5_large--CHUNK_EMBEDS--relevant,0.566667,0.58,0.65,0.790439,0.826773,0.799051
bge_large--CHUNK_EMBEDS--relevant,0.933333,0.9,0.9,0.913543,0.898612,0.899889


**Moderately_descriptive queries** (30 evaluated generic queries)

In [19]:
moderate_df = create_precision_results_dataframe(model_variants, descriptiveness_level="moderately_descriptive", function_score="heuristic_scores")
moderate_df

Unnamed: 0_level_0,prec@3,prec@5,prec@10,ndcg@3,ndcg@5,ndcg@10
Input Config,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
gte_large--basic,0.866667,0.813333,0.77,0.950458,0.922393,0.881324
multilingual_e5_large--basic,0.488889,0.506667,0.48,0.861554,0.819019,0.819698
multilingual_e5_large--CHUNK_EMBEDS--basic,0.622222,0.573333,0.546667,0.890512,0.874402,0.843532
bge_large--CHUNK_EMBEDS--basic,0.644444,0.62,0.62,0.875908,0.857962,0.817924
gte_large--relevant,0.822222,0.78,0.756667,0.951764,0.931141,0.902206
multilingual_e5_large--relevant,0.6,0.533333,0.48,0.888735,0.875225,0.854429
multilingual_e5_large--CHUNK_EMBEDS--relevant,0.577778,0.533333,0.516667,0.919225,0.906729,0.884359
bge_large--CHUNK_EMBEDS--relevant,0.677778,0.726667,0.713333,0.920551,0.896629,0.870544


**Most_descriptive queries** (50 evaluated generic queries)

In [20]:
most_df = create_precision_results_dataframe(model_variants, descriptiveness_level="most_descriptive", function_score="heuristic_scores")
most_df

Unnamed: 0_level_0,prec@3,prec@5,prec@10,ndcg@3,ndcg@5,ndcg@10
Input Config,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
gte_large--basic,0.462963,0.414815,0.37037,0.810439,0.793206,0.788892
multilingual_e5_large--basic,0.271605,0.285185,0.257407,0.78621,0.779858,0.777268
multilingual_e5_large--CHUNK_EMBEDS--basic,0.253086,0.277778,0.262963,0.828089,0.810527,0.778892
bge_large--CHUNK_EMBEDS--basic,0.345679,0.32963,0.346296,0.867957,0.850223,0.815994
gte_large--relevant,0.592593,0.548148,0.501852,0.899618,0.880512,0.858624
multilingual_e5_large--relevant,0.351852,0.325926,0.309259,0.826988,0.817029,0.796365
multilingual_e5_large--CHUNK_EMBEDS--relevant,0.314815,0.311111,0.275926,0.85541,0.827598,0.806311
bge_large--CHUNK_EMBEDS--relevant,0.506173,0.496296,0.487037,0.902635,0.891687,0.874122


**All queries** (90 queries)

In [21]:
all_df = create_precision_results_dataframe(model_variants, descriptiveness_level="all", function_score="heuristic_scores")
all_df

Unnamed: 0_level_0,prec@3,prec@5,prec@10,ndcg@3,ndcg@5,ndcg@10
Input Config,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
gte_large--basic,0.620567,0.580851,0.542553,0.852151,0.833096,0.815375
multilingual_e5_large--basic,0.386525,0.402128,0.376596,0.822193,0.80441,0.799917
multilingual_e5_large--CHUNK_EMBEDS--basic,0.41844,0.417021,0.401064,0.860273,0.842298,0.810616
bge_large--CHUNK_EMBEDS--basic,0.48227,0.465957,0.481915,0.863816,0.84894,0.81754
gte_large--relevant,0.695035,0.657447,0.624468,0.920008,0.899954,0.874371
multilingual_e5_large--relevant,0.464539,0.42766,0.393617,0.837038,0.827396,0.80849
multilingual_e5_large--CHUNK_EMBEDS--relevant,0.425532,0.410638,0.392553,0.868865,0.852765,0.830448
bge_large--CHUNK_EMBEDS--relevant,0.606383,0.612766,0.603191,0.909513,0.894001,0.875721


#### Precision evaluation conclusion

- <span style="color:red">GTE model is the best</span>, 2nd best being BGE model
    - E5 models performed the worst

- Use of chunk embedding instead of document embeddings for E5 model improves the performance, but not that singificantly

- Best field extraction technique
    - on average, <span style="color:red">RELEVANT > BASIC</span>
    - for moderately long descriptions: BASIC > RELEVANT
    - for long descriptions: RELEVANT > BASIC

- **Use of both LLM predicitions and heuristic predictions led to the same conclusions**

-----
-----
-----

# Accuracy/Hit-rate evaluation

- Retrieve top K (K=100) most similar documents to the **ASSET-SPECIFIC QUERIES**
- Check the existence and the position of the GROUND TRUTH assets that correspond to the individual queries
- Compute metrics

In [22]:
def highlight_hitrate(s: pd.Series):
    is_max = s == s.max()
    is_second_max = s == sorted(s.unique())[-2] if len(s.unique()) > 1 else s == s.max()
    return ['background-color: rgba(255, 0, 0, 0.3); color: white' if v else 'background-color: rgba(255, 165, 0, 0.3); color: white' if is_second_max[idx] else '' for idx, v in is_max.items()]

def highlight_hit_position(s: pd.Series):
    is_min = s == s.min()
    is_second_min = s == sorted(s.unique())[1] if len(s.unique()) > 1 else s == s.min()
    return ['background-color: rgba(255, 0, 0, 0.3); color: white' if v else 'background-color: rgba(255, 165, 0, 0.3); color: white' if is_second_min[idx] else '' for idx, v in is_min.items()]

In [23]:
def create_hitrate_results_dataframe(
    model_variants: list[str], 
    asset_quality: Literal["long_description_few_tags", "long_description_many_tags", "moderate_description_many_tags", "poor_description_many_tags", "all"] = "all", 
    descriptiveness_level: Literal["least_descriptive", "moderately_descriptive", "most_descriptive", "all"] = "all", 

) -> pd.DataFrame:
    precision_results_path = "data/results/hit_rate"
    metric_col_names = [
        "asset_hit_rate@5", "asset_hit_rate@10", "asset_hit_rate@20", "asset_hit_rate@30",
        "asset_position@5", "asset_position@10", "asset_position@20", "asset_position@30",
    ]
    dataframe_rows = [[] for _ in range(len(model_variants))]
    for it_var, var in enumerate(model_variants):
        if asset_quality == "all" and descriptiveness_level == "all":
            path = os.path.join(
                precision_results_path, var, "all", "results.json"
            )
        elif asset_quality == "all":
            path = os.path.join(
                precision_results_path, var, descriptiveness_level, "results.json"
            )
        elif descriptiveness_level == "all":
            path = os.path.join(
                precision_results_path, var, asset_quality, "results.json"
            )
        else:
            path = os.path.join(
                precision_results_path, var, 
                f"{descriptiveness_level}-{asset_quality}",
                "results.json"
            )
            
        with open(path) as f:
            metrics = SpecificAssetQueriesMetrics.load(json.load(f))
        
        dataframe_rows[it_var].append(var)
        for col_name in metric_col_names:
            m_name, k = col_name.split("@")
            m_value = getattr(metrics.results_in_top[k], m_name)
            dataframe_rows[it_var].append(m_value)    
        
    df = pd.DataFrame(data=dataframe_rows, columns=["Input Config"] + metric_col_names)
    df = df.set_index(keys=["Input Config"], drop=True)

    cols = df.columns
    index_values = df.index.values
        
    df = df.style.apply(highlight_hitrate, subset=pd.IndexSlice[index_values[: len(df.index) // 2], cols[: len(cols) // 2]])
    df = df.apply(highlight_hitrate, subset=pd.IndexSlice[index_values[len(df.index) // 2:], cols[: len(cols) // 2]])

    df = df.apply(highlight_hit_position, subset=pd.IndexSlice[index_values[: len(df.index) // 2], cols[len(cols) // 2:]])
    df = df.apply(highlight_hit_position, subset=pd.IndexSlice[index_values[len(df.index) // 2:], cols[len(cols) // 2:]])

    df = add_middle_border(df)
    return df

## Analyzing individual pairs (asset_quality, descriptiveness level) on retrieval performance

#### Assets with long descriptions (50 assets)
- description with over 1000 characters

In [24]:
long_doc_least_df = create_hitrate_results_dataframe(model_variants, asset_quality="long_description_many_tags", descriptiveness_level="least_descriptive")
long_doc_least_df

Unnamed: 0_level_0,asset_hit_rate@5,asset_hit_rate@10,asset_hit_rate@20,asset_hit_rate@30,asset_position@5,asset_position@10,asset_position@20,asset_position@30
Input Config,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
gte_large--basic,0.82,0.82,0.84,0.86,0.536585,0.536585,0.785714,1.372093
multilingual_e5_large--basic,0.66,0.72,0.74,0.74,0.787879,1.361111,1.675676,1.675676
multilingual_e5_large--CHUNK_EMBEDS--basic,0.7,0.82,0.86,0.86,0.771429,1.682927,2.27907,2.27907
bge_large--CHUNK_EMBEDS--basic,0.84,0.94,0.94,0.98,0.571429,1.191489,1.191489,2.204082
gte_large--relevant,0.8,0.84,0.84,0.84,0.5,0.809524,0.809524,0.809524
multilingual_e5_large--relevant,0.34,0.34,0.38,0.38,0.470588,0.470588,1.894737,1.894737
multilingual_e5_large--CHUNK_EMBEDS--relevant,0.7,0.72,0.82,0.86,0.914286,1.055556,2.609756,3.697674
bge_large--CHUNK_EMBEDS--relevant,0.88,0.94,0.94,0.96,0.613636,1.042553,1.042553,1.5625


In [25]:
long_doc_moderate_df = create_hitrate_results_dataframe(model_variants, asset_quality="long_description_many_tags", descriptiveness_level="moderately_descriptive")
long_doc_moderate_df

Unnamed: 0_level_0,asset_hit_rate@5,asset_hit_rate@10,asset_hit_rate@20,asset_hit_rate@30,asset_position@5,asset_position@10,asset_position@20,asset_position@30
Input Config,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
gte_large--basic,0.8,0.86,0.88,0.88,0.325,0.697674,0.909091,0.909091
multilingual_e5_large--basic,0.68,0.72,0.78,0.8,0.352941,0.611111,1.589744,2.075
multilingual_e5_large--CHUNK_EMBEDS--basic,0.82,0.86,0.92,0.94,0.560976,0.860465,1.695652,2.12766
bge_large--CHUNK_EMBEDS--basic,0.92,0.96,0.98,0.98,0.304348,0.520833,0.77551,0.77551
gte_large--relevant,0.84,0.88,0.9,0.9,0.380952,0.590909,0.844444,0.844444
multilingual_e5_large--relevant,0.4,0.42,0.42,0.42,0.3,0.571429,0.571429,0.571429
multilingual_e5_large--CHUNK_EMBEDS--relevant,0.76,0.82,0.88,0.9,0.631579,0.97561,2.090909,2.688889
bge_large--CHUNK_EMBEDS--relevant,0.98,1.0,1.0,1.0,0.469388,0.62,0.62,0.62


In [26]:
long_doc_most_df = create_hitrate_results_dataframe(model_variants, asset_quality="long_description_many_tags", descriptiveness_level="most_descriptive")
long_doc_most_df


Unnamed: 0_level_0,asset_hit_rate@5,asset_hit_rate@10,asset_hit_rate@20,asset_hit_rate@30,asset_position@5,asset_position@10,asset_position@20,asset_position@30
Input Config,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
gte_large--basic,0.78,0.8,0.8,0.8,0.230769,0.45,0.45,0.45
multilingual_e5_large--basic,0.72,0.76,0.76,0.76,0.388889,0.763158,0.763158,0.763158
multilingual_e5_large--CHUNK_EMBEDS--basic,0.96,0.98,0.98,0.98,0.291667,0.428571,0.428571,0.428571
bge_large--CHUNK_EMBEDS--basic,0.96,1.0,1.0,1.0,0.291667,0.56,0.56,0.56
gte_large--relevant,0.86,0.88,0.88,0.9,0.162791,0.272727,0.272727,0.822222
multilingual_e5_large--relevant,0.48,0.5,0.5,0.5,0.291667,0.52,0.52,0.52
multilingual_e5_large--CHUNK_EMBEDS--relevant,0.98,0.98,1.0,1.0,0.306122,0.306122,0.58,0.58
bge_large--CHUNK_EMBEDS--relevant,0.98,1.0,1.0,1.0,0.346939,0.52,0.52,0.52


-----

#### Assets with mediocre descriptions (50 assets) 
- description with number of characters in between <200, 500>

In [27]:
moderate_doc_least_df = create_hitrate_results_dataframe(model_variants, asset_quality="moderate_description_many_tags", descriptiveness_level="least_descriptive")
moderate_doc_least_df


Unnamed: 0_level_0,asset_hit_rate@5,asset_hit_rate@10,asset_hit_rate@20,asset_hit_rate@30,asset_position@5,asset_position@10,asset_position@20,asset_position@30
Input Config,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
gte_large--basic,0.62,0.7,0.76,0.76,0.612903,1.2,2.052632,2.052632
multilingual_e5_large--basic,0.7,0.76,0.8,0.84,0.371429,0.921053,1.6,2.904762
multilingual_e5_large--CHUNK_EMBEDS--basic,0.72,0.76,0.82,0.82,0.5,0.815789,1.804878,1.804878
bge_large--CHUNK_EMBEDS--basic,0.66,0.74,0.8,0.82,0.969697,1.567568,2.525,2.95122
gte_large--relevant,0.76,0.78,0.8,0.84,0.894737,1.051282,1.375,2.357143
multilingual_e5_large--relevant,0.22,0.24,0.24,0.24,1.090909,1.416667,1.416667,1.416667
multilingual_e5_large--CHUNK_EMBEDS--relevant,0.48,0.5,0.56,0.56,0.791667,1.04,2.607143,2.607143
bge_large--CHUNK_EMBEDS--relevant,0.72,0.8,0.88,0.88,0.805556,1.475,2.386364,2.386364


In [28]:
moderate_doc_moderate_df = create_hitrate_results_dataframe(model_variants, asset_quality="moderate_description_many_tags", descriptiveness_level="moderately_descriptive")
moderate_doc_moderate_df


Unnamed: 0_level_0,asset_hit_rate@5,asset_hit_rate@10,asset_hit_rate@20,asset_hit_rate@30,asset_position@5,asset_position@10,asset_position@20,asset_position@30
Input Config,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
gte_large--basic,0.76,0.8,0.82,0.82,0.710526,1.05,1.341463,1.341463
multilingual_e5_large--basic,0.72,0.74,0.76,0.76,0.222222,0.432432,0.736842,0.736842
multilingual_e5_large--CHUNK_EMBEDS--basic,0.88,0.9,0.92,0.92,0.25,0.422222,0.695652,0.695652
bge_large--CHUNK_EMBEDS--basic,0.74,0.82,0.88,0.88,0.648649,1.170732,2.022727,2.022727
gte_large--relevant,0.78,0.84,0.86,0.9,0.538462,1.0,1.348837,2.488889
multilingual_e5_large--relevant,0.22,0.22,0.22,0.22,0.181818,0.181818,0.181818,0.181818
multilingual_e5_large--CHUNK_EMBEDS--relevant,0.6,0.62,0.62,0.64,0.533333,0.709677,0.709677,1.53125
bge_large--CHUNK_EMBEDS--relevant,0.8,0.92,0.92,0.94,0.45,1.282609,1.282609,1.723404


In [29]:
moderate_doc_most_df = create_hitrate_results_dataframe(model_variants, asset_quality="moderate_description_many_tags", descriptiveness_level="most_descriptive")
moderate_doc_most_df


Unnamed: 0_level_0,asset_hit_rate@5,asset_hit_rate@10,asset_hit_rate@20,asset_hit_rate@30,asset_position@5,asset_position@10,asset_position@20,asset_position@30
Input Config,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
gte_large--basic,0.88,0.88,0.88,0.88,0.477273,0.477273,0.477273,0.477273
multilingual_e5_large--basic,0.86,0.86,0.9,0.9,0.302326,0.302326,0.977778,0.977778
multilingual_e5_large--CHUNK_EMBEDS--basic,0.9,0.9,0.94,0.94,0.222222,0.222222,0.93617,0.93617
bge_large--CHUNK_EMBEDS--basic,0.88,0.88,0.9,0.9,0.795455,0.795455,1.022222,1.022222
gte_large--relevant,0.86,0.86,0.92,0.92,0.372093,0.372093,1.108696,1.108696
multilingual_e5_large--relevant,0.4,0.4,0.4,0.4,0.3,0.3,0.3,0.3
multilingual_e5_large--CHUNK_EMBEDS--relevant,0.64,0.68,0.72,0.72,0.3125,0.764706,1.555556,1.555556
bge_large--CHUNK_EMBEDS--relevant,0.92,0.94,0.94,0.94,0.76087,0.87234,0.87234,0.87234


-----

#### Assets with short to no descriptions (50 assets)
- description with fewer than 50 characters

In [30]:
short_doc_least_df = create_hitrate_results_dataframe(model_variants, asset_quality="poor_description_many_tags", descriptiveness_level="least_descriptive")
short_doc_least_df


Unnamed: 0_level_0,asset_hit_rate@5,asset_hit_rate@10,asset_hit_rate@20,asset_hit_rate@30,asset_position@5,asset_position@10,asset_position@20,asset_position@30
Input Config,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
gte_large--basic,0.357143,0.380952,0.428571,0.428571,0.866667,1.125,2.5,2.5
multilingual_e5_large--basic,0.285714,0.333333,0.357143,0.357143,0.5,1.5,2.2,2.2
multilingual_e5_large--CHUNK_EMBEDS--basic,0.5,0.595238,0.595238,0.595238,0.333333,1.24,1.24,1.24
bge_large--CHUNK_EMBEDS--basic,0.309524,0.380952,0.380952,0.404762,0.692308,1.875,1.875,3.0
gte_large--relevant,0.261905,0.333333,0.428571,0.452381,0.454545,1.714286,4.333333,5.315789
multilingual_e5_large--relevant,0.02381,0.02381,0.02381,0.02381,2.0,2.0,2.0,2.0
multilingual_e5_large--CHUNK_EMBEDS--relevant,0.166667,0.166667,0.166667,0.166667,0.857143,0.857143,0.857143,0.857143
bge_large--CHUNK_EMBEDS--relevant,0.261905,0.285714,0.333333,0.357143,0.727273,1.166667,2.857143,4.2


In [31]:
short_doc_moderate_df = create_hitrate_results_dataframe(model_variants, asset_quality="poor_description_many_tags", descriptiveness_level="moderately_descriptive")
short_doc_moderate_df


Unnamed: 0_level_0,asset_hit_rate@5,asset_hit_rate@10,asset_hit_rate@20,asset_hit_rate@30,asset_position@5,asset_position@10,asset_position@20,asset_position@30
Input Config,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
gte_large--basic,0.309524,0.428571,0.452381,0.47619,0.076923,2.0,2.684211,3.95
multilingual_e5_large--basic,0.428571,0.452381,0.5,0.5,0.611111,0.947368,2.52381,2.52381
multilingual_e5_large--CHUNK_EMBEDS--basic,0.5,0.571429,0.571429,0.571429,0.47619,1.333333,1.333333,1.333333
bge_large--CHUNK_EMBEDS--basic,0.309524,0.357143,0.404762,0.428571,0.615385,1.466667,3.352941,4.5
gte_large--relevant,0.333333,0.357143,0.428571,0.452381,0.785714,1.266667,3.722222,5.052632
multilingual_e5_large--relevant,0.02381,0.02381,0.047619,0.047619,0.0,0.0,7.0,7.0
multilingual_e5_large--CHUNK_EMBEDS--relevant,0.142857,0.142857,0.190476,0.190476,0.5,0.5,3.5,3.5
bge_large--CHUNK_EMBEDS--relevant,0.261905,0.285714,0.380952,0.404762,0.727273,1.25,4.625,5.588235


In [32]:
short_doc_most_df = create_hitrate_results_dataframe(model_variants, asset_quality="poor_description_many_tags", descriptiveness_level="most_descriptive")
short_doc_most_df


Unnamed: 0_level_0,asset_hit_rate@5,asset_hit_rate@10,asset_hit_rate@20,asset_hit_rate@30,asset_position@5,asset_position@10,asset_position@20,asset_position@30
Input Config,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
gte_large--basic,0.5,0.547619,0.595238,0.690476,0.380952,0.869565,1.84,4.931034
multilingual_e5_large--basic,0.452381,0.452381,0.52381,0.52381,0.631579,0.631579,2.5,2.5
multilingual_e5_large--CHUNK_EMBEDS--basic,0.595238,0.595238,0.642857,0.666667,0.4,0.4,1.333333,2.107143
bge_large--CHUNK_EMBEDS--basic,0.428571,0.47619,0.547619,0.571429,0.611111,1.4,3.217391,3.958333
gte_large--relevant,0.5,0.595238,0.690476,0.738095,0.571429,1.6,3.448276,4.806452
multilingual_e5_large--relevant,0.119048,0.142857,0.166667,0.166667,0.8,2.0,4.0,4.0
multilingual_e5_large--CHUNK_EMBEDS--relevant,0.166667,0.238095,0.285714,0.285714,0.142857,1.9,3.916667,3.916667
bge_large--CHUNK_EMBEDS--relevant,0.404762,0.428571,0.5,0.547619,0.647059,0.944444,2.333333,4.434783


## Analyzing influence of asset_quality on retrieval performance


**long_description_many_tags** (150 queries)

In [33]:
df = create_hitrate_results_dataframe(model_variants, asset_quality="long_description_many_tags", descriptiveness_level="all")
df

Unnamed: 0_level_0,asset_hit_rate@5,asset_hit_rate@10,asset_hit_rate@20,asset_hit_rate@30,asset_position@5,asset_position@10,asset_position@20,asset_position@30
Input Config,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
gte_large--basic,0.8,0.826667,0.84,0.846667,0.366667,0.564516,0.722222,0.92126
multilingual_e5_large--basic,0.686667,0.733333,0.76,0.766667,0.504854,0.909091,1.342105,1.513043
multilingual_e5_large--CHUNK_EMBEDS--basic,0.826667,0.886667,0.92,0.926667,0.516129,0.954887,1.427536,1.57554
bge_large--CHUNK_EMBEDS--basic,0.906667,0.966667,0.973333,0.986667,0.382353,0.751724,0.835616,1.175676
gte_large--relevant,0.833333,0.866667,0.873333,0.88,0.344,0.553846,0.641221,0.825758
multilingual_e5_large--relevant,0.406667,0.42,0.433333,0.433333,0.344262,0.52381,0.938462,0.938462
multilingual_e5_large--CHUNK_EMBEDS--relevant,0.813333,0.84,0.9,0.92,0.581967,0.738095,1.688889,2.23913
bge_large--CHUNK_EMBEDS--relevant,0.946667,0.98,0.98,0.986667,0.471831,0.721088,0.721088,0.891892


**long_description_few_tags** (150 queries)

In [34]:
df = create_hitrate_results_dataframe(model_variants, asset_quality="long_description_few_tags", descriptiveness_level="all")
df

Unnamed: 0_level_0,asset_hit_rate@5,asset_hit_rate@10,asset_hit_rate@20,asset_hit_rate@30,asset_position@5,asset_position@10,asset_position@20,asset_position@30
Input Config,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
gte_large--basic,0.566667,0.686667,0.72,0.753333,1.035294,1.980583,2.564815,3.469027
multilingual_e5_large--basic,0.473333,0.56,0.62,0.626667,0.84507,1.75,2.989247,3.191489
multilingual_e5_large--CHUNK_EMBEDS--basic,0.586667,0.666667,0.76,0.773333,0.840909,1.57,3.210526,3.560345
bge_large--CHUNK_EMBEDS--basic,0.506667,0.6,0.753333,0.773333,0.986842,1.788889,4.380531,4.905172
gte_large--relevant,0.633333,0.713333,0.753333,0.786667,0.905263,1.551402,2.141593,3.0
multilingual_e5_large--relevant,0.353333,0.386667,0.4,0.406667,1.113208,1.586207,2.033333,2.393443
multilingual_e5_large--CHUNK_EMBEDS--relevant,0.593333,0.686667,0.713333,0.753333,0.977528,1.68932,2.28972,3.415929
bge_large--CHUNK_EMBEDS--relevant,0.513333,0.633333,0.706667,0.76,1.116883,2.242105,3.424528,4.798246


**moderate_description_many_tags** (150 queries)

In [35]:
df = create_hitrate_results_dataframe(model_variants, asset_quality="moderate_description_many_tags", descriptiveness_level="all")
df

Unnamed: 0_level_0,asset_hit_rate@5,asset_hit_rate@10,asset_hit_rate@20,asset_hit_rate@30,asset_position@5,asset_position@10,asset_position@20,asset_position@30
Input Config,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
gte_large--basic,0.753333,0.793333,0.82,0.82,0.59292,0.882353,1.252033,1.252033
multilingual_e5_large--basic,0.76,0.786667,0.82,0.833333,0.298246,0.542373,1.105691,1.552
multilingual_e5_large--CHUNK_EMBEDS--basic,0.833333,0.853333,0.893333,0.893333,0.312,0.46875,1.119403,1.119403
bge_large--CHUNK_EMBEDS--basic,0.76,0.813333,0.86,0.866667,0.798246,1.155738,1.829457,1.969231
gte_large--relevant,0.8,0.826667,0.86,0.886667,0.591667,0.798387,1.271318,1.969925
multilingual_e5_large--relevant,0.28,0.286667,0.286667,0.286667,0.47619,0.581395,0.581395,0.581395
multilingual_e5_large--CHUNK_EMBEDS--relevant,0.573333,0.6,0.633333,0.64,0.523256,0.822222,1.589474,1.854167
bge_large--CHUNK_EMBEDS--relevant,0.813333,0.886667,0.913333,0.92,0.672131,1.195489,1.49635,1.644928


**poor_description_many_tags** (150 queries)

In [36]:
df = create_hitrate_results_dataframe(model_variants, asset_quality="poor_description_many_tags", descriptiveness_level="all")
df

Unnamed: 0_level_0,asset_hit_rate@5,asset_hit_rate@10,asset_hit_rate@20,asset_hit_rate@30,asset_position@5,asset_position@10,asset_position@20,asset_position@30
Input Config,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
gte_large--basic,0.388889,0.452381,0.492063,0.531746,0.44898,1.298246,2.290323,3.985075
multilingual_e5_large--basic,0.388889,0.412698,0.460317,0.460317,0.591837,0.980769,2.431034,2.431034
multilingual_e5_large--CHUNK_EMBEDS--basic,0.531746,0.587302,0.603175,0.611111,0.402985,0.986486,1.302632,1.584416
bge_large--CHUNK_EMBEDS--basic,0.349206,0.404762,0.444444,0.468254,0.636364,1.568627,2.875,3.847458
gte_large--relevant,0.365079,0.428571,0.515873,0.547619,0.608696,1.537037,3.769231,5.014493
multilingual_e5_large--relevant,0.055556,0.063492,0.079365,0.079365,0.857143,1.75,4.4,4.4
multilingual_e5_large--CHUNK_EMBEDS--relevant,0.15873,0.18254,0.214286,0.214286,0.5,1.217391,3.0,3.0
bge_large--CHUNK_EMBEDS--relevant,0.309524,0.333333,0.404762,0.436508,0.692308,1.095238,3.196078,4.727273


## Analyzing influence of descriptiveness level on retrieval performance

**Least_descriptive** (200 queries)

In [37]:
least_descriptive_df = create_hitrate_results_dataframe(model_variants, asset_quality="all", descriptiveness_level="least_descriptive")
least_descriptive_df


Unnamed: 0_level_0,asset_hit_rate@5,asset_hit_rate@10,asset_hit_rate@20,asset_hit_rate@30,asset_position@5,asset_position@10,asset_position@20,asset_position@30
Input Config,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
gte_large--basic,0.578125,0.645833,0.677083,0.692708,0.720721,1.306452,1.815385,2.315789
multilingual_e5_large--basic,0.526042,0.598958,0.635417,0.645833,0.643564,1.46087,2.188525,2.620968
multilingual_e5_large--CHUNK_EMBEDS--basic,0.614583,0.703125,0.755208,0.755208,0.661017,1.407407,2.289655,2.289655
bge_large--CHUNK_EMBEDS--basic,0.588542,0.65625,0.723958,0.744792,0.840708,1.428571,2.661871,3.237762
gte_large--relevant,0.59375,0.651042,0.692708,0.71875,0.754386,1.248,2.0,2.73913
multilingual_e5_large--relevant,0.234375,0.25,0.265625,0.270833,0.977778,1.291667,2.039216,2.461538
multilingual_e5_large--CHUNK_EMBEDS--relevant,0.489583,0.515625,0.567708,0.588542,1.031915,1.272727,2.568807,3.336283
bge_large--CHUNK_EMBEDS--relevant,0.604167,0.661458,0.708333,0.729167,0.844828,1.393701,2.147059,2.771429


**Moderately_descriptive** (200 queries)

In [38]:
moderately_descriptive_df = create_hitrate_results_dataframe(model_variants, asset_quality="all", descriptiveness_level="moderately_descriptive")
moderately_descriptive_df


Unnamed: 0_level_0,asset_hit_rate@5,asset_hit_rate@10,asset_hit_rate@20,asset_hit_rate@30,asset_position@5,asset_position@10,asset_position@20,asset_position@30
Input Config,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
gte_large--basic,0.609375,0.692708,0.723958,0.739583,0.57265,1.285714,1.81295,2.316901
multilingual_e5_large--basic,0.572917,0.614583,0.666667,0.677083,0.445455,0.864407,1.90625,2.207692
multilingual_e5_large--CHUNK_EMBEDS--basic,0.692708,0.739583,0.78125,0.791667,0.496241,0.93662,1.626667,1.894737
bge_large--CHUNK_EMBEDS--basic,0.625,0.697917,0.760417,0.776042,0.591667,1.141791,2.294521,2.724832
gte_large--relevant,0.671875,0.723958,0.760417,0.78125,0.643411,1.086331,1.712329,2.373333
multilingual_e5_large--relevant,0.260417,0.270833,0.28125,0.28125,0.54,0.75,1.277778,1.277778
multilingual_e5_large--CHUNK_EMBEDS--relevant,0.53125,0.578125,0.614583,0.635417,0.666667,1.099099,1.966102,2.811475
bge_large--CHUNK_EMBEDS--relevant,0.645833,0.71875,0.765625,0.791667,0.596774,1.26087,2.061224,2.723684


**Most_descriptive** (200 queries)

In [39]:
most_descriptive_df = create_hitrate_results_dataframe(model_variants, asset_quality="all", descriptiveness_level="most_descriptive")
most_descriptive_df


Unnamed: 0_level_0,asset_hit_rate@5,asset_hit_rate@10,asset_hit_rate@20,asset_hit_rate@30,asset_position@5,asset_position@10,asset_position@20,asset_position@30
Input Config,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
gte_large--basic,0.723958,0.760417,0.78125,0.807292,0.532374,0.821918,1.173333,1.890323
multilingual_e5_large--basic,0.65625,0.682292,0.71875,0.71875,0.484127,0.70229,1.427536,1.427536
multilingual_e5_large--CHUNK_EMBEDS--basic,0.796875,0.822917,0.869792,0.880208,0.392157,0.594937,1.413174,1.680473
bge_large--CHUNK_EMBEDS--basic,0.713542,0.770833,0.828125,0.838542,0.583942,1.067568,1.943396,2.217391
gte_large--relevant,0.744792,0.786458,0.828125,0.854167,0.412587,0.748344,1.377358,2.054878
multilingual_e5_large--relevant,0.354167,0.375,0.380208,0.380208,0.514706,0.875,1.082192,1.082192
multilingual_e5_large--CHUNK_EMBEDS--relevant,0.630208,0.6875,0.713542,0.723958,0.396694,0.916667,1.408759,1.683453
bge_large--CHUNK_EMBEDS--relevant,0.729167,0.791667,0.822917,0.848958,0.642857,1.138158,1.531646,2.233129


## ALL results

**ALL** (averaged all the results -> around ~600 queries)

In [40]:
all_df = create_hitrate_results_dataframe(model_variants, asset_quality="all", descriptiveness_level="all")
all_df

Unnamed: 0_level_0,asset_hit_rate@5,asset_hit_rate@10,asset_hit_rate@20,asset_hit_rate@30,asset_position@5,asset_position@10,asset_position@20,asset_position@30
Input Config,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
gte_large--basic,0.637153,0.699653,0.727431,0.746528,0.60218,1.124069,1.584726,2.162791
multilingual_e5_large--basic,0.585069,0.631944,0.673611,0.680556,0.519288,0.994505,1.824742,2.063776
multilingual_e5_large--CHUNK_EMBEDS--basic,0.701389,0.755208,0.802083,0.809028,0.50495,0.958621,1.757576,1.939914
bge_large--CHUNK_EMBEDS--basic,0.642361,0.708333,0.770833,0.786458,0.664865,1.203431,2.283784,2.706402
gte_large--relevant,0.670139,0.720486,0.760417,0.784722,0.590674,1.012048,1.678082,2.369469
multilingual_e5_large--relevant,0.282986,0.298611,0.309028,0.310764,0.650307,0.953488,1.41573,1.541899
multilingual_e5_large--CHUNK_EMBEDS--relevant,0.550347,0.59375,0.631944,0.649306,0.671924,1.078947,1.936813,2.550802
bge_large--CHUNK_EMBEDS--relevant,0.659722,0.723958,0.765625,0.789931,0.689474,1.256595,1.897959,2.562637


#### Hit-rate evaluation conclusion 

- Field extraction strategy has strong effect on the results of the models:
    - GTE/BGE performs very similarly (BGE outperforms the GTE slightly) in both extraction strategies
        - performs better on relevant fields
    - E5 (chunk)
        - performs better on basic fields with description
        - on relevant fields with little to no description, there is a drop in performance when comparing it to the results of GTE/BGE
            - due to the sensitivity to its input we want to avoid this model if possible I suppose

-----

# Conclusions

- GTE models performs the best on precision evaluation, but achieves worse performance on hit-rate evaluation compared to BGE
- E5 model applied on basic fields performs the best on hit-rate evaluation, however its results plummet once you apply the model to relevant fields
    - Hence we prefer using more stable models: GTE/BGE

**TODO:**
- We need to test how good GTE model is when applied to seperate chunks rather than whole documents

**CURRENT MODEL SELECTIONS:**

*If we were to select the best model from the current results, we would choose:*

- GTE_large (relevant fields) -> as an embedding model only (HIGH PRECISION)

- GTE_large/BGE_large (relevant fields) -> as a retrieval part of RAG pipeline (HIGH RECALL)


# Comparison of the best models

In this section we will look at the overall results of the following 3 models using both types of field extraction techniques:
- GTE_large (document embeddings store)
- GTE_large (chunk embeddings store)
- BGE_large (chunk embeddings store)


## PRECISION EVALUATION

In [None]:
model_variants = [
    # basic dataset representations
    "gte_large--basic",
    "gte_large_hierarchical--CHUNK_EMBEDS--basic",
    "bge_large--CHUNK_EMBEDS--basic",

    # relevant dataset representations
    "gte_large--relevant",
    "gte_large_hierarchical--CHUNK_EMBEDS--relevant",
    "bge_large--CHUNK_EMBEDS--relevant"
]   

**LLM predictions - All queries** (90 queries)

In [48]:
all_df = create_precision_results_dataframe(model_variants, descriptiveness_level="all", function_score="llm_scores")
all_df

Unnamed: 0_level_0,prec@3,prec@5,prec@10,ndcg@3,ndcg@5,ndcg@10
Input Config,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
gte_large--basic,0.695035,0.668085,0.638298,0.967204,0.948121,0.933412
gte_large_hierarchical--CHUNK_EMBEDS--basic,0.755319,0.708511,0.681915,0.957709,0.944415,0.927771
bge_large--CHUNK_EMBEDS--basic,0.599291,0.576596,0.590426,0.939337,0.917686,0.88791
gte_large--relevant,0.808511,0.761702,0.72234,0.956565,0.945786,0.933756
gte_large_hierarchical--CHUNK_EMBEDS--relevant,0.808511,0.770213,0.735106,0.959514,0.953607,0.938973
bge_large--CHUNK_EMBEDS--relevant,0.677305,0.687234,0.680851,0.947572,0.932023,0.913666


**Heuristic predictions - All queries** (90 queries)

In [49]:
all_df = create_precision_results_dataframe(model_variants, descriptiveness_level="all", function_score="heuristic_scores")
all_df

Unnamed: 0_level_0,prec@3,prec@5,prec@10,ndcg@3,ndcg@5,ndcg@10
Input Config,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
gte_large--basic,0.620567,0.580851,0.542553,0.852151,0.833096,0.815375
gte_large_hierarchical--CHUNK_EMBEDS--basic,0.624113,0.604255,0.581915,0.894,0.870545,0.84455
bge_large--CHUNK_EMBEDS--basic,0.48227,0.465957,0.481915,0.863816,0.84894,0.81754
gte_large--relevant,0.695035,0.657447,0.624468,0.920008,0.899954,0.874371
gte_large_hierarchical--CHUNK_EMBEDS--relevant,0.70922,0.680851,0.626596,0.934196,0.919122,0.89431
bge_large--CHUNK_EMBEDS--relevant,0.606383,0.612766,0.603191,0.909513,0.894001,0.875721


### Precision final conclusion

- GTE hiearchical version achievies similar or even better results than the GTE on average
    - This approach is more generic and capable of dealing with arbitrary long documents, making it more versatile and robust than working with a token limit of 4k and storing the representations of the whole documents

## HIT RATE EVALUATION

In [64]:
model_variants = [
    # basic dataset representations
    "gte_large--basic",
    "gte_large_hierarchical--CHUNK_EMBEDS--basic",
    "multilingual_e5_large--CHUNK_EMBEDS--basic",
    "bge_large--CHUNK_EMBEDS--basic",

    # relevant dataset representations
    "gte_large--relevant",
    "gte_large_hierarchical--CHUNK_EMBEDS--relevant",
    "multilingual_e5_large--CHUNK_EMBEDS--relevant",
    "bge_large--CHUNK_EMBEDS--relevant"
]   

**Least_descriptive queries** (200 queries)

In [65]:
least_descriptive_df = create_hitrate_results_dataframe(model_variants, asset_quality="all", descriptiveness_level="least_descriptive")
least_descriptive_df


Unnamed: 0_level_0,asset_hit_rate@5,asset_hit_rate@10,asset_hit_rate@20,asset_hit_rate@30,asset_position@5,asset_position@10,asset_position@20,asset_position@30
Input Config,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
gte_large--basic,0.578125,0.645833,0.677083,0.692708,0.720721,1.306452,1.815385,2.315789
gte_large_hierarchical--CHUNK_EMBEDS--basic,0.630208,0.739583,0.776042,0.791667,0.586777,1.457746,1.959732,2.388158
multilingual_e5_large--CHUNK_EMBEDS--basic,0.614583,0.703125,0.755208,0.755208,0.661017,1.407407,2.289655,2.289655
bge_large--CHUNK_EMBEDS--basic,0.588542,0.65625,0.723958,0.744792,0.840708,1.428571,2.661871,3.237762
gte_large--relevant,0.59375,0.651042,0.692708,0.71875,0.754386,1.248,2.0,2.73913
gte_large_hierarchical--CHUNK_EMBEDS--relevant,0.614583,0.671875,0.71875,0.755208,0.677966,1.224806,2.152174,3.2
multilingual_e5_large--CHUNK_EMBEDS--relevant,0.489583,0.515625,0.567708,0.588542,1.031915,1.272727,2.568807,3.336283
bge_large--CHUNK_EMBEDS--relevant,0.604167,0.661458,0.708333,0.729167,0.844828,1.393701,2.147059,2.771429


**Moderately_descriptive** (200 queries)

In [66]:
moderately_descriptive_df = create_hitrate_results_dataframe(model_variants, asset_quality="all", descriptiveness_level="moderately_descriptive")
moderately_descriptive_df


Unnamed: 0_level_0,asset_hit_rate@5,asset_hit_rate@10,asset_hit_rate@20,asset_hit_rate@30,asset_position@5,asset_position@10,asset_position@20,asset_position@30
Input Config,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
gte_large--basic,0.609375,0.692708,0.723958,0.739583,0.57265,1.285714,1.81295,2.316901
gte_large_hierarchical--CHUNK_EMBEDS--basic,0.666667,0.75,0.786458,0.791667,0.585938,1.298611,1.854305,2.0
multilingual_e5_large--CHUNK_EMBEDS--basic,0.692708,0.739583,0.78125,0.791667,0.496241,0.93662,1.626667,1.894737
bge_large--CHUNK_EMBEDS--basic,0.625,0.697917,0.760417,0.776042,0.591667,1.141791,2.294521,2.724832
gte_large--relevant,0.671875,0.723958,0.760417,0.78125,0.643411,1.086331,1.712329,2.373333
gte_large_hierarchical--CHUNK_EMBEDS--relevant,0.666667,0.71875,0.765625,0.78125,0.617188,1.050725,1.782313,2.193333
multilingual_e5_large--CHUNK_EMBEDS--relevant,0.53125,0.578125,0.614583,0.635417,0.666667,1.099099,1.966102,2.811475
bge_large--CHUNK_EMBEDS--relevant,0.645833,0.71875,0.765625,0.791667,0.596774,1.26087,2.061224,2.723684


**Most_descriptive** (200 queries)

In [67]:
most_descriptive_df = create_hitrate_results_dataframe(model_variants, asset_quality="all", descriptiveness_level="most_descriptive")
most_descriptive_df


Unnamed: 0_level_0,asset_hit_rate@5,asset_hit_rate@10,asset_hit_rate@20,asset_hit_rate@30,asset_position@5,asset_position@10,asset_position@20,asset_position@30
Input Config,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
gte_large--basic,0.723958,0.760417,0.78125,0.807292,0.532374,0.821918,1.173333,1.890323
gte_large_hierarchical--CHUNK_EMBEDS--basic,0.802083,0.833333,0.859375,0.880208,0.5,0.725,1.157576,1.775148
multilingual_e5_large--CHUNK_EMBEDS--basic,0.796875,0.822917,0.869792,0.880208,0.392157,0.594937,1.413174,1.680473
bge_large--CHUNK_EMBEDS--basic,0.713542,0.770833,0.828125,0.838542,0.583942,1.067568,1.943396,2.217391
gte_large--relevant,0.744792,0.786458,0.828125,0.854167,0.412587,0.748344,1.377358,2.054878
gte_large_hierarchical--CHUNK_EMBEDS--relevant,0.776042,0.807292,0.828125,0.864583,0.47651,0.677419,0.968553,1.921687
multilingual_e5_large--CHUNK_EMBEDS--relevant,0.630208,0.6875,0.713542,0.723958,0.396694,0.916667,1.408759,1.683453
bge_large--CHUNK_EMBEDS--relevant,0.729167,0.791667,0.822917,0.848958,0.642857,1.138158,1.531646,2.233129


**ALL queries** (averaged all the results -> around ~600 queries)

In [68]:
all_df = create_hitrate_results_dataframe(model_variants, asset_quality="all", descriptiveness_level="all")
all_df

Unnamed: 0_level_0,asset_hit_rate@5,asset_hit_rate@10,asset_hit_rate@20,asset_hit_rate@30,asset_position@5,asset_position@10,asset_position@20,asset_position@30
Input Config,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
gte_large--basic,0.637153,0.699653,0.727431,0.746528,0.60218,1.124069,1.584726,2.162791
gte_large_hierarchical--CHUNK_EMBEDS--basic,0.699653,0.774306,0.807292,0.821181,0.55335,1.143498,1.64086,2.044397
multilingual_e5_large--CHUNK_EMBEDS--basic,0.701389,0.755208,0.802083,0.809028,0.50495,0.958621,1.757576,1.939914
bge_large--CHUNK_EMBEDS--basic,0.642361,0.708333,0.770833,0.786458,0.664865,1.203431,2.283784,2.706402
gte_large--relevant,0.670139,0.720486,0.760417,0.784722,0.590674,1.012048,1.678082,2.369469
gte_large_hierarchical--CHUNK_EMBEDS--relevant,0.685764,0.732639,0.770833,0.800347,0.582278,0.966825,1.605856,2.412148
multilingual_e5_large--CHUNK_EMBEDS--relevant,0.550347,0.59375,0.631944,0.649306,0.671924,1.078947,1.936813,2.550802
bge_large--CHUNK_EMBEDS--relevant,0.659722,0.723958,0.765625,0.789931,0.689474,1.256595,1.897959,2.562637


### Hitrate final conclusion

- GTE hierarchical version was able to beat the peformance of the E5-basic_fields model
    - However due to increase in precision performance, we still prefer using relevant fields extraction to retrieve more information that may be crucial to the user query
        - By applying chunking and computing chunk embeddings separatelly, we dont weed to worry about prolonging the document textual represenations that may potentionally cause the whole document representations to be more generic in the case of storing the whole document embeddings

# THE VERY BEST MODEL

- Based on our conducted evaluations, we identified the GTE_large (hierarchical) model to be the best out of all evaluated embedding models
    - This model can be used either seperately or as a part of RAG pipeline or any other pipeline in a retrieval system