# Valutazione

1. Creare tot sistemi diversi (per esempio, lucene, bert, scibert, tabert (con e senza contesto))
2. Creare un sottoinsieme dei paper (tipo 20) da usare come ground truth (o a caso oppure con lucene i più rilevanti per argomento che abbiano tabelle interessanti) - cerchiamo di limitare il numero di tabelle a ca. 50
3. Per ogni query $q \in Q$ (min 5):
    1. Fare il ranking a mano delle tabelle
        - Salviamo i ranking per ogni query in un json, con le informazioni rilevanti, tipo il ranking, il valore di rilevanza per ogni elemento etc.
    2. Interrogare ogni sistema sulla query
    3. Calcolare le metriche: 
        - Reciprocal Rank: $\text{RR}_q = \frac{1}{rank_i}$ dove $i$ è l’elemento più rilevante.
            - nella pratica possiamo controllare se l’elemento scelto dal motore ha almeno lo score massimo (potrebbero esserci dei parimerito)
        - Normalized Discounted Cumulative Gain con taglio $\text{K} = \set{5,15}$:
            
            $$
            \text{NDCG@K}_q = \frac{\text{DCG@K}_q}{\text{IDCG@K}_q}
            $$
            
            - dove dividiamo il $\text{DCG@K}_q = rel_1 + \sum_{i=2}^K \frac{rel_i}{\log_2 (i + 1)}$ con quello ideale, cioè dove il ranking è il migliore possibile
4. Calcolare la media delle metriche:
    - Mean Reciprocal Rank: $\text{MRR} = \frac{1}{|Q|} \sum_{q \in Q} \text{RR}_q$
    - Media dei NDCG: $\frac{1}{|Q|} \sum_{q \in Q} \text{NDCG@K}_q$

### Query (in verde stesso ranking ma proviamo sinonimi)

1. NDCG su dataset movielens ✅
2. Recommender systems Recall su dataset goodbook ✅
3. Recommender systems MRR ✅
4. Deep Learning dataset Apple Flower ✅
5. Deep Learning GPT3 precision f1 ✅
6. Deep Learning GPT3 precision f-measure ✅

In [1]:
import json

results_file = "./results.json"
ground_truth_path = "./ground_truth"
num_queries = 6

# model -> method -> query -> (position, id) 
results: dict[str, dict[str, dict[str, dict[str, str]]]] = {}

# query -> (position, id) 
ground_truth: dict[str, dict[str, str]] = {}

with open(results_file, "r", encoding="utf-8") as file:
    results = json.load(file)
    
for i in range(1, num_queries + 1):
    query_id = f"q{i}"
    with open(ground_truth_path + f"/{query_id}_rank.json", "r", encoding="utf-8") as file:
        ground_truth[query_id] = json.load(file)
            

## MRR

In [2]:
mrr_values: dict[str, dict[str, float]] = {}

for model, methods in results.items():
    mrr_values[model] = {}
    for method, queries in methods.items():
        sum_rr = 0
        not_founds = 0
        
        for query_id, ranking in queries.items():
            best_table: str = ground_truth[query_id]["1"]
            best_table_id = best_table["paper_id"] + "#" + best_table["table_id"]
            
            rr = 0
            for pos, table_id in ranking.items():
                if (table_id == best_table_id): rr = 1.0 / float(pos)
            
            if rr == 0: not_founds += 1
            sum_rr += rr
            
            
        
        mrr = sum_rr / num_queries
        mrr_values[model][method] = mrr
        print(f"MRR value for {model} using method: {method} is {mrr} --- best not found in {not_founds}/{num_queries} queries.")
        

MRR value for lucene using method: bm25 is 0.0 --- best not found in 6/6 queries.
MRR value for sentence-transformers/sentence-t5-large using method: tab_embedding is 0.027777777777777776 --- best not found in 5/6 queries.
MRR value for sentence-transformers/sentence-t5-large using method: tab_cap_embedding is 0.05555555555555555 --- best not found in 5/6 queries.
MRR value for sentence-transformers/sentence-t5-large using method: tab_cap_ref_embedding is 0.041666666666666664 --- best not found in 5/6 queries.
MRR value for sentence-transformers/sentence-t5-large using method: weighted_embedding is 0.041666666666666664 --- best not found in 5/6 queries.
MRR value for bert-base-uncased using method: tab_embedding is 0.020833333333333332 --- best not found in 5/6 queries.
MRR value for bert-base-uncased using method: tab_cap_embedding is 0.013888888888888888 --- best not found in 5/6 queries.
MRR value for bert-base-uncased using method: tab_cap_ref_embedding is 0.0 --- best not found in

In [None]:
import math

K = 15
ndcg_values: dict[str, dict[str, float]] = {}

idcg_values: dict[str, float] = {}
rel_values: dict[str, dict[str, float]] = {}

for query_id, ranking in ground_truth.items():    
    idcg = 0
    rel_values[query_id] = {}

    for i in range(1, K + 1):
        rank_data = ground_truth[query_id][str(i)]
        table_id = rank_data["paper_id"] + "#" + rank_data["table_id"]
        rel = float(rank_data["rel"])

        idcg += rel / math.log2(i + 1)
        rel_values[query_id][table_id] = rel
    
    idcg_values[query_id] = idcg

for model, methods in results.items():
    ndcg_values[model] = {}
    for method, queries in methods.items():
        
        for query_id, ranking in queries.items():
            dcg = 0
            for pos, table_id in ranking.items():
                table_data = results[model][method][query_id][pos]
                try:
                    rel = rel_values[query_id][table_id]
                except KeyError:
                    rel = 0
                
                dcg += rel / math.log2(int(pos) + 1)
            ndcg = dcg / idcg_values[query_id]
            if math.isnan(ndcg): ndcg = 0
            ndcg_values[model][method] = ndcg
            print(f"NDCG@{K} for query {query_id} using {model} with method {method} is {ndcg}.")

DCG for lucene using method bm25 with query is 0.
NDCG@15 for query q1 using lucene with method bm25 is 0.0.
DCG for lucene using method bm25 with query is 0.
NDCG@15 for query q2 using lucene with method bm25 is 0.0.
DCG for lucene using method bm25 with query is 0.
NDCG@15 for query q3 using lucene with method bm25 is 0.0.
DCG for lucene using method bm25 with query is 0.
NDCG@15 for query q4 using lucene with method bm25 is 0.0.
DCG for lucene using method bm25 with query is 0.
NDCG@15 for query q5 using lucene with method bm25 is 0.0.
DCG for lucene using method bm25 with query is 0.
NDCG@15 for query q6 using lucene with method bm25 is 0.0.
DCG for sentence-transformers/sentence-t5-large using method tab_embedding with query is 0.34867868206391234.
NDCG@15 for query q1 using sentence-transformers/sentence-t5-large with method tab_embedding is 0.020280614244343485.
DCG for sentence-transformers/sentence-t5-large using method tab_embedding with query is 0.725569216607176.
NDCG@15 fo

In [4]:
# Precision at k
K = 5
precision_values: dict[str, dict[str, float]] = {}

for model, methods in results.items():
    precision_values[model] = {}
    for method, queries in methods.items():
        sum_precision = 0
        for query_id, ranking in queries.items():
            relevant = 0
            for pos, table_id in ranking.items():
                table_data = results[model][method][query_id][pos]
                try:
                    rel = rel_values[query_id][table_id]
                except KeyError:
                    rel = 0
                
                if rel > 0: relevant += 1
            
            precision = relevant / K
            sum_precision += precision
            #print(f"Precision@{K} for query {query_id} using {model} and method {method} is {precision}.")
        
        precision_values[model][method] = sum_precision / num_queries
        print(f"Avg Precision@{K} for {model} using method {method} is {precision_values[model][method]}.")

Avg Precision@5 for lucene using method bm25 is 0.0.
Avg Precision@5 for sentence-transformers/sentence-t5-large using method tab_embedding is 0.3666666666666667.
Avg Precision@5 for sentence-transformers/sentence-t5-large using method tab_cap_embedding is 0.4666666666666666.
Avg Precision@5 for sentence-transformers/sentence-t5-large using method tab_cap_ref_embedding is 0.5333333333333333.
Avg Precision@5 for sentence-transformers/sentence-t5-large using method weighted_embedding is 0.5666666666666668.
Avg Precision@5 for bert-base-uncased using method tab_embedding is 0.3.
Avg Precision@5 for bert-base-uncased using method tab_cap_embedding is 0.5.
Avg Precision@5 for bert-base-uncased using method tab_cap_ref_embedding is 0.36666666666666664.
Avg Precision@5 for bert-base-uncased using method weighted_embedding is 0.43333333333333335.
Avg Precision@5 for distilbert-base-uncased using method tab_embedding is 0.3.
Avg Precision@5 for distilbert-base-uncased using method tab_cap_embed

In [5]:
# MAP@K
K = 15
map_values: dict[str, dict[str, float]] = {}

for model, methods in results.items():
    map_values[model] = {}
    for method, queries in methods.items():
        sum_ap = 0
        for query_id, ranking in queries.items():
            relevant = 0
            sum_precisions = 0
            for pos, table_id in ranking.items():
                table_data = results[model][method][query_id][pos]
                try:
                    rel = rel_values[query_id][table_id]
                except KeyError:
                    rel = 0
                
                if rel > 0:
                    relevant += 1
                    sum_precisions += relevant / int(pos)
            
            ap = sum_precisions / relevant if relevant > 0 else 0
            sum_ap += ap
        
        map_value = sum_ap / num_queries
        map_values[model][method] = map_value
        print(f"MAP@{K} for {model} using method: {method} is {map_value}.")

MAP@15 for lucene using method: bm25 is 0.0.
MAP@15 for sentence-transformers/sentence-t5-large using method: tab_embedding is 0.4068181818181818.
MAP@15 for sentence-transformers/sentence-t5-large using method: tab_cap_embedding is 0.3290963665963666.
MAP@15 for sentence-transformers/sentence-t5-large using method: tab_cap_ref_embedding is 0.4362830687830688.
MAP@15 for sentence-transformers/sentence-t5-large using method: weighted_embedding is 0.35884023384023384.
MAP@15 for bert-base-uncased using method: tab_embedding is 0.20338689088689088.
MAP@15 for bert-base-uncased using method: tab_cap_embedding is 0.24989061864061865.
MAP@15 for bert-base-uncased using method: tab_cap_ref_embedding is 0.38847772597772595.
MAP@15 for bert-base-uncased using method: weighted_embedding is 0.3718253968253968.
MAP@15 for distilbert-base-uncased using method: tab_embedding is 0.2976545676545676.
MAP@15 for distilbert-base-uncased using method: tab_cap_embedding is 0.31247594997594996.
MAP@15 for d