# Valutazione

1. Creare tot sistemi diversi (per esempio, lucene, bert, scibert, tabert (con e senza contesto))
2. Creare un sottoinsieme dei paper (tipo 20) da usare come ground truth (o a caso oppure con lucene i più rilevanti per argomento che abbiano tabelle interessanti) - cerchiamo di limitare il numero di tabelle a ca. 50
3. Per ogni query $q \in Q$ (min 5):
    1. Fare il ranking a mano delle tabelle
        - Salviamo i ranking per ogni query in un json, con le informazioni rilevanti, tipo il ranking, il valore di rilevanza per ogni elemento etc.
    2. Interrogare ogni sistema sulla query
    3. Calcolare le metriche: 
        - Reciprocal Rank: $\text{RR}_q = \frac{1}{rank_i}$ dove $i$ è l’elemento più rilevante.
            - nella pratica possiamo controllare se l’elemento scelto dal motore ha almeno lo score massimo (potrebbero esserci dei parimerito)
        - Normalized Discounted Cumulative Gain con taglio $\text{K} = \set{5,15}$:
            
            $$
            \text{NDCG@K}_q = \frac{\text{DCG@K}_q}{\text{IDCG@K}_q}
            $$
            
            - dove dividiamo il $\text{DCG@K}_q = rel_1 + \sum_{i=2}^K \frac{rel_i}{\log_2 (i + 1)}$ con quello ideale, cioè dove il ranking è il migliore possibile
4. Calcolare la media delle metriche:
    - Mean Reciprocal Rank: $\text{MRR} = \frac{1}{|Q|} \sum_{q \in Q} \text{RR}_q$
    - Media dei NDCG: $\frac{1}{|Q|} \sum_{q \in Q} \text{NDCG@K}_q$

### Query (in verde stesso ranking ma proviamo sinonimi)

1. NDCG su dataset movielens ✅
2. Recommender systems Recall su dataset goodbook ✅
3. Recommender systems MRR ✅
4. Deep Learning dataset Apple Flower ✅
5. Deep Learning GPT3 precision f1 ✅
6. Deep Learning GPT3 precision f-measure ✅

In [49]:
import json
import re

results_file = "./results_hybrid.json"
ground_truth_path = "./ground_truth"
num_queries = 6

# model -> method -> query -> (position, id) 
results: dict[str, dict[str, dict[str, dict[str, str]]]] = {}

# query -> (position, table[table_id, query_id, rel]) 
ground_truth: dict[str, dict[str, dict[str, str]]] = {}

with open(results_file, "r", encoding="utf-8") as file:
    results = json.load(file)
    
for i in range(1, num_queries + 1):
    query_id = f"q{i}"
    with open(ground_truth_path + f"/{query_id}_rank.json", "r", encoding="utf-8") as file:
        ground_truth[query_id] = json.load(file)
            

def compare_id(id1: str, id2: str) -> bool:
    id1 = str.lower(re.sub(r'v\d+', '', id1))
    id2 = str.lower(re.sub(r'v\d+', '', id2))

    return id1 == id2

## MRR

In [50]:
mrr_values: dict[str, dict[str, float]] = {}

for model, methods in results.items():
    mrr_values[model] = {}
    for method, queries in methods.items():
        sum_rr = 0
        not_founds = 0
        
        for query_id, ranking in queries.items():
            best_tables_ids: list[str] = []
            best_table =  ground_truth[query_id]["1"]
            
            # this checks for equal relevance tables other than the first position
            for pos, table in ground_truth[query_id].items():
                if table["rel"] == best_table["rel"]:
                    best_tables_ids.append(table["paper_id"] + "#" + table["table_id"])
            
            rr = 0
            for pos, table_id in ranking.items():
                if (table_id in best_tables_ids): rr = 1.0 / float(pos)
            
            if rr == 0: not_founds += 1
            sum_rr += rr
            
        mrr = sum_rr / num_queries
        mrr_values[model][method] = mrr
        print(f"MRR value for {model} using method: {method} is {mrr} --- best not found in {not_founds}/{num_queries} queries.")
        

MRR value for lucene using method: bm25 is 0.20833333333333334 --- best not found in 4/6 queries.
MRR value for bert-base-uncased using method: tab_embedding is 0.05000000000000001 --- best not found in 4/6 queries.
MRR value for bert-base-uncased using method: tab_cap_embedding is 0.05555555555555555 --- best not found in 4/6 queries.
MRR value for bert-base-uncased using method: tab_cap_ref_embedding is 0.19047619047619047 --- best not found in 4/6 queries.
MRR value for bert-base-uncased using method: weighted_embedding is 0.047619047619047616 --- best not found in 4/6 queries.
MRR value for distilbert-base-uncased using method: tab_embedding is 0.03333333333333333 --- best not found in 5/6 queries.
MRR value for distilbert-base-uncased using method: tab_cap_embedding is 0.05416666666666667 --- best not found in 4/6 queries.
MRR value for distilbert-base-uncased using method: tab_cap_ref_embedding is 0.125 --- best not found in 4/6 queries.
MRR value for distilbert-base-uncased usin

In [None]:
import math

K = 15
idcg_values: dict[str, float] = {}

for query_id, ranking in ground_truth.items():    
    idcg = 0

    for i in range(1, K + 1):
        table = ground_truth[query_id][str(i)]
        table_id = table["paper_id"] + "#" + table["table_id"]
        rel: float = float(table["rel"])

        idcg += rel / math.log2(i + 1)
    
    idcg_values[query_id] = idcg

for model, methods in results.items():
    for method, queries in methods.items():
        print(f"============== {model} --> {method} ==============")
        sum_ndcg: float = 0
        for query_id, ranking in queries.items():
            dcg = 0
            for pos, table_id in ranking.items():
                rel: float = 0
                
                for _, gt_table in ground_truth[query_id].items():
                   gt_table_id = gt_table["paper_id"] + "#" + gt_table["table_id"]
                   if compare_id(table_id, gt_table_id):
                       rel = float(gt_table["rel"])
                
                dcg += rel / math.log2(int(pos) + 1)
                
            ndcg = dcg / idcg_values[query_id] if dcg / idcg_values[query_id] <= 1 else 1
            sum_ndcg += ndcg
            
            #print(f"NDCG@{K} for query {query_id} is {ndcg}.")
        print(f"Average NDCG@{K} is {sum_ndcg / num_queries}.\n\n")

Average NDCG@5 is 0.9214215230998138.


Average NDCG@5 is 0.6241123305534374.


Average NDCG@5 is 0.5993533732831239.


Average NDCG@5 is 0.5559778214570784.


Average NDCG@5 is 0.5665435088985525.


Average NDCG@5 is 0.6809919308527355.


Average NDCG@5 is 0.6797787542922435.


Average NDCG@5 is 0.5131625572007309.


Average NDCG@5 is 0.5816040473636793.


Average NDCG@5 is 0.5177521424429389.


Average NDCG@5 is 0.46056144447118824.


Average NDCG@5 is 0.3818974304306249.


Average NDCG@5 is 0.5231405709447273.


Average NDCG@5 is 0.8898029643540261.


Average NDCG@5 is 0.8450019325705456.


Average NDCG@5 is 0.9337763415287165.


Average NDCG@5 is 0.8958585116891989.


Average NDCG@5 is 0.8114090068950669.


Average NDCG@5 is 0.929145052516951.


Average NDCG@5 is 0.906092866266504.


Average NDCG@5 is 0.8642840030473726.


Average NDCG@5 is 0.8874118798157687.


Average NDCG@5 is 0.8415450985092011.


Average NDCG@5 is 0.9096234993857024.


Average NDCG@5 is 0.89637124514452.


Ave

In [52]:
# Precision at k
def precision_at_k(k: int):
    precision_values: dict[str, dict[str, dict[str, float]]] = {}

    for model, methods in results.items():
        precision_values[model] = {}
        for method, queries in methods.items():
            precision_values[model][method] = {}
            for query_id, ranking in queries.items():
                relevant = 0
                for pos, table_id in ranking.items():
                    if int(pos) > k: break
                    
                    rel: float = 0
                    
                    # find relevance
                    for _, gt_table in ground_truth[query_id].items():
                        gt_table_id = gt_table["paper_id"] + "#" + gt_table["table_id"]
                        if compare_id(table_id, gt_table_id):
                            rel = float(gt_table["rel"])
                    
                    if rel > 0: relevant += 1
                
                precision = relevant / k
                precision_values[model][method][query_id] = precision
    
    return precision_values

# Avg Precision at K
K = 15
# model -> method -> query, avg_precision@k
ap_values: dict[str, dict[str, dict[str, float]]] = {}

for model, methods in results.items():
    ap_values[model] = {}
    for method, queries in methods.items():
        ap_values[model][method] = {}
        
        for query_id, ranking in queries.items():
            sum_p = 0
            
            for k in range(1, K + 1):
                precision_values_at_k = precision_at_k(k)
                sum_p += precision_values_at_k[model][method][query_id]
                
            ap_values[model][method][query_id] = sum_p / K
            print(f"Avg Precision@{K} of {model} with {method} for query {query_id} is: {ap_values[model][method][query_id]}")

Avg Precision@15 of lucene with bm25 for query q1 is: 0.8353537203537204
Avg Precision@15 of lucene with bm25 for query q2 is: 0.49310522810522806
Avg Precision@15 of lucene with bm25 for query q3 is: 0.64999000999001
Avg Precision@15 of lucene with bm25 for query q4 is: 0.5959721759721762
Avg Precision@15 of lucene with bm25 for query q5 is: 0.909818144818145
Avg Precision@15 of lucene with bm25 for query q6 is: 0.9507688607688608
Avg Precision@15 of bert-base-uncased with tab_embedding for query q1 is: 0.5697006697006698
Avg Precision@15 of bert-base-uncased with tab_embedding for query q2 is: 0.20914270914270913
Avg Precision@15 of bert-base-uncased with tab_embedding for query q3 is: 0.4165806415806415
Avg Precision@15 of bert-base-uncased with tab_embedding for query q4 is: 0.6762839012839011
Avg Precision@15 of bert-base-uncased with tab_embedding for query q5 is: 0.5587747437747439
Avg Precision@15 of bert-base-uncased with tab_embedding for query q6 is: 0.5703315203315202
Avg P

In [53]:
# MAP@K (K è quello sopra)
map_values: dict[str, dict[str, float]] = {}

for model, methods in results.items():
    map_values[model] = {}
    for method, queries in methods.items():
        sum_ap = 0
        
        for avg_prec in ap_values[model][method].values():
            sum_ap += avg_prec
        
        map_value = sum_ap / num_queries
        map_values[model][method] = map_value
            
        print(f"MAP@{K} for {model} using method: {method} is {map_value}.")

MAP@15 for lucene using method: bm25 is 0.7391680233346901.
MAP@15 for bert-base-uncased using method: tab_embedding is 0.5001356976356978.
MAP@15 for bert-base-uncased using method: tab_cap_embedding is 0.5034342817676152.
MAP@15 for bert-base-uncased using method: tab_cap_ref_embedding is 0.4559785584785585.
MAP@15 for bert-base-uncased using method: weighted_embedding is 0.4619518752852086.
MAP@15 for distilbert-base-uncased using method: tab_embedding is 0.5479131979131979.
MAP@15 for distilbert-base-uncased using method: tab_cap_embedding is 0.5390943007609674.
MAP@15 for distilbert-base-uncased using method: tab_cap_ref_embedding is 0.4154117487450821.
MAP@15 for distilbert-base-uncased using method: weighted_embedding is 0.4787630887630887.
MAP@15 for allenai/scibert_scivocab_uncased using method: tab_embedding is 0.37687722771056104.
MAP@15 for allenai/scibert_scivocab_uncased using method: tab_cap_embedding is 0.35901771068437743.
MAP@15 for allenai/scibert_scivocab_uncased us