# BERTScore-based ROUGE-Style Evaluation

__Author__: Cody Buntain (cbuntain@umd.edu)

## Description

For automated evaluation in CrisisFACTS, we compare participant-system summaries to three additional sources of event summaries:

1. Wikipedia - A simple summary of each event, though we expect these summaries are not massively useful for situational awareness, attention support, or decision making.

2. ICS 209 Archive - A dataset of real daily hazard reports, gathered from Lise St. Denis. This data comes from a pre-release version of their updated NIMS database.
    
3. NIST Assessor Summaries - A dataset of event summaries generated by NIST assessors, where CrisisFACTS coordinators asked NIST assessors to identify and timestamp important facts from each event.

We use BERTScore to compare the top-k most important facts from each participant system to each of the above summaries.

In [2]:
import pandas as pd
import numpy as np
import json
import glob
import gzip

import scipy.stats

import matplotlib.pyplot as plt

In [None]:
!pip install bert-score

In [None]:
import bert_score
bert_score.__version__

<hr>
Gold summaries, generated by `00-CreateMultiSummaries` script

In [4]:
with gzip.open("gold.summaries.json.gz", "rb") as in_file:
    summaries = json.load(in_file)

In [6]:
with open("CrisisFACTs-2022.facts.json", "r") as in_file:
    facts = json.load(in_file)

We use the CrisisFACTS 2022 fact list from NIST assessors to determine the number of facts *per day*.

We use this "depth" to take the top most important facts from each participant system for that day. 

E.g., if a system returns 1000 facts, but the NIST assessor only found 417 facts for that event-day pair, we take the top 417most important facts, as ranked by the participant system

In [7]:
event_request_fact_count_map = {}

day_count = 0 
total_fact_count = 0
for event in facts:
    event_name = event["event"]
    event_id = event["eventID"]
    event_requests = event["summaryRequests"]
    event_factsXrequests = event["factsByRequest"]

    print(event_id, event_name)
    for event_request in event_requests:
        req_id = event_request["requestID"]        
        this_req_facts = event_factsXrequests[req_id]
        fact_count = len(this_req_facts)
        fact_collection = [fact["fact"] for fact in this_req_facts]
        
        print("\t", req_id, fact_count)
        event_request_fact_count_map[req_id] = fact_count
        
        total_fact_count+=fact_count
        day_count+=1

CrisisFACTS-001 Lilac Wildfire 2017
	 CrisisFACTS-001-r3 267
	 CrisisFACTS-001-r4 75
	 CrisisFACTS-001-r5 14
	 CrisisFACTS-001-r6 29
	 CrisisFACTS-001-r7 19
	 CrisisFACTS-001-r8 5
	 CrisisFACTS-001-r9 3
	 CrisisFACTS-001-r10 3
	 CrisisFACTS-001-r11 2
CrisisFACTS-002 Cranston Wildfire 2018
	 CrisisFACTS-002-r1 27
	 CrisisFACTS-002-r2 10
	 CrisisFACTS-002-r3 4
	 CrisisFACTS-002-r4 13
	 CrisisFACTS-002-r5 7
	 CrisisFACTS-002-r6 1
CrisisFACTS-003 Holy Wildfire 2018
	 CrisisFACTS-003-r5 37
	 CrisisFACTS-003-r6 42
	 CrisisFACTS-003-r7 39
	 CrisisFACTS-003-r8 37
	 CrisisFACTS-003-r9 9
	 CrisisFACTS-003-r10 17
	 CrisisFACTS-003-r11 4
CrisisFACTS-004 Hurricane Florence 2018
	 CrisisFACTS-004-r8 5
	 CrisisFACTS-004-r9 5
	 CrisisFACTS-004-r10 2
	 CrisisFACTS-004-r11 4
	 CrisisFACTS-004-r12 8
	 CrisisFACTS-004-r13 15
	 CrisisFACTS-004-r14 55
	 CrisisFACTS-004-r15 26
	 CrisisFACTS-004-r16 14
	 CrisisFACTS-004-r17 37
	 CrisisFACTS-004-r18 46
	 CrisisFACTS-004-r19 3
	 CrisisFACTS-004-r20 6
	 CrisisFA

<hr>

For each submission, we iterate through each event. For each event, we take the top facts for each day and add them to a running summary for that event. After constructing the full event summary across all days, we use  `bert_score` to score the full event summary.

NOTE: We do not evaluate daily summaries as Wikipedia does not provide us with daily summaries, only top-level summaries.

In [23]:
submission_metrics = {}

In [52]:
# Take the top-k facts from each run and each event-request pair per run
event_request_fact_list = {k:{} for k in event_request_fact_count_map.keys()}
for f in glob.glob("submissions.*/*.json.gz"):
    
    this_run_id = f.partition("/")[-1].replace(".json.gz", "")
    print(f, "-->", this_run_id)
    
    this_run_event_request_facts = {k:[] for k in event_request_fact_count_map.keys()}
    with gzip.open(f, "r") as in_file:
        for line_ in in_file:
            line = line_.decode("utf8")
            
            entry = json.loads(line)
            
            this_run_event_request_facts[entry["requestID"]].append(entry)
            
    event_summaries = {s["eventID"]:[] for s in summaries}
    for event_request,this_fact_list in this_run_event_request_facts.items():
        event_id = event_request.rpartition("-")[0]
        
        sorted_fact_list = sorted(this_fact_list, key=lambda v: v["importance"], reverse=True)
        
        this_event_request_k = event_request_fact_count_map[event_request]
        this_day_summary = [this_top_fact["factText"] for this_top_fact in sorted_fact_list[:this_event_request_k]]
        
        event_summaries[event_id] = event_summaries[event_id] + this_day_summary
        

    ics_dfs = []
    wiki_dfs = []
    nist_dfs = []
    for event in summaries:
        event_id = event["eventID"]
        
        this_submitted_summary = event_summaries[event_id]

        this_summary_text = " ".join(this_submitted_summary)
        print(event_id, len(this_summary_text))
        
        nist_summary = event["nist.summary"]
        wiki_summary = event["wiki.summary"]
        ics_summary = event.get("ics.summary", "")

        nist_metric_ = bert_score.score([this_summary_text], [nist_summary], model_type="microsoft/deberta-xlarge-mnli")
        wiki_metric_ = bert_score.score([this_summary_text], [wiki_summary], model_type="microsoft/deberta-xlarge-mnli")
        ics_metric_ = bert_score.score([this_summary_text], [ics_summary], model_type="microsoft/deberta-xlarge-mnli")
        
        nist_metric = {
            "precision": nist_metric_[0],
            "recall": nist_metric_[1],
            "f1": nist_metric_[2],
        }
        
        wiki_metric = {
            "precision": wiki_metric_[0],
            "recall": wiki_metric_[1],
            "f1": wiki_metric_[2],
        }
        
        ics_metric = {
            "precision": ics_metric_[0],
            "recall": ics_metric_[1],
            "f1": ics_metric_[2],
        }
        
        this_ics_df = pd.DataFrame([{"metric":k, "value":v.item(), "event": event_id} for k,v in ics_metric.items()])
        this_wiki_df = pd.DataFrame([{"metric":k, "value":v.item(), "event": event_id} for k,v in wiki_metric.items()])
        this_nist_df = pd.DataFrame([{"metric":k, "value":v.item(), "event": event_id} for k,v in nist_metric.items()])
        
        ics_dfs.append(this_ics_df)
        wiki_dfs.append(this_wiki_df)
        nist_dfs.append(this_nist_df)
        
    full_ics_df = pd.concat(ics_dfs)
    full_wiki_df = pd.concat(wiki_dfs)
    full_nist_df = pd.concat(nist_dfs)
    
    submission_metrics[this_run_id] = {
        "ics": full_ics_df,
        "wiki": full_wiki_df,
        "nist": full_nist_df,
    }
    
    display(full_nist_df.groupby("metric").mean())


submissions.abstractive/unicamp.NM-2.json.gz --> unicamp.NM-2
CrisisFACTS-001 75054


Some weights of the model checkpoint at microsoft/deberta-xlarge-mnli were not used when initializing DebertaModel: ['classifier.weight', 'pooler.dense.weight', 'pooler.dense.bias', 'classifier.bias']
- This IS expected if you are initializing DebertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
  query_layer = query_layer / torch.tensor(scale, dtype=query_layer.dtype)
  p2c_att = torch.matmul(key_layer, torch.tensor(pos_query_layer.transpose(-1, -2), dtype=key_layer.dtype))
Some weights of the model checkpoint at microsoft/deberta-xlarge-mnli were not used when initializing DebertaModel: ['classifier.weight', 'pooler.dense.weigh

CrisisFACTS-002 9286


Some weights of the model checkpoint at microsoft/deberta-xlarge-mnli were not used when initializing DebertaModel: ['classifier.weight', 'pooler.dense.weight', 'pooler.dense.bias', 'classifier.bias']
- This IS expected if you are initializing DebertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
  query_layer = query_layer / torch.tensor(scale, dtype=query_layer.dtype)
  p2c_att = torch.matmul(key_layer, torch.tensor(pos_query_layer.transpose(-1, -2), dtype=key_layer.dtype))
Some weights of the model checkpoint at microsoft/deberta-xlarge-mnli were not used when initializing DebertaModel: ['classifier.weight', 'pooler.dense.weigh

CrisisFACTS-003 35571


Some weights of the model checkpoint at microsoft/deberta-xlarge-mnli were not used when initializing DebertaModel: ['classifier.weight', 'pooler.dense.weight', 'pooler.dense.bias', 'classifier.bias']
- This IS expected if you are initializing DebertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
  query_layer = query_layer / torch.tensor(scale, dtype=query_layer.dtype)
  p2c_att = torch.matmul(key_layer, torch.tensor(pos_query_layer.transpose(-1, -2), dtype=key_layer.dtype))
Some weights of the model checkpoint at microsoft/deberta-xlarge-mnli were not used when initializing DebertaModel: ['classifier.weight', 'pooler.dense.weigh

CrisisFACTS-004 56480


Some weights of the model checkpoint at microsoft/deberta-xlarge-mnli were not used when initializing DebertaModel: ['classifier.weight', 'pooler.dense.weight', 'pooler.dense.bias', 'classifier.bias']
- This IS expected if you are initializing DebertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
  query_layer = query_layer / torch.tensor(scale, dtype=query_layer.dtype)
  p2c_att = torch.matmul(key_layer, torch.tensor(pos_query_layer.transpose(-1, -2), dtype=key_layer.dtype))
Some weights of the model checkpoint at microsoft/deberta-xlarge-mnli were not used when initializing DebertaModel: ['classifier.weight', 'pooler.dense.weigh

CrisisFACTS-005 17367


Some weights of the model checkpoint at microsoft/deberta-xlarge-mnli were not used when initializing DebertaModel: ['classifier.weight', 'pooler.dense.weight', 'pooler.dense.bias', 'classifier.bias']
- This IS expected if you are initializing DebertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
  query_layer = query_layer / torch.tensor(scale, dtype=query_layer.dtype)
  p2c_att = torch.matmul(key_layer, torch.tensor(pos_query_layer.transpose(-1, -2), dtype=key_layer.dtype))
Some weights of the model checkpoint at microsoft/deberta-xlarge-mnli were not used when initializing DebertaModel: ['classifier.weight', 'pooler.dense.weigh

CrisisFACTS-006 15705


Some weights of the model checkpoint at microsoft/deberta-xlarge-mnli were not used when initializing DebertaModel: ['classifier.weight', 'pooler.dense.weight', 'pooler.dense.bias', 'classifier.bias']
- This IS expected if you are initializing DebertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
  query_layer = query_layer / torch.tensor(scale, dtype=query_layer.dtype)
  p2c_att = torch.matmul(key_layer, torch.tensor(pos_query_layer.transpose(-1, -2), dtype=key_layer.dtype))
Some weights of the model checkpoint at microsoft/deberta-xlarge-mnli were not used when initializing DebertaModel: ['classifier.weight', 'pooler.dense.weigh

CrisisFACTS-007 21142


Some weights of the model checkpoint at microsoft/deberta-xlarge-mnli were not used when initializing DebertaModel: ['classifier.weight', 'pooler.dense.weight', 'pooler.dense.bias', 'classifier.bias']
- This IS expected if you are initializing DebertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
  query_layer = query_layer / torch.tensor(scale, dtype=query_layer.dtype)
  p2c_att = torch.matmul(key_layer, torch.tensor(pos_query_layer.transpose(-1, -2), dtype=key_layer.dtype))
Some weights of the model checkpoint at microsoft/deberta-xlarge-mnli were not used when initializing DebertaModel: ['classifier.weight', 'pooler.dense.weigh

CrisisFACTS-008 46870


Some weights of the model checkpoint at microsoft/deberta-xlarge-mnli were not used when initializing DebertaModel: ['classifier.weight', 'pooler.dense.weight', 'pooler.dense.bias', 'classifier.bias']
- This IS expected if you are initializing DebertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
  query_layer = query_layer / torch.tensor(scale, dtype=query_layer.dtype)
  p2c_att = torch.matmul(key_layer, torch.tensor(pos_query_layer.transpose(-1, -2), dtype=key_layer.dtype))
Some weights of the model checkpoint at microsoft/deberta-xlarge-mnli were not used when initializing DebertaModel: ['classifier.weight', 'pooler.dense.weigh

Unnamed: 0_level_0,value
metric,Unnamed: 1_level_1
f1,0.557346
precision,0.571183
recall,0.544588


submissions.abstractive/unicamp.NM-1.json.gz --> unicamp.NM-1
CrisisFACTS-001 75054


Some weights of the model checkpoint at microsoft/deberta-xlarge-mnli were not used when initializing DebertaModel: ['classifier.weight', 'pooler.dense.weight', 'pooler.dense.bias', 'classifier.bias']
- This IS expected if you are initializing DebertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
  query_layer = query_layer / torch.tensor(scale, dtype=query_layer.dtype)
  p2c_att = torch.matmul(key_layer, torch.tensor(pos_query_layer.transpose(-1, -2), dtype=key_layer.dtype))
Some weights of the model checkpoint at microsoft/deberta-xlarge-mnli were not used when initializing DebertaModel: ['classifier.weight', 'pooler.dense.weigh

CrisisFACTS-002 9286


Some weights of the model checkpoint at microsoft/deberta-xlarge-mnli were not used when initializing DebertaModel: ['classifier.weight', 'pooler.dense.weight', 'pooler.dense.bias', 'classifier.bias']
- This IS expected if you are initializing DebertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
  query_layer = query_layer / torch.tensor(scale, dtype=query_layer.dtype)
  p2c_att = torch.matmul(key_layer, torch.tensor(pos_query_layer.transpose(-1, -2), dtype=key_layer.dtype))
Some weights of the model checkpoint at microsoft/deberta-xlarge-mnli were not used when initializing DebertaModel: ['classifier.weight', 'pooler.dense.weigh

CrisisFACTS-003 35571


Some weights of the model checkpoint at microsoft/deberta-xlarge-mnli were not used when initializing DebertaModel: ['classifier.weight', 'pooler.dense.weight', 'pooler.dense.bias', 'classifier.bias']
- This IS expected if you are initializing DebertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
  query_layer = query_layer / torch.tensor(scale, dtype=query_layer.dtype)
  p2c_att = torch.matmul(key_layer, torch.tensor(pos_query_layer.transpose(-1, -2), dtype=key_layer.dtype))
Some weights of the model checkpoint at microsoft/deberta-xlarge-mnli were not used when initializing DebertaModel: ['classifier.weight', 'pooler.dense.weigh

CrisisFACTS-004 56480


Some weights of the model checkpoint at microsoft/deberta-xlarge-mnli were not used when initializing DebertaModel: ['classifier.weight', 'pooler.dense.weight', 'pooler.dense.bias', 'classifier.bias']
- This IS expected if you are initializing DebertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
  query_layer = query_layer / torch.tensor(scale, dtype=query_layer.dtype)
  p2c_att = torch.matmul(key_layer, torch.tensor(pos_query_layer.transpose(-1, -2), dtype=key_layer.dtype))
Some weights of the model checkpoint at microsoft/deberta-xlarge-mnli were not used when initializing DebertaModel: ['classifier.weight', 'pooler.dense.weigh

CrisisFACTS-005 17367


Some weights of the model checkpoint at microsoft/deberta-xlarge-mnli were not used when initializing DebertaModel: ['classifier.weight', 'pooler.dense.weight', 'pooler.dense.bias', 'classifier.bias']
- This IS expected if you are initializing DebertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
  query_layer = query_layer / torch.tensor(scale, dtype=query_layer.dtype)
  p2c_att = torch.matmul(key_layer, torch.tensor(pos_query_layer.transpose(-1, -2), dtype=key_layer.dtype))
Some weights of the model checkpoint at microsoft/deberta-xlarge-mnli were not used when initializing DebertaModel: ['classifier.weight', 'pooler.dense.weigh

CrisisFACTS-006 15705


Some weights of the model checkpoint at microsoft/deberta-xlarge-mnli were not used when initializing DebertaModel: ['classifier.weight', 'pooler.dense.weight', 'pooler.dense.bias', 'classifier.bias']
- This IS expected if you are initializing DebertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
  query_layer = query_layer / torch.tensor(scale, dtype=query_layer.dtype)
  p2c_att = torch.matmul(key_layer, torch.tensor(pos_query_layer.transpose(-1, -2), dtype=key_layer.dtype))
Some weights of the model checkpoint at microsoft/deberta-xlarge-mnli were not used when initializing DebertaModel: ['classifier.weight', 'pooler.dense.weigh

CrisisFACTS-007 21142


Some weights of the model checkpoint at microsoft/deberta-xlarge-mnli were not used when initializing DebertaModel: ['classifier.weight', 'pooler.dense.weight', 'pooler.dense.bias', 'classifier.bias']
- This IS expected if you are initializing DebertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
  query_layer = query_layer / torch.tensor(scale, dtype=query_layer.dtype)
  p2c_att = torch.matmul(key_layer, torch.tensor(pos_query_layer.transpose(-1, -2), dtype=key_layer.dtype))
Some weights of the model checkpoint at microsoft/deberta-xlarge-mnli were not used when initializing DebertaModel: ['classifier.weight', 'pooler.dense.weigh

CrisisFACTS-008 46873


Some weights of the model checkpoint at microsoft/deberta-xlarge-mnli were not used when initializing DebertaModel: ['classifier.weight', 'pooler.dense.weight', 'pooler.dense.bias', 'classifier.bias']
- This IS expected if you are initializing DebertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
  query_layer = query_layer / torch.tensor(scale, dtype=query_layer.dtype)
  p2c_att = torch.matmul(key_layer, torch.tensor(pos_query_layer.transpose(-1, -2), dtype=key_layer.dtype))
Some weights of the model checkpoint at microsoft/deberta-xlarge-mnli were not used when initializing DebertaModel: ['classifier.weight', 'pooler.dense.weigh

Unnamed: 0_level_0,value
metric,Unnamed: 1_level_1
f1,0.557346
precision,0.571183
recall,0.544588


submissions.extractive/eXSum22.eXSum22_submission_02.json.gz --> eXSum22.eXSum22_submission_02
CrisisFACTS-001 37634


Some weights of the model checkpoint at microsoft/deberta-xlarge-mnli were not used when initializing DebertaModel: ['classifier.weight', 'pooler.dense.weight', 'pooler.dense.bias', 'classifier.bias']
- This IS expected if you are initializing DebertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
  query_layer = query_layer / torch.tensor(scale, dtype=query_layer.dtype)
  p2c_att = torch.matmul(key_layer, torch.tensor(pos_query_layer.transpose(-1, -2), dtype=key_layer.dtype))
Some weights of the model checkpoint at microsoft/deberta-xlarge-mnli were not used when initializing DebertaModel: ['classifier.weight', 'pooler.dense.weigh

KeyboardInterrupt: 

In [None]:
!mkdir evaluation.output.bertscore

Save the evaluation for each participant system to its own file.

In [53]:
all_runs = []
for k,v in submission_metrics.items():
    print(k)
    
    stackable = []
    for comparator,ldf in v.items():
        stackable_ldf = ldf.copy()
        stackable_ldf["target.summary"] = comparator

        stackable.append(stackable_ldf)

    this_run_df = pd.concat(stackable)
    this_run_df["run"] = k
    
    all_runs.append(this_run_df)
    this_run_df.to_csv("evaluation.output.bertscore/%s.csv" % k, index=False)
    
all_runs_df = pd.concat(all_runs)
all_runs_df.to_csv("evaluation.output.bertscore/all_runs.csv", index=False)

eXSum22.eXSum22_submission_02
IISER22.submission_final.json
baseline.run1
ohm_kiz.BM25_QAcrisis_ILP
umcp.rr_now
ohm_kiz.BM25_QAasnq_ILP
IRIT_IRIS.IRIT_IRIS_mean_USE
baseline.run2
eXSum22.eXSum22_submission_01
umcp.combsum
umcp.mrr_nobrf
IISER22.submission_final_4
umcp.mrr_sum
umcp.mrr_all
ohm_kiz.ColBERT_ILP
SiPEO.nazmultum11
IISER22.submission_LM_DS_3
IRIT_IRIS.IRIT_IRIS_tssubert
IISER22.submission_LM_JM_2
umcp.mrr_no_dd
ohm_kiz.BM25_Heuristic_ILP
IRIT_IRIS.IRIT_IRIS_mean_USE_INeeds
umcp.mrr_main
unicamp.NM-2
unicamp.NM-1


Summarize the evaluation data and store its summary for each of the three gold-standard summaries.

In [54]:
target_summaries = {}
for target in ["ics", "wiki", "nist"]:
    this_target_df = all_runs_df[all_runs_df["target.summary"] == target]
    
    index = []
    rows = []
    for run_name,group in this_target_df.groupby("run"):
        print(run_name)
        this_row = group.pivot("event", "metric", "value").mean()
        rows.append(this_row)
        index.append(run_name)

    summary_df = pd.DataFrame(rows, index=index)[[
        "f1", 
    ]]

    final_df = summary_df.sort_values(by="f1", ascending=False)
    final_df.to_csv("evaluation.output.bertscore/%s.summary.csv" % target)
    
    target_summaries[target] = final_df

IISER22.submission_LM_DS_3
IISER22.submission_LM_JM_2
IISER22.submission_final.json
IISER22.submission_final_4
IRIT_IRIS.IRIT_IRIS_mean_USE
IRIT_IRIS.IRIT_IRIS_mean_USE_INeeds
IRIT_IRIS.IRIT_IRIS_tssubert
SiPEO.nazmultum11
baseline.run1
baseline.run2
eXSum22.eXSum22_submission_01
eXSum22.eXSum22_submission_02
ohm_kiz.BM25_Heuristic_ILP
ohm_kiz.BM25_QAasnq_ILP
ohm_kiz.BM25_QAcrisis_ILP
ohm_kiz.ColBERT_ILP
umcp.combsum
umcp.mrr_all
umcp.mrr_main
umcp.mrr_no_dd
umcp.mrr_nobrf
umcp.mrr_sum
umcp.rr_now
unicamp.NM-1
unicamp.NM-2
IISER22.submission_LM_DS_3
IISER22.submission_LM_JM_2
IISER22.submission_final.json
IISER22.submission_final_4
IRIT_IRIS.IRIT_IRIS_mean_USE
IRIT_IRIS.IRIT_IRIS_mean_USE_INeeds
IRIT_IRIS.IRIT_IRIS_tssubert
SiPEO.nazmultum11
baseline.run1
baseline.run2
eXSum22.eXSum22_submission_01
eXSum22.eXSum22_submission_02
ohm_kiz.BM25_Heuristic_ILP
ohm_kiz.BM25_QAasnq_ILP
ohm_kiz.BM25_QAcrisis_ILP
ohm_kiz.ColBERT_ILP
umcp.combsum
umcp.mrr_all
umcp.mrr_main
umcp.mrr_no_dd
umcp.mrr