# ROUGE-Style Evaluation

__Author__: Cody Buntain (cbuntain@umd.edu)

## Description

For automated evaluation in CrisisFACTS, we compare participant-system summaries to three additional sources of event summaries:

1. Wikipedia - A simple summary of each event, though we expect these summaries are not massively useful for situational awareness, attention support, or decision making.

2. ICS 209 Archive - A dataset of real daily hazard reports, gathered from Lise St. Denis. This data comes from a pre-release version of their updated NIMS database.
    
3. NIST Assessor Summaries - A dataset of event summaries generated by NIST assessors, where CrisisFACTS coordinators asked NIST assessors to identify and timestamp important facts from each event.

We use ROUGE score to compare the top-k most important facts from each participant system to each of the above summaries.

In [2]:
import pandas as pd
import numpy as np
import json
import glob
import gzip

import scipy.stats

import matplotlib.pyplot as plt

In [None]:
!pip install torchmetrics

In [3]:
from torchmetrics.text.rouge import ROUGEScore

<hr>
Gold summaries, generated by `00-CreateMultiSummaries` script

In [None]:
with gzip.open("gold.summaries.json.gz", "rb") as in_file:
    summaries = json.load(in_file)

In [None]:
with open("CrisisFACTs-2022.facts.json", "r") as in_file:
    facts = json.load(in_file)

We use the CrisisFACTS 2022 fact list from NIST assessors to determine the number of facts *per day*.

We use this "depth" to take the top most important facts from each participant system for that day. 

E.g., if a system returns 1000 facts, but the NIST assessor only found 417 facts for that event-day pair, we take the top 417most important facts, as ranked by the participant system

In [None]:
event_request_fact_count_map = {}

day_count = 0 
total_fact_count = 0
for event in facts:
    event_name = event["event"]
    event_id = event["eventID"]
    event_requests = event["summaryRequests"]
    event_factsXrequests = event["factsByRequest"]

    print(event_id, event_name)
    for event_request in event_requests:
        req_id = event_request["requestID"]        
        this_req_facts = event_factsXrequests[req_id]
        fact_count = len(this_req_facts)
        fact_collection = [fact["fact"] for fact in this_req_facts]
        
        print("\t", req_id, fact_count)
        event_request_fact_count_map[req_id] = fact_count
        
        total_fact_count+=fact_count
        day_count+=1

<hr>

For each submission, we iterate through each event. For each event, we take the top facts for each day and add them to a running summary for that event. After constructing the full event summary across all days, we use  `rouge` to score the full event summary.

NOTE: We do not evaluate daily summaries as Wikipedia does not provide us with daily summaries, only top-level summaries.

In [None]:
rouge = ROUGEScore(
    use_stemmer=True,
)

In [None]:
submission_metrics = {}

In [None]:
# Take the top-k facts from each run and each event-request pair per run
event_request_fact_list = {k:{} for k in event_request_fact_count_map.keys()}
for f in glob.glob("submissions.*/*.json.gz"):
    
    this_run_id = f.partition("/")[-1].replace(".json.gz", "")
    print(f, "-->", this_run_id)
    
    this_run_event_request_facts = {k:[] for k in event_request_fact_count_map.keys()}
    with gzip.open(f, "r") as in_file:
        for line_ in in_file:
            line = line_.decode("utf8")
            
            entry = json.loads(line)
            
            this_run_event_request_facts[entry["requestID"]].append(entry)
            
    event_summaries = {s["eventID"]:[] for s in summaries}
    for event_request,this_fact_list in this_run_event_request_facts.items():
        event_id = event_request.rpartition("-")[0]
        
        sorted_fact_list = sorted(this_fact_list, key=lambda v: v["importance"], reverse=True)
        
        this_event_request_k = event_request_fact_count_map[event_request]
        this_day_summary = [this_top_fact["factText"] for this_top_fact in sorted_fact_list[:this_event_request_k]]
        
        event_summaries[event_id] = event_summaries[event_id] + this_day_summary
        

    ics_dfs = []
    wiki_dfs = []
    nist_dfs = []
    for event in summaries:
        event_id = event["eventID"]
        
        this_submitted_summary = event_summaries[event_id]

        this_summary_text = " ".join(this_submitted_summary)
        print(event_id, len(this_summary_text))
        
        nist_summary = event["nist.summary"]
        wiki_summary = event["wiki.summary"]
        ics_summary = event.get("ics.summary", "")

        nist_metric = rouge(this_summary_text, nist_summary)
        wiki_metric = rouge(this_summary_text, wiki_summary)
        ics_metric = rouge(this_summary_text, ics_summary)
        
        this_ics_df = pd.DataFrame([{"metric":k, "value":v.item(), "event": event_id} for k,v in ics_metric.items()])
        this_wiki_df = pd.DataFrame([{"metric":k, "value":v.item(), "event": event_id} for k,v in wiki_metric.items()])
        this_nist_df = pd.DataFrame([{"metric":k, "value":v.item(), "event": event_id} for k,v in nist_metric.items()])
        
        ics_dfs.append(this_ics_df)
        wiki_dfs.append(this_wiki_df)
        nist_dfs.append(this_nist_df)
        
    full_ics_df = pd.concat(ics_dfs)
    full_wiki_df = pd.concat(wiki_dfs)
    full_nist_df = pd.concat(nist_dfs)
    
    submission_metrics[this_run_id] = {
        "ics": full_ics_df,
        "wiki": full_wiki_df,
        "nist": full_nist_df,
    }
    
    display(full_nist_df.groupby("metric").mean())


In [None]:
!mkdir evaluation.output.rouge

<hr>
Save the evaluation for each participant system to its own file.

In [None]:
all_runs = []
for k,v in submission_metrics.items():
    print(k)
    
    stackable = []
    for comparator,ldf in v.items():
        stackable_ldf = ldf.copy()
        stackable_ldf["target.summary"] = comparator

        stackable.append(stackable_ldf)

    this_run_df = pd.concat(stackable)
    this_run_df["run"] = k
    
    all_runs.append(this_run_df)
    this_run_df.to_csv("evaluation.output.rouge/%s.csv" % k, index=False)
    
all_runs_df = pd.concat(all_runs)
all_runs_df.to_csv("evaluation.output.rouge/all_runs.csv", index=False)

In [None]:
target_summaries = {}
for target in ["ics", "wiki", "nist"]:
    this_target_df = all_runs_df[all_runs_df["target.summary"] == target]
    
    index = []
    rows = []
    for run_name,group in this_target_df.groupby("run"):
        print(run_name)
        this_row = group.pivot("event", "metric", "value").mean()
        rows.append(this_row)
        index.append(run_name)

    summary_df = pd.DataFrame(rows, index=index)[[
        "f1", 
    ]]

    final_df = summary_df.sort_values(by="f1", ascending=False)
    final_df.to_csv("evaluation.output.rouge/%s.summary.csv" % target)
    
    target_summaries[target] = final_df