# Discussion points
* in the current approach, we do not do micro averages
* how we map from **a set of system mentions** to the **set of gold mentions**

# Evaluation
Assuming, questions will always be **how many x?**
* the **gold** consists of:
    * for all questions
        * incident identifier(s)
        * documents belonging to that incident identifier
    * for the manully annotated subset
        * mentions
        
* the **system** consists of 
    * sets of mentions
        * length indicates system answer

In [11]:
gold = {('d1.s24.t15', 'd1.s24.t15'),
        ('d1.s28.t11', 'd1.s28.t11'),
        ('d1.s36.t5', 'd1.s36.t5'),
        ('d2.s05.t16', 'd2.s5.t16'),
        ('d2.s32.t8', 'd2.s32.t8'),
        ('d3.s11.t18', 'd3.s11.t18'),
        ('d3.s19.t19', 'd3.s19.t19'),
        ('d3.s22.t5', 'd3.s22.t5'),
        ('d3.s23.t21', 'd3.s23.t21'),
        ('d4.s11.t12', 'd4.s11.t12')}
gold_number = 2
system = {('d1.s24.t15', 'd1.s24.t15'),
          ('d1.s28.t11', 'd1.s28.t11'),
          ('d1.s36.t5', 'd1.s36.t5'),
          ('d2.s5.t16', 'd2.s5.t16'),
          ('d2.s32.t8', 'd2.s32.t8'),
          ('d3.s11.t18', 'd3.s11.t18'),
          ('d3.s19.t19', 'd3.s19.t19'),
          ('d4.s11.t12', 'd4.s11.t12'),
          ('d5.s11.t12', 'd5.s11.t12'),
          ('d6.s11.t12', 'd6.s11.t12'),
          }
system_number = 2

In [12]:
def extract_ids_at_level(mentions, level=None):
    """
    extract identifiers based on level: document | sentence | token 
    
    ASSUMPTION: start and end of mention is in same sentence
    
    :param set mentions: set of mentions, e.g. 
    ('d001.s024.t015', 'd001.s024.t015')
    :param level: document | sentence | token
    """
    if level == 'token':
        return mentions
    
    output = set()
    for id1, id2 in mentions:
        doc_id, sent_id, token_id = id1.split('.')
        sent_id = '.'.join([doc_id, sent_id])
        
        if level == 'document':
            output.add(doc_id)
        elif level == 'sentence':
            output.add(sent_id)
    
    return output

In [20]:
level = 'document'
system_info = extract_ids_at_level(system, level=level)
gold_info = extract_ids_at_level(gold, level=level)

overlap = system_info & gold_info
precision = len(overlap) / len(system_info)
recall = len(overlap) / len(gold_info)

print()
print('level', level)
print('gold', sorted(gold_info))
print('system', sorted(system_info))
print()
print('precision', round(precision, 2))
print('recall', round(recall, 2))


level document
gold ['d1', 'd2', 'd3', 'd4']
system ['d1', 'd2', 'd3', 'd4', 'd5', 'd6']

precision 0.67
recall 1.0
