## Scorer: semEval22 baselines, inferences on semEval16 

    =================
    - Author: Nana 
    - Created Date: 2022.7.28
    =================
- models: two semEval22 baseline models (extracts opinion expressions `polarity_expression`
- inference dataset: semEval16 ABSA RESTAURANT dataset (with gold answer)
- metrics: micro-f1, roughly according to [SemEval-2016 Task 5: Aspect Based Sentiment Analysis](https://aclanthology.org/S16-1002/)
- tweaks: 
    - Because the s22 baselines do not return Aspect Category (e.g. FOOD#QUALITY), we cannot
    calc score completely based on the s16 metrics. Furthermore, the micro f1 relies on exact match of strings, i.e. gold: 'wine list' ; pred: 'the wine list' is counted as a false prediction,
    which might be improved using string algorithms.
    - Because the s16 gold answer does not provide `polarity_expression`, we cannot calc score based using s22 metrics either. 
    - we'll tweak the scorer function a little bit. 
    

In [7]:
datadir = "../semEval16/data"
seqdir = "../semEval16/sequence_labeling/predictions"
graphdir = "../semEval16/graph_parser/predictions"

In [3]:
splits = ['train', 'test-B', 'test-gold']

## 0.1 Check if test-B and test-gold are the same splits -> Yes

- Ans: Yes, review_id和sent_id都是同一個set。
- test-gold才是有完整答案的（全部都有標注除了intensity）
- test-B缺標polarity（缺標polarity和intensity）

In [10]:
splitpaths = [f'{datadir}/train/ABSA16_Restaurants_Train_SB1_v2.json',
    f'{datadir}/test-B/EN_REST_SB1_TEST-B.json',
    f'{datadir}/test-gold/EN_REST_SB1_TEST-gold.json']


import json 

splitD = {'train':None, 'test-B':None, 'test-gold':None}
for split, sppath in zip(splits, splitpaths):
    print('Loading from', sppath)
    with open(sppath, 'r') as f:
        splitD[split] = json.load(f)

Loading from ../semEval16/data/train/ABSA16_Restaurants_Train_SB1_v2.json
Loading from ../semEval16/data/test-B/EN_REST_SB1_TEST-B.json
Loading from ../semEval16/data/test-gold/EN_REST_SB1_TEST-gold.json


In [23]:
Bset = set(x['sent_id'] for x in splitD['test-B'])
Gset = set(x['sent_id'] for x in splitD['test-gold']) 
Bset == Gset

True

In [24]:
len(Bset)

676

In [25]:
id = 'en_BlueRibbonSushi_478218171'
from random import randint
rindex = randint(0, len(Gset)) 
splitD['test-gold'][rindex] 

{'review_id': 'en_OpenSesame_477970940',
 'sent_id': 'en_OpenSesame_477970940:5',
 'text': 'Also, they serve THE best hummus in America, with a drizzle of fragrant olive oil (which, I believe is the traditional way)!',
 'opinions': [{'Source': [[], []],
   'Target': ['hummus', ['26:32']],
   'Polar_expression': [[], []],
   'Polarity': 'positive',
   'Intensity': '',
   'Category': 'FOOD#QUALITY'},
  {'Source': [[], []],
   'Target': ['hummus', ['26:32']],
   'Polar_expression': [[], []],
   'Polarity': 'positive',
   'Intensity': '',
   'Category': 'FOOD#STYLE_OPTIONS'}]}

In [26]:
splitD['test-B'][rindex]

{'review_id': 'en_OpenSesame_477970940',
 'sent_id': 'en_OpenSesame_477970940:5',
 'text': 'Also, they serve THE best hummus in America, with a drizzle of fragrant olive oil (which, I believe is the traditional way)!',
 'opinions': [{'Source': [[], []],
   'Target': ['hummus', ['26:32']],
   'Polar_expression': [[], []],
   'Polarity': '',
   'Intensity': '',
   'Category': 'FOOD#QUALITY'},
  {'Source': [[], []],
   'Target': ['hummus', ['26:32']],
   'Polar_expression': [[], []],
   'Polarity': '',
   'Intensity': '',
   'Category': 'FOOD#STYLE_OPTIONS'}]}

## 1 eval metrics
   - `Target`
       試著detect相同或相近target的opinion，將gold_target加入預測，找不到則填''。
   - `(Target, polarity)`
       試著detect相同或相近target的opinion。
       找到對應predicted polarity，將f'{gold_target}_{polarity}'加入預測，找不到則填null_null。
   - 計算f1時以tuple為單位，傳入的sklearn.metrics.f1_score(y_true, y_pred)之y_true長度應為整個dataset中gold tuples總數。
   - note that polarity is to be cast to uppercase

In [None]:
splits = ['train', 'test-gold']

In [31]:
#### loading predictions 
from collections import defaultdict
predD = defaultdict(lambda: defaultdict(dict))
# 3 dataset-pretrained models 
prt_dataset = ['darmstadt_unis', 'mpqa', 'opener_en']   
for i, baselinedir in enumerate([seqdir, graphdir]):
    baseline = 'seq' if i == 0 else 'graph'
    for split in splits:
        for pretrained_d in prt_dataset:
            fullpath = f'{baselinedir}/{split}/{pretrained_d}_prediction.json'
            with open(fullpath, 'r') as f:
                predD[baseline][split][pretrained_d] = json.load(f)

In [33]:
#  predD

In [None]:
# for the same sent_id, 
# run through gold targets 
 # run through pred targets 
    # how to ignore duplicates?
        # if lcs(gold, pred) >= 4
         # consider them to be correct and append it to the predictions 

In [42]:
cursplit = 'train'
baseline = 'graph'
pretrained = 'opener_en'

In [43]:
golden = splitD[cursplit]
prediction = predD[baseline][cursplit][pretrained]
pairing = defaultdict(lambda: {'gold':None, 'pred':None})
for y in golden:
    sid = y['sent_id']
    pairing[sid]['gold'] = y 
for y_ in prediction:
    sid = y_['sent_id']
    pairing[sid]['pred'] = y_

In [76]:
# credit to: https://www.geeksforgeeks.org/longest-common-substring-dp-29/
def LCSubStr(X, Y):
    m, n = len(X), len(Y)
    # Create a table to store lengths of
    # longest common suffixes of substrings.
    # Note that LCSuff[i][j] contains the
    # length of longest common suffix of
    # X[0...i-1] and Y[0...j-1]. The first
    # row and first column entries have no
    # logical meaning, they are used only
    # for simplicity of the program.
 
    # LCSuff is the table with zero
    # value initially in each cell
    LCSuff = [[0 for k in range(n+1)] for l in range(m+1)]
 
    # To store the length of
    # longest common substring
    result = 0
 
    # Following steps to build
    # LCSuff[m+1][n+1] in bottom up fashion
    for i in range(m + 1):
        for j in range(n + 1):
            if (i == 0 or j == 0):
                LCSuff[i][j] = 0
            elif (X[i-1] == Y[j-1]):
                LCSuff[i][j] = LCSuff[i-1][j-1] + 1
                result = max(result, LCSuff[i][j])
            else:
                LCSuff[i][j] = 0
    return result

In [116]:
def collect_targets(data):
    opns = data['opinions']
    x = set()
    for opn in opns:
        if isinstance(opn['Target'][0], str):
            x.add(opn['Target'][0])
        elif len(opn['Target'][0]) > 0:
            x.add(opn['Target'][0][0])
    return list(x)

### 1.1 target eval metrics

In [228]:
def target_scorer(pairing):
    '''
    pairing: defaultdict(lambda: {'gold':None, 'pred':None})
        gold_data = pairing[sent_id]['gold']
        pred_data = pairing[sent_id]['pred']
    '''
    import re
    from sklearn.metrics import f1_score

    goldtargets = [] # hold gold targets
    predtargets = [] # hold preprocessed pred targets
    correct = {'exact':0, 'lcs':0} 
    for k, pair in pairing.items():

        goldtgts = collect_targets(pair['gold'])
        predtgts = collect_targets(pair['pred'])
        # print(goldtgts, predtgts)
        for goldtarget in goldtgts:
            goldtarget = re.sub(r"[,.;@#?!&$]+\ *", "", goldtarget)
            goldtargets.append(goldtarget)

            addFlag = False
            for pi, predtarget in enumerate(predtgts):
                predtarget = re.sub(r"[,.;@#?!&$]+\ *", "", predtarget)
                # check lcs (longest common substring)
                if goldtarget == predtarget:
                    predtargets.append(goldtarget)
                    predtgts.pop(pi)
                    addFlag = True
                    correct['exact'] += 1
                    break 
                if LCSubStr(goldtarget, predtarget) >= 4:
                    # print(goldtarget, '|', predtarget)
                    predtargets.append(goldtarget)
                    predtgts.pop(pi)
                    addFlag = True
                    correct['lcs'] += 1
                    break
            if not addFlag:
                predtargets.append('')
    assert len(goldtargets) == len(predtargets)
    print('='*7+' TARGET SCORER '+'='*7)
    print(f'- # sentences: {len(pairing)}')
    print(f'- # gold tuples: {len(goldtargets)}')
    print(f'- # correct: {correct}')
    f1 = f1_score(goldtargets, predtargets, average="micro")
    print(f'- f1 score (micro): {f1}')
    return f1


In [207]:
target_scorer(pairing)

- # sentences: 2000
- # gold tuples: 1741
- # correct: {'exact': 282, 'lcs': 479}
- f1 score (micro): 0.43710511200459506


### 1.2 (target, pol) eval metrics

In [208]:
def collect_tuples(data):
    opns = data['opinions']
    x = set()
    for opn in opns:
        if isinstance(opn['Target'][0], str):
            x.add((opn['Target'][0], opn['Polarity'].lower()))
        elif len(opn['Target'][0]) > 0:
            x.add((opn['Target'][0][0], opn['Polarity'].lower()))
    return list(x)

In [227]:
def tuple_scorer(pairing):
    '''
    pairing: defaultdict(lambda: {'gold':None, 'pred':None})
        gold_data = pairing[sent_id]['gold']
        pred_data = pairing[sent_id]['pred']
    '''
    import re
    from sklearn.metrics import f1_score

    goldtuples = [] # hold gold targets
    predtuples = [] # hold preprocessed pred targets

    for k, pair in pairing.items():

        goldtups = collect_tuples(pair['gold'])
        predtups = collect_tuples(pair['pred'])
        for goldtarget, goldpol in goldtups:
            goldtarget = re.sub(r"[,.;@#?!&$]+\ *", "", goldtarget)
            goldtuples.append(f'{goldtarget}_{goldpol}')

            addFlag = False
            for pi, (predtarget, predpol) in enumerate(predtups):
                predtarget = re.sub(r"[,.;@#?!&$]+\ *", "", predtarget)
                # check lcs (longest common substring)
                if goldtarget == predtarget or LCSubStr(goldtarget, predtarget) >= 4:
                    predtuples.append(f'{goldtarget}_{predpol}')
                    predtups.pop(pi)
                    addFlag = True
                    break
            if not addFlag:
                predtuples.append('null_null')
    
    
    assert len(goldtuples) == len(predtuples)
    print('='*7+' TUPLE SCORER '+'='*7)
    print('- TUPLE = (target, polairty) ')
    print(f'- # sentences: {len(pairing)}')
    print(f'- # gold tuples: {len(goldtuples)}')
    # print(goldtuples[:10], predtuples[:10])
    f1 = f1_score(goldtuples, predtuples, average="micro")
    print(f'- f1 score (micro): {f1}')
    return f1


In [217]:
tuple_scorer(pairing)

- TUPLE = (target, polairty) 
- # sentences: 2000
- # gold tuples: 1770
- f1 score (micro): 0.3338983050847458


## 2 Compute scores for each annotation 
   
   - 2 baselines x 2 pretrained datasets x 2 splits

In [220]:
def makepair(G, P):
    pairing = defaultdict(lambda: {'gold':None, 'pred':None})
    for y in G:
        sid = y['sent_id']
        pairing[sid]['gold'] = y 
    for y_ in P:
        sid = y_['sent_id']
        pairing[sid]['pred'] = y_
    del G, P
    return pairing             

### 2.1 score log

In [234]:
import numpy as np 
import pandas as pd
splits = ['train', 'test-gold']
splitstats = {}
for split in splits:
    goldfile = splitD[split]
    seqstats = pd.Series(index=prt_dataset, name = 'seq_labeling', dtype = object)
    graphstats = pd.Series(index=prt_dataset, name = 'graph_parser', dtype = object)
    for i, baselinedir in enumerate([seqdir, graphdir]):
        baseline = 'seq' if i == 0 else 'graph'
        for pretrained in prt_dataset:
            predfile = predD[baseline][split][pretrained]
            # making pairing dictionary 
            print(f'\nNOW EVALUATING {baseline}_{split}_{pretrained}...\n')
            pairing = makepair(goldfile, predfile)
            tgtf1 = round(target_scorer(pairing), 4)
            tupf1 = round(tuple_scorer(pairing), 4) 
            if i == 0:
                seqstats[pretrained] = (tgtf1, tupf1)
            else: graphstats[pretrained] = (tgtf1, tupf1)
    splitstats[split] = pd.concat([seqstats, graphstats], axis=1)
    


NOW EVALUATING seq_train_darmstadt_unis...

- # sentences: 2000
- # gold tuples: 1741
- # correct: {'exact': 146, 'lcs': 51}
- f1 score (micro): 0.1131533601378518
- TUPLE = (target, polairty) 
- # sentences: 2000
- # gold tuples: 1770
- f1 score (micro): 0.07909604519774012

NOW EVALUATING seq_train_mpqa...

- # sentences: 2000
- # gold tuples: 1741
- # correct: {'exact': 3, 'lcs': 22}
- f1 score (micro): 0.014359563469270534
- TUPLE = (target, polairty) 
- # sentences: 2000
- # gold tuples: 1770
- f1 score (micro): 0.005084745762711864

NOW EVALUATING seq_train_opener_en...

- # sentences: 2000
- # gold tuples: 1741
- # correct: {'exact': 235, 'lcs': 508}
- f1 score (micro): 0.4267662263067203
- TUPLE = (target, polairty) 
- # sentences: 2000
- # gold tuples: 1770
- f1 score (micro): 0.3135593220338983

NOW EVALUATING graph_train_darmstadt_unis...

- # sentences: 2000
- # gold tuples: 1741
- # correct: {'exact': 184, 'lcs': 18}
- f1 score (micro): 0.11602527283170591
- TUPLE = (targ

### 2.2 score tables
score description: `(target micro f1, target_polarity tuple micro f1)`

In [235]:
splitstats['train']

Unnamed: 0,seq_labeling,graph_parser
darmstadt_unis,"(0.1132, 0.0791)","(0.116, 0.0633)"
mpqa,"(0.0144, 0.0051)","(0.0218, 0.013)"
opener_en,"(0.4268, 0.3136)","(0.4371, 0.3339)"


In [236]:
splitstats['test-gold']

Unnamed: 0,seq_labeling,graph_parser
darmstadt_unis,"(0.1111, 0.0738)","(0.0915, 0.053)"
mpqa,"(0.0196, 0.0064)","(0.0327, 0.0144)"
opener_en,"(0.3987, 0.2825)","(0.4036, 0.3323)"


In [None]:
s