## SemEval 2022 Dev set EDA: Seq-labeling Baseline
- Author: Nana
- Created: 2022.7.19

- Dataset statistics 
    src: [1.0-semEval22-eda.ipynb](https://github.com/Nana2929/ckip_absa_en/blob/main/notebooks/1.0-semEval22-eda.ipynb)
    - 資料數量
    
    |-|opener_en|darmstadt_unis|mpqa|
    |-------|------|------|-------|
    |train  |1744  |2253  |5643   |
    |dev    |249   |232   |1997   |
    |test   |499   |318   |2055   |

    - 無情緒標記資料數量
    
    |-|opener_en|darmstadt_unis|mpqa|
    |-------|------|------|-------|
    |train  |344   |1572  |4389   |
    |dev    |51    |150   |1581   |
    |test   |92    |214   |1665   |

The below description is taken from [seq-labeling Github](https://github.com/jerbarnes/semeval22_structured_sentiment/tree/master/baselines/sequence_labeling).
- Pipelined; first train **3 separate models that extract the 3 sub-elements (holders, targets, expressions)** using sequence labeler (BiLSTM, 1-layer, 100 hidden-dim) for 10 epochs
- Finally, train a **relation prediction model (BiLSTM + max-pooling)** to create contextualized representations of 1) the full text, 2) the first element (either a holder or target) and 3) the sentiment expression.
- These three representations are then concatenated and passed to a linear layer followed by a sigmoid function. The training consists of predicting whether two elements have a relationship or not, converting the problem in binary classification.

- Each dataset is trained alone (**7 datasets * (3+1) models = 28 models**).
- The best development scores (metrics: **Macro F1**):
1. extractions 直接看bios tagging每個token label正確與否
2. relations (target, expression)兩兩交叉配對，有關係就T無關係則F，binary classifcation 

|annotation_type|  opener_en |darmstadt_unis|  mpqa  |
|-----------|-----------|-------------|--------|
|sources    |     0.779   |      0.547     |      0.432       |
|targets    |     0.726    |     0.459    |         0.511    |
|expressions|   0.577      |   0.243      |       0.320     | 
|relations|     0.701   | 0.435        |      0.709       | 


In [11]:
import os
os.chdir('/share/home/nana2929/')

In [38]:
import json
root_dir = './semEval22/Structured_Sentiment/'
prediction_dir = f'{root_dir}/baselines/sequence_labeling/saved_models/relation_prediction'
golden_dir = f'{root_dir}/data'
PREDICTIONS = {'darmstadt_unis':None, 'mpqa':None, 'opener_en':None}
GOLDS = {'darmstadt_unis':None, 'mpqa':None, 'opener_en':None}

for DATASET in PREDICTIONS.keys():
    predpath = f'{prediction_dir}/{DATASET}/prediction.json'
    goldpath = f'{golden_dir}/{DATASET}/dev.json'
    with open (predpath, 'r') as f:
        PREDICTIONS[DATASET] = json.load(f)  
    with open (goldpath, 'r') as f2:
        GOLDS[DATASET] = json.load(f2) 

In [45]:
ALLDEVS = {}

In [39]:
# match by sent-id
from collections import defaultdict
def pred_align(preds, golds):
    '''
    preds: list of dicts
    golds: list of dicts
    dict內需包含 'sent_id' key 
    '''
    SENTID = defaultdict(lambda: {'PRED':None, 'GOLD':None})
    for p, g in zip(preds, golds):
        p_id, g_id = p['sent_id'], g['sent_id']
        SENTID[p_id]['PRED'] = p
        SENTID[g_id]['GOLD'] = g
    return SENTID, list(SENTID.keys())

In [148]:
def show_random(dataset_name, sent_id = None):
    '''
    dataset_name: string
    '''
    from random import randint
    try:
        PGPAIRS, ids = ALLDEVS[dataset_name]
    except KeyError:
        print(f'No {dataset_name} in ALLDEVS.')
    dev_size=len(ids)
    rsid = ids[randint(0, dev_size-1)]
    if sent_id:
        rsid = sent_id
    
    print(f'[INFO] dev set size: {dev_size}')
    print(f'[INFO] sent id: {rsid}')
    pred, gold = PGPAIRS[rsid]['PRED'], PGPAIRS[rsid]['GOLD']
    return pred, gold, rsid

In [55]:
keys = ('opener_en', 'darmstadt_unis', 'mpqa')

In [56]:
for key in keys:
    PGPAIRS, ids = pred_align(PREDICTIONS[key], GOLDS[key])
    ALLDEVS[key] = (PGPAIRS, ids)

### A. opener_en

#### example 1: short

In [199]:
key = 'opener_en'
pred, gold, sent_id = show_random(key, sent_id='opener_en/kaf/hotel/english00122_b509ee1b998f3418b1fd4a8096880441-1')

[INFO] dev set size: 249
[INFO] sent id: opener_en/kaf/hotel/english00122_b509ee1b998f3418b1fd4a8096880441-1


In [200]:
gold

{'sent_id': 'opener_en/kaf/hotel/english00122_b509ee1b998f3418b1fd4a8096880441-1',
 'text': 'Stay was too short',
 'opinions': [{'Source': [[], []],
   'Target': [['Stay'], ['0:4']],
   'Polar_expression': [['too short'], ['9:18']],
   'Polarity': 'Negative',
   'Intensity': 'Standard'}]}

In [201]:
pred

{'sent_id': 'opener_en/kaf/hotel/english00122_b509ee1b998f3418b1fd4a8096880441-1',
 'text': 'Stay was too short',
 'opinions': [{'Source': [[], []],
   'Target': [['Stay'], ['0:4']],
   'Polar_expression': [['too short'], ['9:18']],
   'Polarity': 'Negative',
   'Intensity': 'Standard'}]}

#### example 2: multiple, simple 
determine relation的部分似乎配對得不太好，把所有extractions都交叉配對了，
但這筆資料extractions近乎全對

- text = 'the hotel is very clean and all the people who are working their are very friendly and willing to help .'

整體prediction
- the hotel: clean, friendly, willing to help
- the people: clean, friendly, willing to help

In [67]:
pred, gold, sent_id = show_random(key)

[INFO] dev set size: 249
[INFO] sent id: opener_en/kaf/hotel/english00073_6d0660face021ed117d3193124ae7849-2


In [69]:
gold

{'sent_id': 'opener_en/kaf/hotel/english00073_6d0660face021ed117d3193124ae7849-2',
 'text': 'the hotel is very clean and all the people who are working their are very friendly and willing to help .',
 'opinions': [{'Source': [[], []],
   'Target': [['the people'], ['32:42']],
   'Polar_expression': [['very friendly'], ['69:82']],
   'Polarity': 'Positive',
   'Intensity': 'Strong'},
  {'Source': [[], []],
   'Target': [['the people'], ['32:42']],
   'Polar_expression': [['willing to help'], ['87:102']],
   'Polarity': 'Positive',
   'Intensity': 'Standard'},
  {'Source': [[], []],
   'Target': [['the hotel'], ['0:9']],
   'Polar_expression': [['very clean'], ['13:23']],
   'Polarity': 'Positive',
   'Intensity': 'Strong'}]}

In [68]:
pred

{'sent_id': 'opener_en/kaf/hotel/english00073_6d0660face021ed117d3193124ae7849-2',
 'text': 'the hotel is very clean and all the people who are working their are very friendly and willing to help .',
 'opinions': [{'Source': [[], []],
   'Target': [['the hotel'], ['0:9']],
   'Polar_expression': [['very clean'], ['13:23']],
   'Polarity': 'Positive',
   'Intensity': 'Standard'},
  {'Source': [[], []],
   'Target': [['the people'], ['32:42']],
   'Polar_expression': [['very clean'], ['13:23']],
   'Polarity': 'Positive',
   'Intensity': 'Standard'},
  {'Source': [[], []],
   'Target': [['the hotel'], ['0:9']],
   'Polar_expression': [['very friendly'], ['69:82']],
   'Polarity': 'Positive',
   'Intensity': 'Standard'},
  {'Source': [[], []],
   'Target': [['the people'], ['32:42']],
   'Polar_expression': [['very friendly'], ['69:82']],
   'Polarity': 'Positive',
   'Intensity': 'Standard'},
  {'Source': [[], []],
   'Target': [['the hotel'], ['0:9']],
   'Polar_expression': [['willing 

#### example 3: multiple, complicated expressions
expressions 只能抓到 'a little rough from all of the Hen'，忽視了cleanest
targets 的部分 reception 沒有抓出來，只有抓出定冠詞。

In [71]:
pred, gold, sent_id = show_random(key)

[INFO] dev set size: 249
[INFO] sent id: opener_en/kaf/hotel/english00070_4b333078625224037956975e6fba0e41-3


In [73]:
gold

{'sent_id': 'opener_en/kaf/hotel/english00070_4b333078625224037956975e6fba0e41-3',
 'text': "The rooms are one of the cleanest Travelodges I 've been too however the corridors and reception area is a little rough from all of the Hen / Stag weekend traffic .",
 'opinions': [{'Source': [[], []],
   'Target': [['reception area'], ['87:101']],
   'Polar_expression': [['a little rough'], ['105:119']],
   'Polarity': 'Negative',
   'Intensity': 'Standard'},
  {'Source': [[], []],
   'Target': [['the corridors'], ['69:82']],
   'Polar_expression': [['a little rough'], ['105:119']],
   'Polarity': 'Negative',
   'Intensity': 'Standard'},
  {'Source': [[], []],
   'Target': [['The rooms'], ['0:9']],
   'Polar_expression': [['one of the cleanest'], ['14:33']],
   'Polarity': 'Positive',
   'Intensity': 'Standard'}]}

In [72]:
pred

{'sent_id': 'opener_en/kaf/hotel/english00070_4b333078625224037956975e6fba0e41-3',
 'text': "The rooms are one of the cleanest Travelodges I 've been too however the corridors and reception area is a little rough from all of the Hen / Stag weekend traffic .",
 'opinions': [{'Source': [[], []],
   'Target': [['The rooms'], ['0:9']],
   'Polar_expression': [['little rough from all of the Hen /'], ['107:141']],
   'Polarity': 'Negative',
   'Intensity': 'Standard'},
  {'Source': [[], []],
   'Target': [['the'], ['21:24']],
   'Polar_expression': [['little rough from all of the Hen /'], ['107:141']],
   'Polarity': 'Negative',
   'Intensity': 'Standard'},
  {'Source': [[], []],
   'Target': [['corridors'], ['73:82']],
   'Polar_expression': [['little rough from all of the Hen /'], ['107:141']],
   'Polarity': 'Negative',
   'Intensity': 'Standard'}]}

#### example 4: implicit sentiment 

40歐元顯然是說太貴的意思，
但這個如果沒有world knowledge是不能判斷的（模型預測無標記）

In [203]:
pred, gold, sent_id = show_random(key)
gold

[INFO] dev set size: 249
[INFO] sent id: opener_en/kaf/hotel/english00210_f4aa2529a7784fec56fad627916db4b7-8


{'sent_id': 'opener_en/kaf/hotel/english00210_f4aa2529a7784fec56fad627916db4b7-8',
 'text': 'Over 40 Euros for a pizza , burger and two drinks .',
 'opinions': [{'Source': [[], []],
   'Target': [[], []],
   'Polar_expression': [['Over 40 Euros for a pizza , burger and two drinks'],
    ['0:49']],
   'Polarity': 'Negative',
   'Intensity': 'Standard'}]}

In [204]:
pred

{'sent_id': 'opener_en/kaf/hotel/english00210_f4aa2529a7784fec56fad627916db4b7-8',
 'text': 'Over 40 Euros for a pizza , burger and two drinks .',
 'opinions': []}

### B. darmstadt_unis

In [153]:
key = 'darmstadt_unis'

#### example 1: short

- gold: quality: poor
- pred: they: poor, isbn: poor

extractions沒有全對，沒有抓出quality這個字（抓出了they, isbn），因為pipeline的error propagation，
relation pairing也是錯的。expression和情緒極度是對的

In [83]:
pred, gold, sent_id = show_random(key)

[INFO] dev set size: 232
[INFO] sent id: Colorado_Technical_University_Online_76_06-17-2005-6


In [85]:
gold

{'sent_id': 'Colorado_Technical_University_Online_76_06-17-2005-6',
 'text': "The free books taht they provide are so poor in quality you cannot even find their isbn's on the internet .",
 'opinions': [{'Source': [[], []],
   'Target': [['quality'], ['48:55']],
   'Polar_expression': [['so', 'poor'], ['37:39', '40:44']],
   'Polarity': 'Negative',
   'Intensity': 'Strong'}]}

In [84]:
pred

{'sent_id': 'Colorado_Technical_University_Online_76_06-17-2005-6',
 'text': "The free books taht they provide are so poor in quality you cannot even find their isbn's on the internet .",
 'opinions': [{'Source': [[], []],
   'Target': [['they'], ['20:24']],
   'Polar_expression': [['poor'], ['40:44']],
   'Polarity': 'Negative',
   'Intensity': 'Standard'},
  {'Source': [[], []],
   'Target': [["isbn's"], ['83:89']],
   'Polar_expression': [['poor'], ['40:44']],
   'Polarity': 'Negative',
   'Intensity': 'Standard'}]}

#### example 2: short, 'joke'


Capella_University_80_09-21-2004-1 

unregular expression? 抓得還算可以，但target標得不完善

是說這個資料集，很多都是空opinions（pred和gold都是）

In [101]:
pred, gold, sent_id = show_random(key) 

[INFO] dev set size: 232
[INFO] sent id: Capella_University_80_09-21-2004-1


In [103]:
gold

{'sent_id': 'Capella_University_80_09-21-2004-1',
 'text': 'Capella University is a joke .',
 'opinions': [{'Source': [[], []],
   'Target': [['Capella University'], ['0:18']],
   'Polar_expression': [['joke'], ['24:28']],
   'Polarity': 'Negative',
   'Intensity': 'Strong'}]}

In [102]:
pred

{'sent_id': 'Capella_University_80_09-21-2004-1',
 'text': 'Capella University is a joke .',
 'opinions': [{'Source': [[], []],
   'Target': [['University'], ['8:18']],
   'Polar_expression': [['joke'], ['24:28']],
   'Polarity': 'Negative',
   'Intensity': 'Standard'}]}

#### example 3: Neutral和不標的差異到底在...？

In [170]:
pred, gold, sent_id = show_random(key)

[INFO] dev set size: 232
[INFO] sent id: University_of_Phoenix_Online_27_12-21-2007-1


In [171]:
pred

{'sent_id': 'University_of_Phoenix_Online_27_12-21-2007-1',
 'text': 'I am currently a student at UoP , and for every post on here there are some truth to them .',
 'opinions': []}

In [172]:
gold

{'sent_id': 'University_of_Phoenix_Online_27_12-21-2007-1',
 'text': 'I am currently a student at UoP , and for every post on here there are some truth to them .',
 'opinions': [{'Source': [[], []],
   'Target': [['UoP'], ['28:31']],
   'Polar_expression': [['some truth'], ['71:81']],
   'Polarity': 'Neutral',
   'Intensity': 'Average'}]}

### C. mpqa

In [193]:
key = 'mpqa'

#### example 1: weird prediction
determination被標成negative


In [115]:
pred, gold, sent_id = show_random(key)

[INFO] dev set size: 1997
[INFO] sent id: non_fbis/15.45.31-12608-3


In [117]:
gold

{'sent_id': 'non_fbis/15.45.31-12608-3',
 'text': 'In response , President Chavez reiterated his commitment to the principles of constitutional rule , legality and democracy and his determination to pursue a broad - based national dialogue .',
 'opinions': []}

In [116]:
pred

{'sent_id': 'non_fbis/15.45.31-12608-3',
 'text': 'In response , President Chavez reiterated his commitment to the principles of constitutional rule , legality and democracy and his determination to pursue a broad - based national dialogue .',
 'opinions': [{'Source': [[], []],
   'Target': [[], []],
   'Polar_expression': [['determination'], ['131:144']],
   'Polarity': 'Negative',
   'Intensity': 'Standard'}]}

#### example 2: non-empty, all-correct
看起來完全正確的標記，’The SADC on Saturday endorsed the vote' 
extractions全對，因為只有一組，relation pairing is correct (trivially)
唯有intensity略有出入但影響不大

In [194]:
pred, gold, sent_id = show_random(key, sent_id='20020316/20.37.48-18053-9')

[INFO] dev set size: 1997
[INFO] sent id: 20020316/20.37.48-18053-9


In [195]:
gold

{'sent_id': '20020316/20.37.48-18053-9',
 'text': 'The SADC on Saturday endorsed the vote , despite its Parliamentary Forum , which judging that the March 9 - 11 election " did not conform with the norms and standards of the SADC Parliamentary Forum , " signed on to by Zimbabwe .',
 'opinions': [{'Source': [['The SADC'], ['0:8']],
   'Target': [['the vote'], ['30:38']],
   'Polar_expression': [['endorsed'], ['21:29']],
   'Polarity': 'Positive',
   'Intensity': 'Average'}]}

In [196]:
pred

{'sent_id': '20020316/20.37.48-18053-9',
 'text': 'The SADC on Saturday endorsed the vote , despite its Parliamentary Forum , which judging that the March 9 - 11 election " did not conform with the norms and standards of the SADC Parliamentary Forum , " signed on to by Zimbabwe .',
 'opinions': [{'Source': [[], []],
   'Target': [['the vote'], ['30:38']],
   'Polar_expression': [['endorsed'], ['21:29']],
   'Polarity': 'Positive',
   'Intensity': 'Standard',
   'source': [['The SADC'], ['0:8']]}]}

#### example 3: 6 opinions in one data, complicated relations and target spans
這個資料集的targets的定義真的很弔詭，常常是phrases（boundary會很難抓），而且polar expressions有動詞
這個資料的relation pairing prediction很慘，不過他似乎有學到 n't 是否定的意思。

In [149]:
pred, gold, sent_id = show_random(key, sent_id='ula/sw2015-ms98-a-trans-23')

[INFO] dev set size: 1997
[INFO] sent id: ula/sw2015-ms98-a-trans-23


In [152]:
gold

{'sent_id': 'ula/sw2015-ms98-a-trans-23',
 'text': "right yeah yeah no i i agree with you there though i mean they want to choose that particular religion then that 's fine with me too you know as long as they do n't try and pull me in and drag me in and and i do n't like the way that they do it either and and it's their mission as they do as they go door - to - door and they go out into the public and they actually have the uh teenagers serving two years like you would say like in an army and two years in going around and doing missionary type work and i do n't know i just um i just do n't particularly care for that at all and that that 's one thing that i feel really strongly about though is uh you know people coming up to my door especially religious organizations and wanting to uh you know to try and get me to join or you know become interested in their religion because i have my own",
 'opinions': [{'Source': [[], []],
   'Target': [['the way that they do it either'], ['221:251']]

In [150]:
pred

{'sent_id': 'ula/sw2015-ms98-a-trans-23',
 'text': "right yeah yeah no i i agree with you there though i mean they want to choose that particular religion then that 's fine with me too you know as long as they do n't try and pull me in and drag me in and and i do n't like the way that they do it either and and it's their mission as they do as they go door - to - door and they go out into the public and they actually have the uh teenagers serving two years like you would say like in an army and two years in going around and doing missionary type work and i do n't know i just um i just do n't particularly care for that at all and that that 's one thing that i feel really strongly about though is uh you know people coming up to my door especially religious organizations and wanting to uh you know to try and get me to join or you know become interested in their religion because i have my own",
 'opinions': [{'Source': [[], []],
   'Target': [['the way'], ['221:228']],
   'Polar_expression'