# Analyse der Vergleichsannotation

Dieses Notebook enthält die Aufbereitung der Daten aus der Vergleichsannotation sowie den Vergleich zur Crowdsourcing-Annotation

---

### Package- und Datenimport

In [530]:
import pandas as pd
import numpy as np
import sklearn.metrics as metrics
from pymongo import MongoClient
from irrCAC.raw import CAC

In [531]:
def get_client():
    return MongoClient('mongodb://{}:{}@{}:{}'.format(
        'root',
        'root',
        '0.0.0.0',
        '27017',
    ))

def get_database():
    client = get_client()
    return client.get_database('masterthesis-goerner')

db = get_database()
cursor = db['annotationresults'].find({})
toloka_res =  pd.DataFrame(list(cursor))
df = toloka_res.sample(frac=0.06592827004)

---

### Import der Toloka- Annotationsergebnisse

In diesem Abschnitt werden 250 der skzeptierten Sätze aus Toloka in ein Doccano-Spezifisches Format gebracht.
Diese werden anschließend via doccano mit 2 Annotierenden bewertet.

In [532]:
df['id'] = toloka_res['preprocessed_id']
df['entities'] = df.apply(lambda x: [[x.main_start_pos,x.main_end_pos,x.original_word_main]],axis=1)
df['text'] = df['sentence_sample']
df = df.drop(columns=['preprocessed_word','original_word_main','date_publish','source_domain','sentence_sample','main_start_pos','main_end_pos','match_id','polarity','_id','url_id','preprocessed_id'])

In [533]:
#df.to_json('doccano.jsonl',orient='records', lines=True)

---

### Vorbereitung der Daten beider Annotatoren

In [534]:
annotator1 = pd.read_json('data/validation_annotator_1.jsonl',orient='records', lines=True)
annotator1['cats'] = annotator1.apply(lambda x: np.nan if len(x.cats) < 1 else str(x.cats[0]), axis=1)

In [535]:
annotator2 = pd.read_json('data/validation_annotator_2.jsonl',orient='records', lines=True)
annotator2['cats'] = annotator2.apply(lambda x: np.nan if len(x.cats) < 1 else str(x.cats[0]), axis=1)

In [536]:
def combine_cats(cats):
    if cats == 'leicht negativ':
        return 'negativ'
    if cats == 'leicht positiv':
        return 'positiv'
    return cats
annotator1['cats'] = annotator1.apply(lambda x: combine_cats(x.cats), axis=1)
annotator2['cats'] = annotator2.apply(lambda x: combine_cats(x.cats), axis=1)

---

### Berechnung von Koeffizienten
in diesem Abschnitt werden die Koeffizienten Kappa und AC2 berechnet

In [537]:
metrics.cohen_kappa_score(annotator1['cats'].astype(str), annotator2['cats'].astype(str))

0.7138496756962991

In [538]:
cac_obj = CAC(pd.DataFrame({'annotator1':annotator1['cats'].astype(str), 'annotator2':annotator2['cats'].astype(str)}), weights='ordinal')
cac_obj.gwet()

{'est': {'coefficient_value': 0.9401,
  'coefficient_name': 'AC2',
  'confidence_interval': (0.91896, 0.96124),
  'p_value': 0.0,
  'z': 87.58067,
  'se': 0.01073,
  'pa': 0.97,
  'pe': 0.49914},
 'weights': array([[1.        , 0.83333333, 0.5       , 0.        ],
        [0.83333333, 1.        , 0.83333333, 0.5       ],
        [0.5       , 0.83333333, 1.        , 0.83333333],
        [0.        , 0.5       , 0.83333333, 1.        ]]),
 'categories': ['negativ', 'neutral', 'positiv', 'positivs']}

---

### Akzeptanz von Ergebnissen durch Majority Voting

In [539]:
concatenated = pd.concat([annotator1, annotator2],ignore_index=True)

In [540]:
def one_hot_encode(df, field):
    one_hot = pd.get_dummies(df[field])
    # Drop original column as it is now encoded
    df = df.drop(field,axis = 1)
    # Join the encoded df
    return df.join(one_hot) 
concatenated = one_hot_encode(concatenated, 'cats')

In [541]:
grp = concatenated.groupby(by=['text'], as_index=True).agg({'negativ':'sum', 'neutral':'sum','positiv':'sum','id':'first','text':'first'})

In [542]:
def get_accepted_values(df):
    neg = df[df['negativ'] > 1]
    neg['sentiment'] = 'negative'
    neg = neg.drop(columns=['negativ', 'neutral', 'positiv'])
    
    pos = df[df['positiv'] > 1]
    pos['sentiment'] = 'positive'
    pos = pos.drop(columns=['negativ', 'neutral', 'positiv'])
    
    ntr = df[df['neutral'] > 1]
    ntr['sentiment'] = 'neutral'
    ntr = ntr.drop(columns=['negativ', 'neutral', 'positiv'])
    
    return pd.concat([pos, neg, ntr])

In [543]:
result = get_accepted_values(grp)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  neg['sentiment'] = 'negative'
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  pos['sentiment'] = 'positive'
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ntr['sentiment'] = 'neutral'


In [544]:
result['text'] = result.index

In [545]:
# Acceptance rate
len(result) / len(grp)

0.844

In [546]:
# Total accepted
len(result)

211

In [547]:
#db.referenceannotation.insert_many(result.to_dict('records'))

<pymongo.results.InsertManyResult at 0x7fadc3f13f40>

---

### Zusammenführung und Bewertung von Toloka- und Vergleichsannotationen

In [516]:
toloka_res = toloka_res.drop(columns=['preprocessed_word','original_word_main','date_publish','source_domain','sentence_sample','main_start_pos','main_end_pos','match_id'])

In [517]:
merged = pd.merge(result,toloka_res,how='left',left_on='id',right_on='preprocessed_id',suffixes=('','_tlk'))

In [518]:
merged['polarity'].value_counts()

neutral     135
positive     45
negative     31
Name: polarity, dtype: int64

In [519]:
both_accepted = merged[~merged['polarity'].isna()]

In [520]:
def get_agreements(row):
    if row['sentiment'] == row['polarity']:
        return True
    return False

both_accepted['agreement'] = both_accepted.apply(lambda x: get_agreements(x),axis=1)

In [521]:
both_accepted['agreement'].value_counts(normalize=True)

True     0.7109
False    0.2891
Name: agreement, dtype: float64

In [494]:
both_accepted[both_accepted['agreement'] == True]['sentiment'].value_counts(normalize=True)

neutral     0.713333
positive    0.186667
negative    0.100000
Name: sentiment, dtype: float64

In [522]:
disagreements = both_accepted[both_accepted['agreement'] == False]

In [525]:
def find_opposite_classes(row):
    if (row['sentiment'] == 'negative' and row['polarity'] == 'positive') or (row['sentiment'] == 'positive' and row['polarity'] == 'negative'):
        return True
    return False

disagreements['opposite'] = disagreements.apply(lambda x: find_opposite_classes(x), axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  disagreements['opposite'] = disagreements.apply(lambda x: find_opposite_classes(x), axis=1)


In [528]:
len(disagreements[disagreements['opposite'] == True]) / len(both_accepted)

0.014218009478672985