# 3.4.4 Nachbearbeitung des Annnotationsergebnisses
In diesem Notebook werden die Ergebnisse von Toloka bearbeitet. Die Evaluationsergebnisse werden in die Klassen positive, neutral und negative Zusammengeführt, abgelehnte Ergebnisse werden entfernt.

---

### Package- und Datenimport

In [624]:
import pandas as pd
import numpy as np
from pymongo import MongoClient
from irrCAC.raw import CAC

In [651]:
toloka_res = pd.read_csv('data/results_first_campaign_toloka.tsv', sep="\t")

---

### Vorverarbeitung der Toloka-Ergebnisse

Dieser Abschnitt beinhaltet die vorverarbeitung der Toloka Daten.
Die erste Kampagne von Toloka wird in diesem Abschnitt bearbetet.

In [626]:
def one_hot_encode(df, field):
    one_hot = pd.get_dummies(df[field])
    # Drop original column as it is now encoded
    df = df.drop(field,axis = 1)
    # Join the encoded df
    return df.join(one_hot) 

In [627]:
def preprocess_toloka_results(df):
    # drop unnecessary columns, convert values to int
    df = df.drop(columns=['ASSIGNMENT:submitted', 'ASSIGNMENT:started','ASSIGNMENT:link','HINT:text','GOLDEN:senti_score','ASSIGNMENT:task_id','ASSIGNMENT:assignment_id','ASSIGNMENT:worker_id','ACCEPT:verdict','ACCEPT:comment','HINT:default_language'])
    df = df.dropna(subset=['OUTPUT:senti_score'])
    df['OUTPUT:senti_score'] = df['OUTPUT:senti_score'].astype(int)
    # delete score if result was rejected
    df['OUTPUT:senti_score'] = df.apply(lambda x: np.nan if x['ASSIGNMENT:status'] == 'REJECTED' else x['OUTPUT:senti_score'], axis=1)
    # one hot encode scores and rejection status
    df = one_hot_encode(df, 'OUTPUT:senti_score')
    df = one_hot_encode(df, 'ASSIGNMENT:status')
    # group results by sentence
    grp = df.groupby(by=['INPUT:sentence'], as_index=True).sum()
    # group answers of light positive/negative to pos/neg
    grp['pos'] = grp[1.0] + grp [2.0]
    grp['neg'] = grp[-1.0] + grp [-2.0]
    grp['neu'] = grp[0.0]
    return grp

In [628]:
grp = preprocess_toloka_results(toloka_res)

---

### Berechnung des Gwet's AC2 Koeffizienten

In diesem Abschnitt wird die Berechnung des Gwet's koeffizienten aufgesetzt. 

In [631]:
def create_evaluation_df(df):
    ratings_list = []
    for index, row in df.iterrows():
        res_list = []
        if row['neg'] > 0:
            for count_neg in np.arange(0,row['neg']):
                res_list.append(1)
        if row['neu'] > 0:
            for count_neu in np.arange(0,row['neu']):
                res_list.append(0)
        if row['pos'] > 0:
            for count_pos in np.arange(0,row['pos']):
                res_list.append(-1)
        if row['REJECTED'] > 0:
            for count_rej in np.arange(0,row['REJECTED']):
                res_list.append(np.nan)
        if (len(res_list) == 5):
            ratings_list.append(res_list)
    return pd.DataFrame(ratings_list, columns=['1','2','3','4','5'])

In [633]:
# calculation of AC2
cac_obj = CAC(create_evaluation_df(grp),weights='ordinal')
cac_obj.gwet()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.ratings.replace(to_replace="", value=np.nan, inplace=True)


{'est': {'coefficient_value': 0.24045,
  'coefficient_name': 'AC2',
  'confidence_interval': (0.21679, 0.26411),
  'p_value': 0.0,
  'z': 19.92578,
  'se': 0.01207,
  'pa': 0.70465,
  'pe': 0.61115},
 'weights': array([[1.        , 0.66666667, 0.        ],
        [0.66666667, 1.        , 0.66666667],
        [0.        , 0.66666667, 1.        ]]),
 'categories': [-1.0, 0.0, 1.0]}

{'est': {'coefficient_value': 0.24045}}

Der Koeffizient wird mit 0.24 angegeben, dabei gilt zu beachten dass alle abgelehnten Werte bereits als NaN dargestellt werden.

---

### Akzeptanz valider Ergebnisse

In diesem Abschnitt werden alle Ergebnisse mit 4 oder mehr gleichen Antworten in ein set valider Antworten übernommen.
Zudem werden alle ergebnisse als valide erachtet, welche 3 gleiche antworten haben und keine Antworten der 'gegenseitigen' klasse vorhanden sind (also positiv vs. negativ)


In [634]:
def accept_high_mijority(df):
    accepted_neg = df[df['neg'] > 3]
    accepted_neg['polarity'] = 'negative'

    accepted_neu = df[df['neu'] > 3]
    accepted_neu['polarity'] = 'neutral'

    accepted_pos = df[df['pos'] > 3]
    accepted_pos['polarity'] = 'positive'
    return pd.concat([accepted_pos,accepted_neu,accepted_neg])
     

In [635]:
def assign_tolerance_voting_polarity(df):
    if (df['pos'] == 3 and df['neg'] == 0):
        return 'positive'
    elif (df['neg'] == 3 and df['pos'] == 0):
        return 'negative'
    elif (df['neu'] == 3 and df['neg'] == 0) or (df['neu'] == 3 and df['pos'] == 0):
        return 'neutral'
    else: 
        return None
    
def accept_medium_majority(df):
    df['polarity'] = df.apply(lambda x: assign_tolerance_voting_polarity(x), axis=1 )
    return df[~df['polarity'].isna()]

In [636]:
# High Majority
accepted = accept_high_mijority(grp)
accepted['polarity'].value_counts(normalize=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  accepted_neg['polarity'] = 'negative'
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  accepted_neu['polarity'] = 'neutral'
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  accepted_pos['polarity'] = 'positive'


neutral     0.855639
positive    0.112782
negative    0.031579
Name: polarity, dtype: float64

In [637]:
# Medium majority with tolerance
accepted_tolerance = accept_medium_majority(grp)
accepted_tolerance['polarity'].value_counts(normalize=True)

neutral     0.580093
positive    0.301711
negative    0.118196
Name: polarity, dtype: float64

In [638]:
accepted = pd.concat([accepted, accepted_tolerance])
accepted['polarity'].value_counts(normalize=True)

neutral     0.720183
positive    0.205657
negative    0.074159
Name: polarity, dtype: float64

In [639]:
# Calculation of Acceptance rate
len(accepted) / len(grp)

0.16473551637279596

In [640]:
# Calculation of Gwet's AC2
cac_obj = CAC(create_evaluation_df(accepted),weights='ordinal')
cac_obj.gwet()

{'est': {'coefficient_value': 0.78377,
  'coefficient_name': 'AC2',
  'confidence_interval': (0.77111, 0.79644),
  'p_value': 0.0,
  'z': 121.41009,
  'se': 0.00646,
  'pa': 0.88656,
  'pe': 0.47536},
 'weights': array([[1.        , 0.66666667, 0.        ],
        [0.66666667, 1.        , 0.66666667],
        [0.        , 0.66666667, 1.        ]]),
 'categories': [-1.0, 0.0, 1.0]}

---

### Aufbereitung zweiter Toloka-Kampagne

In diesem Abschnitt werden alle Sätze für eine erneute Toloka-Kampagne ermittelt. Diese sätze weisen mehr als 2 abgelehnte Sätze auf und können somit aufgrund fehlerhafter Antworten nicht als valide markiert werden.

In [641]:
df = grp[grp['REJECTED'] > 2]
len(df)

5000

In [642]:
df['INPUT:sentence'] = df.index
#df.to_csv('toloka.tsv', sep="\t", index=False)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['INPUT:sentence'] = df.index


---
### Import und Aufbereitung der Ergebnisse der zweiten Toloka-Kampagne

In diesem Abschnitt werden die Ergebnisse der zweiten Toloka-Kampagne importiert und mit den Ergebnissen der 1. Kampagne (nur alle Sätze mit Ablehnungsrate <= 2) vereinigt. 

In [643]:
toloka_res2 = pd.read_csv('data/results_second_campaign_toloka.tsv', sep="\t")
grp2 = preprocess_toloka_results(toloka_res2)

In [644]:
final_grp = pd.concat([grp[grp['REJECTED'] <= 2],grp2])

In [645]:
cac_obj = CAC(create_evaluation_df(final_grp),weights='ordinal')
cac_obj.gwet()

{'est': {'coefficient_value': 0.25151,
  'coefficient_name': 'AC2',
  'confidence_interval': (0.2394, 0.26363),
  'p_value': 0.0,
  'z': 40.69262,
  'se': 0.00618,
  'pa': 0.70893,
  'pe': 0.61112},
 'weights': array([[1.        , 0.66666667, 0.        ],
        [0.66666667, 1.        , 0.66666667],
        [0.        , 0.66666667, 1.        ]]),
 'categories': [-1.0, 0.0, 1.0]}

---

### Akzeptanz valider Ergebnisse

In diesem Abschnitt werden alle Ergebnisse mit 4 oder mehr gleichen Antworten in ein set valider Antworten übernommen.
Zudem werden alle ergebnisse als valide erachtet, welche 3 gleiche antworten haben und keine Antworten der 'gegenseitigen' klasse vorhanden sind (also positiv vs. negativ)



In [646]:
# High Majority
combined_accepted = accept_high_mijority(final_grp)
combined_accepted['polarity'].value_counts(normalize=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  accepted_neg['polarity'] = 'negative'
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  accepted_neu['polarity'] = 'neutral'
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  accepted_pos['polarity'] = 'positive'


neutral     0.642391
positive    0.220652
negative    0.136957
Name: polarity, dtype: float64

In [647]:
# Medium majority with tolerance
combined_accepted_tolerance = accept_medium_majority(new_grp)
combined_accepted_tolerance['polarity'].value_counts(normalize=True)

neutral     0.549693
positive    0.285861
negative    0.164447
Name: polarity, dtype: float64

In [648]:
combined_accepted = pd.concat([combined_accepted, combined_accepted_tolerance])
combined_accepted['polarity'].value_counts(normalize=True)

neutral     0.594673
positive    0.254219
negative    0.151108
Name: polarity, dtype: float64

In [649]:
# Calculation of Acceptance rate
len(combined_accepted) / len(final_grp)

0.47758186397984886

In [650]:
# Calculation of Gwet's AC2
cac_obj = CAC(create_evaluation_df(combined_accepted),weights='ordinal')
cac_obj.gwet()

{'est': {'coefficient_value': 0.6546,
  'coefficient_name': 'AC2',
  'confidence_interval': (0.64526, 0.66394),
  'p_value': 0.0,
  'z': 137.45034,
  'se': 0.00476,
  'pa': 0.84641,
  'pe': 0.55532},
 'weights': array([[1.        , 0.66666667, 0.        ],
        [0.66666667, 1.        , 0.66666667],
        [0.        , 0.66666667, 1.        ]]),
 'categories': [-1.0, 0.0, 1.0]}

---

### Hinzufügen von Metadaten

In diesem Abshcnitt werden Metadaten hinzugefügt und der Datensatz in der Datenbank abgelegt.

In [523]:
combined_accepted['sentence'] = combined_accepted.index

In [524]:
def get_client():
    return MongoClient('mongodb://{}:{}@{}:{}'.format(
        'root',
        'root',
        '0.0.0.0',
        '27017',
    ))

def get_database():
    client = get_client()
    return client.get_database('masterthesis-goerner')

db = get_database()
cursor = db['sample'].find({})
sample =  pd.DataFrame(list(cursor))

In [525]:
combined_accepted = pd.merge(combined_accepted,sample.drop_duplicates(['sentence_toloka']),how='left', left_on="INPUT:sentence", right_on="sentence_toloka",suffixes=('', '_sample'))


In [526]:
combined_accepted = accepted.drop(columns=['sentence_toloka',-2.0,-1.0,0.0,1.0,2.0,'APPROVED','REJECTED','pos','neg','neu','sentence','_id','HINT:default_language'])

In [527]:
#db.annotationresults.insert_many(combined_accepted.to_dict('records'))

<pymongo.results.InsertManyResult at 0x7fd7c0f70340>