# Sampling der Daten zur Annotation

In diesem Notebook werden die Daten ausgewählt, welche im späteren Verlauf zur Annotation verwendet werden.
Ebenfalls werden diese Daten zur Verwendung im Annotationstool aufbereitet.

---

### Package- und Datenimport

In [None]:
from pymongo import MongoClient
import numpy as np
import pandas as pd

In [None]:
def get_client():
    return MongoClient('mongodb://{}:{}@{}:{}'.format(
        'root',
        'root',
        '0.0.0.0',
        '27017',
    ))

def get_database():
    client = get_client()
    return client.get_database('masterthesis-goerner')

db = get_database()
cursor = db['preprocessed'].find({})
preprocessed =  pd.DataFrame(list(cursor))
cursor_prsmpl = db['presampled'].find({})
presampled = pd.DataFrame(list(cursor_prsmpl))

In [None]:
preprocessed['date'] = pd.to_datetime(preprocessed['date_publish']).dt.date
preprocessed = preprocessed.merge(presampled, on="_id",suffixes=('', '_y'))
preprocessed = preprocessed.drop(columns=['url_id_y', 'preprocessed_word_y'])

---

### Analyse der Sentiment-Verteilung

In diesem Abschnitt werden die zuvor ermittelten Sentiment-Scores zu Klassen transformiert und die Verteilungen im Datensatz analysiert.
Es erfolgt eine Anpassung der Verteilung von Sätzen mit neutralem Sentiment.

In [None]:
def categorize_sen(polarity_score):
    if polarity_score > 0.2:
        return 'positive'
    elif polarity_score < -0.2:
        return 'negative'
    return 'neutral'

In [None]:
preprocessed['sentiment'] = preprocessed['polarity'].map(categorize_sen)

In [None]:
preprocessed['sentiment'].value_counts(normalize=True)

Die Klasse der neutralen Sentiments ist deutlich höher als alle weiteren Klassen. Daher werden einige Sätze mit neutralem Sentiment entfernt.

In [None]:
neutrals = preprocessed[preprocessed['sentiment'] == 'neutral']
neutrals_smpl = neutrals.sample(frac=0.5, random_state=1)
preprocessed = preprocessed[~preprocessed.index.isin(list(neutrals_smpl.index))]

In [None]:
preprocessed['sentiment'].value_counts(normalize=True)

---

### A1: Auswahl von Satz-Paaren mit demselben Sentiment-Target

In diesem Schritt werden 1000 Satz-Paare gesampled, welche dieselbe Sentiment-Target am selben Veröffentlichungsdatum addressieren. Dabei werden lediglich Satz-Paare jeweils verschiedener Nachrichtenanbieter verwendet. Die gesampleten Daten werden aus dem Ursprünglichen Datenpool entfernt, um Mehrfachauswahl zu vermeiden.

In [None]:
# extract all with match
matches = preprocessed[preprocessed.matches.str.len() > 0]
# drop all entries from same article to have more variance
matches = matches.drop_duplicates(subset='url_id')
# sample all extracts, take 1k samples to have 2k results later
sample = matches.sample(frac=0.003645829536, random_state=1)

In [None]:
# extract matches from original df
def extract_matches_from_df(df, row):
    for match in row['matches']:
        match_df = df[df['url_id'] == match]
        match_df = match_df[match_df['preprocessed_word'] == row['preprocessed_word']]
        if len(match_df) > 0:
            mtch = match_df['_id'].values
            return str(mtch[0])
    return None
sample['match_id'] = sample.apply(lambda x: extract_matches_from_df(preprocessed, x), axis=1)

In [None]:
# add matches to sample df
def extract_matches(df, match_id, original_id):
    match = df[df['_id'].astype(str) == match_id]
    match['match_id'] = str(original_id)
    return match

# extract match_ids from original dataframe and append them to sample
match_counterparts = sample.apply(lambda x: extract_matches(preprocessed, x.match_id, x._id), axis=1)
for match_coutnerpart in match_counterparts:
    sample = pd.concat([sample, match_coutnerpart])

In [None]:
# remove samples from preprocessed
preprocessed = preprocessed[~preprocessed['_id'].astype(str).isin(sample['_id'].astype(str))]

---

### A2: Auswahl von Sätzen aller Targets eines Artikels

In diesem Abschnitt werden 2000 Sätze gesampled. Die verwendeten Sätze entstammen jeweils einem Nachrichtenartikel, dabei sind maximal drei sätze je Artikel vorhanden.

In [None]:
# remove sentences where not all three sentences are present for an article
less_than_3_occurences_per_article = preprocessed.url_id.value_counts().reset_index(name="count").query("count < 3")["index"]
less_than_3_occurences_per_article = list(less_than_3_occurences_per_article.values)
cleaned = preprocessed[~preprocessed['url_id'].isin(less_than_3_occurences_per_article)]

In [None]:
# clear duplicates in url_ids (to have only one of 3 per article) and sample 1/3 of 2000 Articles
cleaned = cleaned.drop_duplicates(subset='url_id')
cleaned_sampl = cleaned.sample(frac=0.003003178777, random_state=1)

# retrieve all articles with the selected url ids from original
res = preprocessed[preprocessed['url_id'].isin(list(cleaned_sampl.url_id))]

In [None]:
# Add results to sample data, remove selected articles from data pool
sample = pd.concat([sample, res])
preprocessed = preprocessed[~preprocessed['_id'].astype(str).isin(sample['_id'].astype(str))]

---

### A3: Auswahl zufälliger Sätze

In diesem Schritt werden weitere zufällige Sätze dem Sample hinzugefügt.

In [None]:
preprocessed_a3 = preprocessed.sample(frac=0.003464590396, random_state=1)
sample = pd.concat([sample, preprocessed_a3])

---

### Vorbereitung der Daten für den Upload bei Toloka

Die Daten werden nun in ein Toloka-spezifisches Format transformiert. Dabei werden die Schlüsselwörter im Text markiert und die Daten schließlich als .tsv exportiert. Auch werden die Samples in der Datenbank abgelegt, um eine Text-Referenz zwischen Toloka und der lokal genutzten ID zu erhalten.

In [None]:
sample['preprocessed_id'] = sample['_id']
sample = sample.drop(columns=['_id'])


In [None]:
def add_marks(text, mentions):
    for mention in mentions:
        # Use 𝟇 sign instead of * directly, since articles sometimes contain * and this breaks .md layout then.
        # *s also cannot be removed before, since then the start and end pos doesn't work anymore.. therefore use uncommon sign and replace it later
        # also, using 'further mentions' here makes problems if the start and end pos overlaps, so only take main words
        if mention['type'] == 'main':
            text = text[:int(mention['start_pos'])] + '𝟇𝟇' + text[int(mention['start_pos']):int(mention['end_pos'])] + '𝟇𝟇' + text[int(mention['end_pos']):]
    text = text.replace('"', '\"')
    text = text.replace('*', '')
    text = text.replace('𝟇', '*')
    return text
sample['sentence_toloka'] = sample.apply(lambda x: add_marks(x.sentence, x.mentions), axis=1)

In [None]:
# drop faulty dataset
sample = sample.drop([5456])

In [None]:
# write to DB
db.sample.insert_many(sample.to_dict('records'))

In [None]:
sample = sample.drop(columns=['_id', 'sentiment', 'polarity', 'matches', 'date', 'tfidf_score'])
sample['INPUT:sentence'] = sample['sentence_toloka']
sample.to_csv('toloka.tsv', sep="\t", index=False)

---

### Vorbereitung der Daten für dn Upload bei Doccano

Die Daten werden nun in ein Doccano-spezifisches Format transformiert.

In [None]:
sample['text'] = sample['sentence']
sample['entities'] = sample.apply(lambda x: [[int(x.main_start_pos),int(x.main_end_pos),x.original_word_main]], axis=1)
sample = sample.drop(columns=['_id', 'date_publish', 'source_domain', 'url_id','preprocessed_word','original_word_main','sentence','main_start_pos','main_end_pos','mentions','match_id','preprocessed_id','INPUT:sentence'])

In [None]:
sample.head(250).to_json('doccano.jsonl',orient='records', lines=True)
