## Purpose
This notebook cleans the annotation data from elinor so that it can be analysed further.
Observed problems in the annotations data:
- some fields were not imported to elinor in the first place but should be part of the analysis
- id field was exported wrongly
- some paragraphs were annotated multiple times -> merge
- some annotations contain the old labels
- some specific labels were used in combination with the generic "domestic violence label

For all files:
- extract annotations, columns
- calculate similarity

### Output:
- one file containing all annotations


### Imports & Data

In [54]:
import os
import re
import pandas as pd
import numpy as np
from collections import Counter
import matplotlib.pyplot as plt
from tqdm import tqdm
from datetime import datetime
from ast import literal_eval
import plotly.graph_objects as go
import plotly.express as px
import json

In [55]:
# connect with google drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# change cwd
%cd drive/MyDrive/Work/Frontline/data
#%cd /content/drive/MyDrive/data/

In [57]:
from scripts import annotations

In [58]:
# dataset containing all relevant articles (but only one paragraph each!)
data_all=pd.read_csv("elinor/annotation_test_05_18.csv")

In [59]:
# list of dfs with all annotated datasets
dfs={}
for doc in os.listdir("annotated"):
  if doc.startswith("annotations"):
    #read json data
    json_data=json.load(open("annotated/"+doc, encoding="utf-8"))
    #convert to dataframe
    data=pd.DataFrame(json_data["documents"])
    #for now: filter out paragraphs that have not been annotated
    data=data[data["annotations"].apply(len)>0]
    data.loc[:,"file"]=doc
    dfs[doc]=data

len(dfs)

5

In [60]:
final_data=pd.DataFrame(columns=["artikel_id","text","name", "datum","ressort","annotations","attributes_flat","file","artikel_order" ])

### annotations_05_18
Task:
- extract ressort, datum, name
- extract annotations
- extract artikel id

Note:
- doesnt have article order

In [62]:
data=dfs["annotations_05_18.json"]

In [63]:
data["ressort"]=data.apply(lambda x: x.attributes_flat["ressort"],axis=1)
data["datum"]=data.apply(lambda x: x.attributes_flat["datum"],axis=1)
data["name"]=data.apply(lambda x: x.attributes_flat["name"],axis=1)
data["artikel_id"]=data.apply(lambda x: x.attributes_flat["artikel_id"],axis=1)

In [64]:
# extract annotations
data.loc[:,"annotations"]=data.loc[:,"annotations"].apply(annotations.extract_annotations)

In [65]:
# calculate similarity
data.loc[:,"jaccard"]=data.loc[:,"annotations"].apply(annotations.calculate_similarity)
data.loc[:,"dice"]=data.loc[:,"annotations"].apply(annotations.calculate_similarity,args=["dice"])

In [66]:
data=data.drop("id",axis=1)

In [67]:
final_data=pd.concat([final_data,data])

### annotations_06_09_part1

Task:
- extract datum, artikel_id, ressort, name, artikel_order

Note:
- 1/3 without ressort

In [68]:
data=dfs["annotations_06_09_part1.json"]

In [69]:
data["ressort"]=data.apply(lambda x: x.attributes_flat["ressort"],axis=1)
data["datum"]=data.apply(lambda x: x.attributes_flat["datum"],axis=1)
data["name"]=data.apply(lambda x: x.attributes_flat["name"],axis=1)
data["artikel_id"]=data.apply(lambda x: x.attributes_flat["artikel_id"],axis=1)
data["artikel_order"]=data.apply(lambda x: x.attributes_flat["artikel_order"],axis=1)

In [70]:
# extract annotations
data.loc[:,"annotations"]=data.loc[:,"annotations"].apply(annotations.extract_annotations)

In [71]:
# calculate similarity
data.loc[:,"jaccard"]=data.loc[:,"annotations"].apply(annotations.calculate_similarity)
data.loc[:,"dice"]=data.loc[:,"annotations"].apply(annotations.calculate_similarity,args=["dice"])

In [72]:
data=data.drop("id",axis=1)

In [73]:
final_data=pd.concat([final_data,data])

### annotations_06_09_part2

Task:
- extract datum, artikel_id, ressort, name, artikel_order



In [74]:
data=dfs["annotations_06_09_part2.json"]

In [75]:
data["ressort"]=data.apply(lambda x: x.attributes_flat["ressort"],axis=1)
data["datum"]=data.apply(lambda x: x.attributes_flat["datum"],axis=1)
data["name"]=data.apply(lambda x: x.attributes_flat["name"],axis=1)
data["artikel_id"]=data.apply(lambda x: x.attributes_flat["artikel_id"],axis=1)
data["artikel_order"]=data.apply(lambda x: x.attributes_flat["artikel_order"],axis=1)

In [76]:
# extract annotations
data.loc[:,"annotations"]=data.loc[:,"annotations"].apply(annotations.extract_annotations)

In [77]:
# calculate similarity
data.loc[:,"jaccard"]=data.loc[:,"annotations"].apply(annotations.calculate_similarity)
data.loc[:,"dice"]=data.loc[:,"annotations"].apply(annotations.calculate_similarity,args=["dice"])

In [78]:
data=data.drop("id",axis=1)

In [79]:
final_data=pd.concat([final_data,data])

### annotations_06_09_part3

Task:
- extract datum, artikel_id, ressort, name, artikel_order


In [80]:
data=dfs["annotations_06_09_part3.json"]

In [81]:
data["ressort"]=data.apply(lambda x: x.attributes_flat["ressort"],axis=1)
data["datum"]=data.apply(lambda x: x.attributes_flat["datum"],axis=1)
data["name"]=data.apply(lambda x: x.attributes_flat["name"],axis=1)
data["artikel_id"]=data.apply(lambda x: x.attributes_flat["artikel_id"],axis=1)
data["artikel_order"]=data.apply(lambda x: x.attributes_flat["artikel_order"],axis=1)

In [82]:
# extract annotations
data.loc[:,"annotations"]=data.loc[:,"annotations"].apply(annotations.extract_annotations)

In [83]:
# calculate similarity
data.loc[:,"jaccard"]=data.loc[:,"annotations"].apply(annotations.calculate_similarity)
data.loc[:,"dice"]=data.loc[:,"annotations"].apply(annotations.calculate_similarity,args=["dice"])

In [84]:
data=data.drop("id",axis=1)

In [85]:
final_data=pd.concat([final_data,data])

### annotations_50sample_06_09
Task:
- get ressort, name, datum, artikel_id
- the column id is NOT artikel_id

Note:
- all annotationsof this sample are duplicates, so a subset of the others , annotated by 3 annotators

In [None]:
data=dfs["annotations_50sample_06_09.json"]

In [110]:
data=data[["text","annotations","file"]].merge(final_data[["datum","ressort","name","text","artikel_id"]], on="text")

In [112]:
# extract annotations
data.loc[:,"annotations"]=data.loc[:,"annotations"].apply(annotations.extract_annotations)

In [113]:
# calculate similarity
data.loc[:,"jaccard"]=data.loc[:,"annotations"].apply(annotations.calculate_similarity)
data.loc[:,"dice"]=data.loc[:,"annotations"].apply(annotations.calculate_similarity,args=["dice"])

In [116]:
final_data=pd.concat([final_data,data])

### Export merged data

In [117]:
final_data.to_csv("annotated/all_annotations_06_13.csv")