## Purpose
This notebook cleans the annotation data from elinor so that it can be analysed further.
Observed problems in the annotations data:
- some fields were not imported to elinor in the first place but should be part of the analysis
- id field was exported wrongly
- some paragraphs were annotated multiple times -> merge
- some annotations contain the old labels
- some specific labels were used in combination with the generic "domestic violence label

For all files:
- extract annotations, columns
- calculate similarity


### Imports & Data

In [67]:
import os
import re
import pandas as pd
import numpy as np
from collections import Counter
import matplotlib.pyplot as plt
from tqdm import tqdm
from datetime import datetime
from ast import literal_eval
import plotly.graph_objects as go
import plotly.express as px
import json

In [68]:
# connect with google drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [69]:
# change cwd
%cd drive/MyDrive/Work/Frontline/data
#%cd /content/drive/MyDrive/data/

[Errno 2] No such file or directory: 'drive/MyDrive/Work/Frontline/data'
/content/drive/.shortcut-targets-by-id/1WfnZsqpG1r110J63sMbfS5TpsDOkveiV/data


In [70]:
from scripts import annotations

In [71]:
# dataset containing all relevant articles (but only one paragraph each!)
data_all=pd.read_csv("elinor/annotation_test_05_18.csv")

In [72]:
# list of dfs with all annotated datasets
dfs={}
for doc in os.listdir("annotated"):
  if doc.startswith("annotations"):
    #read json data
    json_data=json.load(open("annotated/"+doc, encoding="utf-8"))
    #convert to dataframe
    data=pd.DataFrame(json_data["documents"])
    #for now: filter out paragraphs that have not been annotated
    data=data[data["annotations"].apply(len)>0]
    data.loc[:,"file"]=doc
    dfs[doc]=data

len(dfs)

# # merge jsons
# data=pd.concat(dfs)
# data=data.reset_index(drop=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data.loc[:,"file"]=doc


5

In [73]:
final_data=pd.DataFrame(columns=["artikel_id","text","name", "datum","ressort","annotations","attributes_flat","file","artikel_order" ])

### methods

In [74]:
def extract_field(entry,field):
  return entry.attributes_flat[field]

### annotations_05_18
Task:
- extract ressort, datum, name
- extract annotations
- extract artikel id

Note:
- doesnt have article order

In [75]:
data=dfs["annotations_05_18.json"]

In [76]:
data["ressort"]=data.apply(lambda x: x.attributes_flat["ressort"],axis=1)
data["datum"]=data.apply(lambda x: x.attributes_flat["datum"],axis=1)
data["name"]=data.apply(lambda x: x.attributes_flat["name"],axis=1)
data["artikel_id"]=data.apply(lambda x: x.attributes_flat["artikel_id"],axis=1)

In [77]:
# extract annotations
data.loc[:,"annotations"]=data.loc[:,"annotations"].apply(annotations.extract_annotations)

In [78]:
# calculate similarity
data.loc[:,"jaccard"]=data.loc[:,"annotations"].apply(annotations.calculate_similarity)
data.loc[:,"dice"]=data.loc[:,"annotations"].apply(annotations.calculate_similarity,args=["dice"])

In [79]:
data=data.drop("id",axis=1)

In [80]:
final_data=pd.concat([final_data,data])

### annotations_06_09_part1

Task:
- extract datum, artikel_id, ressort, name, artikel_order

Note:
- 1/3 without ressort

In [81]:
data=dfs["annotations_06_09_part1.json"]

In [82]:
data["ressort"]=data.apply(lambda x: x.attributes_flat["ressort"],axis=1)
data["datum"]=data.apply(lambda x: x.attributes_flat["datum"],axis=1)
data["name"]=data.apply(lambda x: x.attributes_flat["name"],axis=1)
data["artikel_id"]=data.apply(lambda x: x.attributes_flat["artikel_id"],axis=1)
data["artikel_order"]=data.apply(lambda x: x.attributes_flat["artikel_order"],axis=1)

In [83]:
# extract annotations
data.loc[:,"annotations"]=data.loc[:,"annotations"].apply(annotations.extract_annotations)

In [84]:
# calculate similarity
data.loc[:,"jaccard"]=data.loc[:,"annotations"].apply(annotations.calculate_similarity)
data.loc[:,"dice"]=data.loc[:,"annotations"].apply(annotations.calculate_similarity,args=["dice"])

In [85]:
data=data.drop("id",axis=1)

In [86]:
final_data=pd.concat([final_data,data])

### annotations_50sample_06_09
Task:
- get ressort, name, datum, id

-

In [None]:
final_data