<a href="https://colab.research.google.com/github/blue-create/langlens/blob/main/import/clean_annotation_data_V1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Purpose
This notebook cleans the annotation data from elinor so that it can be analysed further.
Observed problems in the annotations data:
- some fields were not imported to elinor in the first place but should be part of the analysis
- id field was exported wrongly
- some paragraphs were annotated multiple times -> merge
- some annotations contain the old labels
- some specific labels were used in combination with the generic "domestic violence label

For all files:
- extract annotations, columns
- calculate similarity

### Output:
- one file containing all annotations of the first annotation data set (v1)


### Imports & Data

In [135]:
import os
import re
import pandas as pd
import numpy as np
from collections import Counter
import matplotlib.pyplot as plt
from tqdm import tqdm
from datetime import datetime
from ast import literal_eval
import plotly.graph_objects as go
import plotly.express as px
import json

In [136]:
# connect with google drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [137]:
# change cwd
%cd drive/MyDrive/Work/Frontline/data
#%cd /content/drive/MyDrive/data/

[Errno 2] No such file or directory: 'drive/MyDrive/Work/Frontline/data'
/content/drive/.shortcut-targets-by-id/1WfnZsqpG1r110J63sMbfS5TpsDOkveiV/data


In [138]:
from scripts import annotations

In [139]:
# dataset containing all relevant articles (but only one paragraph each!)
data_all=pd.read_csv("elinor/annotation_test_05_18.csv")

In [140]:
# list of dfs with all annotated datasets
dfs={}
for doc in os.listdir("annotated"):
  if doc.endswith(".json"):
    #read json data
    json_data=json.load(open("annotated/"+doc, encoding="utf-8"))
    #convert to dataframe
    data=pd.DataFrame(json_data["documents"])
    #for now: filter out paragraphs that have not been annotated
    data=data[data["annotations"].apply(len)>0]
    data.loc[:,"file"]=doc
    dfs[doc]=data

len(dfs)

5

In [141]:
final_data=pd.DataFrame(columns=["artikel_id","text","name", "datum","ressort","annotations","attributes_flat","file","artikel_order" ])

### annotations_05_18
Task:
- extract ressort, datum, name
- extract annotations
- extract artikel id

Note:
- doesnt have article order

In [142]:
data=dfs["230518_annotations.json"]

In [143]:
data["ressort"]=data.apply(lambda x: x.attributes_flat["ressort"],axis=1)
data["datum"]=data.apply(lambda x: x.attributes_flat["datum"],axis=1)
data["name"]=data.apply(lambda x: x.attributes_flat["name"],axis=1)
data["artikel_id"]=data.apply(lambda x: x.attributes_flat["artikel_id"],axis=1)

In [144]:
# extract annotations
data.loc[:,"annotations"]=data.loc[:,"annotations"].apply(annotations.extract_annotations)

In [145]:
# calculate similarity
data.loc[:,"jaccard"]=data.loc[:,"annotations"].apply(annotations.calculate_similarity)
data.loc[:,"dice"]=data.loc[:,"annotations"].apply(annotations.calculate_similarity,args=["dice"])

In [146]:
data=data.drop("id",axis=1)

In [147]:
final_data=pd.concat([final_data,data])

### 230609_annotations_part1

Task:
- extract datum, artikel_id, ressort, name, artikel_order

Note:
- 1/3 without ressort

In [148]:
data=dfs["230609_annotations_part1.json"]

In [149]:
data["ressort"]=data.apply(lambda x: x.attributes_flat["ressort"],axis=1)
data["datum"]=data.apply(lambda x: x.attributes_flat["datum"],axis=1)
data["name"]=data.apply(lambda x: x.attributes_flat["name"],axis=1)
data["artikel_id"]=data.apply(lambda x: x.attributes_flat["artikel_id"],axis=1)
data["artikel_order"]=data.apply(lambda x: x.attributes_flat["artikel_order"],axis=1)

In [152]:
# extract annotations
data.loc[:,"annotations"]=data.loc[:,"annotations"].apply(annotations.extract_annotations)

In [153]:
# calculate similarity
data.loc[:,"jaccard"]=data.loc[:,"annotations"].apply(annotations.calculate_similarity)
data.loc[:,"dice"]=data.loc[:,"annotations"].apply(annotations.calculate_similarity,args=["dice"])

In [154]:
data=data.drop("id",axis=1)

In [155]:
final_data=pd.concat([final_data,data])

### 230609_annotations_part2

Task:
- extract datum, artikel_id, ressort, name, artikel_order



In [156]:
data=dfs["230609_annotations_part2.json"]

In [157]:
data["ressort"]=data.apply(lambda x: x.attributes_flat["ressort"],axis=1)
data["datum"]=data.apply(lambda x: x.attributes_flat["datum"],axis=1)
data["name"]=data.apply(lambda x: x.attributes_flat["name"],axis=1)
data["artikel_id"]=data.apply(lambda x: x.attributes_flat["artikel_id"],axis=1)
data["artikel_order"]=data.apply(lambda x: x.attributes_flat["artikel_order"],axis=1)

In [158]:
# extract annotations
data.loc[:,"annotations"]=data.loc[:,"annotations"].apply(annotations.extract_annotations)

In [159]:
# calculate similarity
data.loc[:,"jaccard"]=data.loc[:,"annotations"].apply(annotations.calculate_similarity)
data.loc[:,"dice"]=data.loc[:,"annotations"].apply(annotations.calculate_similarity,args=["dice"])

In [160]:
data=data.drop("id",axis=1)

In [161]:
final_data=pd.concat([final_data,data])

### 230609_annotations_part3

Task:
- extract datum, artikel_id, ressort, name, artikel_order
- all of J's annotations belong to V2 of annotations, hence remove from this data


In [162]:
data=dfs["230609_annotations_part3.json"]

In [163]:
data["ressort"]=data.apply(lambda x: x.attributes_flat["ressort"],axis=1)
data["datum"]=data.apply(lambda x: x.attributes_flat["datum"],axis=1)
data["name"]=data.apply(lambda x: x.attributes_flat["name"],axis=1)
data["artikel_id"]=data.apply(lambda x: x.attributes_flat["artikel_id"],axis=1)
data["artikel_order"]=data.apply(lambda x: x.attributes_flat["artikel_order"],axis=1)

In [164]:
# extract annotations
data.loc[:,"annotations"]=data.loc[:,"annotations"].apply(annotations.extract_annotations)

In [165]:
# remove J's annotations
data=data[data.loc[:,"annotations"].apply(lambda x: "J" not in x.keys())]

In [166]:
# calculate similarity
data.loc[:,"jaccard"]=data.loc[:,"annotations"].apply(annotations.calculate_similarity)
data.loc[:,"dice"]=data.loc[:,"annotations"].apply(annotations.calculate_similarity,args=["dice"])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data.loc[:,"jaccard"]=data.loc[:,"annotations"].apply(annotations.calculate_similarity)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data.loc[:,"dice"]=data.loc[:,"annotations"].apply(annotations.calculate_similarity,args=["dice"])


In [167]:
data=data.drop("id",axis=1)

In [168]:
final_data=pd.concat([final_data,data])

### 230609_annotations_50sample
Task:
- get ressort, name, datum, artikel_id
- the column id is NOT artikel_id

Note:
- all annotationsof this sample are duplicates, so a subset of the others , annotated by 3 annotators

In [169]:
data=dfs["230609_annotations_50sample.json"]

In [170]:
data=data[["text","annotations","file"]].merge(final_data[["datum","ressort","name","text","artikel_id"]], on="text")

In [171]:
# extract annotations
data.loc[:,"annotations"]=data.loc[:,"annotations"].apply(annotations.extract_annotations)

In [172]:
# calculate similarity
data.loc[:,"jaccard"]=data.loc[:,"annotations"].apply(annotations.calculate_similarity)
data.loc[:,"dice"]=data.loc[:,"annotations"].apply(annotations.calculate_similarity,args=["dice"])

In [173]:
final_data=pd.concat([final_data,data])

### Export merged data

In [174]:
from datetime import datetime

In [175]:
date=datetime.today().strftime('%y%m%d')

In [177]:
final_data.to_csv(f"annotated/{date}_all_annotationsV1.csv")