<a href="https://colab.research.google.com/github/blue-create/langlens/blob/main/import/clean_annotation_data_V2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Purpose
This notebook cleans and merges the annotation data (V2 - after June 20th) from elinor so that it can be analysed further.

For all files:
- extract annotations, columns
- calculate similarity

### Output:
- one file containing all annotations of the first annotation data set (v2)


### Imports & Data

In [191]:
import os
import re
import pandas as pd
import numpy as np
from collections import Counter
import matplotlib.pyplot as plt
from tqdm import tqdm
from datetime import datetime
from ast import literal_eval
import plotly.graph_objects as go
import plotly.express as px
import json

In [192]:
# connect with google drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [193]:
# change cwd
%cd drive/MyDrive/Work/Frontline/data
#%cd /content/drive/MyDrive/data/

[Errno 2] No such file or directory: 'drive/MyDrive/Work/Frontline/data'
/content/drive/.shortcut-targets-by-id/1WfnZsqpG1r110J63sMbfS5TpsDOkveiV/data


In [194]:
from scripts import annotations

In [195]:
# list of dfs with all annotated datasets
dfs={}
for doc in os.listdir("annotated"):
  if doc.endswith(".json"):
    #read json data
    json_data=json.load(open("annotated/"+doc, encoding="utf-8"))
    #convert to dataframe
    data=pd.DataFrame(json_data["documents"])
    #for now: filter out paragraphs that have not been annotated
    data=data[data["annotations"].apply(len)>0]
    data.loc[:,"file"]=doc
    dfs[doc]=data

len(dfs)

1

In [196]:
final_data=pd.DataFrame(columns=["artikel_id","text","name", "datum","ressort","annotations","attributes_flat","file","artikel_order" ])

### 230609_annotations_part3

Task:
- extract datum, artikel_id, ressort, name, artikel_order
- all of J's annotations belong to V2 of annotations, all of K's annotations belong to V1 of the annotation data -> remove K's data


In [197]:
data=dfs["230609_annotations_part3.json"]

In [198]:
data["ressort"]=data.apply(lambda x: x.attributes_flat["ressort"],axis=1)
data["datum"]=data.apply(lambda x: x.attributes_flat["datum"],axis=1)
data["name"]=data.apply(lambda x: x.attributes_flat["name"],axis=1)
data["artikel_id"]=data.apply(lambda x: x.attributes_flat["artikel_id"],axis=1)
data["artikel_order"]=data.apply(lambda x: x.attributes_flat["artikel_order"],axis=1)

In [199]:
# extract annotations
data.loc[:,"annotations"]=data.loc[:,"annotations"].apply(annotations.extract_annotations)

In [200]:
# remove K's annotations
data=data[data.loc[:,"annotations"].apply(lambda x: "K" not in x.keys())]

In [201]:
# calculate similarity
data.loc[:,"jaccard"]=data.loc[:,"annotations"].apply(annotations.calculate_similarity)
data.loc[:,"dice"]=data.loc[:,"annotations"].apply(annotations.calculate_similarity,args=["dice"])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data.loc[:,"jaccard"]=data.loc[:,"annotations"].apply(annotations.calculate_similarity)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data.loc[:,"dice"]=data.loc[:,"annotations"].apply(annotations.calculate_similarity,args=["dice"])


In [202]:
data=data.drop("id",axis=1)

In [203]:
final_data=pd.concat([final_data,data])

### Export merged data

In [204]:
from datetime import datetime

In [205]:
date=datetime.today().strftime('%y%m%d')

In [206]:
final_data.to_csv(f"annotated/{date}_all_annotationsV2.csv")