# Covid annotation extraction

This notebook is designed to read pickled BERTje containers with annotated sentences for **Covid patients**. These are then concatenated into a Pandas DataFrame and saved. This allows the annotated sentences to be loaded rapidly from a single file in future.

Note: Much of the code is now redundant, as the BERTje embedding pipeline was updated to have only a single pickle file with all the BERTje containers, which were also updated to include relevant information in the attributes of the [Container class](https://github.com/cltl/a-proof/blob/master/machine_learning/class_definitions.py). 

In [2]:
import pickle
from pathlib import Path

import numpy as np
import pandas as pd
import scipy
import seaborn as sns
import sklearn
import statsmodels
import torch
from matplotlib import pyplot as plt
from tqdm import tqdm as tqdm

In [3]:
# Make graphics nice
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

# Set sensible defaults
sns.set()
sns.set_style("ticks")
sns.set_context('paper')

In [4]:
import class_definitions
from class_definitions import Annotation, BertContainer

In [5]:
DATADIR = '//data2/Documents/Covid_data_11nov/traindata_covidbatch.pkl'

In [6]:
from collections import defaultdict


def preview_info_from_pkl(fpath):
    with open(fpath, 'rb') as file:
        all_containers = pickle.load(file)
        for container in all_containers:
            print(container.annotator, container.sen_id, container.sen, len(container.encoding), container.annot)
            
def extract_info_from_pkl(fpath):
    print("Opening pickle...")
    with open(fpath, 'rb') as file:
        all_containers = pickle.load(file)
        contents = defaultdict(list)
        
        print("Reading...")
        
        print(dir(all_containers[0]))
            
        for container in tqdm(all_containers):
            
            contents['src_file'].append(container.key)
            contents['annotator'].append(container.annotator)
            contents['sentence_id'].append(container.sen_id)
            contents['sentence'].append(container.sen)
        
#             contents['encoding_shape'].append(torch.stack(container.encoding).shape if len(container.encoding) > 0 else None)
            if container.annot != []:
                contents['annotations'].append(' | '.join([c.label for c in container.annot]))
                contents['num_annot'].append(len(container.annot))
            else:
                contents['annotations'].append(None)
                contents['num_annot'].append(0)
            
        return contents

In [7]:
contents = extract_info_from_pkl(DATADIR)

Opening pickle...


100%|██████████| 17365/17365 [00:00<00:00, 600539.98it/s]

Reading...
['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', 'add_anno', 'annot', 'annotator', 'encoding', 'key', 'print_container', 'sen', 'sen_id', 'write_to_pkl']





In [8]:
df = pd.DataFrame(contents)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17365 entries, 0 to 17364
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   src_file     17365 non-null  object
 1   annotator    17365 non-null  object
 2   sentence_id  17365 non-null  object
 3   sentence     17365 non-null  object
 4   annotations  6308 non-null   object
 5   num_annot    17365 non-null  int64 
dtypes: int64(1), object(5)
memory usage: 814.1+ KB


In [10]:
df.to_csv('~/gianluca_data/data/covid_traindata.tsv', index=False, sep='\t')

In [22]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 216959 entries, 0 to 36
Data columns (total 7 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   src_file        216959 non-null  object
 1   annotator       216959 non-null  object
 2   sentence_id     216959 non-null  object
 3   sentence        216959 non-null  object
 4   encoding_shape  216959 non-null  object
 5   num_annot       216959 non-null  int64 
 6   annotations     61313 non-null   object
dtypes: int64(1), object(6)
memory usage: 13.2+ MB


In [25]:
df.columns

Index(['src_file', 'annotator', 'sentence_id', 'sentence', 'encoding_shape',
       'num_annot', 'annotations'],
      dtype='object')

In [26]:
df.annotations.value_counts().head(50)

                                                              27804
type\_Background                                              23362
target                                                         4945
disregard\_file                                                 671
info\_Third party                                               382
STM 1 | .B152: Stemming                                         293
.D450: Lopen en zich verplaatsen                                262
.D450: Lopen en zich verplaatsen | FAC 4                        215
view\_Patient                                                   124
FAC 4 | .D450: Lopen en zich verplaatsen                        112
.B152: Stemming | STM 1                                         111
*                                                               105
STM 3 | .B152: Stemming                                          87
STM 1 | stm\_reaction | .B152: Stemming                          84
STM 1 | .B152: Stemming | stm\_reaction         