# Annotation Pipeline
This notebook runs through typical steps in the model annotation pipeline and mocks the expected communcation with the active learning API as far as presentable within this notebook. It requires the AL_REST docker container to be running. The path that documents take between the work packages is as follows:

1. **AP2:** Documents get added to collection
* **AP2->AP4:** Collection including new documents gets annotated by segment detection model, which creates segments
* **AP4->AP3:** Additional text classification is performed and added to the segments
* **AP3->AP2:** Segment annotations are added to the documents in the collection
* **AP2 (Active Learning):** Importance score is calculated for each annotation and, based on that, each document
* **AP2 (Active Learning) -> AP2 (Annotation Tool):** Most important documents are selected and passed to the annotation tool
* **AP2 (Annotation Tool):** Annotation is performed by human annotators and documents updated
* **AP2 (Annotation Tool) -> AP2 (Active Learning):** Documents in collection are updated with human annotation

Since there is no machine learning as of yet, the process stops here.

## Initialization
In a first step, model names and used documents (xml and mocked) are defined and basic request functions are implemented.

In [2]:
from urllib import request, parse
import xml.etree.ElementTree as ET
import json
from pprint import pprint

ANNOTATION_BOUNDING_BOX = ("segment_boundary", "segment_detection_model")
ANNOTATION_SEGMENT_TYPE = ("segment_type", "segment_type_model")
ANNOTATION_TRUNCATION = ("is_truncated", "truncation_model")
ANNOTATION_OCCLUSION  = ("is_occluded", "occlusion_model")
ANNOTATOR_TYPE = "model"
DOCUMENT_NAMES = ["testdata/3ZCCCW.pdf", "testdata/01-Anfrage-Musterbrief_1.pdf"]
DB_PATH = "/home/tobias/data"

def post_json(data_in, url):
    data = json.dumps(data_in).encode('utf-8')
    req =  request.Request(url, data=data, headers={'content-type': 'application/json'}) # this will make the method "POST"
    resp = request.urlopen(req)
    resp_text = resp.read().decode('utf-8')
    return resp_text

def get_json(url):
    req =  request.Request(url, headers={})
    resp = request.urlopen(req)
    resp_text = resp.read().decode('utf-8')
    return resp_text


# 1. **AP2:** Documents get added to collection
The documents are added to the active learning collection. At the moment they only consist of a document id and a single page (mocked without pdf). Models are already defined in a config file in the docker container.

In [6]:
"""
from shutil import copyfile
import os.path
for name in DOCUMENT_NAMES:
    copyfile(name, os.path.join(DB_PATH, name))"""

"""
def add_document(document_id):
    document = {
        "document_id": document_id,
        "pages":[{"page_number":0}]
    }
    
    url = "http://localhost:5000/add_document/"
    resp = post_json(document, url)
    print(resp)

for document_id in DOCUMENT_IDS:
    add_document(document_id)"""

url = "http://localhost:5000/start_pipeline/"

#url = "http://localhost:5000/set_document_ids/"
resp = post_json(DOCUMENT_NAMES, url)
print("Created Documents:", resp)

HTTPError: HTTP Error 500: INTERNAL SERVER ERROR

# 2. **AP2->AP4:** Collection including new documents gets annotated by segment detection model, which creates segments
##  Getting document data from the server as the segement detection model
After receiving the unlabeled documents, the model can annotate the pages. To be determined: how to send pdf file. This only happens this way for unlabeled documents. When machine learning is involved, labeled data (including partially labeled documents) needs to be considered and transferred differently.

In [7]:
# called from AP4, but should be the other way around
url = "http://localhost:5000/get_unlabeled_documents/"
document_dict = json.loads(get_json(url))
print(document_dict)

# TODO: How should control between AP containers be handled?
# instead have function in AP4 that AP2 can call to give document data to AP4?

{'dd459a615a1fb4da9b46d5da6f9eda88': {'document_id': 'dd459a615a1fb4da9b46d5da6f9eda88', 'importance_score': -1, 'document_label': {'group_id': 0, 'importance_score': -1, 'annotation_type': 'document_label', 'annotations': []}, 'pages': []}}


## Get annotation types
Here we just get the annotation types so that we can later get the right index for labeling segments.

In [4]:
url = "http://localhost:5000/get_annotation_types/"
annotation_types = json.loads(get_json(url))
annotation_types

{'segment_boundary': {'model_ids': ['segment_detection_model'],
  'labels': ['xmin', 'ymin', 'xmax', 'ymax']},
 'segment_type': {'model_ids': ['segment_type_model'],
  'labels': ['text', 'image', 'table']},
 'is_truncated': {'model_ids': ['truncation_model'],
  'labels': ['is_truncated', 'is_not_truncated']},
 'is_occluded': {'model_ids': ['occlusion_model'],
  'labels': ['is_occluded', 'is_not_occluded']},
 'text_segment_boundary_test': {'model_ids': ['test_annotator_1',
   'test_annotator_2'],
  'labels': ['xmin', 'ymin', 'xmax', 'ymax']},
 'text_segment_label': {'model_ids': ['text_segment_label_clf1'],
  'labels': ['sender', 'receiver', 'footer', 'header']}}

## Mocking the model
To simulate the model, annotated xml files are loaded. The annotations are read from xml and stored in the document data structure. This data structure can be returned to the API to store the page segment annotations.

In [5]:
# Boundary boxes loaded from xml
def create_segments_from_xml(document_id):
    annotated_pages = []
    with open(document_id[:len(document_id)-4]+".xml", 'r', encoding='utf-8') as fp:
        xml = fp.read()
    document_id = "dd459a615a1fb4da9b46d5da6f9eda88"
    root = ET.fromstring(xml)
    for page_ele in root.findall("page1"):
        annotated_pages.append((document_id,page_ele))
    
    for page_number, annotated_page in enumerate(annotated_pages):
        segment_annotations = []
        _, page_ele = annotated_page
        for segment_ele in page_ele.findall("object"):
            segment_label_text = segment_ele.find("SegmentLabel").text
            segment_id = segment_ele.find("id").text
            if segment_ele.find("truncated").text == '0':
                truncated = 0.0
            else:
                truncated = 1.0
            if segment_ele.find("occluded").text == '0':
                occluded = 0.0
            else:
                occluded = 1.0
            bounding_box = segment_ele.find("bndbox")
            xmin = bounding_box.find("xmin").text
            ymin = bounding_box.find("ymin").text
            xmax = bounding_box.find("xmax").text
            ymax = bounding_box.find("ymax").text
            
            segment = {
                "segment_id": segment_id,
                "child_segments": [],
                "annotation_groups":[
                    {
                        "group_id":0,
                        "importance_score":None,
                        "annotation_type":ANNOTATION_BOUNDING_BOX[0],
                        "annotations":[{
                            "annotation_id":segment_id,
                            "annotator_id":ANNOTATION_BOUNDING_BOX[1],
                            "annotator_type":ANNOTATOR_TYPE,
                            "annotation_vector":[xmin,ymin,xmax,ymax]
                        }]
                    },
                    {
                        "group_id":1,
                        "importance_score":None,
                        "annotation_type":ANNOTATION_SEGMENT_TYPE[0],
                        "annotations":[{
                            "annotation_id":segment_id,
                            "annotator_id":ANNOTATION_SEGMENT_TYPE[1],
                            "annotator_type":ANNOTATOR_TYPE,
                            "annotation_vector":get_annotation_vector(ANNOTATION_SEGMENT_TYPE[0], segment_label_text)
                        }]
                    },
                    {
                        "group_id":2,
                        "importance_score":None,
                        "annotation_type":ANNOTATION_TRUNCATION[0],
                        "annotations":[{
                            "annotation_id":segment_id,
                            "annotator_id":ANNOTATION_TRUNCATION[1],
                            "annotator_type":ANNOTATOR_TYPE,
                            "annotation_vector":[truncated, 1.0 - truncated]
                        }]
                    },
                    {
                        "group_id":3,
                        "importance_score":None,
                        "annotation_type":ANNOTATION_OCCLUSION[0],
                        "annotations":[{
                            "annotation_id":segment_id,
                            "annotator_id":ANNOTATION_OCCLUSION[1],
                            "annotator_type":ANNOTATOR_TYPE,
                            "annotation_vector":[occluded, 1.0 - occluded]
                        }]
                    },
                ]
            }
            segment_annotations.append(segment)
        pages = document_dict[document_id]["pages"]
        pages[page_number]["segments"] = segment_annotations

def get_annotation_vector(annotation_type, label_text):
    # do not use if list of labels is large
    # or label probabilities are provided by the model
    labels = annotation_types[annotation_type]["labels"]
    idx = labels.index(label_text)
    return [1.0 if i == idx else 0.0 for i in range(len(labels))]
    

for document_id in DOCUMENT_NAMES[:1]:
    create_segments_from_xml(document_id)

# all annotations for one page
document_dict[DOCUMENT_NAMES[0]]

{'document_id': 'testdata/3ZCCCW.pdf',
 'importance_score': 9.4,
 'document_label': {'group_id': 0,
  'importance_score': -1,
  'annotation_type': 'document_label',
  'annotations': []},
 'pages': [{'page_number': '0',
   'page_label': {'group_id': 0,
    'importance_score': -1,
    'annotation_type': 'page_label',
    'annotations': []},
   'segments': [{'segment_id': 'r0',
     'child_segments': [],
     'annotation_groups': [{'group_id': 0,
       'importance_score': None,
       'annotation_type': 'segment_boundary',
       'annotations': [{'annotation_id': 'r0',
         'annotator_id': 'segment_detection_model',
         'annotator_type': 'model',
         'annotation_vector': ['0.0611 ', '0.4016 ', '0.9196 ', '0.4124 ']}]},
      {'group_id': 1,
       'importance_score': None,
       'annotation_type': 'segment_type',
       'annotations': [{'annotation_id': 'r0',
         'annotator_id': 'segment_type_model',
         'annotator_type': 'model',
         'annotation_vector': [0

# 3. **AP4->AP3:** Additional text classification is performed and added to the segments

In [6]:
# TODO

# 4. **AP3->AP2:** Segment annotations are added to the documents in the collection
Using the page_id_dict data structure, the page segment annotations are saved.

In [7]:
# Send data to API
data_in = document_dict


url = "http://localhost:5000/add_segment_boundary_annotations_to_page/"
resp = post_json(data_in, url)
print("Saved annotations to db")

Saved annotations to db


# 5. **AP2 (Active Learning):** Importance score is calculated for each annotation and, based on that, each document

In [8]:
# TODO: Do in the AL container

# 6. **AP2 (Active Learning) -> AP2 (Annotation Tool):** Most important documents are selected and passed to the annotation tool
At the moment implemented so that annotation tool asks for the documents. Might need to be reversed.

In [9]:
best_docs = []

url = "http://localhost:5000/init_next_most_important/"
get_json(url)

url = "http://localhost:5000/next_most_important/"
document_id, score = json.loads(get_json(url))
print(document_id, score)
best_docs.append(document_id)

url = "http://localhost:5000/next_most_important/"
document_id, score = json.loads(get_json(url))
print(document_id, score)
best_docs.append(document_id)


testdata/3ZCCCW.pdf 9.4
testdata/01-Anfrage-Musterbrief_1.pdf 5.8


In [10]:
url = "http://localhost:5000/get_document/" + best_docs[0]
document = json.loads(get_json(url))
pprint(document)

{'document_id': 'testdata/3ZCCCW.pdf',
 'document_label': {'annotation_type': 'document_label',
                    'annotations': [],
                    'group_id': 0,
                    'importance_score': -1},
 'importance_score': 9.4,
 'pages': [{'page_label': {'annotation_type': 'page_label',
                           'annotations': [],
                           'group_id': 0,
                           'importance_score': -1},
            'page_number': '0',
            'segments': [{'annotation_groups': [{'annotation_type': 'segment_boundary',
                                                 'annotations': [{'annotation_id': 'r0',
                                                                  'annotation_vector': ['0.0611 ',
                                                                                        '0.4016 ',
                                                                                        '0.9196 ',
                                                     

                                                 'annotations': [{'annotation_id': 'r22',
                                                                  'annotation_vector': [0.0,
                                                                                        1.0],
                                                                  'annotator_id': 'truncation_model',
                                                                  'annotator_type': 'model'}],
                                                 'group_id': 2,
                                                 'importance_score': None},
                                                {'annotation_type': 'is_occluded',
                                                 'annotations': [{'annotation_id': 'r22',
                                                                  'annotation_vector': [0.0,
                                                                                        1.0],
                          

In [11]:
url = "http://localhost:5000/get_document/" + best_docs[1]
document = json.loads(get_json(url))
pprint(document)

{'document_id': 'testdata/01-Anfrage-Musterbrief_1.pdf',
 'document_label': {'annotation_type': 'document_label',
                    'annotations': [],
                    'group_id': 0,
                    'importance_score': -1},
 'importance_score': 5.8,
 'pages': [{'page_label': {'annotation_type': 'page_label',
                           'annotations': [],
                           'group_id': 0,
                           'importance_score': -1},
            'page_number': '0',
            'segments': [{'annotation_groups': [{'annotation_type': 'segment_boundary',
                                                 'annotations': [{'annotation_id': 'r0',
                                                                  'annotation_vector': ['0.3283 ',
                                                                                        '0.9312 ',
                                                                                        '0.6735 ',
                                   

# 7. **AP2 (Annotation Tool):** Annotation is performed by human annotators and documents updated

# 8. **AP2 (Annotation Tool) -> AP2 (Active Learning):** Documents in collection are updated with human annotation

# Other stuff

## Handling multiple models
To perform active learning on segment detection the input of multiple models is needed. Two or more models all need to annotate the same segment to determine what segment boundaries need to be annotated by humans. This means the API needs to assign overlapping annotations to the same segment. To test this functionality, two test documents are mocked that are annotated using two models. The first document does not contain any overlapping annotations while the second one does. At the moment, annotations are merged so that each segment consists of up to two annotations, which is the minimum for active learning. An alternative to this would be to merge multiple overlapping annotations into one segment.

In [12]:
"""
# Custom Boundary Boxes for testing
# Non-Overlapping Segments
# Annotator 1
ad1 = {
    "annotation_id":"1",
    "annotator_id": "test_annotator_1",
    "annotator_type": "model",
    "annotation_vector":[0.00,0.10,0.30,0.40]
}
ad2 = {
    "annotation_id":"2",
    "annotator_id": "test_annotator_1",
    "annotator_type": "model",
    "annotation_vector":[0.35,0.10,0.65,0.40]
}
ad3 = {
    "annotation_id":"3",
    "annotator_id": "test_annotator_1",
    "annotator_type": "model",
    "annotation_vector":[0.70,0.10,1.00,0.40]
}
# Annotator 2
ad4 = {
    "annotation_id":"4",
    "annotator_id": "test_annotator_2",
    "annotator_type": "model",
    "annotation_vector":[0.00,0.60,0.30,0.90]
}
ad5 = {
    "annotation_id":"5",
    "annotator_id": "test_annotator_2",
    "annotator_type": "model",
    "annotation_vector":[0.35,0.60,0.65,0.90]
}

segment_annotations = [ad1, ad2, ad3, ad4, ad5]
page_id_dict["test1.xml"]["0"] = segment_annotations

# Overlapping Segments
# Annotator 1
ad1 = {
    "annotation_id":"1",
    "annotator_id": "test_annotator_1",
    "annotator_type": "model",
    "annotation_vector":[0.00,0.10,0.30,0.40]
}
ad2 = {
    "annotation_id":"2",
    "annotator_id": "test_annotator_1",
    "annotator_type": "model",
    "annotation_vector":[0.35,0.10,0.65,0.40]
}
ad3 = {
    "annotation_id":"3",
    "annotator_id": "test_annotator_1",
    "annotator_type": "model",
    "annotation_vector":[0.70,0.10,1.00,0.40]
}
# Annotator 2
# Overlapping with 1 Segment
ad4 = {
    "annotation_id":"4",
    "annotator_id": "test_annotator_2",
    "annotator_type": "model",
    "annotation_vector":[0.00,0.15,0.30,0.35]
}
# Overlapping with 2 Segments
ad5 = {
    "annotation_id":"5",
    "annotator_id": "test_annotator_2",
    "annotator_type": "model",
    "annotation_vector":[0.40,0.15,0.95,0.35]
}

segment_annotations = [ad1, ad2, ad3, ad4, ad5]
page_id_dict["test2.xml"]["0"] = segment_annotations
"""

'\n# Custom Boundary Boxes for testing\n# Non-Overlapping Segments\n# Annotator 1\nad1 = {\n    "annotation_id":"1",\n    "annotator_id": "test_annotator_1",\n    "annotator_type": "model",\n    "annotation_vector":[0.00,0.10,0.30,0.40]\n}\nad2 = {\n    "annotation_id":"2",\n    "annotator_id": "test_annotator_1",\n    "annotator_type": "model",\n    "annotation_vector":[0.35,0.10,0.65,0.40]\n}\nad3 = {\n    "annotation_id":"3",\n    "annotator_id": "test_annotator_1",\n    "annotator_type": "model",\n    "annotation_vector":[0.70,0.10,1.00,0.40]\n}\n# Annotator 2\nad4 = {\n    "annotation_id":"4",\n    "annotator_id": "test_annotator_2",\n    "annotator_type": "model",\n    "annotation_vector":[0.00,0.60,0.30,0.90]\n}\nad5 = {\n    "annotation_id":"5",\n    "annotator_id": "test_annotator_2",\n    "annotator_type": "model",\n    "annotation_vector":[0.35,0.60,0.65,0.90]\n}\n\nsegment_annotations = [ad1, ad2, ad3, ad4, ad5]\npage_id_dict["test1.xml"]["0"] = segment_annotations\n\n# Ove

In [13]:
"""
url = "http://localhost:5000/get_document/" + DOCUMENT_IDS[2]
document = json.loads(get_json(url))
pprint(document)"""

'\nurl = "http://localhost:5000/get_document/" + DOCUMENT_IDS[2]\ndocument = json.loads(get_json(url))\npprint(document)'

In [14]:
"""url = "http://localhost:5000/get_document/" + DOCUMENT_IDS[3]
document = json.loads(get_json(url))
pprint(document)"""

'url = "http://localhost:5000/get_document/" + DOCUMENT_IDS[3]\ndocument = json.loads(get_json(url))\npprint(document)'