# SP+ End-to-end Evaluation: Annotation

## 0. Setting up your environment

### 0a. git clone the mmda repo to your local device: https://github.com/allenai/mmda
Follow the instructions for setting up a virtual environment/requirements for working with this repo. You may need to install various libraries/packages. 
**Make sure you activate the env using:** _conda activate mmda_

### 0b. git clone extract-stuff-from-pdfs repo to your local device: https://github.com/allenai/extract-stuff-from-pdfs
Follow the instructions for setting up your environment for this repo.

## 1. Get data from annotation store
Load data from schema into notebook.

In [95]:
# NOTE - if schema changes, need to update "annotation-type", "bib-mention-linker-source", etc... (one time only)
import urllib.request, json 

# !!!! ANNOTATOR TODO: We will need to change SHA and annotation/attribute source for each paper evaluated
sha = '17ada3c96ef888fc89f6b995d2edd19a45383423'


attributeSource = "spp-test-1"
annotationSource = "spp-test-1"

# get data from json file
with urllib.request.urlopen(f"http://annotations-api.dev.s2.allenai.org/pdf/{sha}/annotations?"
                            f"annotationSources={annotationSource}&attributeSources={attributeSource}") as url:
    data = json.load(url)
    
#sanity check, remove later
print(data)

{'docId': '17ada3c96ef888fc89f6b995d2edd19a45383423', 'attributesFromSource': {'spp-test-1': {'plain-text-doc-id': '740c1d64752b4679a742544046cf5de3452cfef4', 'mentions': {'annotation-type': ['PlainText', 'PDF'], 'mention-detection-source': 'mentions-test-3', 'bib-mention-linker-source': 'citation-links-test-1'}, 'bib-entries': {'annotation-type': 'PDF', 'bib-detection-source': 'bib_detector_test_3'}, 'bib-entry-parses': {'annotation-type': 'PDF', 'bib-parsing-source': 'bib-struct-test-1'}, 'pdf-plumber': 'pdfplumber-0.0.4'}}, 'annotationsFromSource': {}}


### 1a. Get source names and endpoint types
This part of the notebook will retreive information about where to find the annotation sources for each part - mentions, bib-entries, bib-entry-parses, and pdf plumber. It will also retreive whether the annotations for each source are in the annotation store as a PDF or plain text. These sources will be the inputs for retreiving each type of annotation from the annotation store. 

In [96]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [102]:
# sources as variables in case they change later; may need to change strings if schema changes (one time only)
mention = "mentions"
bib_entries = "bib-entries"
bib_entry_parses = "bib-entry-parses"
plumber = "pdf-plumber"

# get source/stored formats in annotation store, store as variables
for source_type in data['attributesFromSource'][attributeSource]:  
    
    # mentions
    if source_type == mention:
        mention_annot_type = data['attributesFromSource'][attributeSource][source_type]['annotation-type']
        mention_bib_link_source = data['attributesFromSource'][attributeSource][source_type]['bib-mention-linker-source']
        mention_det_source = data['attributesFromSource'][attributeSource][source_type]['mention-detection-source']
        
    # bib entries
    if source_type == bib_entries:
        bib_entry_annot_type = data['attributesFromSource'][attributeSource][source_type]['annotation-type']
        bib_entry_det_source = data['attributesFromSource'][attributeSource][source_type]['bib-detection-source']

    # bib entry parses
    if source_type == bib_entry_parses:
        bib_parse_annot_type = data['attributesFromSource'][attributeSource][source_type]['annotation-type']
        bib_parse_source = data['attributesFromSource'][attributeSource][source_type]['bib-parsing-source']
        
    # pdf plumber
    if source_type == plumber:
        plumber_annot_type = ['PlainText', 'PDF'] #always has annotations from both endpoints
        plumber_source = data['attributesFromSource'][attributeSource][source_type]

# all source type values and endpoints from data
source_values = [[mention_annot_type, mention_bib_link_source, mention_det_source],
                [bib_entry_annot_type, bib_entry_det_source],
                [bib_parse_annot_type, bib_parse_source],
                [plumber_annot_type, plumber_source]]

# spot check of source types: outputs all the sources and what format they are stored in
[source for source in source_values]

[[['PlainText', 'PDF'], 'citation-links-test-1', 'mentions-test-3'],
 ['PDF', 'bib_detector_test_3'],
 ['PDF', 'bib-struct-test-1'],
 [['PlainText', 'PDF'], 'pdfplumber-0.0.4']]

### 1b. Retrieve Annotations for tokens, rows, and pages
Using the sources obtained from the previous cell, we will now retreive the annotations for each source depending on if they are in the annotation store as a PDF or plain text.

In [103]:
"""
NOTE: For the provided sample, we are getting annotations from prod annotations-api.
      When the data is actually run through SPP, they will be in dev annotations-api.
      Will need to modify "http://annotations-api.prod.s2...." --> "http://annotations-api.dev.s2...."
"""

#!/usr/bin/env python3
import boto3
import requests
import random # to get random page number to annotate

from mmda.types.span import Span
from mmda.types.annotation import SpanGroup, Box
from mmda.types.document import Document


# !!!! ANNOTATOR TODO: Record page # from output in annotation spreadsheet
# get random page number to review and annotate
page = random.randrange(1,15) #bailey todo: change range end at # of pages in pdf
print("page to annotate:", page)


# method that gets url to help create span group
def get_text(url: str) -> str:
    s3 = boto3.resource('s3')
    url_no_prefix = url[5:]
    bucket, *key = url_no_prefix.split("/")
    key = "/".join(key)
    with s3.Bucket(bucket).Object(key).get()['Body'] as f:
        return f.read().decode('utf-8')

page to annotate: 9


#### Part 1: PDF Plumber
Get pdf plumber data - Need to use text id, text_url from this block in the rest of the annotations.

In [168]:
#PDF PLUMBER: 1
#request annots where source = plumber_source
resp1 = requests.get(f'http://annotations-api.prod.s2.allenai.org/pdf/{sha}/annotations?annotationSources=none&attributeSources={plumber_source}').json()

#get text_url which is needed to get text (which is needed to make initial annotation doc per paper)
text_url = resp1['attributesFromSource'][plumber_source]['text']
print("text_url:", text_url)

#returns id and s3 url, use this to get text id
resp2 = requests.post(
    'http://annotations-api.prod.s2.allenai.org/plain-text/',
    json={'s3Url': text_url}
).json()
text_id = resp2['id']
print("\ntext_id:",text_id)

#gets document id which is the same throughout all types of annotations; will resuse for future cells
#be careful of plain-text vs. pdf endpoint
resp3 = requests.get(f'http://annotations-api.prod.s2.allenai.org/plain-text/{text_id}/annotations'
                     f'?annotationSources=vila-0.0.2|layout-parser-0.0.2|{plumber_source}'
                     '&attributeSources=none').json()


#getting annotations from plumber_source
vila_annos = resp3['annotationsFromSource'][f'vila-0.0.2|layout-parser-0.0.2|{plumber_source}']['vila_span_groups']
#print(resp3['annotationsFromSource'][f'vila-0.0.2|layout-parser-0.0.2|{plumber_source}']['vila_span_groups'])

#need this to create span group
text = get_text(text_url)
vila_spangroups = []
for anno in vila_annos:
    vila_spangroups.append(
        SpanGroup(
            spans=[Span(start=anno['startChar'], end=anno['endChar'])],
            type=anno['attributesFromSource'][f'vila-0.0.2|layout-parser-0.0.2|{plumber_source}']['type']
        )
    )

#create initial document
doc = Document(text)

#annotate annotatiosn onto doc
doc.annotate(vila=vila_spangroups)

#spot check
print(vila_spangroups[0])

text_url: s3://ai2-s2-science-parse-plus-prod/document/17ada3c96ef888fc89f6b995d2edd19a45383423/pdfplumber-0.0.4/text

text_id: 740c1d64752b4679a742544046cf5de3452cfef4
[{'id': 'vila-0.0.2|layout-parser-0.0.2|pdfplumber-0.0.4/vila_span_groups/0', 'startChar': 0, 'endChar': 125, 'attributesFromSource': {'vila-0.0.2|layout-parser-0.0.2|pdfplumber-0.0.4': {'type': 'Title', '_group': 0}}}, {'id': 'vila-0.0.2|layout-parser-0.0.2|pdfplumber-0.0.4/vila_span_groups/1', 'startChar': 126, 'endChar': 133, 'attributesFromSource': {'vila-0.0.2|layout-parser-0.0.2|pdfplumber-0.0.4': {'type': 'Paragraph', '_group': 1}}}, {'id': 'vila-0.0.2|layout-parser-0.0.2|pdfplumber-0.0.4/vila_span_groups/2', 'startChar': 134, 'endChar': 140, 'attributesFromSource': {'vila-0.0.2|layout-parser-0.0.2|pdfplumber-0.0.4': {'type': 'Keywords', '_group': 2}}}, {'id': 'vila-0.0.2|layout-parser-0.0.2|pdfplumber-0.0.4/vila_span_groups/3', 'startChar': 141, 'endChar': 180, 'attributesFromSource': {'vila-0.0.2|layout-parser-

SpanGroup(uuid='546cb098-a86d-48d8-9e6c-409251dea562', doc=<mmda.types.document.Document object at 0x7fea2e1ced90>, metadata=Metadata({'id': None, 'type': 'Title', 'text': None}), spans=[Span(start=0, end=125, box=None)], box_group=None, id=None, type='Title', text='1\nFunctions of FMS-like tyrosine kinase 3 (flt3) in zebrafish hematopoiesis and its\nrelevance to human acute myeloid leukemia')


In [118]:
#PDF PLUMBER: 2
def make_span_groups(text_spans, pdf_boxes):
    
    instance_tokens = []
    for i, text_span in enumerate(text_spans):
        box_info = pdf_boxes[i]
        associated_box = Box(
            l=box_info['x'], 
            t=box_info['y'], 
            w=box_info['width'], 
            h=box_info['height'], 
            page=box_info['page']
        )

        instance_tokens.append(
            SpanGroup(
                spans=[Span(start=text_span['startChar'], end=text_span['endChar'], box=associated_box)],
                id=text_span['attributesFromSource']['pdfplumber-0.0.4']['id']

            )
        )
    return instance_tokens
    

symbols = text
pdf_plumber_text_annos_resp = requests.get(f'http://annotations-api.prod.s2.allenai.org/plain-text/{text_id}/annotations'
                     '?annotationSources=pdfplumber-0.0.4'
                     '&attributeSources=none').json()
pdf_plumber_pdf_annos_resp = requests.get(f'http://annotations-api.prod.s2.allenai.org/pdf/{sha}/annotations'
                     '?annotationSources=pdfplumber-0.0.4'
                     '&attributeSources=none').json()

token_spans = pdf_plumber_text_annos_resp['annotationsFromSource']['pdfplumber-0.0.4']['tokens']
token_boxes = pdf_plumber_pdf_annos_resp['annotationsFromSource']['pdfplumber-0.0.4']['tokens']


instance_tokens = make_span_groups(token_spans, token_boxes)

# rows/pages specific to PDF Plumber annotations
rows = pdf_plumber_text_annos_resp['annotationsFromSource']['pdfplumber-0.0.4']['rows']
pages = pdf_plumber_text_annos_resp['annotationsFromSource']['pdfplumber-0.0.4']['pages']
vila_span_groups = vila_spangroups
print(vila_span_groups[0])

SpanGroup(uuid='47fc103c-c18c-46d0-9952-f9ffd1b96007', doc=<mmda.types.document.Document object at 0x7fea4d5d2e20>, metadata=Metadata({'id': None, 'type': 'Title', 'text': None}), spans=[Span(start=0, end=125, box=None)], box_group=None, id=None, type='Title', text='1\nFunctions of FMS-like tyrosine kinase 3 (flt3) in zebrafish hematopoiesis and its\nrelevance to human acute myeloid leukemia')


#### Part 2: Mentions
Get mentions annotations from annotation store

In [150]:
#retrieve mentions boxes and spans (box = pdf) (spans = plain-text)
    #Example:
    #mention_bib_link_source = citation-links-test-1
    #mention_det_source = mentions-test-3

#annotation sources = variables found above
#attribute sources = all
ment_resp1 = requests.get(f'https://annotations-api.dev.s2.allenai.org/pdf/{sha}/'
                     f'annotations?annotationSources={mention_det_source}&attributeSources=all').json()


#Plain-Text endpoint - get spans
ment_resp3 = requests.get(f'http://annotations-api.dev.s2.allenai.org/plain-text/{text_id}/annotations'
                         f'?annotationSources={mention_det_source}'
                         '&attributeSources=all').json()

#getting annotations from mentions
anno_mentions = ment_resp3['annotationsFromSource'][f'{mention_det_source}']['mentions']

#need to combine spans and boxes into single span group, which can then be annotated onto mmda doc

# method that combines box and text spans - specific to mentions
def make_span_groups(text_spans, pdf_boxes):
    
    instance_tokens = []
    for i, text_span in enumerate(text_spans):
        box_info = pdf_boxes[i]
        associated_box = Box(
            l=box_info['x'], 
            t=box_info['y'], 
            w=box_info['width'], 
            h=box_info['height'], 
            page=box_info['page']
        )

        instance_tokens.append(
            SpanGroup(
                spans=[Span(start=text_span['startChar'], end=text_span['endChar'], box=associated_box)],
                id=text_span['attributesFromSource'][f'{mention_det_source}']
            )
        )
    return instance_tokens

# create span group
# text = get_text(text_url) --already did this?
#initialize mentions spangroup list
ment_spangroups = []
for anno in anno_mentions: #sub in new variable here
    ment_spangroups.append(
        SpanGroup(
            spans=[Span(start=anno['startChar'], end=anno['endChar'])],
            type=anno['attributesFromSource'][f'{mention_det_source}']['group']
        )
    )

symbols = text
ment_text_annos_resp = requests.get(f'http://annotations-api.dev.s2.allenai.org/plain-text/{text_id}/annotations'
                     f'?annotationSources={mention_det_source}'
                     '&attributeSources=none').json()

ment_pdf_annos_resp = requests.get(f'http://annotations-api.dev.s2.allenai.org/pdf/{sha}/annotations'
                     f'?annotationSources={mention_det_source}'
                     '&attributeSources=none').json()

token_boxes = ment_pdf_annos_resp['annotationsFromSource'][f'{mention_det_source}']['mentions']
token_spans = ment_text_annos_resp['annotationsFromSource'][f'{mention_det_source}']['mentions']

instance_tokens = make_span_groups(token_spans, token_boxes)


#annotate annotations onto doc
doc.annotate(mentions=ment_spangroups)

#for now
is_overwrite=False

#print span group spot check
print(ment_spangroups[0])

SpanGroup(uuid='753f900f-8618-43d0-a3d8-4b37cce627c6', doc=<mmda.types.document.Document object at 0x7fea2f81ff70>, metadata=Metadata({'id': None, 'type': 0, 'text': None}), spans=[Span(start=4376, end=4381, box=None)], box_group=None, id=None, type=0, text='12,13')


### Part 3: Bib-Entries
Get bib-entry annotations from annotation store

In [190]:
#retrieve bib-entry annotations
#pdf endpoint
bib_resp1 = requests.get(f'https://annotations-api.dev.s2.allenai.org/pdf/{sha}/'
                     f'annotations?annotationSources={bib_entry_det_source}&attributeSources=all').json()


#plain-text endpoint
#not actually being used here
bib_resp3 = requests.get(f'http://annotations-api.dev.s2.allenai.org/plain-text/{text_id}/annotations'
                         f'?annotationSources={bib_entry_det_source}'
                         '&attributeSources=all').json()

#getting annotations from mentions
#you may need to change bib_resp1 to bib_resp3 if the annotation type listed is plain-text (rather than PDF)
anno_bib = bib_resp1['annotationsFromSource'][f'{bib_entry_det_source}']['bib-entries']
#print(anno_bib)

# need to create box groups
# bib_entry_boxgroups_json to spangroups to json:
from mmda.types.annotation import BoxGroup
from copy import copy

def box_groups_json_to_boxgroups(box_groups_json):
    #lil hack
    import itertools

    counter = itertools.count()
    
    # make BoxGroups from Json
    boxgroups = []
    for bg in box_groups_json:
        box_info = bg
        box = Box(
            l=box_info['x'], 
            t=box_info['y'], 
            w=box_info['width'], 
            h=box_info['height'],            
            page=box_info['page']
        )
        boxgroups.append(
        BoxGroup(boxes=[box],
                 id=next(counter) # OR update models?
                )
        )
    print(boxgroups[0])
    #doc.annotate(bibs=boxgroups)
    return boxgroups

bib_box_groups = box_groups_json_to_boxgroups(anno_bib)
#annotate annotations onto doc
doc.annotate(bibs=bib_box_groups)

BoxGroup(uuid='38e9764d-cca7-449c-9bac-0998da64be0a', doc=None, metadata=Metadata({'id': 0, 'type': None}), boxes=[Box(l=0.18146317915032664, t=0.866992529633839, w=0.6380922541862745, h=0.06077997036616102, page=12)], id=0, type=None)


AttributeError: 'Document' object has no attribute 'pages'

### Part 4: Bib-Parse 
Get bib-parse-entry annotations from annotation store

In [159]:
#retrieve bib parse data
#pdf endpoint
bib_parse_resp1 = requests.get(f'https://annotations-api.dev.s2.allenai.org/pdf/{sha}/'
                     f'annotations?annotationSources={bib_parse_source}&attributeSources=all').json()


#plain-text endpoint
#not actually being used here
bib_parse_resp3 = requests.get(f'http://annotations-api.dev.s2.allenai.org/plain-text/{text_id}/annotations'
                         f'?annotationSources={bib_parse_source}'
                         '&attributeSources=all').json()

#test
print(bib_parse_resp1['attributesFromSource'])

#getting annotations from bib parse
#you may need to change bib_resp1 to bib_resp3 if the annotation type listed is plain-text (rather than PDF)
anno_bib_parse = bib_parse_resp1['annotationsFromSource'][f'{bib_parse_source}']['bib-entry-parses']

#initialize bib parse spangroup list
bib_parse_spangroups = []

# need this to create span group
for anno in anno_bib_parse: #sub in new variable here
    bib_parse_spangroups.append(
        SpanGroup(
            spans=[Span(start=anno['startChar'], end=anno['endChar'])],
            type=anno['attributesFromSource'][f'{bib_parse_source}']['id']
        )
    )

#annotate annotations onto doc
#doc.annotate(bib_parse=bib_parse_spangroups)

{'bib_detector_test_1': {'bib-entries': [{'x': 0.18146317915032664, 'y': 0.866992529633839, 'page': 12, 'width': 0.6380922541862745, 'height': 0.06077997036616102, 'attributes': {'id': 0}}, {'x': 0.18147058823529413, 'y': 0.1447024740656565, 'page': 12, 'width': 0.6380905871421568, 'height': 0.032949968926767735, 'attributes': {'id': 1}}, {'x': 0.18147058823529413, 'y': 0.1817488250378788, 'page': 12, 'width': 0.638381949408497, 'height': 0.05158693280303038, 'attributes': {'id': 2}}, {'x': 0.1814687359640523, 'y': 0.23720456310606067, 'page': 12, 'width': 0.6376549329460782, 'height': 0.032949968926767735, 'attributes': {'id': 3}}, {'x': 0.18146503142156847, 'y': 0.5522123347601015, 'page': 12, 'width': 0.6379462952124177, 'height': 0.05135935602272722, 'attributes': {'id': 4}}, {'x': 0.18146688369281047, 'y': 0.34834361602272745, 'page': 12, 'width': 0.6379592611111118, 'height': 0.05158693280303038, 'attributes': {'id': 5}}, {'x': 0.18146503142156847, 'y': 0.6076680728282833, 'page'

KeyError: 'bib-struct-test-1'

### 1c. Check to make sure you have annotated all fields onto mmda doc

In [122]:
print(doc.fields)

['vila', 'mentions']


## 2. Visalize Paper and Draw boxes on mentions

Use draw boxes to visualize each mention

In [160]:
#draw boxes function
import layoutparser.elements as lpe
import layoutparser.visualization as lpv

def draw_boxes(groups, canvas, color, **kwargs):
    viz = []

    for group in groups:
        for box in group:
            viz.append(lpe.TextBlock(box, type=f"color"))

    return lpv.draw_box(
        canvas, 
        viz,
        color_map={'color': color},
        **kwargs
    )

In [164]:
# importing sys
import sys
 
# adding folder2 to system path; can't import for some reason
sys.path.insert(0, '/Users/baileyk/Projects/SP_end_to_end/extract-stuff-from-pdfs')

from extract_citation_mentions import download_pdf, make_boxes
from mmda.rasterizers.rasterizer import PDF2ImageRasterizer

paper_sha = '17ada3c96ef888fc89f6b995d2edd19a45383423'
page = 0

matches = papers[sha].pages[page]
print(f"found={matches.found} missed={matches.missed} incorrect/extra={matches.incorrect}")

ImportError: cannot import name 'make_boxes' from 'extract_citation_mentions' (/Users/baileyk/Projects/SP_end_to_end/extract-stuff-from-pdfs/extract_citation_mentions/__init__.py)

In [None]:
pdf_path = download_pdf(paper_sha)
images = PDF2ImageRasterizer().rasterize(pdf_path, dpi=144)

gold = [m.gold for m in matches.matches if m.gold]
baseline = [m.baseline for m in matches.matches if m.baseline]

image = images[page]
image = draw_boxes(make_boxes(gold, image), image, 'yellow')
image = draw_boxes(make_boxes(baseline, image), image, 'blue', box_width=3)

display(image)