# ScienceBeam Parser: Advanced Name Model example

This example will look more in-depth into the ScienceBeam Parser architecure.
It will explore the model data generated for the `name` model, and train a new model
using a single training document. The main purpose is to illustrate how the individual
parts fit together. (It isn't necessarily the suggested workflow)

### Install Dependencies (if necessary)

In [1]:
!pip install --quiet \
    "sciencebeam-parser>=0.1.4" \
    "tensorflow<2.0.0" \
    "numpy<1.17.0" \
    "pandas<1.3.0" \
    "typing_extensions"

In [2]:
!python --version

Python 3.7.6


In [3]:
!pip freeze | grep --ignore-case --extended-regexp "tensorflow|keras|sciencebeam|numpy|pandas|delft"

delft==0.2.7
Keras==2.2.4
Keras-Applications==1.0.8
keras-bert==0.84.0
keras-embed-sim==0.9.0
keras-layer-normalization==0.15.0
keras-multi-head==0.28.0
keras-pos-embd==0.12.0
keras-position-wise-feed-forward==0.7.0
Keras-Preprocessing==1.1.2
keras-self-attention==0.50.0
keras-transformer==0.39.0
numpy==1.16.6
pandas==1.2.5
sciencebeam-parser==0.1.4
sciencebeam-trainer-delft==0.0.31
tensorflow==1.15.5
tensorflow-estimator==1.15.1


### Configure Logging

In [4]:
import logging
import os
import sys
from typing import List, Tuple

import tensorflow as tf

In [5]:
# configure logging so that we see ScienceBeam Parser's output
logging.basicConfig(level='ERROR', stream=sys.stdout)

# reduce tensorflow warnings
tf.logging.set_verbosity(tf.logging.ERROR)

### Download "real" PDF documents with XML

For this example we are using a sample document from the bioRxiv 10k dataset.

In [6]:
biorxiv_10k_train_100_pdf_url = (
    'https://github.com/elifesciences/sciencebeam-datasets/releases/download/biorxiv/biorxiv-10k-train-100-pdf.zip'
)
local_biorxiv_10k_train_100_pdf_dir = os.path.expanduser('~/.keras/datasets/biorxiv-10k-train-100-pdf')
local_biorxiv_10k_train_100_pdf_zip_file = local_biorxiv_10k_train_100_pdf_dir + '.zip'
tf.keras.utils.get_file(
    local_biorxiv_10k_train_100_pdf_zip_file,
    origin=biorxiv_10k_train_100_pdf_url,
    cache_subdir=local_biorxiv_10k_train_100_pdf_dir,
    extract=True,
    archive_format='zip'
)

'/home/jovyan/.keras/datasets/biorxiv-10k-train-100-pdf.zip'

In [7]:
biorxiv_10k_train_100_xml_url = (
    'https://github.com/elifesciences/sciencebeam-datasets/releases/download/biorxiv/biorxiv-10k-train-100-xml.zip'
)
local_biorxiv_10k_train_100_xml_dir = os.path.expanduser('~/.keras/datasets/biorxiv-10k-train-100-xml')
local_biorxiv_10k_train_100_xml_zip_file = local_biorxiv_10k_train_100_xml_dir + '.zip'
tf.keras.utils.get_file(
    local_biorxiv_10k_train_100_xml_zip_file,
    origin=biorxiv_10k_train_100_xml_url,
    cache_subdir=local_biorxiv_10k_train_100_xml_dir,
    extract=True,
    archive_format='zip'
)

'/home/jovyan/.keras/datasets/biorxiv-10k-train-100-xml.zip'

In [8]:
local_sample_pdf_file = os.path.join(local_biorxiv_10k_train_100_pdf_dir, '005587v1.pdf')
assert os.path.exists(local_sample_pdf_file)

In [9]:
local_sample_xml_file = os.path.join(local_biorxiv_10k_train_100_xml_dir, '005587v1.xml')
assert os.path.exists(local_sample_xml_file)

### Parse to Semantic Document

The *Semantic Document* is the internal representation of a
semanticially annotated document within ScienceBeam Parser.
It can iteratively be "improved", e.g. by using additional models.

As an example, the `citation` model identifies `RawAuthors`, a text of usually comma separated authors as found in the document. The `name` model is then used to replace `RawAuthors` into individual `Author` elements, that each have sub-elements for the name parts, such as the *surname*.

In this example we are looking into training our own `name` model.

In [10]:
# create an instance of ScienceBeamParser
from sciencebeam_parser.resources.default_config import DEFAULT_CONFIG_FILE
from sciencebeam_parser.config.config import AppConfig
from sciencebeam_parser.utils.media_types import MediaTypes
from sciencebeam_parser.app.parser import ScienceBeamParser


config = AppConfig.load_yaml(DEFAULT_CONFIG_FILE)

# the parser contains all of the models
sciencebeam_parser = ScienceBeamParser.from_config(config)

Using TensorFlow backend.


In [11]:
# preload models
sciencebeam_parser.fulltext_models.preload()



In [12]:
from sciencebeam_parser.processors.fulltext.config import RequestFieldNames

In [13]:
fulltext_processor_config = sciencebeam_parser.fulltext_processor_config.get_for_requested_field_names(
    {RequestFieldNames.REFERENCES}
)._replace(extract_citation_authors=False)

In [14]:
# a session provides a scope and temporary directory for intermediate files
# it is recommended to create a separate session for every document
with sciencebeam_parser.get_new_session(fulltext_processor_config=fulltext_processor_config) as session:
    session_source = session.get_source(
        local_sample_pdf_file,
        MediaTypes.PDF
    )
    parsed_layout_document = session_source.lazy_parsed_layout_document.get()
    parsed_semantic_content = parsed_layout_document.get_parsed_semantic_document()






### Extract `SemanticReference` list from `SemanticDocument`

In [15]:
import sciencebeam_parser.document.semantic_document as sb_semantic_document


semantic_document = parsed_semantic_content.semantic_document
semantic_references = list(semantic_document.iter_by_type_recursively(
    sb_semantic_document.SemanticReference
))
semantic_raw_authors = list(semantic_document.iter_by_type_recursively(
    sb_semantic_document.SemanticRawAuthors
))
print('text:', semantic_raw_authors[0].get_text())
print('semantic content:', semantic_raw_authors[0])

text: Allen, T.
semantic content: SemanticRawAuthors(mixed_content=[SemanticTextContentWrapper(content=LayoutBlock(lines=[LayoutLine(tokens=[LayoutToken(text='Allen', font=LayoutFont(font_id='font6', font_family='tzucyh+nimbusromno9l-regu', font_size=10.909, is_bold=False, is_italics=False, is_subscript=False, is_superscript=False), whitespace='', coordinates=LayoutPageCoordinates(x=85.039, y=709.934, width=4.49455, height=13.1455, page_number=30), line_descriptor=LayoutLineDescriptor(line_id=139768066158672)), LayoutToken(text=',', font=LayoutFont(font_id='font6', font_family='tzucyh+nimbusromno9l-regu', font_size=10.909, is_bold=False, is_italics=False, is_subscript=False, is_superscript=False), whitespace=' ', coordinates=LayoutPageCoordinates(x=107.51175, y=709.934, width=4.49455, height=13.1455, page_number=30), line_descriptor=LayoutLineDescriptor(line_id=139768066158672)), LayoutToken(text='T', font=LayoutFont(font_id='font6', font_family='tzucyh+nimbusromno9l-regu', font_size=1

In [16]:
print('text:', semantic_raw_authors[1].get_text())

text: Attolini, C., Cheng, Y., Beroukhim, R., Getz, G., Abdel-Wahab, O., Levine, R. L., Mellinghoff, I. K., and Michor, F.


### Parse XML document with "ground-truth" data

In [17]:
from lxml import etree
from sciencebeam_parser.utils.xml import get_text_content

xml_root = etree.parse(local_sample_xml_file).getroot()
reference_nodes = xml_root.xpath('//ref-list/ref')
label_by_tag = {'surname': '<surname>', 'given-names': '<forename>'}
entity_label_texts_by_reference = [
    [
        (label_by_tag[node.tag], get_text_content(node))
        for node in reference_node.xpath('|'.join([
            f'.//string-name/{tag}'
            for tag in label_by_tag
        ]))
    ]
    for reference_node in reference_nodes
]
entity_label_texts_by_reference[:3]

[[('<surname>', 'Allen'), ('<forename>', 'T')],
 [('<surname>', 'Attolini'),
  ('<forename>', 'C.'),
  ('<surname>', 'Cheng'),
  ('<forename>', 'Y.'),
  ('<surname>', 'Beroukhim'),
  ('<forename>', 'R.'),
  ('<surname>', 'Getz'),
  ('<forename>', 'G.'),
  ('<surname>', 'Abdel-Wahab'),
  ('<forename>', 'O.'),
  ('<surname>', 'Levine'),
  ('<forename>', 'R. L.'),
  ('<surname>', 'Mellinghoff'),
  ('<forename>', 'I. K.'),
  ('<surname>', 'Michor'),
  ('<forename>', 'F.')],
 [('<surname>', 'Beerenwinkel'),
  ('<forename>', 'N.'),
  ('<surname>', 'Rahnenführer'),
  ('<forename>', 'J.'),
  ('<surname>', 'Däumer'),
  ('<forename>', 'M.'),
  ('<surname>', 'Hoffmann'),
  ('<forename>', 'D.'),
  ('<surname>', 'Kaiser'),
  ('<forename>', 'R.'),
  ('<surname>', 'Selbig'),
  ('<forename>', 'J.'),
  ('<surname>', 'Lengauer'),
  ('<forename>', 'T')]]

### Generate `delft` sequence model data

The training data doesn't just consist of the tokens.
Each token also contains additional features, defined by the *data generator* of the individual model.
The format is a space-separated values. The token usually being the first value.
Sub-token values are not used by the DL model, but are there for the non-DL model (`wapiti`).

In [18]:
from sciencebeam_parser.document.layout_document import LayoutDocument
from sciencebeam_parser.models.data import DEFAULT_DOCUMENT_FEATURES_CONTEXT
from sciencebeam_parser.models.name.data import NameDataGenerator

data_generator = NameDataGenerator(document_features_context=DEFAULT_DOCUMENT_FEATURES_CONTEXT)

In [19]:
model_data_list_list = [
    list(data_generator.iter_model_data_for_layout_document(
        LayoutDocument.for_blocks(list(
            semantic_reference.view_by_type(
                sb_semantic_document.SemanticRawAuthors
            ).iter_blocks()
        ))
    ))
    for semantic_reference in semantic_references
]
model_data_list = model_data_list_list[0]
print(model_data_list[0].layout_token)
[model_data.data_line for model_data in model_data_list]

LayoutToken(text='Allen', font=LayoutFont(font_id='font6', font_family='tzucyh+nimbusromno9l-regu', font_size=10.909, is_bold=False, is_italics=False, is_subscript=False, is_superscript=False), whitespace='', coordinates=LayoutPageCoordinates(x=85.039, y=709.934, width=4.49455, height=13.1455, page_number=30), line_descriptor=LayoutLineDescriptor(line_id=139768066158672))


['Allen allen A Al All Alle n en len llen LINESTART INITCAP NODIGIT 0 0 0 0 0 0 NOPUNCT 0',
 ', , , , , , , , , , LINEIN ALLCAP NODIGIT 1 0 0 0 0 0 COMMA 0',
 'T t T T T T T T T T LINEIN ALLCAP NODIGIT 1 0 0 0 0 0 NOPUNCT 0',
 '. . . . . . . . . . LINEEND ALLCAP NODIGIT 1 0 0 0 0 0 DOT 0']

In [20]:
# convert to data frame
import pandas as pd


def get_dataframe_for_data_lines(data_lines: List[str], has_label: bool = False) -> pd.DataFrame:
    df = pd.DataFrame([
        data_line.split(' ')
        for data_line in data_lines
    ])
    df.columns = ['token'] + list(df.columns)[1:]
    if has_label:
        df.columns = list(df.columns)[:-1] + ['label']
    return df


get_dataframe_for_data_lines([model_data.data_line for model_data in model_data_list])

Unnamed: 0,token,1,2,3,4,5,6,7,8,9,...,11,12,13,14,15,16,17,18,19,20
0,Allen,allen,A,Al,All,Alle,n,en,len,llen,...,INITCAP,NODIGIT,0,0,0,0,0,0,NOPUNCT,0
1,",",",",",",",",",",",",",",",",",",",",...,ALLCAP,NODIGIT,1,0,0,0,0,0,COMMA,0
2,T,t,T,T,T,T,T,T,T,T,...,ALLCAP,NODIGIT,1,0,0,0,0,0,NOPUNCT,0
3,.,.,.,.,.,.,.,.,.,.,...,ALLCAP,NODIGIT,1,0,0,0,0,0,DOT,0


### Add labels based on "ground-truth" name information (aka "auto-annotate")

In this section we will use the "ground-truth" name part information extracted from the XML.
We will add the corresponding `delft` label to the model data.
This is a simplified implementation that relies on exact matches and all references being extracted correctly.
We ignore references where that assumption didn't hold.

In [21]:
import re

from sciencebeam_parser.utils.text import normalize_text
from sciencebeam_parser.document.layout_document import LayoutTokensText, LayoutBlock


def auto_annotate_model_data_lines(
    model_data_list,
    entity_label_text: List[Tuple[str, str]]
):
    layout_block = LayoutBlock.for_tokens([
        model_data.layout_token
        for model_data in model_data_list
    ])
    label_by_layout_token_id = {}
    layout_tokens_text = LayoutTokensText(layout_block)
    layout_tokens_text_str = str(layout_tokens_text)
    previous_start = 0
    for entity_label, entity_text in entity_label_text:
        entity_text = normalize_text(entity_text)
        p = re.compile(r'\b' + re.escape(entity_text) + r'\b')
        m = p.search(layout_tokens_text_str, pos=previous_start)
        if not m:
            p = re.compile(r'\b' + re.escape(entity_text.rstrip('.')) + r'\b')
            m = p.search(layout_tokens_text_str, pos=previous_start)
        if not m:
            print('not found: %r: %r in %r (%d)' % (
                entity_label, entity_text, layout_tokens_text_str, previous_start
            ))
            return []
        previous_start = m.end()
        for layout_index, layout_token in enumerate(
            layout_tokens_text.iter_layout_tokens_between(m.start(), m.end())
        ):
            if layout_index == 0:
                label_prefix = 'B-'
            else:
                label_prefix = 'I-'
            label_by_layout_token_id[id(layout_token)] = label_prefix + entity_label
    return [
        model_data.data_line + ' ' + label_by_layout_token_id.get(
            id(model_data.layout_token), 'O'
        )
        for model_data in model_data_list
    ]


print('training data for 1st reference (index 0):')
get_dataframe_for_data_lines(
    auto_annotate_model_data_lines(
        model_data_list_list[0],
        entity_label_texts_by_reference[0]
    ),
    has_label=True
)

training data for 1st reference (index 0):


Unnamed: 0,token,1,2,3,4,5,6,7,8,9,...,12,13,14,15,16,17,18,19,20,label
0,Allen,allen,A,Al,All,Alle,n,en,len,llen,...,NODIGIT,0,0,0,0,0,0,NOPUNCT,0,B-<surname>
1,",",",",",",",",",",",",",",",",",",",",...,NODIGIT,1,0,0,0,0,0,COMMA,0,O
2,T,t,T,T,T,T,T,T,T,T,...,NODIGIT,1,0,0,0,0,0,NOPUNCT,0,B-<forename>
3,.,.,.,.,.,.,.,.,.,.,...,NODIGIT,1,0,0,0,0,0,DOT,0,O


In [22]:
print('training data for 2nd reference (index 1):')
get_dataframe_for_data_lines(
    auto_annotate_model_data_lines(
        model_data_list_list[1],
        entity_label_texts_by_reference[1]
    ),
    has_label=True
)

training data for 2nd reference (index 1):


Unnamed: 0,token,1,2,3,4,5,6,7,8,9,...,12,13,14,15,16,17,18,19,20,label
0,Attolini,attolini,A,At,Att,Atto,i,ni,ini,lini,...,NODIGIT,0,0,0,0,0,0,NOPUNCT,0,B-<surname>
1,",",",",",",",",",",",",",",",",",",",",...,NODIGIT,1,0,0,0,0,0,COMMA,0,O
2,C,c,C,C,C,C,C,C,C,C,...,NODIGIT,1,0,0,0,0,0,NOPUNCT,0,B-<forename>
3,.,.,.,.,.,.,.,.,.,.,...,NODIGIT,1,0,0,0,0,0,DOT,0,O
4,",",",",",",",",",",",",",",",",",",",",...,NODIGIT,1,0,0,0,0,0,COMMA,0,O
5,Cheng,cheng,C,Ch,Che,Chen,g,ng,eng,heng,...,NODIGIT,0,0,0,0,0,0,NOPUNCT,0,B-<surname>
6,",",",",",",",",",",",",",",",",",",",",...,NODIGIT,1,0,0,0,0,0,COMMA,0,O
7,Y,y,Y,Y,Y,Y,Y,Y,Y,Y,...,NODIGIT,1,0,0,0,0,0,NOPUNCT,0,B-<forename>
8,.,.,.,.,.,.,.,.,.,.,...,NODIGIT,1,0,0,0,0,0,DOT,0,O
9,",",",",",",",",",",",",",",",",",",",",...,NODIGIT,1,0,0,0,0,0,COMMA,0,O


In [23]:
print('training data for 3rd reference (index 2):')
get_dataframe_for_data_lines(
    auto_annotate_model_data_lines(
        model_data_list_list[2],
        entity_label_texts_by_reference[2]
    ),
    has_label=True
)

training data for 3rd reference (index 2):


Unnamed: 0,token,1,2,3,4,5,6,7,8,9,...,12,13,14,15,16,17,18,19,20,label
0,Beerenwinkel,beerenwinkel,B,Be,Bee,Beer,l,el,kel,nkel,...,NODIGIT,0,0,0,0,0,0,NOPUNCT,0,B-<surname>
1,",",",",",",",",",",",",",",",",",",",",...,NODIGIT,1,0,0,0,0,0,COMMA,0,O
2,N,n,N,N,N,N,N,N,N,N,...,NODIGIT,1,0,0,0,0,0,NOPUNCT,0,B-<forename>
3,.,.,.,.,.,.,.,.,.,.,...,NODIGIT,1,0,0,0,0,0,DOT,0,O
4,",",",",",",",",",",",",",",",",",",",",...,NODIGIT,1,0,0,0,0,0,COMMA,0,O
5,Rahnenführer,rahnenführer,R,Ra,Rah,Rahn,r,er,rer,hrer,...,NODIGIT,0,0,0,0,0,0,NOPUNCT,0,B-<surname>
6,",",",",",",",",",",",",",",",",",",",",...,NODIGIT,1,0,0,0,0,0,COMMA,0,O
7,J,j,J,J,J,J,J,J,J,J,...,NODIGIT,1,0,0,0,0,0,NOPUNCT,0,B-<forename>
8,.,.,.,.,.,.,.,.,.,.,...,NODIGIT,1,0,0,0,0,0,DOT,0,O
9,",",",",",",",",",",",",",",",",",",",",...,NODIGIT,1,0,0,0,0,0,COMMA,0,O


In [24]:
print('generating training data for all references...')
annotated_model_data_lines_list = [
    auto_annotate_model_data_lines(
        model_data_list,
        entity_label_texts
    )
    for model_data_list, entity_label_texts in zip(
        model_data_list_list, entity_label_texts_by_reference
    )
]

print('training data for 3rd reference (index 2), sample from overall training data:')
get_dataframe_for_data_lines(
    annotated_model_data_lines_list[2],
    has_label=True
)

generating training data for all references...
not found: '<forename>': 'T.' in 'Held, L., Schrödle, B., and Rue, H.' (34)
not found: '<forename>': 'R.' in 'Raphael, B. and Vandin, F.' (25)
not found: '<forename>': 'N.' in 'Reiter, J., Bozic, I., Chatterjee, K., and Nowak, M.' (51)
not found: '<surname>': 'Parmigiani' in 'Sjoblom, T., Jones, S., Wood, L. D., Parsons, D. W., Lin, J., Barber, T. D., Mandelker, D., Leary, R. J., Ptak, J., Silliman, N., Szabo, S., Buckhaults, P., Farrell, C., Meeh, P., Markowitz, S. D., Willis, J., Dawson, D., Willson, J. K. V., Gazdar, A. F., Hartigan, J., Wu, L., Liu, C., Parmi- giani, G., Park, B. H., Bachman, K. E., Papadopoulos, N., Vogelstein, B., Kinzler, K. W., and Velculescu, V. E.' (284)
not found: '<forename>': 'W.-Y.' in 'Szabo, A. and Boucher, K. M.' (27)
not found: '<forename>': 'J.' in 'Tofigh, A., Sjolund, E., Hoglund, M., and Lagergren, J.' (54)
training data for 3rd reference (index 2), sample from overall training data:


Unnamed: 0,token,1,2,3,4,5,6,7,8,9,...,12,13,14,15,16,17,18,19,20,label
0,Beerenwinkel,beerenwinkel,B,Be,Bee,Beer,l,el,kel,nkel,...,NODIGIT,0,0,0,0,0,0,NOPUNCT,0,B-<surname>
1,",",",",",",",",",",",",",",",",",",",",...,NODIGIT,1,0,0,0,0,0,COMMA,0,O
2,N,n,N,N,N,N,N,N,N,N,...,NODIGIT,1,0,0,0,0,0,NOPUNCT,0,B-<forename>
3,.,.,.,.,.,.,.,.,.,.,...,NODIGIT,1,0,0,0,0,0,DOT,0,O
4,",",",",",",",",",",",",",",",",",",",",...,NODIGIT,1,0,0,0,0,0,COMMA,0,O
5,Rahnenführer,rahnenführer,R,Ra,Rah,Rahn,r,er,rer,hrer,...,NODIGIT,0,0,0,0,0,0,NOPUNCT,0,B-<surname>
6,",",",",",",",",",",",",",",",",",",",",...,NODIGIT,1,0,0,0,0,0,COMMA,0,O
7,J,j,J,J,J,J,J,J,J,J,...,NODIGIT,1,0,0,0,0,0,NOPUNCT,0,B-<forename>
8,.,.,.,.,.,.,.,.,.,.,...,NODIGIT,1,0,0,0,0,0,DOT,0,O
9,",",",",",",",",",",",",",",",",",",",",...,NODIGIT,1,0,0,0,0,0,COMMA,0,O


### Train Model

In this section we will load an existing model, and train it on new data.
That way we don't have to learn all from scratch.
This approach only works if the passed in features

In [25]:
from sciencebeam_trainer_delft.embedding.manager import EmbeddingManager
from sciencebeam_trainer_delft.sequence_labelling.wrapper import (
    DEFAULT_EMBEDDINGS_PATH,
    Sequence
)

default_model_url = (
    'https://github.com/elifesciences/sciencebeam-models/releases/download'
    '/grobid-0.6.1/2021-06-28-grobid-0.6.1-name-citation-no-word-embedding-no-layout-features-e500.tar.gz'
)

embedding_manager = EmbeddingManager(
    path=DEFAULT_EMBEDDINGS_PATH,
    download_manager=sciencebeam_parser.download_manager
)
delft_model = Sequence(
    'name-citation',
    embedding_manager=embedding_manager
)
delft_model.load_from(default_model_url)

In [26]:
get_dataframe_for_data_lines(annotated_model_data_lines_list[0], has_label=True)

Unnamed: 0,token,1,2,3,4,5,6,7,8,9,...,12,13,14,15,16,17,18,19,20,label
0,Allen,allen,A,Al,All,Alle,n,en,len,llen,...,NODIGIT,0,0,0,0,0,0,NOPUNCT,0,B-<surname>
1,",",",",",",",",",",",",",",",",",",",",...,NODIGIT,1,0,0,0,0,0,COMMA,0,O
2,T,t,T,T,T,T,T,T,T,T,...,NODIGIT,1,0,0,0,0,0,NOPUNCT,0,B-<forename>
3,.,.,.,.,.,.,.,.,.,.,...,NODIGIT,1,0,0,0,0,0,DOT,0,O


In [27]:
flat_data_lines = '\n\n'.join([
    '\n'.join(annotated_model_data_lines)
    for annotated_model_data_lines in annotated_model_data_lines_list
]).splitlines()
print(flat_data_lines[:10])

['Allen allen A Al All Alle n en len llen LINESTART INITCAP NODIGIT 0 0 0 0 0 0 NOPUNCT 0 B-<surname>', ', , , , , , , , , , LINEIN ALLCAP NODIGIT 1 0 0 0 0 0 COMMA 0 O', 'T t T T T T T T T T LINEIN ALLCAP NODIGIT 1 0 0 0 0 0 NOPUNCT 0 B-<forename>', '. . . . . . . . . . LINEEND ALLCAP NODIGIT 1 0 0 0 0 0 DOT 0 O', '', 'Attolini attolini A At Att Atto i ni ini lini LINESTART INITCAP NODIGIT 0 0 0 0 0 0 NOPUNCT 0 B-<surname>', ', , , , , , , , , , LINEIN ALLCAP NODIGIT 1 0 0 0 0 0 COMMA 0 O', 'C c C C C C C C C C LINEIN ALLCAP NODIGIT 1 0 0 0 0 0 NOPUNCT 0 B-<forename>', '. . . . . . . . . . LINEIN ALLCAP NODIGIT 1 0 0 0 0 0 DOT 0 O', ', , , , , , , , , , LINEIN ALLCAP NODIGIT 1 0 0 0 0 0 COMMA 0 O']


In [28]:
from sciencebeam_trainer_delft.sequence_labelling.reader import (
    load_data_and_labels_crf_lines
)


texts, labels, features = load_data_and_labels_crf_lines(flat_data_lines)
len(texts)

65

In [29]:
# predictions before model training
delft_model.tag(texts[:3], output_format=None, features=features[:3])




[[('Allen', 'B-<surname>'), (',', 'O'), ('T', 'B-<forename>'), ('.', 'O')],
 [('Attolini', 'B-<surname>'),
  (',', 'O'),
  ('C', 'B-<forename>'),
  ('.', 'O'),
  (',', 'O'),
  ('Cheng', 'B-<surname>'),
  (',', 'O'),
  ('Y', 'B-<forename>'),
  ('.', 'O'),
  (',', 'O'),
  ('Beroukhim', 'B-<surname>'),
  (',', 'O'),
  ('R', 'B-<forename>'),
  ('.', 'O'),
  (',', 'O'),
  ('Getz', 'B-<surname>'),
  (',', 'O'),
  ('G', 'B-<forename>'),
  ('.', 'O'),
  (',', 'O'),
  ('Abdel', 'B-<surname>'),
  ('-', 'I-<surname>'),
  ('Wahab', 'I-<surname>'),
  (',', 'O'),
  ('O', 'B-<forename>'),
  ('.', 'O'),
  (',', 'O'),
  ('Levine', 'B-<surname>'),
  (',', 'O'),
  ('R', 'B-<forename>'),
  ('.', 'O'),
  ('L', 'B-<middlename>'),
  ('.', 'O'),
  (',', 'O'),
  ('Mellinghoff', 'B-<surname>'),
  (',', 'O'),
  ('I', 'B-<forename>'),
  ('.', 'O'),
  ('K', 'B-<middlename>'),
  ('.', 'O'),
  (',', 'O'),
  ('and', 'O'),
  ('Michor', 'B-<surname>'),
  (',', 'O'),
  ('F', 'B-<forename>'),
  ('.', 'O')],
 [('Beerenwin

In [30]:
delft_model.training_config.max_epoch = 20
delft_model.training_config.patience = 5

In [31]:
vars(delft_model.training_config)

{'batch_size': 10,
 'optimizer': 'adam',
 'learning_rate': 0.001,
 'lr_decay': 0.9,
 'clip_gradients': 5.0,
 'max_epoch': 20,
 'early_stop': True,
 'patience': 5,
 'max_checkpoints_to_keep': 5,
 'multiprocessing': True,
 'initial_epoch': None,
 'input_window_stride': None,
 'checkpoint_epoch_interval': 1,
 'initial_meta': None}

In [32]:
valid_size = 3
delft_model.train(
    x_train=texts[:-valid_size],
    y_train=labels[:-valid_size],
    features_train=features[:-valid_size],
    x_valid=texts[-valid_size:],
    y_valid=labels[-valid_size:],
    features_valid=features[-valid_size:]
)

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
char_input (InputLayer)         (None, None, 16)     0                                            
__________________________________________________________________________________________________
char_embeddings (TimeDistribute (None, None, 16, 32) 2848        char_input[0][0]                 
__________________________________________________________________________________________________
word_input (InputLayer)         (None, None, 0)      0                                            
__________________________________________________________________________________________________
char_lstm (TimeDistributed)     (None, None, 128)    49664       char_embeddings[0][0]            
__________________________________________________________________________________________________
word_lstm_

In [33]:
# predictions after model training (on training example)
delft_model.tag(texts[:3], output_format=None, features=features[:3])




[[('Tomasetti', 'B-<surname>'),
  (',', 'O'),
  ('C', 'B-<forename>'),
  ('.', 'O'),
  (',', 'O'),
  ('Vogelstein', 'B-<surname>'),
  (',', 'O'),
  ('B', 'B-<forename>'),
  ('.', 'O'),
  (',', 'O'),
  ('and', 'O'),
  ('Parmigiani', 'B-<surname>'),
  (',', 'O'),
  ('G', 'B-<forename>'),
  ('.', 'O')],
 [('Gerstung', 'B-<surname>'),
  (',', 'O'),
  ('M', 'B-<forename>'),
  ('.', 'O'),
  (',', 'O'),
  ('Baudis', 'B-<surname>'),
  (',', 'O'),
  ('M', 'B-<forename>'),
  ('.', 'O'),
  (',', 'O'),
  ('Moch', 'B-<surname>'),
  (',', 'O'),
  ('H', 'B-<forename>'),
  ('.', 'O'),
  (',', 'O'),
  ('and', 'O'),
  ('Beerenwinkel', 'B-<surname>'),
  (',', 'O'),
  ('N', 'B-<forename>'),
  ('.', 'O')],
 [('Mather', 'B-<surname>'),
  (',', 'O'),
  ('W', 'B-<forename>'),
  ('.', 'B-<forename>'),
  ('H', 'B-<forename>'),
  ('.', 'O'),
  (',', 'O'),
  ('Hasty', 'B-<surname>'),
  (',', 'O'),
  ('J', 'B-<forename>'),
  ('.', 'O'),
  (',', 'O'),
  ('and', 'O'),
  ('Tsimring', 'B-<surname>'),
  (',', 'O'),
  (

In [34]:
model_path = os.path.expanduser('~/.cache/sciencebeam-usage-examples/name-model-1')
delft_model.save(model_path)

In [35]:
!ls -lh {model_path}/name-citation

total 1.4M
-rw-rw-r-- 1 jovyan jovyan 1.2K Nov 24 17:01 config.json
-rw-rw-r-- 1 jovyan jovyan  461 Nov 24 17:01 meta.json
-rw-rw-r-- 1 jovyan jovyan 1.4M Nov 24 17:01 model_weights.hdf5
-rw-rw-r-- 1 jovyan jovyan 2.3K Nov 24 17:01 preprocessor.json
-rw-rw-r-- 1 jovyan jovyan 1.5K Nov 24 17:01 preprocessor.pkl


### Where to go next:

- This model could now be used as the "name_citation" model
- Alternatively the "SemanticRawAuthors" could be processed separately with some custom logic