The python package provides functionality preprocessing and parsing of scientific articles. It is specifically designed to capture naming conventions common in scientific literature.
The package can be installed by:
git clone https://github.com/dave-s477/articlenizer
cd articlenizer
pip install .
or for an editable installation
pip install -e .
The only non-standard package included in the install is pytest
.
To verify the functionality run:
pytest tests
or
python -m pytest tests
The package is offers different functionality centered around parsing and processing scientific articles. The most important ones are also available as command line tools.
Sentenization is performed by initially splitting everything where one of [.!?]
is followed by a whitespace.
The split is refined by adding splitting potential errors such as "sentence.Next"
that should have a newline but contain a formatting error.
False positive line splits are recombined after all potential splits were generated in oder to capture errors, e.g. a newline after e.g.\n
is likely to be erroneous and is removed.
At last, enumerations are split, but only if they start with an upper cased word (1) Like this
.
Looking at an example:
from articlenizer import articlenizer as art
art.sentenize_text('Split this text in sentences. Output depends on the flag: "representation".')
# Out: ['Split this text in sentences.', 'Output depends on the flag: "representation".']
Tokenization is purely regex based and can be best understood by taking a look in ./articlenizer/tokenize.py
.
It can for instance be run by:
from articlenizer import articlenizer as art
art.tokenize_text('Tokenize a text with articlenizer v.0.1.')
# Out: ['Tokenize', 'a', 'text', 'with', 'articlenizer', 'v.', '0.1', '.']
Some "obvious" text errors are correct such as: 1. no space after semi-colons: this;should;not;happen
and no space before and after brackets: neither(should)this.
For instance:
from articlenizer import articlenizer as art
art.correct_text('Some wrong text;needs to be corrected.')
# Out: 'Some wrong text; needs to be corrected'
from articlenizer import articlenizer as art
art.get_tokenized_sentences('Split this text in sentences with articlenizer v.0.1. Output depends on a couple of flags.')
# Out: [
# ['Split', 'this', 'text', 'in', 'sentences', 'with', 'articlenizer', 'v.', '0.1', '.'],
# ['Output', 'depends', 'on', 'a', 'couple', 'of', 'flags', '.']
# ]
Articlenizer includes a JATS XML parser that extracts plain text from JATS articles, omitting meta-data.
Articlenizer includes functionality for transforming BRAT (Stand-off format) to IOB2 and reverse.
It also offers functionality to transform TEI based annotation and HTML based annotation to BRAT format. However, those were designed specifically to handle two corpora and will not generalize well to other problems: Softcite (TEI) and BioNerDs (HTML)