Articlenizer

The python package provides functionality preprocessing and parsing of scientific articles. It is specifically designed to capture naming conventions common in scientific literature.

Installing

The package can be installed by:

git clone https://github.com/dave-s477/articlenizer
cd articlenizer
pip install .

or for an editable installation

pip install -e .

The only non-standard package included in the install is pytest. To verify the functionality run:

pytest tests

or

python -m pytest tests

Preprocessing

The package is offers different functionality centered around parsing and processing scientific articles. The most important ones are also available as command line tools.

Sentenization

Sentenization is performed by initially splitting everything where one of [.!?] is followed by a whitespace. The split is refined by adding splitting potential errors such as "sentence.Next"that should have a newline but contain a formatting error. False positive line splits are recombined after all potential splits were generated in oder to capture errors, e.g. a newline after e.g.\n is likely to be erroneous and is removed. At last, enumerations are split, but only if they start with an upper cased word (1) Like this.

Looking at an example:

from articlenizer import articlenizer as art

art.sentenize_text('Split this text in sentences. Output depends on the flag: "representation".')

# Out: ['Split this text in sentences.', 'Output depends on the flag: "representation".']

Tokenization

Tokenization is purely regex based and can be best understood by taking a look in ./articlenizer/tokenize.py.

It can for instance be run by:

from articlenizer import articlenizer as art

art.tokenize_text('Tokenize a text with articlenizer v.0.1.')

# Out: ['Tokenize', 'a', 'text', 'with', 'articlenizer', 'v.', '0.1', '.']

Corrections

Some "obvious" text errors are correct such as: 1. no space after semi-colons: this;should;not;happen and no space before and after brackets: neither(should)this.

For instance:

from articlenizer import articlenizer as art

art.correct_text('Some wrong text;needs to be corrected.')

# Out: 'Some wrong text; needs to be corrected'

All in one

from articlenizer import articlenizer as art

art.get_tokenized_sentences('Split this text in sentences with articlenizer v.0.1. Output depends on a couple of flags.')

# Out: [
#   ['Split', 'this', 'text', 'in', 'sentences', 'with', 'articlenizer', 'v.', '0.1', '.'], 
#   ['Output', 'depends', 'on', 'a', 'couple', 'of', 'flags', '.']
# ]

Format conversion

JATS

Articlenizer includes a JATS XML parser that extracts plain text from JATS articles, omitting meta-data.

BRAT and IOB2

Articlenizer includes functionality for transforming BRAT (Stand-off format) to IOB2 and reverse.

TEI and HTML

It also offers functionality to transform TEI based annotation and HTML based annotation to BRAT format. However, those were designed specifically to handle two corpora and will not generalize well to other problems: Softcite (TEI) and BioNerDs (HTML)

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.github/workflows		.github/workflows
articlenizer		articlenizer
bin		bin
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Articlenizer

Installing

Preprocessing

Sentenization

Tokenization

Corrections

All in one

Format conversion

JATS

BRAT and IOB2

TEI and HTML

About

Releases

Packages

Languages

License

dave-s477/articlenizer

Folders and files

Latest commit

History

Repository files navigation

Articlenizer

Installing

Preprocessing

Sentenization

Tokenization

Corrections

All in one

Format conversion

JATS

BRAT and IOB2

TEI and HTML

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages