A text mining framework encapsulating content extraction, language processing and content analysis functionality with a strong focus on BioNLP.
txtfnnl currently integrates the following Apache projects:
In addition, the following direct dependencies exist:
- uimaFIT 1.4 for configuration and testing
- for making gene mention and normalization annotations, the gnamed DB has to be available on the network, which in turn (by default) requires PostgreSQL 8.4+; the SQL-realted tests for txtfnnl however use the H2 in-memory DB.
- for the syntactic grep facilities (via the
greppipeline), libfsmg has to be in your local Maven repository.
- for the txtfnnl-wrappers module, the relevant external tools need to be
downloaded, installed, and visible on the system
$PATH. All supported external NLP tools are listed in the section Installation below.
Before installing txtfnnl itself, the additional (independent) tools should be installed (only libfsmg must be installed). The following NLP tools are supported by txtfnnl:
- After downloading, unpacking, building, and installation (usually, just a
curl-tar-configure-make-install loop) and assuming the default installation
/usr/local, nothing else needs to be configured (parser version known to work with this AE wrapper: 4.7.6).
- GENIA Tagger
- The GENIA Tagger does not follow GNU Autotools' best practices, so
after downloading, unpacking, and compiling you need to make sure that the
geniataggerexecutable is on your
$PATH. Furthermore, you should put the whole directory that contains the
morphdicdirectory somewhere you can remember: each time you want to use the GENIA Tagger, you will have to add the directory containing the
morphdicdirectory as an argument. A sensible place for the tagger directory might be
/usr/local/share/geniataggerif you have write access to it.
- The BioLemmatizer is wrapped by this pipeline, but does not have to be installed separately, it is automatically compiled from its official Maven repository.
- The Linnaeus species name recognition software is wrapped by this library.
The relevant JAR file is distributed within the txtfnnl-wrappers package,
but you will have to separately fetch and install the species dictionary
you want to use (commonly, that would either species, the species-proxy,
and/or the genera-species-proxy dictionary). The dictionaries are only
required at runtime, and only if you run a pipeline and wish to use its
species normalization capabilities. You do not need to do anything to
txtfnnlframework and/or run pipelines w/o species normalization.
- A generic finite state machine library developed by the principal author
txtfnnlframework. Clone from github (
git clone git://github.com/fnl/libfsmg.git), and run
mvn installin the newly created
libfsmgdirectory to install.
All Java dependencies should be resolved by Maven (if you have a working
Internet connection). To "install" txtfnnl itself, execute
in the TLD (after installing libfsmg in a similary way). txtfnnl is known
to work on Apple OSX, Ubuntu and CentOS. The framework requires the use of
Java 1.6 or later.
After installing the Maven project, the
txtfnnl shell script in the
txtfnnl-bin module can be put anywhere on the system
To use the pipelines from the command line, execute the
txtfnnl script in
the txtfnnl-bin module directory (or copy it to your
The script expects to find the local Maven repository either in
~/.m2/repository or otherwise defined as the environment variable
Currently, the following pipelines are available:
splitsplits any kind of data Tika can extract plain-text from into sentences, one per line.
prepre-processes any kind of data Tika can extract, generating XMI files with sentence, token, and chunk annotations. The tokens are PoS tagged and lemmatized.
tagworks just as
pre, but outputs the content in plaintext format instead of XMI.
grepenables the use of syntax patterns (written in a style similar to the output of
tag) to annotate semantic entities and relationships between those entities. In other words, this pipeline provides a regular syntax expression language for matching token sequences and their part-of-speech, chunk tags, and lemmas in UIMA. This is a functionality similar to that provided by GATE's JAPE, but a much simpler grammar with far less features. This pipeline implements a libfsmg FSM.
norma gene normalization pipeline (under construction)
ginxa gene interaction extraction pipeline (under construction)
Two more pipelines that are not actively maintained right now:
entitiesannotates known entity mentions on documents by supplying a mapping of input file names (w/o sufffix) to entity identifiers (type, namespace, identifier), looking up the names for those entity IDs in a DB, and matching any of those names in the extracted plain-text. Example use: for gene mention annotations using gnamed
patternsextracts relationship patterns between named entities in a known relationship. A relationship is defined as one or more entity IDs (as for
patterns) together with the input file name and is supposed to be contained within a single sentence. If a sentence with all required entities is found, a number of patterns used to syntactically combine the entities are extracted. Each pattern is printed on a single line and patterns for different sentences are separated by an empty line.
A quick reference of the CF regular expression grammar for syntax patterns:
S -> Capture S? | Phrase S? | Token S? Capture -> "(" S ")" # => Semantic Relationship Annotations Token -> "." Quantifier? | RegEx Quantifier? # dot "." matches any token Quantifier -> "*" | "?" | "+" # zero-or-more, zero-or-one, one-or-more Phrase -> "[" Chunk InPhrase "]" "?"? # may be skipped with "?" Chunk -> "NP" | "VP" | "PP" | "ADVP" ... # i.e., a chunker tag InPhrase -> CaptureInPhrase InPhrase? | Token InPhrase? CaptureInPhrase -> "(" InPhrase ")" # => Semantic Relationship Ann. # InPhrase and CaptureInPhrase ensure that phrases are never nested RegEx -> RE1 | RE2 | RE3 # token annotation-specific matching RE1 -> "<lemma>" # a Java regex used to match the token's lemma RE2 -> "<PoS>_<lemma>" # as RE1, two regex patterns separated by underscore RE3 -> "<word>_<PoS>_<lemma>" # idem, w/ 3 patterns (PoS = Part-of-Speech) # to allow any match for a word, PoS or lemma annotation in RE2 or RE3: # use a "*" in stead of the corresponding regex, e.g.: "IN_*" # all terminals must be separated by white-spaces
An example line in a pattern resource file that will annotate relationships between two entities: the first entity is a noun phrase with a head lemma of gene or protein, any number of tokens, a verb phrase with a head lemma of bind, and optional IN-preposition, and the second entity, which may be any other noun phrase:
[ NP DT_* ? ( . + ) gene|protein|factor ] . * [ VP . * bind ] IN_* ? [ NP DT_* ? ( . + ) ] interaction PPI actor source actor target
After the pattern, separated by tabs, the annotations are specified: a match will result in a RelationshipAnnotation with namespace "interaction" and ID "PPI" between the matched entities, which are annotated as SemanticAnnotations with namespace "actor", IDs "source" and "target", respectively. I.e., the first namespace-ID-pair defines the relationship annotation, all following pairs should correspond with the number of capture groups in the pattern and define the semantic (entity) annotations that should be made.
License, Author and Copyright Notice
txtfnnl is free, open software provided via a
Apache 2.0 License -
LICENSE.txt in this directory for details.
The only part of this framework that cannot be freely applied in a commercial application are (some of) the wrapped tools. To the author's best knowledge, this limitation only relates to the GENIA tagger, while all other tools are open to any kind of use.
Copyright 2012, 2013 - Florian Leitner (fnl). All rights reserved.