Python 3 tools for data mining in molecular biology
Python Shell
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.




libfnl is an API and CLI facilitating data and text mining by providing a collection of easy-to-use tools. The library is designed to work with Python 3 (only). It is specifically tuned towards mining biomedical/scientific texts, but can be used in other contexts if need be, too. It is a complementary piece in the gnamed gene name repository daemon and the medic PubMed mirroring tool collection. In addtion, an (orphan) couchpy repository could provide a document storage facility.

The library contains the following packages:

tools to linguistically analyze text (tokenization, PoS tagging, phrase chunking, entity detection); modules to segment sentences (based on NLTK), and map text (strings) to entries in dictionaries this includes a Python wrapper for the GENIA Tagger, a Python wrapper for the NER Suite, and a handler for the GENIA corpus; furthermore, via NLTK 's wrapper for MegaM, a Maximum Entropy classifier is available, too;
a module to evaluate inter-rater Kappa scores and a module to develop text classifiers based on Scikit-Learn
wrappers to work with text data (strings, tokens, segments, annotations, etc.)
additional utilities and tools (currently, just for handling JSON)
the CLI scripts to manage data/text, representing the main value provided by this collection

The script directory provides the following command-line interfaces:

  • fnlclassi generate a classifier for [NER-tagged] text using Scikit-Learn.
  • fnlcorpus store corpora in JSON format in a CouchDB.
  • fnldgrep "grep" for tokens using a dictionary.
  • fnldictag tag semantic tokens from a dictionary in linguistically annotated text.
  • fnlgpcounter count gene/protein symbols in MEDLINE.
  • fnlkappa calculate inter-rater agreement scores.
  • fnlsegment segment text into sentences using NLTK (PunktSentenceTokenizer).
  • fnlsegtrain train a nltk.punkt.PunktSentenceTokenizer.
  • fnltok a fast, pure-Python, Unicode-aware string tokenizer.


This project is under "continuous development", better take your own snapshot.


  • Python 3.2+
  • Numpy, SciPy, and Scikit-Learn 0.14+ (for fnlclassi)
  • NLTK 3.0+ (for the sentence segmenting tools fnlseg*)
  • DAWG (for fnlgpcounter; see Installation below)

Optional projects that work together with this project:

  • GENIA Tagger (optional, latest version)
  • NER Suite (optional, latest version, in turn requires CRF Suite)
  • MegaM - a MaxEnt classifier for NLTK with a (fast) L-BFGS optimizer
  • gnamed for creating gene/protein name repositories
  • medic for mirroring and handling PubMed citations
  • txtfnnl natural language processing tools based on Apache OpenNLP and UIMA


Into a Python 3 virtual environment:

pip install virtualenv # if virtualenv is not yet installed
git clone git:// libfnl
virtualenv libfnl
cd libfnl
. bin/activate
pip install argparse # for python3 < 3.2
pip install numpy # because installing scipy fails if numpy isn't installed already
pip install -e . # installs all other dependencies

# if you prefer to install all other dependencies manually
# and/or prefer to use instead of pip:
# python install
pip install sqlalchemy
pip install sklearn
pip install matplotlib
pip install nltk --pre # to get 3.0

# if you want to install the test environment:
pip install pytest

# special steps to install DAWG
git clone
python install
cd ..


All parts of this library are licensed under the GNU Affero GPL v3

See the attached LICENSE.txt file.


© 2006-2014 Florian Leitner. All rights reserved.