Python 3 tools for data mining in molecular biology
Python Shell
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
doc
scripts
src/fnl
var
.gitignore
CHANGES.txt
LICENSE.txt
README.rst
profile
setup.py

README.rst

libfnl

Introduction

libfnl is an API and CLI facilitating data and text mining by providing a collection of easy-to-use tools. The library is designed to work with Python 3 (only). It is specifically tuned towards mining biomedical/scientific texts, but can be used in other contexts if need be, too. It is a complementary piece in the gnamed gene name repository daemon and the medic PubMed mirroring tool collection. In addtion, an (orphan) couchpy repository could provide a document storage facility.

The library contains the following packages:

fnl.nlp
tools to linguistically analyze text (tokenization, PoS tagging, phrase chunking, entity detection); modules to segment sentences (based on NLTK), and map text (strings) to entries in dictionaries this includes a Python wrapper for the GENIA Tagger, a Python wrapper for the NER Suite, and a handler for the GENIA corpus; furthermore, via NLTK 's wrapper for MegaM, a Maximum Entropy classifier is available, too;
fnl.stat
a module to evaluate inter-rater Kappa scores and a module to develop text classifiers based on Scikit-Learn
fnl.text
wrappers to work with text data (strings, tokens, segments, annotations, etc.)
fnl.utils
additional utilities and tools (currently, just for handling JSON)
scripts
the CLI scripts to manage data/text, representing the main value provided by this collection

The script directory provides the following command-line interfaces:

  • fnlclassi generate a classifier for [NER-tagged] text using Scikit-Learn.
  • fnlcorpus store corpora in JSON format in a CouchDB.
  • fnldgrep "grep" for tokens using a dictionary.
  • fnldictag tag semantic tokens from a dictionary in linguistically annotated text.
  • fnlgpcounter count gene/protein symbols in MEDLINE.
  • fnlkappa calculate inter-rater agreement scores.
  • fnlsegment segment text into sentences using NLTK (PunktSentenceTokenizer).
  • fnlsegtrain train a nltk.punkt.PunktSentenceTokenizer.
  • fnltok a fast, pure-Python, Unicode-aware string tokenizer.

Warning

This project is under "continuous development", better take your own snapshot.

Requirements

  • Python 3.2+
  • Numpy, SciPy, and Scikit-Learn 0.14+ (for fnlclassi)
  • NLTK 3.0+ (for the sentence segmenting tools fnlseg*)
  • DAWG (for fnlgpcounter; see Installation below)

Optional projects that work together with this project:

  • GENIA Tagger (optional, latest version)
  • NER Suite (optional, latest version, in turn requires CRF Suite)
  • MegaM - a MaxEnt classifier for NLTK with a (fast) L-BFGS optimizer
  • gnamed for creating gene/protein name repositories
  • medic for mirroring and handling PubMed citations
  • txtfnnl natural language processing tools based on Apache OpenNLP and UIMA

Installation

Into a Python 3 virtual environment:

pip install virtualenv # if virtualenv is not yet installed
git clone git://github.com/fnl/libfnl.git libfnl
virtualenv libfnl
cd libfnl
. bin/activate
pip install argparse # for python3 < 3.2
pip install numpy # because installing scipy fails if numpy isn't installed already
pip install -e . # installs all other dependencies

# if you prefer to install all other dependencies manually
# and/or prefer to use setup.py instead of pip:
# python setup.py install
pip install sqlalchemy
pip install sklearn
pip install matplotlib
pip install nltk --pre # to get 3.0

# if you want to install the test environment:
pip install pytest

# special steps to install DAWG
git clone git@github.com:fnl/DAWG.git
cd DAWG
python setup.py install
cd ..

License

All parts of this library are licensed under the GNU Affero GPL v3

See the attached LICENSE.txt file.

Copyright

© 2006-2014 Florian Leitner. All rights reserved.