An Adaptive Information Extraction System
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
Datasets
data
docs
gaz
lib
LICENCE
README.markdown
config.py
evaluation.py
instances.py
multiLevelExtractor.py
porterstemmer.py
preprocessCorpus.py
preprocessGateCorpus.py
scorer.py
tagging.py
tagmatch.py
tagstats.py
utils.py

README.markdown

ELIE

Overview

ELIE is a tool for adaptive information extraction. It also provides a number of other text processing tools e.g. POS tagging, chunking, gazeteer, stemming. It is written in Python.

More info

The following publications include more information about the algorithm that ELIE implements

  • Finn, A. (2006). A Multi-Level Boundary Classification Approach to Information Extraction. Phd thesis (University College Dublin). pdf
  • Finn, A. & Kushmerick, N. (2004). Multi-level Boundary Classification for Information Extraction. In Proc. European Conference on Machine Learning (Pisa). pdf
  • Finn, A. & Kushmerick, N. (2004). Information Extraction by Convergent Boundary Classification. AAAI-04 Workshop on Adaptive Text Extraction and Mining (San Jose). pdf

Installation

Requirements:

  • Python 2.1 or higher
  • Java 2 or higher
  • Weka (included in distribution)
  • Brilltag (if you intend to use datasets other than those provided)

Unzip the Elie archive. Edit the basedir, BRILLTAGPATH and java variables in the file config.py to describe your own system. Add $ELIEHOME/lib/weka.jar to your java classpath.

Usage

Elie contains the following executable files:

  • evaluation.py The main way to run ELIE
  • scorer.py Calculate performance measures from ELIE logs
  • extractor.py Performs basic learning and extraction
  • preprocessCorpus.py preprocesses a corpus of text files
  • tagging.py does POS, chunking etc on a text file

Execute these files without any arguments to get usage information.

Input format

Documents should be stored in text files with one document per text-file. Fields should be marked using the syntax ... .

Preprocessing

This stage adds tokenisation, orthographic, POS, chunking and gazetteer information to the input files and stores it using an ELIE internal format. This stage only needs to be done once for each document collection! Running 'preprocessCorpus.py datasetDirectory' will create a new directory called datasetDirectory.preprocessed which contains all the files in ELIEs internal format.

Note the input files shouldn't contain any unusual control characters and for every there must be a corresponding .

Running

The recommended way to run ELIE is using the file evaluation.py. It takes the following parameters.

-f field 

-t trainCorpusDirectory 

-D dataDirectory 

[-T testCorpusDirectory]

[-s splitfilebase]

[-mpnvh]

If -t and -T are are set, then we train on trainCorpusDirectory and test on testCorpusDirectory. Otherwise we do repeated random splits on trainCorpusDirectory

Options:

-m use cached models (NotYetImplemented)

-p set train proportion default=0.5

-n number of trials default=10

-v version info

-h help

The corpora directories should contain preprocessed files only i.e. those created by preprocessCorpus.py. The dataDirectory is where ELIE will store all its intermediate and output files. The splitfilebase argument can used be for predefined splits.

Output

The detail of ELIEs printed output is controled using the parameter config.verbosity.

ELIE produces several logfiles that can be used by the bwi-scorer or ELIEs own scorer (scorer.py). These are located in the specified dataDirectory.

e.g. scorer.py elie.speaker.*.elie.L1.log

Licence

(c) Aidan Finn, 2004, aidanf@gmail.com

This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/2.0/ or send a letter to Creative Commons, 559 Nathan Abbott Way, Stanford, California 94305, USA.