Document Analysis Platform

What it is:

The Document-Analysis Platform, or DAP, is a programming platform for integrating several NLP tools, making them:

interact with each other, and
conform to the same interface.

DAP is a lightweight, simple and easy-to-use alternative to UIMA. While UIMA is a revolutionary and strong platform, it suffers from significant drawbacks, which turned into high barriers for new-comers.

The need for a simple, easy-to-learn and easy-to-use alternative, which preserves only the core ideas of UIMA, is the motivation behind DAP development.

The advantages of DAP over UIMA are:

UIMA takes several weeks to learn, and requires reading of hundreds of user-manuals pages. Getting started with DAP takes no longer than 5-10 minutes. Learning DAP 100% A-to-Z takes only 20 minutes.
UIMA requires long and hard-to-maintain XML files. DAP requires nothing but pure-Java programming.
UIMA employs unusual paradigms for exception throwing, logging, constructing objects, etc. DAP follows normal Java conventions.

The core idea

NLP tools tend to depend on each other. Part-of-speech taggers operate over tokenized texts. Syntactic parsers operate over part-of-speech annotations. Coreference-resolvers operate over syntactic analyses. etc. In short, higher level tools rely on the output of lower-level ones.

This brings up the challenge of integration. Both the syntactic-parser and the part-of-speech tagger should agree on the data-structures and the format of a POS-tagged text. In other words, the POS-tagger output should be what the syntactic-parser expects. This requirement applies to every set of tools with dependencies between them.

Moreover, if all POS-taggers conform to the same format, then replacing one tagger by another is transparent to the syntactic-parser. Similarly, if all the parsers conform to the same format, then replacing one parser by another is transparent to the coreference-resolver.

The goal of DAP is to target this integration challenge. DAP provides data-structures with characteristics and utilities that make them fit for virtually every standard NLP tool. The main two data-structures are document and annotation. The output of every NLP tool can be stored as annotations in documents, with features, attributes, and inter-annotation relations.

In addition to data-structures, an actual set of part-of-speech tags, syntactic phrases types, syntactic-dependency-relations, etc. is required. The project DAP-DKPro_1_8 provides a standard set of NLP types, borrowing them from the DKPro project.

Batteries included

Users can start working with DAP right-away with dozens of state-of-the-art NLP tools for several languages, by using the DAP-DKPro_1_8 library, which wraps DKPro tools inside DAP.

A demo is provided in DAP-DKPro_1_8-demo.

Usage in Maven

The project has been uploaded to Maven central repository.

In a Maven project, add the following:

<dependency>
  <groupId>com.github.document-analysis</groupId>
  <artifactId>dap</artifactId>
  <version>0.1.1</version>
</dependency>

To get started, related projects should be imported as well. See:

Your first steps

Start by reading the 20-minutes-tutorial.

Then jump to the demo.

License

DAP is licensed under Apache 2.0 license, which is a permissive license that is good also for commercial use.

Note that DAP-DKPro_1_8-demo depends on external libraries, which have more restrictive licenses.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
src		src
.gitignore		.gitignore
20_minutes_tutorial.md		20_minutes_tutorial.md
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Document Analysis Platform

What it is:

The core idea

Batteries included

Usage in Maven

Your first steps

License

About

Uh oh!

Releases 1

Packages

Languages

License

document-analysis/dap

Folders and files

Latest commit

History

Repository files navigation

Document Analysis Platform

What it is:

The core idea

Batteries included

Usage in Maven

Your first steps

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages