corvid

Extract and aggregate tables of empirical results from computer science papers!

Installation

This project requires Python 3.6. We recommend you set up a conda environment:

conda create -n corvid python=3.6
source activate corvid

The dependencies are listed in the requirements.in file:

pip install -r requirements.in

After installing, you can run all the unit tests:

pytest tests/

Other dependencies

If you're interested in using one of the predefined Table extractors from the table_extraction module, you'll also need to install a tool to parse PDFs to XML. We currently support PDFLib's TET toolkit v5.1 and Nuance's OmniPage Capture SDK v20.2. For TET, you'll need the path to the bin/tet executable after installation. For OmniPage, you'll need to run make to build corvid.cpp within the module omnipage/ in this repo.

Project structure

|-- corvid/
|   |-- table_extraction/
|   |   |-- table_extractor.py
|   |   |-- evaluate.py
|   |-- table_aggregation/
|   |   |-- schema_matcher.py
|   |   |-- evaluate.py
|   |-- types/
|   |   |-- table.py
|-- tests/
|-- config.py
|-- requirements.in

A few important things:

table.py contains the Table class, which is the data structure used to represent Tables. It's fine to think of Table as a wrapper around a 2D numpy array, where each [i,j] element represents a cell in the Table.
table_extractor.py contains the TableExtractor class. The .extract() method extracts Table objects from a PDF input.
schema_matcher.py contains the SchemaMatcher class. The .aggregate_tables() method takes a list of Table objects and finds alignments between columns. For example, a column "p" in Table 1 could be aligned with another column "precision" in Table 2. The .map_tables() method uses these alignments to build a single aggregate Table.
evaluate.py contains a function evaluate() which computes a suite of performance metrics on a given a Gold Table and Predicted Table pair. The table_extraction and table_aggregation modules have their own respective evaluation methods.

Usage / API

The repo contains two modules:

`table_extraction`

`table_aggregation`

Example

First, prepare paper_ids.txt that looks like:

0ad9e1f04af6a9727ea7a21d0e9e3cf062ca6d75
eda636e3abae829cf7ad8e0519fbaec3f29d1e82
...

We can download PDFs from S3 for the papers in this file:

python scripts/fetch_papers_pdfs_from_s3.py 
    --mode pdf 
    --paper_ids /path/to/paper_ids.txt 
    --input_url s3://url-with-pdfs
    --output_dir data/pdf/

After we download the PDFs, we can parse them into the TETML format using PDFLib's TET:

python scripts/parse_pdfs_to_tetml.py
    --parser /path/to/pdflib-tet-binary
    --input_dir data/pdf/
    --output_dir data/tetml/

If the options in scripts fetch_papers_*.py and parse_pdfs_*.py are left out, the scripts will attempt to use default values from a configuration file. See our example in example_config.py.

Now that we've processed all these papers to TETML format, let's try extracting tables from one of them:

from bs4 import BeautifulSoup
from corvid.table_extraction.table_extractor import TetmlTableExtractor

TETML_PATH = 'data/tetml/0ad9e1f04af6a9727ea7a21d0e9e3cf062ca6d75.tetml'
with open(TETML_PATH, 'r') as f_tetml:
    tetml = BeautifulSoup(f_tetml)
    tables = TetmlTableExtractor.extract_tables(tetml)

Let's try manipulating the first table in this list:

table = tables[0]

# visualize
print(table)

# shape
table.nrow; table.ncol; table.dim

# indexing via grid
first_row = table[0,:]
first_col = table[:,0]

# indexing via cells
first_cell = table[0]

TODO

read aliases from madeleine's annotation and add to datasets.json

font information in cells
finish evaluation module for table extraction; write example script for API
table normalizing function
reorganize data/ file structure
handling box after table transformations (maybe store externally from class)
maybe store all metadata non-specific to table externally from class
tests for file/tetml utils
[[cell for cell in row] for row in x] make possible on Table x using __iter__; make .grid private after this
Script for inspecting Table pickles
Naming. Alignment seems to denote bidirectionality vs Mapping has direction.

Future

latex source to table (for training/evaluation)
parsing heuristics

Miscellaneous

Installing PDFLib's TET toolkit on OSX

After downloading the .dmg, you'll need to mount the file:

sudo hdiutil attach TET-5.1-OSX-Perl-PHP-Python-Ruby.dmg

You can then find the TET binary at

ls /Volumes/TET-5.1-OSX-Perl-PHP-Python-Ruby/bin/tet

Name		Name	Last commit message	Last commit date
Latest commit History 200 Commits
corvid		corvid
data		data
omnipage		omnipage
scripts		scripts
tests		tests
.gitignore		.gitignore
.pylintrc		.pylintrc
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
env.sh		env.sh
example_config.py		example_config.py
requirements.in		requirements.in
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

corvid

Installation

Other dependencies

Project structure

Usage / API

`table_extraction`

`table_aggregation`

Example

TODO

Future

Miscellaneous

Installing PDFLib's TET toolkit on OSX

About

Releases

Packages

Languages

License

afcarl/corvid

Folders and files

Latest commit

History

Repository files navigation

corvid

Installation

Other dependencies

Project structure

Usage / API

table_extraction

table_aggregation

Example

TODO

Future

Miscellaneous

Installing PDFLib's TET toolkit on OSX

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

`table_extraction`

`table_aggregation`

Packages