Extract and aggregate tables of empirical results from computer science papers!
This project requires Python 3.6. We recommend you set up a conda environment:
conda create -n corvid python=3.6
source activate corvid
The dependencies are listed in the requirements.in
file:
pip install -r requirements.in
After installing, you can run all the unit tests:
pytest tests/
If you're interested in using one of the predefined Table extractors from the table_extraction
module, you'll also need to install a tool to parse PDFs to XML. We currently support PDFLib's TET toolkit v5.1 and Nuance's OmniPage Capture SDK v20.2. For TET, you'll need the path to the bin/tet
executable after installation. For OmniPage, you'll need to run make
to build corvid.cpp
within the module omnipage/
in this repo.
|-- corvid/
| |-- table_extraction/
| | |-- table_extractor.py
| | |-- evaluate.py
| |-- table_aggregation/
| | |-- schema_matcher.py
| | |-- evaluate.py
| |-- types/
| | |-- table.py
|-- tests/
|-- config.py
|-- requirements.in
A few important things:
-
table.py
contains theTable
class, which is the data structure used to represent Tables. It's fine to think ofTable
as a wrapper around a 2Dnumpy
array, where each[i,j]
element represents a cell in the Table. -
table_extractor.py
contains theTableExtractor
class. The.extract()
method extractsTable
objects from a PDF input. -
schema_matcher.py
contains theSchemaMatcher
class. The.aggregate_tables()
method takes a list ofTable
objects and finds alignments between columns. For example, a column "p" in Table 1 could be aligned with another column "precision" in Table 2. The.map_tables()
method uses these alignments to build a single aggregate Table. -
evaluate.py
contains a functionevaluate()
which computes a suite of performance metrics on a given a Gold Table and Predicted Table pair. Thetable_extraction
andtable_aggregation
modules have their own respective evaluation methods.
The repo contains two modules:
First, prepare paper_ids.txt
that looks like:
0ad9e1f04af6a9727ea7a21d0e9e3cf062ca6d75
eda636e3abae829cf7ad8e0519fbaec3f29d1e82
...
We can download PDFs from S3 for the papers in this file:
python scripts/fetch_papers_pdfs_from_s3.py
--mode pdf
--paper_ids /path/to/paper_ids.txt
--input_url s3://url-with-pdfs
--output_dir data/pdf/
After we download the PDFs, we can parse them into the TETML format using PDFLib's TET:
python scripts/parse_pdfs_to_tetml.py
--parser /path/to/pdflib-tet-binary
--input_dir data/pdf/
--output_dir data/tetml/
If the options in scripts fetch_papers_*.py
and parse_pdfs_*.py
are left out, the scripts will attempt to use default values from a configuration file. See our example in example_config.py
.
Now that we've processed all these papers to TETML format, let's try extracting tables from one of them:
from bs4 import BeautifulSoup
from corvid.table_extraction.table_extractor import TetmlTableExtractor
TETML_PATH = 'data/tetml/0ad9e1f04af6a9727ea7a21d0e9e3cf062ca6d75.tetml'
with open(TETML_PATH, 'r') as f_tetml:
tetml = BeautifulSoup(f_tetml)
tables = TetmlTableExtractor.extract_tables(tetml)
Let's try manipulating the first table in this list:
table = tables[0]
# visualize
print(table)
# shape
table.nrow; table.ncol; table.dim
# indexing via grid
first_row = table[0,:]
first_col = table[:,0]
# indexing via cells
first_cell = table[0]
- read
aliases
from madeleine's annotation and add todatasets.json
- font information in cells
- finish evaluation module for table extraction; write example script for API
- table normalizing function
- reorganize
data/
file structure - handling
box
after table transformations (maybe store externally from class) - maybe store all metadata non-specific to table externally from class
- tests for file/tetml utils
[[cell for cell in row] for row in x]
make possible on Tablex
using__iter__
; make.grid
private after this- Script for inspecting Table pickles
- Naming. Alignment seems to denote bidirectionality vs Mapping has direction.
- latex source to table (for training/evaluation)
- parsing heuristics
After downloading the .dmg
, you'll need to mount the file:
sudo hdiutil attach TET-5.1-OSX-Perl-PHP-Python-Ruby.dmg
You can then find the TET binary at
ls /Volumes/TET-5.1-OSX-Perl-PHP-Python-Ruby/bin/tet