Synthesis of motifs, or graph rules, for heterogeneous entity extraction.
With opam installed, the project can be compiled by navigating to this directory and running:
opam depext
opam pin .
makeThis produces a general synthesis executable ./synthesize and an evaluation executable ./evaluate.
To improve portability, some infrastructure for running the code in a Docker container is provided. To build the docker image, simply navigate to this directory and run docker build . --tag motifs.
The executables ./synthesize and ./evaluate takes several parameters, the most important of which is the problem file. A simple problem file is given below:
{
"metadata" : {
"views" : ["simple_table"]
},
"files" : [
"path/to/db/database.db",
"path/to/other/db/database.db"
],
"examples" : [
{
"file" : "path/to/ex/db/database.db",
"example" : 12345
}
]
}The metadata field carries some relevant info for execution - in this case, which view is being used for synthesis. The files field contains absolute paths to database files we want our synthesized ensemble to run on. Lastly, the examples field contains file / example pairs.
Problem files are passed in using the --problem flag. For other flags, run ./executable --help.
Three directories contain scripts for running experiments:
analysis- a Python module for constructing and evaluating ensembles over synthesized motifsscripts- a collection of scripts for running ensembles and making graphs of performancedata- the directory containing all the data necessary for the scripts inscripts
The Python modules have very few dependencies, which can be found in analysis-requirements.txt or installed with python3 -m pip install -r analysis-requirements.txt.
The data directory contains a variety of sub-directories, each of which contain relevant information for an experiment. Given an experiment name <experiment>, the flow of information is roughly as follows:
- Databases are stored in
data/db/<dataset>. Each document is a separate.dbfile. - A ground truth file contains the ground-truth positive labels for a set of database files as a
.jsonfile. A ground truth file can either be provided directly viadata/default-gt/<experiment>.jsonor constructed via a sql filedata/sql/<experiment>.json, which contains the attributedatasetand aSQLcommand in the attributesql. - Given a ground truth file, we construct the problem file
data/problem/<experiment>.json, which is built from the ground truth file above and a metadata filedata/metadata/<experiment>.json. - The problem file fully defines the synthesis constraints, so we can produce the relevant motifs in
data/motifs/<experiment>.json.
I'll fill out the rest in a bit.
Note, an example data directory (and the one used for our experiments) can be found in this repository.
Here's the relevant Docker command (assuming you've made the image as hera:latest):
docker run -it /path/to/motif-data/repo:/hera/motifs/data hera:latest /bin/bash
Then, simply run make <experiment> to build what you want. make experiments should build them all, but this is probably not useful right now.
To interact with the built Docker image, we run a container from the image with input and output directories bind-mounted at appropriate locations. This requires some configuration of the inputs.
Databases should be stored flat in a directory, along with a problem file named problem.json that references the stored databases using the prefix /data. That is, if we have data for a benchmark and store it in a directory called benchmark, our file structure looks like the following:
/benchmark
|-- problem.json
|-- database1.db
|-- database2.db
...Any references to database1.db (or any other database) in problem.json should be given via the path /data/database1.db, regardless of where benchmark is located on the host.
Finally, an output directory should be prepared. Once the data is stored in an appropriate file, and the output directory is made, we run the image with the appropriate directories bind-mounted to the Docker image directories /data and /output. Usually, absolute paths are needed for this. For example, if we have the above benchmark directory at /path/to/bm/benchmark, and we've made an output directory at /path/to/output, we can run the following Docker command:
docker run -v /path/to/bm/benchmark:/data /path/to/output:/output gr:latestThis will spin up a container, run the synthesis on the /path/to/bm/benchmark/problem.json problem file, and save the output .json files to /path/to/output/*.json on the host file system.