Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update READMEs #113

Merged
merged 2 commits into from Jan 28, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
21 changes: 20 additions & 1 deletion README.md
@@ -1,7 +1,10 @@
<div align="center">
<img src="./ui/src/components/sidebar/pawlsLogo.png" width="400"/>

[Demo Server](https://pawls.apps.allenai.org) | [Video Tutorial](https://www.youtube.com/watch?v=TB4kzh2H9og) | [Paper](https://arxiv.org/pdf/2101.10281v1.pdf)
</div>

------------------------------------------------
PDF Annotations with Labels and Structure is software that makes it easy
to collect a series of annotations associated with a PDF document. It was written
specifically for annotating academic papers within the [Semantic Scholar](https://www.semanticscholar.org) corpus, but can be used with any collection of PDF documents.
Expand Down Expand Up @@ -165,4 +168,20 @@ location /docs/ {
}
```

PAWLS is an open-source project developed by [the Allen Institute for Artificial Intelligence (AI2)](http://www.allenai.org). AI2 is a non-profit institute with the mission to contribute to humanity through high-impact AI research and engineering.
### Cite PAWLS

If you find PAWLS helpful for your research, please consider cite PAWLS.
```
@misc{neumann2021pawls,
title={PAWLS: PDF Annotation With Labels and Structure},
author={Mark Neumann and Zejiang Shen and Sam Skjonsberg},
year={2021},
eprint={2101.10281},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```

---

PAWLS is an open-source project developed by [the Allen Institute for Artificial Intelligence (AI2)](http://www.allenai.org). AI2 is a non-profit institute with the mission to contribute to humanity through high-impact AI research and engineering.
67 changes: 57 additions & 10 deletions cli/readme.md
Expand Up @@ -29,7 +29,7 @@ Please follow the [instructions here](https://tesseract-ocr.github.io/tessdoc/In
```
By default, pawls will use the name of the containing directory to refer to the pdf in the ui.

2. Process the token information for each PDF document with the given PDF preprocessor.
2. [preprocess] Process the token information for each PDF document with the given PDF preprocessor.
```bash
pawls preprocess <preprocessor-name> skiff_files/apps/pawls/papers
```
Expand All @@ -38,31 +38,78 @@ By default, pawls will use the name of the containing directory to refer to the
2. grobid *Note: to use the grobid preprocessor, you need to run `docker-compose up` in a separate shell, because grobid needs to be running as a service.*
3. ocr *Note: you might need to install [tesseract-ocr](https://tesseract-ocr.github.io/tessdoc/Installation.html) for using this preprocessor.*

3. Assign annotation tasks (<PDF_SHA>s) to specific users <user>:
3. [assign] Assign annotation tasks (<PDF_SHA>s) to specific users <user>:
```bash
pawls assign ./skiff_files/apps/pawls/papers <user> <PDF_SHA>
```
Optionally at this stage, you can provide a `--name-file` argument to `pawls assign`,
which allows you to specify a name for a given pdf (for example the title of a paper).
This file should be a json file containing `sha:name` mappings.
4. (optional) Create pre-annotations for the PDFs based on some model predictions `anno.json`:

4. (optional) [preannotate] Create pre-annotations for the PDFs based on some model predictions `anno.json`:
```bash
pawls preannotate <labeling_folder> <labeling_config> anno.json -u <user>
```
You could find an example for generating the pre-annotations in `scripts/generate_pdf_layouts.py`.
5. Export the annotated dataset to the COCO format:

1. Export all annotations of a project of the default annotator (development_user):
5. [status] Check annotation status for the <labeling_folder>:
```bash
pawls status <labeling_folder>
```

1. Save the labeling record table:
```bash
pawls status <labeling_folder> --output record.csv
```

6. [metric] Check Inter Annotator Agreement (IAA):
```bash
pawls metric <labeling_folder> <config_file> \
--textual-categories cat1,cat2 --non-textual-categories cat3,cat4
```
For blocks, we measure the consistency using the [mAP scores](https://jonathan-hui.medium.com/map-mean-average-precision-for-object-detection-45c121a31173). It is a common metric
in object detection tasks, evaluating the block category consistency at different overlapping
levels.


For textual regions, we measure the consistency based on the token categories.
We will assign PDF tokens with the categories of the contained blocks, and compare
the label of the same token across annotators. The agreement level is measured via
token accuracy.

It will print a matrix, where the (i,j)-th element in the table is calculated by
treating the annotations from i as the "ground-truth"s, and those from j are
considered as "predictions"

1. Save the IAA report to `<save-folder>`:
```bash
pawls metric <labeling_folder> <config_file> \
--textual-categories cat1,cat2 --non-textual-categories cat3,cat4 \
--save <save-folder>
```
It will create `block-eval.csv` and `textual-eval.csv` in the folder for block and textual
region IAA.

2. Specify annotators for calculating IAA:
```bash
pawls metric <labeling_folder> <config_file> \
--textual-categories cat1,cat2 --non-textual-categories cat3,cat4 \
--u <annotator1> --u <annotator2>
```

7. [export] Export the annotated dataset to the specified format. Currently we support export to `COCO` format and the `token` table format.

1. Export all annotations of a project of all annotators:
```bash
pawls export <labeling_folder> <labeling_config> <output_path>
pawls export <labeling_folder> <labeling_config> <output_path> <format>
```

2. Export only finished annotations of from a given annotator, e.g. markn:
2. Export only finished annotations of a given annotator, e.g. markn:
```bash
pawls export <labeling_folder> <labeling_config> <output_path> -u markn
pawls export <labeling_folder> <labeling_config> <output_path> <format> -u markn
```

3. Export all annotations of from a given annotator:
3. Export all annotations (include unfinished annotations) from a given annotator:
```bash
pawls export <labeling_folder> <labeling_config> <output_path> -u markn --all
pawls export <labeling_folder> <labeling_config> <output_path> <format> -u markn --include-unfinished
```