Skip to content

Commit

Permalink
Update READMEs (#113)
Browse files Browse the repository at this point in the history
* Update main readme

* Update cli readme
  • Loading branch information
lolipopshock committed Jan 28, 2021
1 parent 4400281 commit 8584b99
Show file tree
Hide file tree
Showing 2 changed files with 77 additions and 11 deletions.
21 changes: 20 additions & 1 deletion README.md
@@ -1,7 +1,10 @@
<div align="center">
<img src="./ui/src/components/sidebar/pawlsLogo.png" width="400"/>

[Demo Server](https://pawls.apps.allenai.org) | [Video Tutorial](https://www.youtube.com/watch?v=TB4kzh2H9og) | [Paper](https://arxiv.org/pdf/2101.10281v1.pdf)
</div>

------------------------------------------------
PDF Annotations with Labels and Structure is software that makes it easy
to collect a series of annotations associated with a PDF document. It was written
specifically for annotating academic papers within the [Semantic Scholar](https://www.semanticscholar.org) corpus, but can be used with any collection of PDF documents.
Expand Down Expand Up @@ -165,4 +168,20 @@ location /docs/ {
}
```

PAWLS is an open-source project developed by [the Allen Institute for Artificial Intelligence (AI2)](http://www.allenai.org). AI2 is a non-profit institute with the mission to contribute to humanity through high-impact AI research and engineering.
### Cite PAWLS

If you find PAWLS helpful for your research, please consider cite PAWLS.
```
@misc{neumann2021pawls,
title={PAWLS: PDF Annotation With Labels and Structure},
author={Mark Neumann and Zejiang Shen and Sam Skjonsberg},
year={2021},
eprint={2101.10281},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```

---

PAWLS is an open-source project developed by [the Allen Institute for Artificial Intelligence (AI2)](http://www.allenai.org). AI2 is a non-profit institute with the mission to contribute to humanity through high-impact AI research and engineering.
67 changes: 57 additions & 10 deletions cli/readme.md
Expand Up @@ -29,7 +29,7 @@ Please follow the [instructions here](https://tesseract-ocr.github.io/tessdoc/In
```
By default, pawls will use the name of the containing directory to refer to the pdf in the ui.

2. Process the token information for each PDF document with the given PDF preprocessor.
2. [preprocess] Process the token information for each PDF document with the given PDF preprocessor.
```bash
pawls preprocess <preprocessor-name> skiff_files/apps/pawls/papers
```
Expand All @@ -38,31 +38,78 @@ By default, pawls will use the name of the containing directory to refer to the
2. grobid *Note: to use the grobid preprocessor, you need to run `docker-compose up` in a separate shell, because grobid needs to be running as a service.*
3. ocr *Note: you might need to install [tesseract-ocr](https://tesseract-ocr.github.io/tessdoc/Installation.html) for using this preprocessor.*

3. Assign annotation tasks (<PDF_SHA>s) to specific users <user>:
3. [assign] Assign annotation tasks (<PDF_SHA>s) to specific users <user>:
```bash
pawls assign ./skiff_files/apps/pawls/papers <user> <PDF_SHA>
```
Optionally at this stage, you can provide a `--name-file` argument to `pawls assign`,
which allows you to specify a name for a given pdf (for example the title of a paper).
This file should be a json file containing `sha:name` mappings.
4. (optional) Create pre-annotations for the PDFs based on some model predictions `anno.json`:

4. (optional) [preannotate] Create pre-annotations for the PDFs based on some model predictions `anno.json`:
```bash
pawls preannotate <labeling_folder> <labeling_config> anno.json -u <user>
```
You could find an example for generating the pre-annotations in `scripts/generate_pdf_layouts.py`.
5. Export the annotated dataset to the COCO format:

1. Export all annotations of a project of the default annotator (development_user):
5. [status] Check annotation status for the <labeling_folder>:
```bash
pawls status <labeling_folder>
```

1. Save the labeling record table:
```bash
pawls status <labeling_folder> --output record.csv
```

6. [metric] Check Inter Annotator Agreement (IAA):
```bash
pawls metric <labeling_folder> <config_file> \
--textual-categories cat1,cat2 --non-textual-categories cat3,cat4
```
For blocks, we measure the consistency using the [mAP scores](https://jonathan-hui.medium.com/map-mean-average-precision-for-object-detection-45c121a31173). It is a common metric
in object detection tasks, evaluating the block category consistency at different overlapping
levels.


For textual regions, we measure the consistency based on the token categories.
We will assign PDF tokens with the categories of the contained blocks, and compare
the label of the same token across annotators. The agreement level is measured via
token accuracy.

It will print a matrix, where the (i,j)-th element in the table is calculated by
treating the annotations from i as the "ground-truth"s, and those from j are
considered as "predictions"

1. Save the IAA report to `<save-folder>`:
```bash
pawls metric <labeling_folder> <config_file> \
--textual-categories cat1,cat2 --non-textual-categories cat3,cat4 \
--save <save-folder>
```
It will create `block-eval.csv` and `textual-eval.csv` in the folder for block and textual
region IAA.

2. Specify annotators for calculating IAA:
```bash
pawls metric <labeling_folder> <config_file> \
--textual-categories cat1,cat2 --non-textual-categories cat3,cat4 \
--u <annotator1> --u <annotator2>
```
7. [export] Export the annotated dataset to the specified format. Currently we support export to `COCO` format and the `token` table format.

1. Export all annotations of a project of all annotators:
```bash
pawls export <labeling_folder> <labeling_config> <output_path>
pawls export <labeling_folder> <labeling_config> <output_path> <format>
```

2. Export only finished annotations of from a given annotator, e.g. markn:
2. Export only finished annotations of a given annotator, e.g. markn:
```bash
pawls export <labeling_folder> <labeling_config> <output_path> -u markn
pawls export <labeling_folder> <labeling_config> <output_path> <format> -u markn
```

3. Export all annotations of from a given annotator:
3. Export all annotations (include unfinished annotations) from a given annotator:
```bash
pawls export <labeling_folder> <labeling_config> <output_path> -u markn --all
pawls export <labeling_folder> <labeling_config> <output_path> <format> -u markn --include-unfinished
```

0 comments on commit 8584b99

Please sign in to comment.