Update READMEs (#113)

* Update main readme * Update cli readme
allenai · Jan 28, 2021 · 8584b99 · 8584b99
1 parent 4400281
commit 8584b99
Show file tree

Hide file tree

Showing 2 changed files with 77 additions and 11 deletions.
diff --git a/README.md b/README.md
@@ -1,7 +1,10 @@
 <div align="center">
     <img src="./ui/src/components/sidebar/pawlsLogo.png" width="400"/>
+
+[Demo Server](https://pawls.apps.allenai.org) | [Video Tutorial](https://www.youtube.com/watch?v=TB4kzh2H9og) | [Paper](https://arxiv.org/pdf/2101.10281v1.pdf)
 </div>
 
+------------------------------------------------
   PDF Annotations with Labels and Structure is software that makes it easy
   to collect a series of annotations associated with a PDF document. It was written
   specifically for annotating academic papers within the [Semantic Scholar](https://www.semanticscholar.org) corpus, but can be used with any collection of PDF documents.
@@ -165,4 +168,20 @@ location /docs/ {
 }
 ```
 
-PAWLS is an open-source project developed by [the Allen Institute for Artificial Intelligence (AI2)](http://www.allenai.org). AI2 is a non-profit institute with the mission to contribute to humanity through high-impact AI research and engineering.
+### Cite PAWLS
+
+If you find PAWLS helpful for your research, please consider cite PAWLS. 
+```
+@misc{neumann2021pawls,
+      title={PAWLS: PDF Annotation With Labels and Structure}, 
+      author={Mark Neumann and Zejiang Shen and Sam Skjonsberg},
+      year={2021},
+      eprint={2101.10281},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
+}
+```
+
+---
+
+PAWLS is an open-source project developed by [the Allen Institute for Artificial Intelligence (AI2)](http://www.allenai.org). AI2 is a non-profit institute with the mission to contribute to humanity through high-impact AI research and engineering.
diff --git a/cli/readme.md b/cli/readme.md
@@ -29,7 +29,7 @@ Please follow the [instructions here](https://tesseract-ocr.github.io/tessdoc/In
 ```
 By default, pawls will use the name of the containing directory to refer to the pdf in the ui.
 
-2. Process the token information for each PDF document with the given PDF preprocessor.
+2. [preprocess] Process the token information for each PDF document with the given PDF preprocessor.
     ```bash
     pawls preprocess <preprocessor-name> skiff_files/apps/pawls/papers
     ```
@@ -38,31 +38,78 @@ By default, pawls will use the name of the containing directory to refer to the
     2. grobid *Note: to use the grobid preprocessor, you need to run `docker-compose up` in a separate shell, because grobid needs to be running as a service.*
     3. ocr *Note: you might need to install [tesseract-ocr](https://tesseract-ocr.github.io/tessdoc/Installation.html) for using this preprocessor.*
 
-3. Assign annotation tasks (<PDF_SHA>s) to specific users <user>:
+3. [assign] Assign annotation tasks (<PDF_SHA>s) to specific users <user>:
     ```bash
     pawls assign ./skiff_files/apps/pawls/papers <user> <PDF_SHA>
     ```
     Optionally at this stage, you can provide a `--name-file` argument to `pawls assign`,
     which allows you to specify a name for a given pdf (for example the title of a paper).
     This file should be a json file containing `sha:name` mappings.
-4. (optional) Create pre-annotations for the PDFs based on some model predictions `anno.json`:
+
+4. (optional) [preannotate] Create pre-annotations for the PDFs based on some model predictions `anno.json`:
     ```bash
     pawls preannotate <labeling_folder> <labeling_config> anno.json -u <user>
     ```
     You could find an example for generating the pre-annotations in `scripts/generate_pdf_layouts.py`.
-5. Export the annotated dataset to the COCO format:
 
-    1. Export all annotations of a project of the default annotator (development_user):
+5. [status] Check annotation status for the <labeling_folder>:
+    ```bash
+    pawls status <labeling_folder>
+    ```
+
+    1. Save the labeling record table:
+        ```bash
+        pawls status <labeling_folder> --output record.csv
+        ```
+
+6. [metric] Check Inter Annotator Agreement (IAA):
+    ```bash
+    pawls metric <labeling_folder> <config_file> \
+        --textual-categories cat1,cat2 --non-textual-categories cat3,cat4
+    ```
+    For blocks, we measure the consistency using the [mAP scores](https://jonathan-hui.medium.com/map-mean-average-precision-for-object-detection-45c121a31173). It is a common metric 
+    in object detection tasks, evaluating the block category consistency at different overlapping 
+    levels.
+
+
+    For textual regions, we measure the consistency based on the token categories. 
+    We will assign PDF tokens with the categories of the contained blocks, and compare
+    the label of the same token across annotators. The agreement level is measured via 
+    token accuracy. 
+
+    It will print a matrix, where the (i,j)-th element in the table is calculated by 
+    treating the annotations from i as the "ground-truth"s, and those from j are 
+    considered as "predictions"
+
+    1. Save the IAA report to `<save-folder>`:
+        ```bash
+        pawls metric <labeling_folder> <config_file> \
+            --textual-categories cat1,cat2 --non-textual-categories cat3,cat4 \
+            --save <save-folder>
+        ```
+        It will create `block-eval.csv` and `textual-eval.csv` in the folder for block and textual 
+        region IAA. 
+
+    2. Specify annotators for calculating IAA:
+        ```bash
+        pawls metric <labeling_folder> <config_file> \
+            --textual-categories cat1,cat2 --non-textual-categories cat3,cat4 \
+            --u <annotator1> --u <annotator2>
+        ```
+        
+7. [export] Export the annotated dataset to the specified format. Currently we support export to `COCO` format and the `token` table format. 
+
+    1. Export all annotations of a project of all annotators:
         ```bash
-        pawls export <labeling_folder> <labeling_config> <output_path>
+        pawls export <labeling_folder> <labeling_config> <output_path> <format>
         ```
 
-    2. Export only finished annotations of from a given annotator, e.g. markn:
+    2. Export only finished annotations of a given annotator, e.g. markn:
         ```bash
-        pawls export <labeling_folder> <labeling_config> <output_path> -u markn
+        pawls export <labeling_folder> <labeling_config> <output_path> <format> -u markn
         ```
 
-    3. Export all annotations of from a given annotator: 
+    3. Export all annotations (include unfinished annotations) from a given annotator: 
         ```bash
-        pawls export <labeling_folder> <labeling_config> <output_path> -u markn --all
+        pawls export <labeling_folder> <labeling_config> <output_path> <format> -u markn --include-unfinished
         ```