DocDrop Web Applications
A collection of web applications showcasing use of Hypothesis web annotator or in support of Hypothesis usage.
More detail on each included application below.
DocDrop (main application):
Drag and drop a document to annotate. Annotations will persist and be available if the same document is uploaded again. Supports many common document formats including PDF, EPUB and various office formats such as spreadsheets and editors. Both Microsoft and ODF documents are generally supported. Upon upload, the user is presented with a web representation of the document for Hypothesis annotation.
Documents are stored on AWS. A reference is stored in the database. Before storage to AWS, a hash is taken and if the hash exists (checking database references) the same document is not uploaded again.
Documents (most types) are converted using the Libre Office headless engine. The resulting web viewable formats are pdf or csv or epub (which are not converted as there is a dedicated epub viewer).
A hash of the converted document (if conversion was necessary) is also taken and stored. Thus, when a user uploads an existing document (i.e hash of parent exists in database) the derivative or converted document is provided and no upload nor conversion occurs.
The LibreOffice conversion process is managed by Celery with RabbitMQ (task queues). See
tasks.py. A reference to the completed (or failed) conversion is stored in the database upon resolution. This result is checked against by the web service to manage download of the web viewable document.
YouTube Video Annotator:
Displays YouTube videos with subtitles (if existing) and allows search and annotation of the subtitles. User can pause or jump to various points in the video by clicking on text chunks in the subtitle display. Displays error if user enters a video without subtitles.
Not all videos have subtitles created for them and the application is dependent on YouTube for subtitle creation.
Language of the subtitles can be manipulated (if multiple language subtitles exist for the video) by passing a query string argument in the url. The argument can either be a language or comma separated list of languages (in order preference). I.E
https://docdrop.org/video/<my video id>/?lang=de or
https://docdrop.org/video/<my video id>/?lang=en,es. English is default if no language specified. An error will occur if subtitles are not available in the specified langauge. ISO Language
Subtitles are obtained serverside using Python youtube-transcript-api and passed to the users browser when the video display template renders.
The subtitles can be annotated using Hypothesis.
Parses text from an image pdf and overlays it, creating a pdf with selectable text which can then be annotated using Hypothesis. If a pdf already contains text, there is an option to force redo which will turn the pdf back into an image and OCR that, overwriting existing text with new appliction OCR'd text.
Both the original and derivative (ocr'd) pdf are stored on AWS with references stored in the database.
As with the core DocDrop application, documents are only stored once as determined by hashes, and on subsequent uploads (of the same document), the existing derivative document is provided with no further processing or upload.
Celery with RabbitMQ is used to manage the OCR process in task queues. The result of the process is stored in the database and consulted by the browser to show user process is complete.
The conversion process occurs using OCRmyPDF which is shelled out with a system call to the system
OCRmyPDF uses tesseract and a number of additional libraries and the results are quite good, comparable to many commercially available tools. However it is very computationally expensive, most especially the "force" (or redo) option which converts the pdf back into an image before OCR. In order to rate limit, the application creates a lock file for each conversion process and refuses additional process requests after the lock file limit count is reached. This "allowed number of ocr tasks" can be altered through
the environmental variable
MAX_SIM_OCR_PROCESSES. Additional Celery workers calling OCR may be added to other servers or VM instances at some point in the future to support higher traffic.
Change a pdf's "fingerprint" which is used by Hypothesis to determine if a pdf is the same as one that is already known. This allows different sets of Hypothesis annotations to be used on the same pdf by changing the identifier.
Refingerprint pdfs are not stored on AWS nor is a reference kept in the database. Intermediate and final work product are stored locally and deleted upon completion.
Celery with RabbitMQ is used to queue and manage the refingerprint process (see
tasks.py). Also it is used to queue deletion (mentioned above).
The refingerprint process uses pdfrw for reading and writing to the document.
The document ID is overwritten with a randomized ID and random metadata is written to the document resulting in a pdf that has a different identity when used by Hypothesis.
Google Drive Annotator:
(currently 10/13/21 broken)
Allows pdfs from google drive to be opened in annotation display. Appears on "open with" in Google Drive ui as option. User can annotate pdf and read annotations, including other uploads or instances of the same pdf.
DocDrop has evolved to support many more document types than pdf. Google changed some policies on use. There are some additional issues and considerations with the DocDrop2 format. The original approach of hacking drive app into pdf.js was not nice. All this has resulted in the drive app being broken and stranded. TODO.
3.6+. A virtualenvironment is recommended to install packages but how you do it is up to you.
sudo apt install postgresql
sudo su postgres
create database dbname (e.g. dbname=docdrop)
createuser --interactive --pwprompt to set username (e.g. dduser) and password (e.g. ddpwd)
create database docdrop;
GRANT all privileges ON DATABASE docdrop TO dduser;
sudo apt install python-psycopg2
sudo apt install libpq-dev
pip install -r requirements.txt
cp _docs/env sample/ droppdf/.env
DB_NAME='droppdf' DB_USER='dduser' DB_PASSWORD='ddpwd' DB_HOST='localhost' DJANGO_SERVER='dev' DJANGO_DEBUG=true DJANGO_SECRET_KEY='secret123'
Create and update database tables.
python manage.py migrate
Create Super User
(optional, if you want to log on to admin interface to view db data)
python manage.py createsuperuser
(terminal or screen 1)
celery -A financial_planning_app worker -l info
(terminal or screen 2)
python manage.py runserver
systemctl status celery