- Open ONI: https://oni.cmzx.it/lccn/sn00000001/issues/
- Annotated Thumbnails: https://newspapers.codemyriad.io/lamasca-preview/index.html
This repository contains code to help with annotating and training a Machine Learning (ML) model for layout recognition of the newspaper La Masca.
The process for annotating is:
- PDF files are uploaded to https://newspapers.codemyriad.io/lamasca/1994/index.html
- The PDF files are converted to grayscale images and deskewed.
- The results are available in URLs like https://newspapers.codemyriad.io/lamasca-pages/1994/lamasca-1994-01-12/page_01.jpeg
- The pages are then uploaded to Label Studio using the
lp-labelstudio labelstudio-api projects createcommand - Label Studio is used to annotate the pages
- The
lp-labelstudio labelstudio-api projects fetchcommand is used to download the annotations into theannotationsdirectory next to the images. For each annotator, a directory is created, and each task (page) is saved as a single file in that directory. All annotations are also incorporated into amanifest.jsonfile. For example:
/lamasca-pages/1994/lamasca-1994-01-12/
βββ annotations/
β βββ annotator@example.com/
β βββ page01.json
β βββ page02.json
β βββ ...
βββ manifest.json
βββ page_01.jpeg
βββ page_02.jpeg
βββ ...
-
If the same issue is uploaded to Label Studio again, it will now include the annotations that have been fetched.
-
The
generate-thumbnailscommand can be used to generate thumbnails of the pages with overlaid annotations, like this one. -
The Sigal gallery generator has been used to generate HTML galleries of these annotated pages for easy manual checking and observation.
-
A
manifest.jsonfile is generated for each issue folder. If annotations are already present for the issue, they will be included in the manifest. The manifest is used to generate the COCO JSON file for training. Here's the command used to generate the manifest files:lp-labelstudio generate-labelstudio-manifest /tmp/newspapers/lamasca-pages/1994/lamasca-* -
To prepare the annotations for training, a COCO JSON file is generated with
lp-labelstudio collect-cocofrom src/lp_labelstudio/collect_coco.py. This is the command used:lp-labelstudio collect-coco $(find /tmp/newspapers/lamasca-pages -name manifest.json -size +100k) cp /tmp/coco-out.json /tmp/newspapers/lamasca-pages/1994/coco-all.json -
Now the training can start. The
training-imagedirectory defines a Docker image that can be used to train the model:ghcr.io/codemyriad/lamasca-layoutparser. It includes theprepare-training.shscript that will prepare and start the training.
- To use vast.ai to run the training, these commands can be quite handy:
vastai search offers 'dlperf>100 cpu_ram>60 inet_down>1000 inet_up>1000 gpu_name=RTX_4090 num_gpus>=2' -o dph # Choose an instance id INSTANCEID=000000000 vastai create instance ${INSTANCEID} --image ghcr.io/codemyriad/lamasca-layoutparser --disk 100 --onstart-cmd "byobu new-session -d -s training 'touch ~/.no_auto_tmux; bash /usr/local/bin/prepare-training.sh'" # Wait for the instance to start. Here's an indicator for the terminal (but you can follow on their web UI too) watch -n1 "vastai show instances --raw|jq .[].status_msg" ssh $(vastai ssh-url $(vastai show instances --raw|jq .[0].id)) # when connected, run `byobu`
When working on this project, the contents of https://newspapers.codemyriad.io/ are mounted on /tmp/newspapers using the excellent rclone tool. Some code may hardcode this path.
The project package is currently named lp-labelstudio, but it's actually very specific to the use case it was developed for. Consider renaming it to reflect its purpose more accurately.
The Sigal gallery generator has been used to generate HTML galleries.
-
Generate Thumbnails with Annotations:
lp-labelstudio generate-thumbnails /tmp/newspapers/lamasca-pages/1994/ /tmp/thumbnails
-
Generate the Gallery:
sigal build -c sigal.conf.py /tmp/thumbnails/ /tmp/sigal-thumbnails
-
Copy the Gallery:
cp -r /tmp/sigal-thumbnails/* /tmp/newspapers/lamasca-preview/