Skip to content

codemyriad/lamasca-processing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Newspaper annotation training for lamasca

Quick Links

General Structure

This repository contains code to help with annotating and training a Machine Learning (ML) model for layout recognition of the newspaper La Masca.

The process for annotating is:

  1. PDF files are uploaded to https://newspapers.codemyriad.io/lamasca/1994/index.html
  2. The PDF files are converted to grayscale images and deskewed.
  3. The results are available in URLs like https://newspapers.codemyriad.io/lamasca-pages/1994/lamasca-1994-01-12/page_01.jpeg
  4. The pages are then uploaded to Label Studio using the lp-labelstudio labelstudio-api projects create command
  5. Label Studio is used to annotate the pages
  6. The lp-labelstudio labelstudio-api projects fetch command is used to download the annotations into the annotations directory next to the images. For each annotator, a directory is created, and each task (page) is saved as a single file in that directory. All annotations are also incorporated into a manifest.json file. For example:
/lamasca-pages/1994/lamasca-1994-01-12/
β”œβ”€β”€ annotations/
β”‚   └── annotator@example.com/
β”‚       β”œβ”€β”€ page01.json
β”‚       β”œβ”€β”€ page02.json
β”‚       └── ...
β”œβ”€β”€ manifest.json
β”œβ”€β”€ page_01.jpeg
β”œβ”€β”€ page_02.jpeg
└── ...
  1. If the same issue is uploaded to Label Studio again, it will now include the annotations that have been fetched.

  2. The generate-thumbnails command can be used to generate thumbnails of the pages with overlaid annotations, like this one.

  3. The Sigal gallery generator has been used to generate HTML galleries of these annotated pages for easy manual checking and observation.

  4. A manifest.json file is generated for each issue folder. If annotations are already present for the issue, they will be included in the manifest. The manifest is used to generate the COCO JSON file for training. Here's the command used to generate the manifest files:

    lp-labelstudio generate-labelstudio-manifest /tmp/newspapers/lamasca-pages/1994/lamasca-*
  5. To prepare the annotations for training, a COCO JSON file is generated with lp-labelstudio collect-coco from src/lp_labelstudio/collect_coco.py. This is the command used:

    lp-labelstudio collect-coco $(find /tmp/newspapers/lamasca-pages -name manifest.json -size +100k)
    cp /tmp/coco-out.json /tmp/newspapers/lamasca-pages/1994/coco-all.json
  6. Now the training can start. The training-image directory defines a Docker image that can be used to train the model: ghcr.io/codemyriad/lamasca-layoutparser. It includes the prepare-training.sh script that will prepare and start the training.

  • To use vast.ai to run the training, these commands can be quite handy:
    vastai search offers 'dlperf>100 cpu_ram>60 inet_down>1000 inet_up>1000 gpu_name=RTX_4090 num_gpus>=2' -o dph
    # Choose an instance id
    INSTANCEID=000000000
    vastai create instance ${INSTANCEID} --image ghcr.io/codemyriad/lamasca-layoutparser --disk 100 --onstart-cmd "byobu new-session -d -s training 'touch ~/.no_auto_tmux; bash /usr/local/bin/prepare-training.sh'"
    # Wait for the instance to start. Here's an indicator for the terminal (but you can follow on their web UI too)
    watch -n1 "vastai show instances --raw|jq .[].status_msg"
    ssh $(vastai ssh-url $(vastai show instances --raw|jq .[0].id))
    # when connected, run `byobu`

Notes

When working on this project, the contents of https://newspapers.codemyriad.io/ are mounted on /tmp/newspapers using the excellent rclone tool. Some code may hardcode this path.

The project package is currently named lp-labelstudio, but it's actually very specific to the use case it was developed for. Consider renaming it to reflect its purpose more accurately.

Galleries

The Sigal gallery generator has been used to generate HTML galleries.

  • Generate Thumbnails with Annotations:

    lp-labelstudio generate-thumbnails /tmp/newspapers/lamasca-pages/1994/ /tmp/thumbnails
  • Generate the Gallery:

    sigal build -c sigal.conf.py /tmp/thumbnails/ /tmp/sigal-thumbnails
  • Copy the Gallery:

    cp -r /tmp/sigal-thumbnails/* /tmp/newspapers/lamasca-preview/

About

Support for newspaper digitazion of "π‹πš 𝐦𝐚𝐬𝐜𝐚"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors 3

  •  
  •  
  •