Newspaper annotation training for lamasca

Quick Links

Open ONI: https://oni.cmzx.it/lccn/sn00000001/issues/
Annotated Thumbnails: https://newspapers.codemyriad.io/lamasca-preview/index.html

General Structure

This repository contains code to help with annotating and training a Machine Learning (ML) model for layout recognition of the newspaper La Masca.

The process for annotating is:

PDF files are uploaded to https://newspapers.codemyriad.io/lamasca/1994/index.html
The PDF files are converted to grayscale images and deskewed.
The results are available in URLs like https://newspapers.codemyriad.io/lamasca-pages/1994/lamasca-1994-01-12/page_01.jpeg
The pages are then uploaded to Label Studio using the lp-labelstudio labelstudio-api projects create command
Label Studio is used to annotate the pages
The lp-labelstudio labelstudio-api projects fetch command is used to download the annotations into the annotations directory next to the images. For each annotator, a directory is created, and each task (page) is saved as a single file in that directory. All annotations are also incorporated into a manifest.json file. For example:

/lamasca-pages/1994/lamasca-1994-01-12/
├── annotations/
│   └── annotator@example.com/
│       ├── page01.json
│       ├── page02.json
│       └── ...
├── manifest.json
├── page_01.jpeg
├── page_02.jpeg
└── ...

If the same issue is uploaded to Label Studio again, it will now include the annotations that have been fetched.
The generate-thumbnails command can be used to generate thumbnails of the pages with overlaid annotations, like this one.
The Sigal gallery generator has been used to generate HTML galleries of these annotated pages for easy manual checking and observation.
A manifest.json file is generated for each issue folder. If annotations are already present for the issue, they will be included in the manifest. The manifest is used to generate the COCO JSON file for training. Here's the command used to generate the manifest files:
```
lp-labelstudio generate-labelstudio-manifest /tmp/newspapers/lamasca-pages/1994/lamasca-*
```

To prepare the annotations for training, a COCO JSON file is generated with lp-labelstudio collect-coco from src/lp_labelstudio/collect_coco.py. This is the command used:

lp-labelstudio collect-coco $(find /tmp/newspapers/lamasca-pages -name manifest.json -size +100k)
cp /tmp/coco-out.json /tmp/newspapers/lamasca-pages/1994/coco-all.json

Now the training can start. The training-image directory defines a Docker image that can be used to train the model: ghcr.io/codemyriad/lamasca-layoutparser. It includes the prepare-training.sh script that will prepare and start the training.

To use vast.ai to run the training, these commands can be quite handy:

vastai search offers 'dlperf>100 cpu_ram>60 inet_down>1000 inet_up>1000 gpu_name=RTX_4090 num_gpus>=2' -o dph
# Choose an instance id
INSTANCEID=000000000
vastai create instance ${INSTANCEID} --image ghcr.io/codemyriad/lamasca-layoutparser --disk 100 --onstart-cmd "byobu new-session -d -s training 'touch ~/.no_auto_tmux; bash /usr/local/bin/prepare-training.sh'"
# Wait for the instance to start. Here's an indicator for the terminal (but you can follow on their web UI too)
watch -n1 "vastai show instances --raw|jq .[].status_msg"
ssh $(vastai ssh-url $(vastai show instances --raw|jq .[0].id))
# when connected, run `byobu`

Notes

When working on this project, the contents of https://newspapers.codemyriad.io/ are mounted on /tmp/newspapers using the excellent rclone tool. Some code may hardcode this path.

The project package is currently named lp-labelstudio, but it's actually very specific to the use case it was developed for. Consider renaming it to reflect its purpose more accurately.

Galleries

The Sigal gallery generator has been used to generate HTML galleries.

Generate Thumbnails with Annotations:

lp-labelstudio generate-thumbnails /tmp/newspapers/lamasca-pages/1994/ /tmp/thumbnails

Generate the Gallery:

sigal build -c sigal.conf.py /tmp/thumbnails/ /tmp/sigal-thumbnails

Copy the Gallery:

cp -r /tmp/sigal-thumbnails/* /tmp/newspapers/lamasca-preview/

Name		Name	Last commit message	Last commit date
Latest commit History 474 Commits
.github		.github
src/lp_labelstudio		src/lp_labelstudio
test-alto		test-alto
training-image		training-image
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
install.sh		install.sh
pytest.ini		pytest.ini
setup.py		setup.py
sigal.conf.py		sigal.conf.py
update-preview.sh		update-preview.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Newspaper annotation training for lamasca

Quick Links

General Structure

Notes

Galleries

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

codemyriad/lamasca-processing

Folders and files

Latest commit

History

Repository files navigation

Newspaper annotation training for lamasca

Quick Links

General Structure

Notes

Galleries

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages