Skip to content

omniscribe_traning_data

Latest
Compare
Choose a tag to compare
@kirschbombe kirschbombe released this 29 Mar 03:04

Omniscribe training data

DOI

Project data for Omniscribe: https://github.com/collectionslab/Omniscribe

Omniscribe was developed to detect annotations (marginalia, interlinear markings, provenance marks, etc.) in digitized printed books hosted via the International Image Interoperability Framework (IIIF).

Files and Directories

  • rawData.csv: This csv files stores all the labeled data created from zooniverse users. The data includes regions of interests that were labeled and provides some information of the user who marked them. Further processing of this data is needed before it can be trained.

  • extractROIs.py: This script takes the rawData.csv file (hard-coded) and generates data.json, a JSON file that contains all the images listed on Zooniverse along with all the regions that they may have. The JSON itself is a relatively complex object that stores many images, and those images may themselves have lists of ROIs.

    To put it simply, every image has a list of ROIs, and every ROI is made up of an all_points_x" array and an all_points_y array such that all_points_x[i] and all_points_y[i] make up a coordinate point, where every region would have four of these coordinate points (to make a rectangle that captures the ROI). The ROIs are constructed this way to fit Mask R-CNN structure requirements.

  • data.json: The resulting file generated from extractROIs.py. It contains all the images with their labeled annotations from rawData.csv. It is to be used with datasetGenerator.py in order to generate datasets that are ready for training.

  • datasetGenerator.py: This scripts reads data.json and generates three JSON files for training, validation, and testing. Each of these files have to be renamed to via_region_data.json and are to be placed in the same directory where the images they represent are located. Note that changing the SEED value will create different datasets.

  • annotation-datasets/: Contains a training set and a validation set for images that contain handwriting. Note that training the model assumes daughter directories "train" and "val" where those directories contain only images.