Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

separate preprocessing steps and use AlternativeImage in ocropy wrappers #10

Merged
merged 4 commits into from
Jul 16, 2019

Conversation

bertsky
Copy link
Collaborator

@bertsky bertsky commented Jun 25, 2019

details see changelog, roughly:

  • extra processors for binarization, deskewing, dewarping with ocropy
  • pipeline relies on AlternativeImage – as reference implementation showing how to deal with relative coordinates and coordinate transforms (rotation, translation, offset)
  • use variant of ocropy page segmentation to re-segment region rectangles into line masks
  • extra module for common ocropy functions

I will provide more examples and images shortly.

…ers:

- move binarization from recognition into extra Processor
  (also allowing region and page level operation)
- move dewarping from recognition into extra Processor
  (operating on the line level; model-independent)
- move deskewing from binarization into extra Processor
  (operating on the region level, only annotating angle in PAGE)
- always dive down the PAGE hierarchy checking whether
  AlternativeImage is referenced: use it if present,
  otherwise create an ad-hoc image for the segment
  (page/region/line) from _relative_ coordinates into
  the next higher-level image by cropping (and rotating);
  also, pass down corrected coordinates:
  - offset coordinates if the image are larger than the segment
    (e.g. from rotation),
  - rotate coordinates if the region was rotated (has @orientation)
  new functions (all to be moved into ocrd core):
  - image_from_page (AlternativeImage, or crop via Border)
  - image_from_region (AlternativeImage, or crop and rotate via Coords
    and orientation)
  - image_from_line (AlternativeImage, or crop via rotated polygon
    mask, and optionally region segmentation)
  - save_image_file (save new AlternativeImage: add to METS and
    reference in PAGE)
- use polygon masks instead of rectangles when cropping lines
  (especially useful after rotation), and try to resegment regions
  to mask components from neighbouring lines (especially useful
  against ascenders and descenders when dewarping or with sensitive
  OCR like ocropy)

- move common ocropy functions into extra module
  (but with additions/improvements):
  - PIL.Image vs np.ndarray conversions
  - type and plausibility checks for line/region/page level
    (but mix absolute and relative error criteria)
  - local whitelevel estimation (but keeping exact size)
  - deskewing (but expanding image size with rotation)
  - binarization (but using larger whitelevel percentile,
    smaller whitelevel local range and zoom, and
    larger white point threshold)
  - borderclean (remove black components only in the margin)
  - black and white column separator search
  - gradmap for baseline search (but with smaller minimum size
    of boxmap and sticky top/bottom for line components that
    were chopped-off)
  - line seed search (but with horizontal merge to avoid
    splitting lines at large whitespace in the absence of
    true colseps)
  - line segmentation for regions/pages without/with colseps
    (but with larger scale estimate and tighter hscale
     for higher vertical variability of broken fonts)
  - denoising
- ocrd-tool: add default input and output file groups
- update README and setup
- version: 0.0.2 -> 0.0.3
@bertsky
Copy link
Collaborator Author

bertsky commented Jun 28, 2019

The last commit maintains that functions in common are the same as those in OCR-D/ocrd_tesserocr#48.

@bertsky
Copy link
Collaborator Author

bertsky commented Jun 28, 2019

examples:

from euler_rechenkunst01_1738 page phys0006 region r_5_3 (as annotated by OCR-D-GT-SEG-LINE):

  • the cropped raw region (with components of neighbouring regions at the margins)
    OCR-D-IMG_0006_r_5_3
  • and some cropped raw lines (with components of neighbouring lines at the margins):
    OCR-D-IMG_0006_r_5_3_tl_10
    OCR-D-IMG_0006_r_5_3_tl_11
    OCR-D-IMG_0006_r_5_3_tl_12
    OCR-D-IMG_0006_r_5_3_tl_13
    OCR-D-IMG_0006_r_5_3_tl_14
  • now the binarized and deskewed region (after borderclean with margin=8):
    OCR-D-IMG-BIN_0006_r_5_3
  • and the line images masked by their polygons (i.e. rotated rectangles):
    OCR-D-IMG-BIN_0006_r_5_3__tl_10_maskedrot
    OCR-D-IMG-BIN_0006_r_5_3__tl_11_maskedrot
    OCR-D-IMG-BIN_0006_r_5_3__tl_12_maskedrot
    OCR-D-IMG-BIN_0006_r_5_3__tl_13_maskedrot
    OCR-D-IMG-BIN_0006_r_5_3__tl_14_maskedrot
  • now the region's line segmentation pixel mask (for resegmentation), overlaid by the deskewed raw region image with opacity 50% (merely for illustration):
    OCR-D-IMG-BIN_0006_r_5_3 labels+raw0 5
  • and then the line images masked by that line segmentation (resegmented):
    OCR-D-IMG-BIN_0006_r_5_3__tl_10_maskedrot_resegmented
    OCR-D-IMG-BIN_0006_r_5_3__tl_11_maskedrot_resegmented
    OCR-D-IMG-BIN_0006_r_5_3__tl_12_maskedrot_resegmented
    OCR-D-IMG-BIN_0006_r_5_3__tl_13_maskedrot_resegmented
    OCR-D-IMG-BIN_0006_r_5_3__tl_14_maskedrot_resegmented
  • recropped to the actual outlines:
    OCR-D-IMG-BIN_0006_r_5_3__tl_10_maskedrot_resegmented_cropped
    OCR-D-IMG-BIN_0006_r_5_3__tl_11_maskedrot_resegmented_cropped
    OCR-D-IMG-BIN_0006_r_5_3__tl_12_maskedrot_resegmented_cropped
    OCR-D-IMG-BIN_0006_r_5_3__tl_13_maskedrot_resegmented_cropped
    OCR-D-IMG-BIN_0006_r_5_3__tl_14_maskedrot_resegmented_cropped
  • and finally, dewarped (which also adds padding at the top and bottom):
    OCR-D-IMG-DEWARP_0006_r_5_3_tl_10
    OCR-D-IMG-DEWARP_0006_r_5_3_tl_11
    OCR-D-IMG-DEWARP_0006_r_5_3_tl_12
    OCR-D-IMG-DEWARP_0006_r_5_3_tl_13
    OCR-D-IMG-DEWARP_0006_r_5_3_tl_14

@bertsky
Copy link
Collaborator Author

bertsky commented Jul 8, 2019

Please do not merge yet. I will push more changes soon that will further improve:

  • rebase common functions on processing polygon coordinates throughout (instead of just bounding boxes)
  • better ensapsulation
  • resegmentation as a separate processor (not mixed into common), and also used to clip graphic/separator foreground components out of text regions (where they overlap in the annotation)
  • more improvements on ocropy line segmentation

…dd clip:

- make all common functions for image extraction respect and recreate
  the full polygon coordinates (not just the bounding box):
  - use Numpy arrays for coordinates instead of dicts
  - rename rotate_polygon → rotate_coordinates
  - factor out coordinates_of_segment for shared offset/rotation calc
  - offer extra coordaintes_for_segment for the reverse direction
    (to add segmentation on lower levels)
  - factor out image_from_polygon for shared background masking
- when masking a polygon from an image, fill with the background color
  (instead of white)
- when cropping a rectangle from an image, if the rectangle extends
  beyond the image (as happens with bad segmentation when segments
  extend beyond their parents in PAGE), fill with the background color
  (instead of black)

- in various processors: start introducing DPI-based zoom parameter
- when deskewing, make sure to also create a rotated AlternativeImage
- when deskewing, ignore detected angles if the drop in variance is too
  small (as happens on tiny regions)
- when binarizing, be robust against NaN results for threshold levels
- when binarizing, do not attempt borderclean (obsolete with clip)
- when binarizing, do not attempt deskewing on page level (yet)

- add new Processor clipping connected components from neighbouring
  segments (operating on the region or line level), which produces
  images with intruding foreground components clipped to white
- move re-segmentation from `image_from_line` or binarization/dewarping
  into extra Processor (operating on the line level), which instead of
  producing images creates shrinked, non-overlapping polygon outlines
- improve line segmentation (compute_line_labels) further:
  - use more robust state transitions from bottom to top line markers:
    project seed by delta from both bottom (up) and top (down), but
    stop short if they are closer to each other already (fill only)
  - horizontally blur bottom line markers just like top line markers
  - skip horizontally blurring the resulting seeds altogether
    (to avoid accidentally joining lines)
  - this obsoletes the large (6*hscale) horizontal blur of the gradient
  - this obsoletes the sticky option for compute_gradmap: do not extend
    the gradient from the bottom/top margins
  - make the old behaviour available with robust=False
  - fix hmerge relabelling
  - when spreading line seeds to the background, first make sure that
    connected components of the foreground remain in their majority label
  - when full_page=True,
    - add remove_hlines again, but with additional height threshold,
      and smaller width threshold default
    - when searching for black column separators,
      - reduce the vertical threshold (because vlines can be discontinuous)
      - keep only connected components that are properly contained in
        the detected region (i.e. avoid damaging neighbours)
  - make checks optional here as well
  - combine scale parameters with additional top-level zoom parameter
    (to be determined from DPI factor against implicit 300)
- improve docstrings
@bertsky
Copy link
Collaborator Author

bertsky commented Jul 16, 2019

Besides the changes already announced, this last commit includes a new processor clip which can suppress neighbouring segments (lines vs. line, or regions vs. region – including SeparatorRegion, GraphicRegion etc) by clipping connected components belonging to them to white in the segment's AlternativeImage. (I had to use image clipping instead of polygon shrinking here, because many frequent cases would create interior islands or non-contiguous polygons.) This completely obsoletes the need for the borderclean mechanism.

Due to the nature of comparison between neighbours though, the segments must not have @orientation or AlternativeImage already. That is, for region-level clipping, any region-level binarization or deskewing must come afterwards.

Line-level clipping can be seen as an alternative to resegmentation which does not depend on Ocropy segmentation.

@bertsky
Copy link
Collaborator Author

bertsky commented Jul 16, 2019

Here are some examples. Starting with the above euler_rechenkunst01_1738 page phys0006 region r_5_3 again:

  • clipping the region results in a raw image with white spots where neighbours (here r_5_2 and r_5_4) intrude:
    OCR-D-IMG-CLIP_0006_r_5_3

  • this gives an improved starting point for resegmentation:
    phys_0006_r_5_3_region_labels

  • these are the original polygon line masks annotated in the GT:
    phys_0006_r_5_3_tl_10_line_mask
    phys_0006_r_5_3_tl_11_line_mask
    phys_0006_r_5_3_tl_12_line_mask
    phys_0006_r_5_3_tl_13_line_mask
    phys_0006_r_5_3_tl_14_line_mask

  • now we have a parameter in resegmentation that allows extending them by some amount of pixels first (to compensate for too tight cropping):
    phys_0006_r_5_3_tl_10_line_mask_extended
    phys_0006_r_5_3_tl_11_line_mask_extended
    phys_0006_r_5_3_tl_12_line_mask_extended
    phys_0006_r_5_3_tl_13_line_mask_extended
    phys_0006_r_5_3_tl_14_line_mask_extended

  • this is how the intersection with the region labels looks like:
    phys_0006_r_5_3_tl_10_line_labels
    phys_0006_r_5_3_tl_11_line_labels
    phys_0006_r_5_3_tl_12_line_labels
    phys_0006_r_5_3_tl_13_line_labels
    phys_0006_r_5_3_tl_14_line_labels

  • and thus, the mask of the largest label, respectively:
    phys_0006_r_5_3_tl_10_line_mask_largest
    phys_0006_r_5_3_tl_11_line_mask_largest
    phys_0006_r_5_3_tl_12_line_mask_largest
    phys_0006_r_5_3_tl_13_line_mask_largest
    phys_0006_r_5_3_tl_14_line_mask_largest

  • finally, the recropped and dewarped images (again with extra margin at the top and bottom):
    OCR-D-IMG-DEWARP_0006_r_5_3_tl_10
    OCR-D-IMG-DEWARP_0006_r_5_3_tl_11
    OCR-D-IMG-DEWARP_0006_r_5_3_tl_12
    OCR-D-IMG-DEWARP_0006_r_5_3_tl_13
    OCR-D-IMG-DEWARP_0006_r_5_3_tl_14

@bertsky
Copy link
Collaborator Author

bertsky commented Jul 16, 2019

Here are some more clipping examples from euler_rechenkunst01_1738:

  • page phys0003 region TextRegion_1479743994178_731:
    OCR-D-IMG-CLIP_0003_TextRegion_1479743994178_731

  • page phys0004 region r_6_2 (not perfect due to suboptimal binarization):
    OCR-D-IMG-CLIP_0004_r_6_2

  • page phys0004 region r_9_3:
    OCR-D-IMG-CLIP_0004_r_9_3

  • page phys0004 region r_9_4:
    OCR-D-IMG-CLIP_0004_r_9_4

  • page phys0005 region r_8_5:
    OCR-D-IMG-CLIP_0005_r_8_5

  • page phys0005 region TextRegion_1475759982805_45:
    OCR-D-IMG-CLIP_0005_TextRegion_1475759982805_45

  • page phys0005 region TextRegion_1475759982805_44:
    OCR-D-IMG-CLIP_0005_TextRegion_1475759982805_44

  • page phys0006 region r_5_2 (just above the first example):
    OCR-D-IMG-CLIP_0006_r_5_2

  • page phys0006 region r_5_4 (just below the first example):
    OCR-D-IMG-CLIP_0006_r_5_4

@wrznr
Copy link

wrznr commented Jul 16, 2019

@finkf Please review and/or merge!

@finkf
Copy link
Contributor

finkf commented Jul 16, 2019

I am OK to merge this. But wouldn't it make more sense to separate all the ocropus related parts of ocrd-cis in its own ocrd module?

@bertsky
Copy link
Collaborator Author

bertsky commented Jul 16, 2019

Splendid! Definitely, it does make sense to have a separate module with all Ocropy based processors. And that's exactly what we are planning to do. This is just the first step, so others can start experimenting already. Remember, I started this in your repo because of all your good work on the recognition processor.

The idea is to restructure the repo OCR-D/ocropy into 2 packages, ocrolib (current ocrolib modules plus non-trivial functions in current CLIs) and ocropus (CLIs only, and packaged properly, possibly ported to click). (Both names are still free in PyPI!) That fork is then meant to live on as an improved, Python 3 ported, better encapsulated incarnation, which can probably be PRed back into tmbdev/ocropy.

Moreover, OCR-D/ocrd_ocropy can then become a simple OCR-D wrapper based on ocrolib.

Based on this, I will re-commit all changes not strictly related to wrappers I made here to the new ocrolib. Finally, I will introduce the new processors into OCR-D/ocrd_ocropy, and also rewrite the segmentation processor (the only one I have not touched yet).

@finkf
Copy link
Contributor

finkf commented Jul 16, 2019

OK. Ill merge then.

@finkf finkf merged commit fab1e8d into cisocrgroup:dev Jul 16, 2019
@bertsky
Copy link
Collaborator Author

bertsky commented Jul 16, 2019

thx!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants