Skip to content

Mapping a set of zones generated by a segmentation algorithm to the regions generated by OCR engine

License

Notifications You must be signed in to change notification settings

chulwoopack/Zone2OCR

Repository files navigation

Zone2OCR

Zone2OCR is a tool for document layout analysis. This tool aims at mapping a set of zones generated by a segmentation algorithm (e.g., dhSegment) to the regions generated by OCR engine.

Installation

  1. Clone this repository
  2. Install Anaconda or Miniconda (installation procedure)
  3. Create a virtual environment and activate it
conda create -n <ENV_NAME> python=3.6
conda activate <ENV_NAME>

(Optional) If one wants to run the segmentation algorithm (dhSegment) pretrained on ImageNet + Europeana historical Newspaper Project, install Tensorflow 1.13 first with

# For cpu
conda install -c conda-forge tensorflow=1.13
# For gpu
conda install tensorflow-gpu=1.13.1

and then install dhSegment dependencies with

pip install ./dhsegment/.
  1. Install Zone2OCR dependencies with
pip install .

Usage

  1. Make sure to prepare a valid file structure as below: (Note: all segmentation result xml files should match with OCR xml files)
.root
├── zone_xmls     # segmentation results
│   ├── image1.xml  
│   ├── ...
│   └── image8.xml
├── ocr_xmls      # OCR results
│   ├── image1.xml  
│   ├── ...
│   └── image8.xml
└── images        # (optional) images for visual inspection
    ├── image1.jpg  
    ├── ...
    └── image8.jpg

(Optional) Run pretrained dhSegment to collect segmentation result xml files

python run_segmentation.py -i <IMAGE_DIR> -s <SAVE_DIR> [-t <SMALL_REGION_THRESHOLD>] [-v (True|False)]
  • -i: The path to the folder containing image to be processed
  • -s: The path to the folder to store output xml file
  • -t: (Optional) A threshold of area(zone)/area(full_page) ratio for ignoring small zones [0,1] (default: 0.005)
  • -v: (Optional) Increase output verbosity (default: False)
  1. Run mapping
python zone2ocr.py -zx <ZONE_XML_DIR> -ox <OCR_XML_DIR> [-t <IOU_THRESHOLD>] -s <SAVE_DIR> [-v (True|False)]
  • -zx: The path to the folder containing segmentation result xml files
  • -ox: The path to the folder containing OCR xml files
  • -t: (Optional) A threshold of intersection over union to ignore small zones [0,1] (default: 0.1)
  • -s: The path to the folder to store output JSON file
  • -v: (Optional) Increase output verbosity (default: False)

Remark

  • Both segmentation result and OCR XML file have to follow PAGE XML-schema
  • Output JSON file follows the below structure:
[
  {
    "zone_coord" : [
      [x1,y1],[x2,y2],[x3,y3],[x4,y4]              // Found zone 1
    ],
    "zone_texts": [                               
      "text1",                                     // Matched OCR zone 1's text contents within the zone 1
      "text2",                                     // Matched OCR zone 2's text contents within the zone 1
      ...,
    ]
    "ocr_coord" : [
      [
        [x1,y1],[x2,y2],[x3,y3],[x4,y4]            // Matched OCR zone 1
      ],
      [
        [x1',y1'],[x2',y2'],[x3',y3'],[x4',y4']    // Matched OCR zone 2
      ],
        ...,
      ]
    ]
    "ocr_texts" : [
      "text1",                                     // Matched OCR zone 1's text contents
      "text2",                                     // Matched OCR zone 2's text contents
      ...,
    ]
  },
  {
    ...                                            // Found zone 2
  },
  ...
]

Authors

Acknowledgements

Main parts of dhSegment code are adapted from the work by Benoit Seguin and Sofia Ares Oliveira - DHLAB, EPFL - git - https://github.com/dhlab-epfl/dhSegment

License

This project is licensed under the GPL License - see the LICENSE file for details

About

Mapping a set of zones generated by a segmentation algorithm to the regions generated by OCR engine

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages