New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ocr] Research ways of cutting printed statements into smaller subsections #15

LucianU opened this Issue Nov 24, 2016 · 1 comment


None yet
1 participant

LucianU commented Nov 24, 2016

Since OCR has better results on printed statements, we want to cut the statements into the pieces with text that we can feed to the Google OCR API.

An example statement:[011230]ready//DA_2016-06-10_PONTA%20VICTOR%20VIOREL_70438189.pdf

We first want to get the tables. Here we need to find out how we can connect tables that start on one page and finish on another. Then, we take each table and we cut the cells, while keeping a reference to the column to which they belong.


  • convert PDFs to images
  • cut the tables out
  • take each table and cut the cells out
  • feed the cells to Google OCR API and build a text representation of the statement

The final version should look like a tree:

- name: Victor Ponta
   position: DEPUTAT
     - location
     - area
      - location
      - area

@LucianU LucianU added the help wanted label Nov 24, 2016


This comment has been minimized.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment