New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ocr] Research ways of cutting printed statements into smaller subsections #15

Open
LucianU opened this Issue Nov 24, 2016 · 1 comment

Comments

Projects
None yet
1 participant
@LucianU
Member

LucianU commented Nov 24, 2016

Since OCR has better results on printed statements, we want to cut the statements into the pieces with text that we can feed to the Google OCR API.

An example statement: http://declaratii.integritate.eu/UserFiles/PDFfiles/RP1119_186165-C77-I780_5-ANI-L524-00001[011230]ready//DA_2016-06-10_PONTA%20VICTOR%20VIOREL_70438189.pdf

We first want to get the tables. Here we need to find out how we can connect tables that start on one page and finish on another. Then, we take each table and we cut the cells, while keeping a reference to the column to which they belong.

Steps:

  • convert PDFs to images
  • cut the tables out
  • take each table and cut the cells out
  • feed the cells to Google OCR API and build a text representation of the statement

The final version should look like a tree:

- name: Victor Ponta
   position: DEPUTAT
   institution: PARLAMENTUL ROMÂNIEI
   lands:
     - location
     - area
     ...
   buildings:
      - location
      - area

@LucianU LucianU added the help wanted label Nov 24, 2016

@LucianU

This comment has been minimized.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment