New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tesseract 3.03-rc1 and newer git versions have basic integrated(!) mixed-mode single-page PDF rendering support #85

Closed
Wikinaut opened this Issue Sep 12, 2014 · 5 comments

Comments

Projects
None yet
3 participants
@Wikinaut

Wikinaut commented Sep 12, 2014

[UPDATED: I removed the information about the tessedit_pdf_compression parameter, which has been recently removed in the HEAD branch version of tesseract.]

Thanks

Dear developers of OCRmyPDF, first at all, thank you for your impressive and great work.

News

In the meantime, after your publications in autumn 2013 and later in German magazine "c't", the Tesseract developers integrated a similar PDF output support into their code starting in February 2014, which makes - not in all, but in many standard cases - the OCRmyPDF framework obsolete, in my personal view.

The new feature can currently only be used if you checkout Tesseract from their source in Google git ( https://code.google.com/p/tesseract-ocr/ explains how)

Searchable PDF output is a standard feature as of Tesseract version 3.03

tesseract phototest.png phototest pdf

framework for multi-page pdfs

Tesseract cannot process multi-page PDFs as input.

Here is a framework example:

  • splitting (use pdftk) and converting (use convert from imagemagick) into single-page png images (lossless coded)
  • per page: tesseract OCR process and create a single-page mixed-mode PDF
  • merge (use pdftk) the single-page mixed-mode PDFs into the multi-page mixed-mode PDF
pdftk infile.pdf burst output $tmpdir/page_%03d.pdf
page=0
imagetype="png"

for file in $tmpdir/*.pdf
do 
    image=$file.$imagetype
    convert -density $density -depth $depth $file $image
    rm $file
    page=`expr $page + 1`
    tessoptions="--tessdata-dir "$tessdatadir" -l "$language" pdf"
    tesseract $image $image $tessoptions
    rm $image
done

pdftk $tmpdir/*.pdf cat output $tmpdir/tmp.pdf

@Wikinaut Wikinaut changed the title from [IMPORTANT TO KNOW] Tesseract 3.03 has integrated support mixed-mode PDF rendering and makes the whole OCRmyPDF framework in many cases obsolete to [IMPORTANT TO KNOW] Tesseract 3.03 has integrated support mixed-mode PDF rendering and makes this OCRmyPDF framework in many cases obsolete Sep 12, 2014

@Wikinaut Wikinaut changed the title from [IMPORTANT TO KNOW] Tesseract 3.03 has integrated support mixed-mode PDF rendering and makes this OCRmyPDF framework in many cases obsolete to [IMPORTANT TO KNOW] Tesseract 3.03 has integrated support mixed-mode PDF rendering and may make this OCRmyPDF framework in many cases obsolete Sep 12, 2014

@Wikinaut Wikinaut changed the title from [IMPORTANT TO KNOW] Tesseract 3.03 has integrated support mixed-mode PDF rendering and may make this OCRmyPDF framework in many cases obsolete to [IMPORTANT TO KNOW] Tesseract 3.03 has integrated support mixed-mode PDF rendering. May make this OCRmyPDF framework in many cases obsolete Sep 12, 2014

@Wikinaut Wikinaut changed the title from [IMPORTANT TO KNOW] Tesseract 3.03 has integrated support mixed-mode PDF rendering. May make this OCRmyPDF framework in many cases obsolete to [IMPORTANT TO KNOW] Tesseract 3.03 has integrated(!) mixed-mode PDF rendering support. May make this OCRmyPDF framework in many cases obsolete Sep 12, 2014

@fritz-hh

This comment has been minimized.

Show comment
Hide comment
@fritz-hh

fritz-hh Sep 24, 2014

Owner

Hi Wikinaut,

Thank you for your message.

Tesseract delivered a 3.03-rc1 (i.e. release candidate 1) version on 4th of february 2014 (https://code.google.com/p/tesseract-ocr/wiki/ReleaseNotes)
According to the tesseract team, this pre-release is for developpers and testers only (see this thread: https://groups.google.com/d/msg/tesseract-ocr/er3ONslwbEE/IQozlErxz9sJ)
For unclear reasons it has been delivered in some linux distributions (e.g. Ubuntu) and is advertised as tesseract 3.03 (even though it is 3-03-rc1).

Nevertheles but most linux/unix distribustions do not have any package of tesseract supporting PDF generation yet.

Even after the delivery of the next official version of tesseract (And there is not delivery date annouced yet), I believe that OCRmyPDF will be of strong interrest for many users. Indeed it provides the many functions that (to my knowledge) won't be part of the next tesseract release:

  • Generation of multipage PDF/A-1 file (i.e. meeting the requirements for long term archivation)
  • Fast generation as it makes use of all CPU cores instead of using just one core
  • Keepts exact resolution of the original embedded images
  • If required performs deskews and / or clean the image before performing OCR
  • Provides a debug mode to enable easy verification of the OCR results

For the mean time I keep yout ticket open, as it might be help for some users to know that 3.03-rc1 support single page pdf generation from images

Owner

fritz-hh commented Sep 24, 2014

Hi Wikinaut,

Thank you for your message.

Tesseract delivered a 3.03-rc1 (i.e. release candidate 1) version on 4th of february 2014 (https://code.google.com/p/tesseract-ocr/wiki/ReleaseNotes)
According to the tesseract team, this pre-release is for developpers and testers only (see this thread: https://groups.google.com/d/msg/tesseract-ocr/er3ONslwbEE/IQozlErxz9sJ)
For unclear reasons it has been delivered in some linux distributions (e.g. Ubuntu) and is advertised as tesseract 3.03 (even though it is 3-03-rc1).

Nevertheles but most linux/unix distribustions do not have any package of tesseract supporting PDF generation yet.

Even after the delivery of the next official version of tesseract (And there is not delivery date annouced yet), I believe that OCRmyPDF will be of strong interrest for many users. Indeed it provides the many functions that (to my knowledge) won't be part of the next tesseract release:

  • Generation of multipage PDF/A-1 file (i.e. meeting the requirements for long term archivation)
  • Fast generation as it makes use of all CPU cores instead of using just one core
  • Keepts exact resolution of the original embedded images
  • If required performs deskews and / or clean the image before performing OCR
  • Provides a debug mode to enable easy verification of the OCR results

For the mean time I keep yout ticket open, as it might be help for some users to know that 3.03-rc1 support single page pdf generation from images

@fritz-hh fritz-hh changed the title from [IMPORTANT TO KNOW] Tesseract 3.03 has integrated(!) mixed-mode PDF rendering support. May make this OCRmyPDF framework in many cases obsolete to Tesseract 3.03-rc1 has integrated(!) mixed-mode singlepage PDF rendering support Sep 24, 2014

@Wikinaut

This comment has been minimized.

Show comment
Hide comment
@Wikinaut

Wikinaut Oct 9, 2014

@fritz-hh thanks for your detailed and correct analysis. I also think that the OCRmyPDF framework offers more options but requires also more resource (dependencies).

The main purpose of my contribution is to point you and other potential users to the tesseract-now-integrated (basic, single-page) pdf support - the original c't article, the follow-up articles, and also the OCRmyPDF framework were and are silent about this fact. It is understandable, because the OCRmyPDF framework was devloped earlier than tesseract's new pdf option.

Wikinaut commented Oct 9, 2014

@fritz-hh thanks for your detailed and correct analysis. I also think that the OCRmyPDF framework offers more options but requires also more resource (dependencies).

The main purpose of my contribution is to point you and other potential users to the tesseract-now-integrated (basic, single-page) pdf support - the original c't article, the follow-up articles, and also the OCRmyPDF framework were and are silent about this fact. It is understandable, because the OCRmyPDF framework was devloped earlier than tesseract's new pdf option.

@Wikinaut Wikinaut changed the title from Tesseract 3.03-rc1 has integrated(!) mixed-mode singlepage PDF rendering support to Tesseract 3.03-rc1 and newer versions have integrated(!) mixed-mode singlepage PDF rendering support Oct 9, 2014

@Wikinaut Wikinaut changed the title from Tesseract 3.03-rc1 and newer versions have integrated(!) mixed-mode singlepage PDF rendering support to Tesseract 3.03-rc1 and newer git versions have integrated(!) mixed-mode singlepage PDF rendering support Oct 9, 2014

@Wikinaut Wikinaut changed the title from Tesseract 3.03-rc1 and newer git versions have integrated(!) mixed-mode singlepage PDF rendering support to Tesseract 3.03-rc1 and newer git versions have basic integrated(!) mixed-mode single-page PDF rendering support Oct 9, 2014

@jbarlow83

This comment has been minimized.

Show comment
Hide comment
@jbarlow83

jbarlow83 Jul 28, 2015

Collaborator

In OCRmyPDF v3.0-rc2 we're now taking advantage of Tesseract's improved (single page only) PDF output to improve the overall results of OCRmyPDF. Tesseract doesn't do everything we need.

Collaborator

jbarlow83 commented Jul 28, 2015

In OCRmyPDF v3.0-rc2 we're now taking advantage of Tesseract's improved (single page only) PDF output to improve the overall results of OCRmyPDF. Tesseract doesn't do everything we need.

@jbarlow83 jbarlow83 closed this Jul 28, 2015

@Wikinaut

This comment has been minimized.

Show comment
Hide comment
@Wikinaut

Wikinaut Jul 28, 2015

@jbarlow83 thanks for the info.

Wikinaut commented Jul 28, 2015

@jbarlow83 thanks for the info.

@Wikinaut

This comment has been minimized.

Show comment
Hide comment
@Wikinaut

Wikinaut Sep 6, 2015

Apparently, the Tesseract PDF rendering mode (Tesseract versions > 3.02 can generate mixed-mode PDFs directly) which has been proposed in the present issue, can be achieved by starting OCRMyPDF with the commandline option

--pdf-renderer tesseract

(introduced in OCRMyPDf version 3.0) like in the example

ocrmypdf -l deu --pdf-renderer tesseract infile.pdf outfile.pdf

See https://github.com/fritz-hh/OCRmyPDF/blob/master/RELEASE_NOTES.rst .
@jbarlow83 Thanks for implementing this!

Wikinaut commented Sep 6, 2015

Apparently, the Tesseract PDF rendering mode (Tesseract versions > 3.02 can generate mixed-mode PDFs directly) which has been proposed in the present issue, can be achieved by starting OCRMyPDF with the commandline option

--pdf-renderer tesseract

(introduced in OCRMyPDf version 3.0) like in the example

ocrmypdf -l deu --pdf-renderer tesseract infile.pdf outfile.pdf

See https://github.com/fritz-hh/OCRmyPDF/blob/master/RELEASE_NOTES.rst .
@jbarlow83 Thanks for implementing this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment