New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hangs on large PDFs #33

Closed
thebenedict opened this Issue Aug 6, 2014 · 10 comments

Comments

Projects
None yet
3 participants
@thebenedict

thebenedict commented Aug 6, 2014

I'm seeing the Python package hang on large PDFs. To reproduce, use a large pdf like this World Bank Annual Report (pdf) and run textract.process('filename.pdf'). I get no output but the command doesn't complete.

Popen.wait() in shell.py is filling up the OS buffer. Switching to Popen.communicate() as suggested in the link above fixes the issue, but since communicate() returns the output (and error) text directly, I had to modify shell and the pdf parser to return text instead of a pipe object.

Happy to submit a pull request, but is it wise to change the return value of shell (and therefore need to change every parser)? Is there something else to return that we can call sdtout.read() on?

FWIW, this post suggests using tempfile.TemporaryFile() to work around the buffer issue, but I couldn't figure out how to call pipe.stdout.read() with the tempfile.

@christomitov

This comment has been minimized.

Contributor

christomitov commented Aug 7, 2014

@thebenedict Found this on the issue: http://stackoverflow.com/questions/1180606/using-subprocess-popen-for-process-with-large-output

Going to play with the solutions proposed by Glenn Maynard there and see what happens.

@deanmalmgren

This comment has been minimized.

Owner

deanmalmgren commented Aug 7, 2014

Well that's a bummer. Are you sure the problem is that python is hanging or
is it just that extracting the text from PDFs takes a long time?

The reason I ask is that --- depending on how the PDF parser extracts text
--- this can take a long time. e.g.,
euske/pdfminer#61

If you just run pdftotext filename.pdf, is that fast?
On Aug 6, 2014 10:57 PM, "Christo Mitov" notifications@github.com wrote:

@thebenedict https://github.com/thebenedict Found this on the issue:
http://stackoverflow.com/questions/1180606/using-subprocess-popen-for-process-with-large-output

Going to play with the two solutions proposed by Glenn Maynard there and
see what happens.

Reply to this email directly or view it on GitHub
#33 (comment)
.

@deanmalmgren

This comment has been minimized.

Owner

deanmalmgren commented Aug 7, 2014

Also, for what its worth, I'm not opposed to having the shell.run function return a string instead of a pipe object. That choice wasn't...er hem...carefully made

(sorry I can't be more help; in the airport at the moment)

@thebenedict

This comment has been minimized.

thebenedict commented Aug 7, 2014

Fairly sure it's hanging. CPU is at 0 when it stops, and both pdftotext filename.pdf and pdf2text filename.pdf complete in a second or two in ipython shell.

I'll try to take another look at it later today. Assuming it's a buffer issue, a tempfile seems like a good approach.

@deanmalmgren

This comment has been minimized.

Owner

deanmalmgren commented Aug 7, 2014

{{kicks floor. cloud of dust drifts in the wind}}

Bummer.

@christomitov

This comment has been minimized.

Contributor

christomitov commented Aug 7, 2014

@deanmalmgren @thebenedict Alternatively we can just remove the .wait() call as waiting for the process to finish just to read the pipe anyway is not needed. I can open a PR if you like, I basically just removed the .wait(), and the returncode check and it doesn't hang. Ideally the returncode check will have to happen after the data has been read.

@thebenedict

This comment has been minimized.

thebenedict commented Aug 7, 2014

@christomitov @deanmalmgren Sounds good to me -- is there a branch where I can try it out? I'd like to see what that does when there actually is an error in the pdf.

@christomitov

This comment has been minimized.

Contributor

christomitov commented Aug 7, 2014

@thebenedict Here you go , to do the returncode check will probably for sure require a .poll() to check if the process has terminated.

EDIT: I think we're just going to have to switch to use .communicate() like you originally suggested.. there doesn't seem to be another way to retain exceptions based on return codes otherwise.

@deanmalmgren deanmalmgren added the bug label Aug 9, 2014

@deanmalmgren

This comment has been minimized.

Owner

deanmalmgren commented Aug 9, 2014

Thanks for looking into this. Sorry I've been neglecting it; I got busy the past few days and haven't had a chance to properly take a look.

All things being equal, I'd prefer to keep the return code check in the run function to make this as clean as possible. If it makes sense to refactor and have run return a string instead of a pipe object, I am 100% supportive of that change.

I'll try to take a closer look over the weekend and hopefully we can release a patch to fix this issue ASAP

@thebenedict

This comment has been minimized.

thebenedict commented Aug 9, 2014

Sweet, thanks @deanmalmgren. Good hanging with you both.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment