Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

max() arg is an empty sequence error on PDFs with blank pages, is there a skip empty page option? #179

Closed
aflip opened this issue Aug 10, 2020 · 1 comment · Fixed by #189
Projects

Comments

@aflip
Copy link

aflip commented Aug 10, 2020

  1. Camelot is great! Thank you
  2. I'm using Camelot to extract IDSP data. I am a physician, trying to see if an epidemic calendar can be made using this data.
  3. When there are files which have empty pages after the tables, instead of skipping the page or moving on to the next page, camelot aborts that run and throws up ValueError: max() arg is an empty sequence

The PDF that triggers this is also attached.
5.pdf

When row_tol is not specified, it throws up an error, but parses the file and extracts the other tables like so

image

but once the row_tol, is set, it doesn't give me the other tables.

So, if I have a feature that lets me skip the empty pages, that would help.

Because In some PDFs there are a few empty pages between the tables, and when I'm processing thousands of PDFs, it's impossible to keep changing the parameters for each one.

To reproduce:

Use stream with row_tol or other parameters on an empty page in the pdf.

System:

Linux-5.4.0-42-generic-x86_64-with-debian-bullseye-sid
Python 3.7.8 | packaged by conda-forge | (default, Jul 31 2020, 02:25:08) 
[GCC 7.5.0]
NumPy 1.1.1
OpenCV 4.4.0
Camelot 0.8.2

Full error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-2-437e7b144dac> in <module>
----> 1 tables= cm.read_pdf('data/2-end/5.pdf', pages='2-end', flavor = 'stream', row_tol=55)

~/miniconda3/envs/ex/lib/python3.7/site-packages/camelot/io.py in read_pdf(filepath, pages, password, flavor, suppress_stdout, layout_kwargs, **kwargs)
    115             suppress_stdout=suppress_stdout,
    116             layout_kwargs=layout_kwargs,
--> 117             **kwargs
    118         )
    119         return tables

~/miniconda3/envs/ex/lib/python3.7/site-packages/camelot/handlers.py in parse(self, flavor, suppress_stdout, layout_kwargs, **kwargs)
    170             for p in pages:
    171                 t = parser.extract_tables(
--> 172                     p, suppress_stdout=suppress_stdout, layout_kwargs=layout_kwargs
    173                 )
    174                 tables.extend(t)

~/miniconda3/envs/ex/lib/python3.7/site-packages/camelot/parsers/stream.py in extract_tables(self, filename, suppress_stdout, layout_kwargs)
    455             sorted(self.table_bbox.keys(), key=lambda x: x[1], reverse=True)
    456         ):
--> 457             cols, rows = self._generate_columns_and_rows(table_idx, tk)
    458             table = self._generate_table(table_idx, cols, rows)
    459             table._bbox = tk

~/miniconda3/envs/ex/lib/python3.7/site-packages/camelot/parsers/stream.py in _generate_columns_and_rows(self, table_idx, tk)
    346             # calculate mode of the list of number of elements in
    347             # each row to guess the number of columns
--> 348             ncols = max(set(elements), key=elements.count)
    349             if ncols == 1:
    350                 # if mode is 1, the page usually contains not tables

ValueError: max() arg is an empty sequence

@aflip aflip changed the title [Feature Request ] Skip empty page option to avoid max() arg is an empty sequence error max() arg is an empty sequence error on PDFs with blank pages, is there a skip empty page option? Aug 11, 2020
@vinayak-mehta vinayak-mehta added this to In progress in TODO! Aug 25, 2020
@vinayak-mehta
Copy link
Member

@aflip Thanks for the detailed issue report! This should be fixed in the next release. 💚 💙 💜 💛 ❤️

@vinayak-mehta vinayak-mehta moved this from In progress to Done in TODO! Aug 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
TODO!
  
Done
2 participants