Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IndexError: list index out of range #357

Closed
heixincai opened this issue Jul 9, 2019 · 10 comments
Closed

IndexError: list index out of range #357

heixincai opened this issue Jul 9, 2019 · 10 comments

Comments

@heixincai
Copy link

heixincai commented Jul 9, 2019

When i trying to read this pdf,i got this question:
i don't how to solve it,thanks
AreaPercent-lc-multPageTable1.pdf

D:\LES\Environment\Python\python.exe D:/LES/Python/Code/ReadNoBorderTable.py
Traceback (most recent call last):
File "D:/LES/Python/Code/ReadNoBorderTable.py", line 7, in
tables = camelot.read_pdf(pdfPath,flavor=tableType,strip_text=' .\n',columns=['58,107,139,189,327,258'],split_text=True)
File "C:\Users\suyongdeng.RD\AppData\Roaming\Python\Python37\site-packages\camelot\io.py", line 106, in read_pdf
layout_kwargs=layout_kwargs, **kwargs)
File "C:\Users\suyongdeng.RD\AppData\Roaming\Python\Python37\site-packages\camelot\handlers.py", line 162, in parse
layout_kwargs=layout_kwargs)
File "C:\Users\suyongdeng.RD\AppData\Roaming\Python\Python37\site-packages\camelot\parsers\stream.py", line 425, in extract_tables
cols, rows = self._generate_columns_and_rows(table_idx, tk)
File "C:\Users\suyongdeng.RD\AppData\Roaming\Python\Python37\site-packages\camelot\parsers\stream.py", line 321, in _generate_columns_and_rows
if self.columns is not None and self.columns[table_idx] != "":
IndexError: list index out of range

Process finished with exit code 1

@heixincai
Copy link
Author

when i use Excalibur,i find Autodetect Tables will find a smallTable,look at the picture,
image
if i remove the small, Camelot can extract tables.May be this is the same question,this is anything what i know.Thanks~

@pachacamac
Copy link

pachacamac commented Aug 13, 2019

Pretty sure I have the same problem.

Running with flavor='stream', columns=['62,105,185,252'] and get

File "/home/user/.local/lib/python3.7/site-packages/camelot/io.py", line 117, in read_pdf
    **kwargs
  File "/home/user/.local/lib/python3.7/site-packages/camelot/handlers.py", line 172, in parse
    p, suppress_stdout=suppress_stdout, layout_kwargs=layout_kwargs
  File "/home/user/.local/lib/python3.7/site-packages/camelot/parsers/stream.py", line 458, in extract_tables
    cols, rows = self._generate_columns_and_rows(table_idx, tk)
  File "/home/user/.local/lib/python3.7/site-packages/camelot/parsers/stream.py", line 336, in _generate_columns_and_rows
    if self.columns is not None and self.columns[table_idx] != "":
IndexError: list index out of range

Any way to just ignore tables that don't fall in the provided columns and move on? I get this on a lot of PDFs but the PDFs all have slightly different stuff surrounding the table I'm interested in. So I would just like to move on without failing and later filter the tables that fit my scheme.

PS: I already limited to the right pages but can't / don't know how to give the stream parser a concrete starting and ending point.

@pachacamac
Copy link

pachacamac commented Aug 13, 2019

Somewhat dirty workaround that works for my case:

cols = ['62,105,185,252']
cols *= 128 # <-- workaround: just make sure to have enough of the same col set for all tables that will be discovered. e.g. ['62,105,185,252', '62,105,185,252', .....]
camelot.read_pdf(pdf_file, flavor='stream', columns=cols)

@heixincai let me know if it helps you too :)

@vinayak-mehta
Copy link
Contributor

@heixincai @pachacamac If you know the approximate location of the table in your PDF (assuming the table always lies in this general area in all PDFs that you have), you can specify table_regions to make camelot look for tables in only these regions.

@pachacamac
Copy link

pachacamac commented Aug 27, 2019

@vinayak-mehta for me the problem is that I have PDFs where I'm interested in tables by structure (same columns etc) but different height, y-position, etc. on multiple pages (unknown number of pages)

@vinayak-mehta
Copy link
Contributor

vinayak-mehta commented Aug 27, 2019

Perhaps, we can put in another filter to weed out tables which do not have a certain width/height as a parameter inside the library.

@heixincai
Copy link
Author

sorry,I have been busy with my project these days.My solution is to just get the location information of the entire PDF page.Then filter the parsed data.Maybe my method is just for myself,But the problem has been solved.
Thanks~

@pachacamac
Copy link

Has this been fixed now?

@vinayak-mehta
Copy link
Contributor

@pachacamac Opened it here camelot-dev/camelot#50

@helpgodsg
Copy link

@pachacamac Great. It's really work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants