-
-
Notifications
You must be signed in to change notification settings - Fork 297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Columns getting merged #55
Comments
I think it's a kind of tabula-java option problem, so it would be nice to ask at tabula-java issue. How about using stream option?
|
Ok. I'll raise an issue at tabula-java. Received same output from |
The same problem occur in tabular-py |
I am so having issue same @konfi Please help us! |
@samkit-jain Were you able to raise this issue again. Please let me know steps if its resolved. Issue - Even if a column is empty in any page then tabula-py writes data of another column Tried reading the pdf file using tabula read_pdf in python. df=read_pdf(pdfFile, pages='1', stream='True', guess='False') As you can see in output screenshot the columns Withdrawal & Deposit got merged into a single column If i use Tabula app (https://tabula.technology/) on local it works fine. |
@kvopencode Haven't been able to resolve this |
I tried with this PDF created by Google Spreadsheet, but couldn't reproduce the issue. Could someone provide a minimum reproducible PDF for me? In [1]: import tabula
In [2]: path = "test.pdf"
In [23]: df= tabula.read_pdf(path)
In [24]: df
Out[24]:
ColA ColB ColC
0 1.0 NaN 6.0
1 NaN 2.0 7.0
2 NaN NaN NaN
3 3.0 NaN 8.0
4 NaN 4.0 NaN
5 NaN 5.0 9.0
In [25]: tabula.read_pdf(path, guess=False)
Out[25]:
ColA ColB ColC
0 1.0 NaN 6.0
1 NaN 2.0 7.0
2 NaN NaN NaN
3 3.0 NaN 8.0
4 NaN 4.0 NaN
5 NaN 5.0 9.0
In [26]: tabula.read_pdf(path, guess=False, stream=True)
Out[26]:
ColA Unnamed: 1 ColB Unnamed: 3 ColC Unnamed: 5
0 NaN 1.0 NaN NaN NaN 6.0
1 NaN NaN NaN 2.0 NaN 7.0
2 NaN 3.0 NaN NaN NaN 8.0
3 NaN NaN NaN 4.0 NaN NaN
4 NaN NaN NaN 5.0 NaN 9.0
In [36]: tabula.read_pdf(path, guess=False, lattice=True)
Out[36]:
ColA ColB ColC
0 1.0 NaN 6.0
1 NaN 2.0 7.0
2 NaN NaN NaN
3 3.0 NaN 8.0
4 NaN 4.0 NaN
5 NaN 5.0 9.0 |
@chezou Can you please share your email id ? I shall send across |
@kvopencode Unfortunately, I don't want to support via E-mail privately, because:
For people facing this issue: |
I have the same issue. The debit and the credit column should be on a separate column but for some reason they are not |
was anyone able to resolve the issue of merged columns? |
A possible workaround for me is to convert the PDF file to HTML. https://datascientyst.com/extract-table-from-pdf-with-python-pandas/ |
@samkit-jain - Found any fix for this? |
@Aman0509 Not with tabula. I switched over to pdfplumber. |
Summary of your issue
I have a PDF with a table extending to multiple pages. For some rows, the value in last two (or second last two) columns is getting merged into a single one. Please note that I have removed some of the sensitive information.
![image](https://user-images.githubusercontent.com/15127115/30755227-7871bd8c-9fe3-11e7-8d90-c56304462eb5.png)
Environment
python --version
: Python 2.7.6java -version
: java version "1.8.0_144"What did you do when you faced the problem?
Tried
lattice=True
option but it is not even reading the tableExample code:
Output:
What did you intend to be?
The text was updated successfully, but these errors were encountered: