Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Columns getting merged #55

Closed
3 of 4 tasks
samkit-jain opened this issue Sep 22, 2017 · 15 comments
Closed
3 of 4 tasks

Columns getting merged #55

samkit-jain opened this issue Sep 22, 2017 · 15 comments
Labels
help wanted PDF required PDF should be provided to address this issue

Comments

@samkit-jain
Copy link

samkit-jain commented Sep 22, 2017

Summary of your issue

I have a PDF with a table extending to multiple pages. For some rows, the value in last two (or second last two) columns is getting merged into a single one. Please note that I have removed some of the sensitive information.
image

Environment

  • python --version: Python 2.7.6
  • java -version: java version "1.8.0_144"
  • OS and it's version: Ubuntu 14.04
  • Your PDF URL: I have attached a screenshot of the table

What did you do when you faced the problem?

Tried lattice=True option but it is not even reading the table

Example code:

all_values = tabula.read_pdf("sample.pdf", pages='all', pandas_options={'header': None, 'error_bad_lines': False, 'warn_bad_lines': False})
all_values = all_values.values.tolist()

for val in all_values:
	print val

Output:

[u'22/06/17', u'IMPS-7-RAHUL-HDFC-XXXXXXXX', u'00007', u'22/06/17', nan, u'1,000.00 14,904.08', nan]
[nan, u'8-', nan, nan, nan, nan, nan]
[u'23/06/17', u'NEFT CHGS INCL ST & CESS 170617', u'000000000000000', u'23/06/17', u'2.88', u'14,901.20', nan]
[u'23/06/17', u'NEFT CR-YESB0000001-ANI TECHNOLOGIES PRI', u'N1', u'23/06/17', nan, u'579.00 15,480.20', nan]
[nan, u'VATE LIMITED-RAHUL YADAV-N1', nan, nan, nan, nan, nan]
[nan, u'9', nan, nan, nan, nan, nan]
[u'23/06/17', u'ACH D- AUFINANCIERS-9', u'0000002', u'23/06/17', u'14,400.00', u'1,080.20', nan]
[u'24/06/17', u'IMPS-7-ANI TECHNOLOGIES PRI-H', u'00007', u'24/06/17', nan, u'2,378.00 3,458.20', nan]
[nan, u'DFC-XXXXXXXXXXX-2017-06', nan, nan, nan, nan, nan]
[nan, u'-24-PAYMENT', nan, nan, nan, nan, nan]
[u'27/06/17', u'NEFT CR-YESB0000001-ANI TECHNOLOGIES PRI', u'N1', u'27/06/17', nan, u'445.00 3,903.20', nan]
[nan, u'VATE LIMITED-RAHUL YADAV-N1', nan, nan, nan, nan, nan]
[nan, u'8', nan, nan, nan, nan, nan]
[u'28/06/17', u'NEFT CR-YESB0000001-ANI TECHNOLOGIES PRI', u'N1', u'28/06/17', nan, u'2,279.00 6,182.20', nan]
[nan, u'VATE LIMITED-RAHUL YADAV-N1', nan, nan, nan, nan, nan]
[nan, u'0', nan, nan, nan, nan, nan]
[u'29/06/17', u'NEFT CR-YESB0000001-ANI TECHNOLOGIES PRI', u'N1', u'29/06/17', nan, u'1,165.00 7,347.20', nan]
[nan, u'VATE LIMITED-RAHUL YADAV-N1', nan, nan, nan, nan, nan]
[nan, u'5', nan, nan, nan, nan, nan]

What did you intend to be?

[u'22/06/17', u'IMPS-7-RAHUL-HDFC-XXXXXXXX', u'00007', u'22/06/17', nan, u'1,000.00', u'14,904.08']
[nan, u'8-', nan, nan, nan, nan, nan]
[u'23/06/17', u'NEFT CHGS INCL ST & CESS 170617', u'000000000000000', u'23/06/17', u'2.88', nan, u'14,901.20']
[u'23/06/17', u'NEFT CR-YESB0000001-ANI TECHNOLOGIES PRI', u'N1', u'23/06/17', nan, u'579.00', u'15,480.20']
[nan, u'VATE LIMITED-RAHUL YADAV-N1', nan, nan, nan, nan, nan]
[nan, u'9', nan, nan, nan, nan, nan]
[u'23/06/17', u'ACH D- AUFINANCIERS-9', u'0000002', u'23/06/17', u'14,400.00', nan, u'1,080.20']
[u'24/06/17', u'IMPS-7-ANI TECHNOLOGIES PRI-H', u'00007', u'24/06/17', nan, u'2,378.00', u'3,458.20']
[nan, u'DFC-XXXXXXXXXXX-2017-06', nan, nan, nan, nan, nan]
[nan, u'-24-PAYMENT', nan, nan, nan, nan, nan]
[u'27/06/17', u'NEFT CR-YESB0000001-ANI TECHNOLOGIES PRI', u'N1', u'27/06/17', nan, u'445.00', u'3,903.20']
[nan, u'VATE LIMITED-RAHUL YADAV-N1', nan, nan, nan, nan, nan]
[nan, u'8', nan, nan, nan, nan, nan]
[u'28/06/17', u'NEFT CR-YESB0000001-ANI TECHNOLOGIES PRI', u'N1', u'28/06/17', nan, u'2,279.00', u'6,182.20']
[nan, u'VATE LIMITED-RAHUL YADAV-N1', nan, nan, nan, nan, nan]
[nan, u'0', nan, nan, nan, nan, nan]
[u'29/06/17', u'NEFT CR-YESB0000001-ANI TECHNOLOGIES PRI', u'N1', u'29/06/17', nan, u'1,165.00', u'7,347.20']
[nan, u'VATE LIMITED-RAHUL YADAV-N1', nan, nan, nan, nan, nan]
[nan, u'5', nan, nan, nan, nan, nan]
@chezou
Copy link
Owner

chezou commented Sep 23, 2017

I think it's a kind of tabula-java option problem, so it would be nice to ask at tabula-java issue.

How about using stream option?

tabula.read_pdf('target.pdf', pages='all', stream=True, guess=False)

@samkit-jain
Copy link
Author

Ok. I'll raise an issue at tabula-java.

Received same output from stream=True

@ayubansal1998
Copy link

The same problem occur in tabular-py
Even if a column is empty in any page then tabula-py writes data of another column

@vaibhavsingh05
Copy link

Columns containing only NaN get removed automatically in the read_pdf module, while I want them to be there. Is there any way to prevent this?
sample
I want 3 columns to be in order as they are, while I am getting this
sample2

@thiencuong14
Copy link

thiencuong14 commented Aug 21, 2019

I am so having issue same @konfi

[input]
image

[output]
image

Please help us!

@kvopencode
Copy link

kvopencode commented Oct 30, 2019

@samkit-jain Were you able to raise this issue again. Please let me know steps if its resolved.

Issue - Even if a column is empty in any page then tabula-py writes data of another column

Tried reading the pdf file using tabula read_pdf in python.
Code

df=read_pdf(pdfFile, pages='1', stream='True', guess='False')
df = df.dropna(axis='rows')
print(tabulate(df))

As you can see in output screenshot the columns Withdrawal & Deposit got merged into a single column

If i use Tabula app (https://tabula.technology/) on local it works fine.

Sample source pdf
Screenshot 2019-10-30 at 14 32 22

Python dataframe output
pythonDataframeOutput

@samkit-jain
Copy link
Author

@kvopencode Haven't been able to resolve this

@chezou
Copy link
Owner

chezou commented Oct 30, 2019

I tried with this PDF created by Google Spreadsheet, but couldn't reproduce the issue. Could someone provide a minimum reproducible PDF for me?
test.pdf

In [1]: import tabula
In [2]: path = "test.pdf"

In [23]: df=  tabula.read_pdf(path)

In [24]: df
Out[24]:
   ColA  ColB  ColC
0   1.0   NaN   6.0
1   NaN   2.0   7.0
2   NaN   NaN   NaN
3   3.0   NaN   8.0
4   NaN   4.0   NaN
5   NaN   5.0   9.0

In [25]: tabula.read_pdf(path, guess=False)
Out[25]:
   ColA  ColB  ColC
0   1.0   NaN   6.0
1   NaN   2.0   7.0
2   NaN   NaN   NaN
3   3.0   NaN   8.0
4   NaN   4.0   NaN
5   NaN   5.0   9.0

In [26]: tabula.read_pdf(path, guess=False, stream=True)
Out[26]:
   ColA  Unnamed: 1  ColB  Unnamed: 3  ColC  Unnamed: 5
0   NaN         1.0   NaN         NaN   NaN         6.0
1   NaN         NaN   NaN         2.0   NaN         7.0
2   NaN         3.0   NaN         NaN   NaN         8.0
3   NaN         NaN   NaN         4.0   NaN         NaN
4   NaN         NaN   NaN         5.0   NaN         9.0

In [36]: tabula.read_pdf(path, guess=False, lattice=True)
Out[36]:
   ColA  ColB  ColC
0   1.0   NaN   6.0
1   NaN   2.0   7.0
2   NaN   NaN   NaN
3   3.0   NaN   8.0
4   NaN   4.0   NaN
5   NaN   5.0   9.0

@kvopencode
Copy link

@chezou Can you please share your email id ? I shall send across

@chezou
Copy link
Owner

chezou commented Oct 31, 2019

@kvopencode Unfortunately, I don't want to support via E-mail privately, because:

  • I doubt this is a tabula-java related issue. If so, the PDF should be shared with the tabula-java team.
  • tabula-py is a private project, which means I develop and maintain it in my spare time. Not so enough resources to support only by me.
  • Personally, I had really awful experiences through e-mail basis requests. Some were impolite, some tended to overuse my limited resource. I don't think you're the one, but it'd be nice if we could have "minimum reproducible data" for tackling this issue since tabula-py is an open-source project.

For people facing this issue:
Please provide the PDF for future comments on this issue. Otherwise, I might lock this issue since I can't reproduce it and can't specify whether a tabula-py issue or tabula-java one.

@chezou chezou added the PDF required PDF should be provided to address this issue label Oct 31, 2019
@egodalle
Copy link

I have the same issue. The debit and the credit column should be on a separate column but for some reason they are not
MIB ADJ02262021.pdf
.

@Ayushi-Garg-1
Copy link

was anyone able to resolve the issue of merged columns?

@softhints
Copy link

A possible workaround for me is to convert the PDF file to HTML.
Then read the table from the HTML content:

https://datascientyst.com/extract-table-from-pdf-with-python-pandas/

@Aman0509
Copy link

@samkit-jain - Found any fix for this?

@samkit-jain
Copy link
Author

@Aman0509 Not with tabula. I switched over to pdfplumber.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted PDF required PDF should be provided to address this issue
Projects
None yet
Development

No branches or pull requests

10 participants