Columns getting merged #55

samkit-jain · 2017-09-22T16:47:55Z

Summary of your issue

I have a PDF with a table extending to multiple pages. For some rows, the value in last two (or second last two) columns is getting merged into a single one. Please note that I have removed some of the sensitive information.

Environment

python --version: Python 2.7.6
java -version: java version "1.8.0_144"
OS and it's version: Ubuntu 14.04
Your PDF URL: I have attached a screenshot of the table

What did you do when you faced the problem?

Tried lattice=True option but it is not even reading the table

Example code:

all_values = tabula.read_pdf("sample.pdf", pages='all', pandas_options={'header': None, 'error_bad_lines': False, 'warn_bad_lines': False})
all_values = all_values.values.tolist()

for val in all_values:
	print val

Output:

[u'22/06/17', u'IMPS-7-RAHUL-HDFC-XXXXXXXX', u'00007', u'22/06/17', nan, u'1,000.00 14,904.08', nan]
[nan, u'8-', nan, nan, nan, nan, nan]
[u'23/06/17', u'NEFT CHGS INCL ST & CESS 170617', u'000000000000000', u'23/06/17', u'2.88', u'14,901.20', nan]
[u'23/06/17', u'NEFT CR-YESB0000001-ANI TECHNOLOGIES PRI', u'N1', u'23/06/17', nan, u'579.00 15,480.20', nan]
[nan, u'VATE LIMITED-RAHUL YADAV-N1', nan, nan, nan, nan, nan]
[nan, u'9', nan, nan, nan, nan, nan]
[u'23/06/17', u'ACH D- AUFINANCIERS-9', u'0000002', u'23/06/17', u'14,400.00', u'1,080.20', nan]
[u'24/06/17', u'IMPS-7-ANI TECHNOLOGIES PRI-H', u'00007', u'24/06/17', nan, u'2,378.00 3,458.20', nan]
[nan, u'DFC-XXXXXXXXXXX-2017-06', nan, nan, nan, nan, nan]
[nan, u'-24-PAYMENT', nan, nan, nan, nan, nan]
[u'27/06/17', u'NEFT CR-YESB0000001-ANI TECHNOLOGIES PRI', u'N1', u'27/06/17', nan, u'445.00 3,903.20', nan]
[nan, u'VATE LIMITED-RAHUL YADAV-N1', nan, nan, nan, nan, nan]
[nan, u'8', nan, nan, nan, nan, nan]
[u'28/06/17', u'NEFT CR-YESB0000001-ANI TECHNOLOGIES PRI', u'N1', u'28/06/17', nan, u'2,279.00 6,182.20', nan]
[nan, u'VATE LIMITED-RAHUL YADAV-N1', nan, nan, nan, nan, nan]
[nan, u'0', nan, nan, nan, nan, nan]
[u'29/06/17', u'NEFT CR-YESB0000001-ANI TECHNOLOGIES PRI', u'N1', u'29/06/17', nan, u'1,165.00 7,347.20', nan]
[nan, u'VATE LIMITED-RAHUL YADAV-N1', nan, nan, nan, nan, nan]
[nan, u'5', nan, nan, nan, nan, nan]

What did you intend to be?

[u'22/06/17', u'IMPS-7-RAHUL-HDFC-XXXXXXXX', u'00007', u'22/06/17', nan, u'1,000.00', u'14,904.08']
[nan, u'8-', nan, nan, nan, nan, nan]
[u'23/06/17', u'NEFT CHGS INCL ST & CESS 170617', u'000000000000000', u'23/06/17', u'2.88', nan, u'14,901.20']
[u'23/06/17', u'NEFT CR-YESB0000001-ANI TECHNOLOGIES PRI', u'N1', u'23/06/17', nan, u'579.00', u'15,480.20']
[nan, u'VATE LIMITED-RAHUL YADAV-N1', nan, nan, nan, nan, nan]
[nan, u'9', nan, nan, nan, nan, nan]
[u'23/06/17', u'ACH D- AUFINANCIERS-9', u'0000002', u'23/06/17', u'14,400.00', nan, u'1,080.20']
[u'24/06/17', u'IMPS-7-ANI TECHNOLOGIES PRI-H', u'00007', u'24/06/17', nan, u'2,378.00', u'3,458.20']
[nan, u'DFC-XXXXXXXXXXX-2017-06', nan, nan, nan, nan, nan]
[nan, u'-24-PAYMENT', nan, nan, nan, nan, nan]
[u'27/06/17', u'NEFT CR-YESB0000001-ANI TECHNOLOGIES PRI', u'N1', u'27/06/17', nan, u'445.00', u'3,903.20']
[nan, u'VATE LIMITED-RAHUL YADAV-N1', nan, nan, nan, nan, nan]
[nan, u'8', nan, nan, nan, nan, nan]
[u'28/06/17', u'NEFT CR-YESB0000001-ANI TECHNOLOGIES PRI', u'N1', u'28/06/17', nan, u'2,279.00', u'6,182.20']
[nan, u'VATE LIMITED-RAHUL YADAV-N1', nan, nan, nan, nan, nan]
[nan, u'0', nan, nan, nan, nan, nan]
[u'29/06/17', u'NEFT CR-YESB0000001-ANI TECHNOLOGIES PRI', u'N1', u'29/06/17', nan, u'1,165.00', u'7,347.20']
[nan, u'VATE LIMITED-RAHUL YADAV-N1', nan, nan, nan, nan, nan]
[nan, u'5', nan, nan, nan, nan, nan]

The text was updated successfully, but these errors were encountered:

chezou · 2017-09-23T01:16:36Z

I think it's a kind of tabula-java option problem, so it would be nice to ask at tabula-java issue.

How about using stream option?

tabula.read_pdf('target.pdf', pages='all', stream=True, guess=False)

samkit-jain · 2017-09-23T04:03:29Z

Ok. I'll raise an issue at tabula-java.

Received same output from stream=True

ayubansal1998 · 2019-06-26T13:35:31Z

The same problem occur in tabular-py
Even if a column is empty in any page then tabula-py writes data of another column

vaibhavsingh05 · 2019-07-01T08:58:46Z

Columns containing only NaN get removed automatically in the read_pdf module, while I want them to be there. Is there any way to prevent this?

I want 3 columns to be in order as they are, while I am getting this

thiencuong14 · 2019-08-21T10:03:21Z

I am so having issue same @konfi

[input]

[output]

Please help us!

kvopencode · 2019-10-30T14:30:02Z

@samkit-jain Were you able to raise this issue again. Please let me know steps if its resolved.

Issue - Even if a column is empty in any page then tabula-py writes data of another column

Tried reading the pdf file using tabula read_pdf in python.
Code

df=read_pdf(pdfFile, pages='1', stream='True', guess='False')
df = df.dropna(axis='rows')
print(tabulate(df))

As you can see in output screenshot the columns Withdrawal & Deposit got merged into a single column

If i use Tabula app (https://tabula.technology/) on local it works fine.

Sample source pdf

Python dataframe output

samkit-jain · 2019-10-30T14:40:18Z

@kvopencode Haven't been able to resolve this

chezou · 2019-10-30T15:25:33Z

I tried with this PDF created by Google Spreadsheet, but couldn't reproduce the issue. Could someone provide a minimum reproducible PDF for me?
test.pdf

In [1]: import tabula
In [2]: path = "test.pdf"

In [23]: df=  tabula.read_pdf(path)

In [24]: df
Out[24]:
   ColA  ColB  ColC
0   1.0   NaN   6.0
1   NaN   2.0   7.0
2   NaN   NaN   NaN
3   3.0   NaN   8.0
4   NaN   4.0   NaN
5   NaN   5.0   9.0

In [25]: tabula.read_pdf(path, guess=False)
Out[25]:
   ColA  ColB  ColC
0   1.0   NaN   6.0
1   NaN   2.0   7.0
2   NaN   NaN   NaN
3   3.0   NaN   8.0
4   NaN   4.0   NaN
5   NaN   5.0   9.0

In [26]: tabula.read_pdf(path, guess=False, stream=True)
Out[26]:
   ColA  Unnamed: 1  ColB  Unnamed: 3  ColC  Unnamed: 5
0   NaN         1.0   NaN         NaN   NaN         6.0
1   NaN         NaN   NaN         2.0   NaN         7.0
2   NaN         3.0   NaN         NaN   NaN         8.0
3   NaN         NaN   NaN         4.0   NaN         NaN
4   NaN         NaN   NaN         5.0   NaN         9.0

In [36]: tabula.read_pdf(path, guess=False, lattice=True)
Out[36]:
   ColA  ColB  ColC
0   1.0   NaN   6.0
1   NaN   2.0   7.0
2   NaN   NaN   NaN
3   3.0   NaN   8.0
4   NaN   4.0   NaN
5   NaN   5.0   9.0

kvopencode · 2019-10-30T22:45:51Z

@chezou Can you please share your email id ? I shall send across

chezou · 2019-10-31T01:44:10Z

@kvopencode Unfortunately, I don't want to support via E-mail privately, because:

I doubt this is a tabula-java related issue. If so, the PDF should be shared with the tabula-java team.
tabula-py is a private project, which means I develop and maintain it in my spare time. Not so enough resources to support only by me.
Personally, I had really awful experiences through e-mail basis requests. Some were impolite, some tended to overuse my limited resource. I don't think you're the one, but it'd be nice if we could have "minimum reproducible data" for tackling this issue since tabula-py is an open-source project.

For people facing this issue:
Please provide the PDF for future comments on this issue. Otherwise, I might lock this issue since I can't reproduce it and can't specify whether a tabula-py issue or tabula-java one.

egodalle · 2021-12-29T01:39:26Z

I have the same issue. The debit and the credit column should be on a separate column but for some reason they are not
MIB ADJ02262021.pdf
.

Ayushi-Garg-1 · 2022-07-17T22:33:46Z

was anyone able to resolve the issue of merged columns?

softhints · 2022-09-30T06:49:49Z

A possible workaround for me is to convert the PDF file to HTML.
Then read the table from the HTML content:

https://datascientyst.com/extract-table-from-pdf-with-python-pandas/

Aman0509 · 2022-10-31T19:58:02Z

@samkit-jain - Found any fix for this?

samkit-jain · 2022-10-31T20:10:53Z

@Aman0509 Not with tabula. I switched over to pdfplumber.

samkit-jain closed this as completed Sep 23, 2017

ayubansal1998 mentioned this issue Jun 27, 2019

Columns getting merged and mispredict data in different column #157

Closed

chezou added the help wanted label Oct 31, 2019

chezou added the PDF required PDF should be provided to address this issue label Oct 31, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Columns getting merged #55

Columns getting merged #55

samkit-jain commented Sep 22, 2017 •

edited

Loading

chezou commented Sep 23, 2017

samkit-jain commented Sep 23, 2017

ayubansal1998 commented Jun 26, 2019

vaibhavsingh05 commented Jul 1, 2019

thiencuong14 commented Aug 21, 2019 •

edited

Loading

kvopencode commented Oct 30, 2019 •

edited

Loading

samkit-jain commented Oct 30, 2019

chezou commented Oct 30, 2019

kvopencode commented Oct 30, 2019

chezou commented Oct 31, 2019

egodalle commented Dec 29, 2021

Ayushi-Garg-1 commented Jul 17, 2022

softhints commented Sep 30, 2022

Aman0509 commented Oct 31, 2022

samkit-jain commented Oct 31, 2022

Columns getting merged #55

Columns getting merged #55

Comments

samkit-jain commented Sep 22, 2017 • edited Loading

Summary of your issue

Environment

What did you do when you faced the problem?

Example code:

Output:

What did you intend to be?

chezou commented Sep 23, 2017

samkit-jain commented Sep 23, 2017

ayubansal1998 commented Jun 26, 2019

vaibhavsingh05 commented Jul 1, 2019

thiencuong14 commented Aug 21, 2019 • edited Loading

kvopencode commented Oct 30, 2019 • edited Loading

samkit-jain commented Oct 30, 2019

chezou commented Oct 30, 2019

kvopencode commented Oct 30, 2019

chezou commented Oct 31, 2019

egodalle commented Dec 29, 2021

Ayushi-Garg-1 commented Jul 17, 2022

softhints commented Sep 30, 2022

Aman0509 commented Oct 31, 2022

samkit-jain commented Oct 31, 2022

samkit-jain commented Sep 22, 2017 •

edited

Loading

thiencuong14 commented Aug 21, 2019 •

edited

Loading

kvopencode commented Oct 30, 2019 •

edited

Loading