[Python] Reading parquet file with many columns becomes slow for 0.15.0 #23204

asfimport · 2019-10-14T17:07:38Z

Hi,

I just noticed that reading a parquet file becomes really slow after I upgraded to 0.15.0 when using pandas.

Example:

With 0.14.1
In [4]: %timeit df = pd.read_parquet(path)
2.02 s ± 47.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

With 0.15.0
In [5]: %timeit df = pd.read_parquet(path)
22.9 s ± 478 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The file is about 15MB in size. I am testing on the same machine using the same version of python and pandas.

Have you received similar complain? What could be the issue here?

Thanks a lot.

Edit1:

Some profiling I did:

0.14.1:

0.15.0:

Environment: python3.7
Reporter: Bob
Assignee: Wes McKinney / @wesm

Related issues:

[Python] Reading parquet file with many columns is much slower in 0.15.x versus 0.14.x (is related to)

Original Issue Attachments:

PRs and other links:

GitHub Pull Request #5653

_{Note: This issue was originally created as ARROW-6876. Please see the migration documentation for further details.}

asfimport · 2019-10-14T17:12:03Z

Joris Van den Bossche / @jorisvandenbossche:
Thanks for the report. Would you be able to share a script that reproduces it (that writes a parquet file that has the issue, or otherwise share a file)?
What's the schema of the data?

asfimport · 2019-10-14T17:15:21Z

Bob:
@jorisvandenbossche sorry I cannot share the data with you because they contain our IP. Something I can share with is:

In [6]: df.shape
Out[6]: (61, 31835)

All fields are just pain floats, I believe you can create a dataframe just like this with no difficulties?

One thing to note is that in our dataframe we use multilevel columns. But I suppose that is not an issue?

asfimport · 2019-10-14T17:17:27Z

Bob:
@jorisvandenbossche seems you guys started calling this function which caused the issue:

https://github.com/apache/arrow/blob/master/python/pyarrow/_parquet.pyx#L1118

asfimport · 2019-10-14T17:21:57Z

Joris Van den Bossche / @jorisvandenbossche:
Thanks, if it is just floats, I'll try to reproduce based on that description. But it's probably related to the fact that you have a very wide dataframe (n columns >> n rows). In general, the parquet is not very suited for that kind of data (also in 0.14 the 2 seconds to read is very slow). But that said, it's still a performance regression compared to 0.14 that is worth looking into.

asfimport · 2019-10-14T17:24:41Z

Bob:
@jorisvandenbossche thanks. let me know if I can help. We are very special in this case I think, Also I am not sure if the multilevel columns thing adds any complexity – seems parquet do not handle this very well?

asfimport · 2019-10-14T17:25:49Z

Bob:
I also tried fastparquet as an engine and it just thrown an error to me when reading the file.. Seems it just cannot decode the file.

asfimport · 2019-10-14T18:12:16Z

Joris Van den Bossche / @jorisvandenbossche:
Small reproducer:

import pyarrow as pa
import pyarrow.parquet as pq 
table = pa.table({'c' + str(i): np.random.randn(10) for i in range(10000)})  
pq.write_table(table, "test_wide.parquet")
res = pq.read_table("test_wide.parquet")

asfimport · 2019-10-14T20:37:18Z

Wes McKinney / @wesm:
Marked this for 0.15.1

asfimport · 2019-10-16T17:59:15Z

Antoine Pitrou / @pitrou:
Issue resolved by pull request 5653
#5653

asfimport · 2019-11-26T14:20:53Z

Axel:
Hi, I am still experiencing some very slow load times with version 0.15.1.

With the reproducer above:

0.14.1:
282 ms ± 10.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

0.15.1

5.06 s ± 288 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

From reading the github issue, I expected it to be slower than 0.14.1 but not by this much.

asfimport · 2019-11-27T14:29:34Z

Joris Van den Bossche / @jorisvandenbossche:
[~axelg] would you be able to share a reproducible example ? (eg the data, or code that creates a dummy dataset with the same characteristics that shows the problem)

asfimport · 2019-11-27T14:31:39Z

Axel:
Sure! For the numbers above I used the exact same example you posted above.

import pyarrow as paimport pyarrow.parquet as pq
table = pa.table({'c' + str(i): np.random.randn(10) for i in range(10000)})
pq.write_table(table, "test_wide.parquet")
res = pq.read_table("test_wide.parquet")

asfimport · 2019-11-27T14:50:17Z

Joris Van den Bossche / @jorisvandenbossche:
Ah, sorry, missed the "With the reproducer above:" in your message.

I see a similar difference locally, it's indeed not the speed-up that @wesm reported on the PR: #5653 (comment) (this might depend on the machine / number of cores ?)

asfimport · 2019-11-27T17:10:46Z

Wes McKinney / @wesm:
I think there is another JIRA for follow up investigation, can we move the discussion there?

asfimport · 2019-11-28T09:35:33Z

Joris Van den Bossche / @jorisvandenbossche:
The open issue about this is ARROW-7059

asfimport closed this as completed Oct 16, 2019

asfimport assigned wesm Jan 10, 2023

asfimport added this to the 0.15.1 milestone Jan 11, 2023

asfimport mentioned this issue Jan 11, 2023

[Python] Reading parquet file with many columns is much slower in 0.15.x versus 0.14.x #23368

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] Reading parquet file with many columns becomes slow for 0.15.0 #23204

[Python] Reading parquet file with many columns becomes slow for 0.15.0 #23204

asfimport commented Oct 14, 2019 •

edited

Loading

asfimport commented Oct 14, 2019

asfimport commented Oct 14, 2019

asfimport commented Oct 14, 2019

asfimport commented Oct 14, 2019

asfimport commented Oct 14, 2019

asfimport commented Oct 14, 2019

asfimport commented Oct 14, 2019

asfimport commented Oct 14, 2019

asfimport commented Oct 16, 2019

asfimport commented Nov 26, 2019

asfimport commented Nov 27, 2019

asfimport commented Nov 27, 2019

asfimport commented Nov 27, 2019

asfimport commented Nov 27, 2019

asfimport commented Nov 28, 2019

[Python] Reading parquet file with many columns becomes slow for 0.15.0 #23204

[Python] Reading parquet file with many columns becomes slow for 0.15.0 #23204

Comments

asfimport commented Oct 14, 2019 • edited Loading

Related issues:

Original Issue Attachments:

PRs and other links:

asfimport commented Oct 14, 2019

asfimport commented Oct 14, 2019

asfimport commented Oct 14, 2019

asfimport commented Oct 14, 2019

asfimport commented Oct 14, 2019

asfimport commented Oct 14, 2019

asfimport commented Oct 14, 2019

asfimport commented Oct 14, 2019

asfimport commented Oct 16, 2019

asfimport commented Nov 26, 2019

asfimport commented Nov 27, 2019

asfimport commented Nov 27, 2019

asfimport commented Nov 27, 2019

asfimport commented Nov 27, 2019

asfimport commented Nov 28, 2019

asfimport commented Oct 14, 2019 •

edited

Loading