Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Reading parquet file with many columns becomes slow for 0.15.0 #23204

Closed
asfimport opened this issue Oct 14, 2019 · 15 comments
Closed

Comments

@asfimport
Copy link
Collaborator

asfimport commented Oct 14, 2019

Hi,

 

I just noticed that reading a parquet file becomes really slow after I upgraded to 0.15.0 when using pandas.

 

Example:

With 0.14.1
In [4]: %timeit df = pd.read_parquet(path)
2.02 s ± 47.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

With 0.15.0
In [5]: %timeit df = pd.read_parquet(path)
22.9 s ± 478 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

 

The file is about 15MB in size. I am testing on the same machine using the same version of python and pandas.

 

Have you received similar complain? What could be the issue here?

 

Thanks a lot.

 

 

Edit1:

Some profiling I did:

0.14.1:

image-2019-10-14-18-12-07-652.png

 

0.15.0:

image-2019-10-14-18-10-42-850.png

 

Environment: python3.7
Reporter: Bob
Assignee: Wes McKinney / @wesm

Related issues:

Original Issue Attachments:

PRs and other links:

Note: This issue was originally created as ARROW-6876. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
Thanks for the report. Would you be able to share a script that reproduces it (that writes a parquet file that has the issue, or otherwise share a file)?
What's the schema of the data?

@asfimport
Copy link
Collaborator Author

Bob:
@jorisvandenbossche  sorry I cannot share the data with you because they contain our IP. Something I can share with is:

 

In [6]: df.shape
Out[6]: (61, 31835)

 

All fields are just pain floats, I believe you can create a dataframe just like this with no difficulties?

 

One thing to note is that in our dataframe we use multilevel columns. But I suppose that is not an issue?

 

@asfimport
Copy link
Collaborator Author

Bob:
@jorisvandenbossche  seems you guys started calling this function which caused the issue:

 

https://github.com/apache/arrow/blob/master/python/pyarrow/_parquet.pyx#L1118

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
Thanks, if it is just floats, I'll try to reproduce based on that description. But it's probably related to the fact that you have a very wide dataframe (n columns >> n rows). In general, the parquet is not very suited for that kind of data (also in 0.14 the 2 seconds to read is very slow). But that said, it's still a performance regression compared to 0.14 that is worth looking into.

@asfimport
Copy link
Collaborator Author

Bob:
@jorisvandenbossche  thanks. let me know if I can help. We are very special in this case I think, Also I am not sure if the multilevel columns thing adds any complexity – seems parquet do not handle this very well?

@asfimport
Copy link
Collaborator Author

Bob:
I also tried fastparquet as an engine and it just thrown an error to me when reading the file.. Seems it just cannot decode the file.

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
Small reproducer:

import pyarrow as pa
import pyarrow.parquet as pq 
table = pa.table({'c' + str(i): np.random.randn(10) for i in range(10000)})  
pq.write_table(table, "test_wide.parquet")
res = pq.read_table("test_wide.parquet")

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
Marked this for 0.15.1

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
Issue resolved by pull request 5653
#5653

@asfimport
Copy link
Collaborator Author

Axel:
Hi, I am still experiencing some very slow load times with version 0.15.1.

With the reproducer above:

0.14.1:
282 ms ± 10.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

0.15.1

5.06 s ± 288 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

 

From reading the github issue, I expected it to be slower than 0.14.1 but not by this much.

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
[~axelg] would you be able to share a reproducible example ? (eg the data, or code that creates a dummy dataset with the same characteristics that shows the problem)

@asfimport
Copy link
Collaborator Author

Axel:
Sure! For the numbers above I used the exact same example you posted above.

import pyarrow as paimport pyarrow.parquet as pq
table = pa.table({'c' + str(i): np.random.randn(10) for i in range(10000)})
pq.write_table(table, "test_wide.parquet")
res = pq.read_table("test_wide.parquet")
 

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
Ah, sorry, missed the "With the reproducer above:" in your message.

I see a similar difference locally, it's indeed not the speed-up that @wesm reported on the PR: #5653 (comment) (this might depend on the machine / number of cores ?)

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
I think there is another JIRA for follow up investigation, can we move the discussion there?

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
The open issue about this is ARROW-7059

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants