-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python] Reading parquet file with many columns becomes slow for 0.15.0 #23204
Comments
Joris Van den Bossche / @jorisvandenbossche: |
Bob:
In [6]: df.shape
All fields are just pain floats, I believe you can create a dataframe just like this with no difficulties?
One thing to note is that in our dataframe we use multilevel columns. But I suppose that is not an issue?
|
Bob:
https://github.com/apache/arrow/blob/master/python/pyarrow/_parquet.pyx#L1118 |
Joris Van den Bossche / @jorisvandenbossche: |
Bob: |
Bob: |
Joris Van den Bossche / @jorisvandenbossche: import pyarrow as pa
import pyarrow.parquet as pq
table = pa.table({'c' + str(i): np.random.randn(10) for i in range(10000)})
pq.write_table(table, "test_wide.parquet")
res = pq.read_table("test_wide.parquet") |
Wes McKinney / @wesm: |
Antoine Pitrou / @pitrou: |
Axel: With the reproducer above:
From reading the github issue, I expected it to be slower than 0.14.1 but not by this much. |
Joris Van den Bossche / @jorisvandenbossche: |
Axel: import pyarrow as paimport pyarrow.parquet as pq |
Joris Van den Bossche / @jorisvandenbossche: I see a similar difference locally, it's indeed not the speed-up that @wesm reported on the PR: #5653 (comment) (this might depend on the machine / number of cores ?) |
Wes McKinney / @wesm: |
Joris Van den Bossche / @jorisvandenbossche: |
Hi,
I just noticed that reading a parquet file becomes really slow after I upgraded to 0.15.0 when using pandas.
Example:
With 0.14.1
In [4]: %timeit df = pd.read_parquet(path)
2.02 s ± 47.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
With 0.15.0
In [5]: %timeit df = pd.read_parquet(path)
22.9 s ± 478 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The file is about 15MB in size. I am testing on the same machine using the same version of python and pandas.
Have you received similar complain? What could be the issue here?
Thanks a lot.
Edit1:
Some profiling I did:
0.14.1:
0.15.0:
Environment: python3.7
Reporter: Bob
Assignee: Wes McKinney / @wesm
Related issues:
Original Issue Attachments:
PRs and other links:
Note: This issue was originally created as ARROW-6876. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: