Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] PyArrow fails to load partitioned parquet files with non-primitive types #17479

Closed
asfimport opened this issue Sep 4, 2017 · 6 comments

Comments

@asfimport
Copy link

When reading partitioned parquet files (tested with those produced by Spark), that contain lists, the resulting table seems to contain data loaded only from one partition. Primitive types seems to be loaded correctly.

It can be reproduced using following code (arrow 0.6.0, spark 2.1.1):

>>> df = spark.createDataFrame(list(zip(np.arange(10).tolist(), np.arange(20).reshape((10,2)).tolist())))
>>> df.toPandas()
   _1        _2
0   0    [0, 1]
1   1    [2, 3]
2   2    [4, 5]
3   3    [6, 7]
4   4    [8, 9]
5   5  [10, 11]
6   6  [12, 13]
7   7  [14, 15]
8   8  [16, 17]
9   9  [18, 19]
>>> df.repartition(2).write.parquet('df_parts.parquet')
>>> pq.read_table('df_parts.parquet').to_pandas()
   _1        _2
0   0    [0, 1]
1   2    [4, 5]
2   4    [8, 9]
3   6  [12, 13]
4   8  [16, 17]
5   1    [0, 1]
6   3    [4, 5]
7   5    [8, 9]
8   7  [12, 13]
9   9  [16, 17]

When the data is loaded using Spark or coalesced into one partition, everything works as expected:

>>> spark.read.parquet('df_parts.parquet').toPandas()
   _1        _2
0   1    [2, 3]
1   3    [6, 7]
2   5  [10, 11]
3   7  [14, 15]
4   9  [18, 19]
5   0    [0, 1]
6   2    [4, 5]
7   4    [8, 9]
8   6  [12, 13]
9   8  [16, 17]
>>> df.coalesce(1).write.parquet('df_single.parquet')
>>> pq.read_table('df_single.parquet').to_pandas()
   _1        _2
0   0    [0, 1]
1   1    [2, 3]
2   2    [4, 5]
3   3    [6, 7]
4   4    [8, 9]
5   5  [10, 11]
6   6  [12, 13]
7   7  [14, 15]
8   8  [16, 17]
9   9  [18, 19]

Reporter: Jonas Amrich
Assignee: Wes McKinney / @wesm

Note: This issue was originally created as ARROW-1459. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Wes McKinney / @wesm:
I think this is ARROW-1357, and has already been fixed in trunk. If you are on Linux can you try out a nightly build and confirm

conda install pyarrow -c twosigma

@asfimport
Copy link
Author

Jonas Amrich:
Yes, that looks very similar - I must have overlooked that issue before. However it seems that the fix doesn't solve the problem. Using 0.6.1.dev64+g9968d95d only makes thing stranger:

>>> pq.read_table('df_parts.parquet').to_pandas()
   _1        _2
0   5  [10, 11]
1   7  [14, 15]
2   9  [18, 19]
3   0    [0, 1]
4   2    [4, 5]
5   4    [8, 9]
6   6    [0, 1]
7   8    [4, 5]
8   1    [8, 9]
9   3  [12, 13]

@asfimport
Copy link
Author

Wes McKinney / @wesm:
OK, one of us ( @xhochy or me) will have to take a look so we can resolve this before 0.7.0 final goes out. If you find the problem feel free to submit a patch

@asfimport
Copy link
Author

Jonas Amrich:
I'll try to look deeper into this. However I'm not so familiar with Arrow's internals, so I expect it will take some time..

@asfimport
Copy link
Author

Wes McKinney / @wesm:
PR: #1090

@asfimport
Copy link
Author

Uwe Korn / @xhochy:
Issue resolved by pull request 1090
#1090

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants