[Python] PyArrow fails to load partitioned parquet files with non-primitive types #17479

asfimport · 2017-09-04T14:51:42Z

When reading partitioned parquet files (tested with those produced by Spark), that contain lists, the resulting table seems to contain data loaded only from one partition. Primitive types seems to be loaded correctly.

It can be reproduced using following code (arrow 0.6.0, spark 2.1.1):

>>> df = spark.createDataFrame(list(zip(np.arange(10).tolist(), np.arange(20).reshape((10,2)).tolist())))
>>> df.toPandas()
   _1        _2
0   0    [0, 1]
1   1    [2, 3]
2   2    [4, 5]
3   3    [6, 7]
4   4    [8, 9]
5   5  [10, 11]
6   6  [12, 13]
7   7  [14, 15]
8   8  [16, 17]
9   9  [18, 19]
>>> df.repartition(2).write.parquet('df_parts.parquet')
>>> pq.read_table('df_parts.parquet').to_pandas()
   _1        _2
0   0    [0, 1]
1   2    [4, 5]
2   4    [8, 9]
3   6  [12, 13]
4   8  [16, 17]
5   1    [0, 1]
6   3    [4, 5]
7   5    [8, 9]
8   7  [12, 13]
9   9  [16, 17]

When the data is loaded using Spark or coalesced into one partition, everything works as expected:

>>> spark.read.parquet('df_parts.parquet').toPandas()
   _1        _2
0   1    [2, 3]
1   3    [6, 7]
2   5  [10, 11]
3   7  [14, 15]
4   9  [18, 19]
5   0    [0, 1]
6   2    [4, 5]
7   4    [8, 9]
8   6  [12, 13]
9   8  [16, 17]
>>> df.coalesce(1).write.parquet('df_single.parquet')
>>> pq.read_table('df_single.parquet').to_pandas()
   _1        _2
0   0    [0, 1]
1   1    [2, 3]
2   2    [4, 5]
3   3    [6, 7]
4   4    [8, 9]
5   5  [10, 11]
6   6  [12, 13]
7   7  [14, 15]
8   8  [16, 17]
9   9  [18, 19]

Reporter: Jonas Amrich
Assignee: Wes McKinney / @wesm

_{Note: This issue was originally created as ARROW-1459. Please see the migration documentation for further details.}

The text was updated successfully, but these errors were encountered:

asfimport · 2017-09-04T15:45:03Z

Wes McKinney / @wesm:
I think this is ARROW-1357, and has already been fixed in trunk. If you are on Linux can you try out a nightly build and confirm

conda install pyarrow -c twosigma

asfimport · 2017-09-04T16:31:08Z

Jonas Amrich:
Yes, that looks very similar - I must have overlooked that issue before. However it seems that the fix doesn't solve the problem. Using 0.6.1.dev64+g9968d95d only makes thing stranger:

>>> pq.read_table('df_parts.parquet').to_pandas()
   _1        _2
0   5  [10, 11]
1   7  [14, 15]
2   9  [18, 19]
3   0    [0, 1]
4   2    [4, 5]
5   4    [8, 9]
6   6    [0, 1]
7   8    [4, 5]
8   1    [8, 9]
9   3  [12, 13]

asfimport · 2017-09-04T17:30:49Z

Wes McKinney / @wesm:
OK, one of us ( @xhochy or me) will have to take a look so we can resolve this before 0.7.0 final goes out. If you find the problem feel free to submit a patch

asfimport · 2017-09-05T07:43:53Z

Jonas Amrich:
I'll try to look deeper into this. However I'm not so familiar with Arrow's internals, so I expect it will take some time..

asfimport · 2017-09-12T02:27:00Z

Wes McKinney / @wesm:
PR: #1090

asfimport · 2017-09-12T07:13:11Z

Uwe Korn / @xhochy:
Issue resolved by pull request 1090
#1090

asfimport closed this as completed Sep 12, 2017

asfimport assigned wesm Jan 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] PyArrow fails to load partitioned parquet files with non-primitive types #17479

[Python] PyArrow fails to load partitioned parquet files with non-primitive types #17479

asfimport commented Sep 4, 2017

asfimport commented Sep 4, 2017

asfimport commented Sep 4, 2017

asfimport commented Sep 4, 2017

asfimport commented Sep 5, 2017

asfimport commented Sep 12, 2017

asfimport commented Sep 12, 2017

[Python] PyArrow fails to load partitioned parquet files with non-primitive types #17479

[Python] PyArrow fails to load partitioned parquet files with non-primitive types #17479

Comments

asfimport commented Sep 4, 2017

asfimport commented Sep 4, 2017

asfimport commented Sep 4, 2017

asfimport commented Sep 4, 2017

asfimport commented Sep 5, 2017

asfimport commented Sep 12, 2017

asfimport commented Sep 12, 2017