-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python] ParquetDataset().read columns argument always returns partition column #20409
Comments
Wes McKinney / @wesm: |
Joris Van den Bossche / @jorisvandenbossche: |
Christian Thiel / @c-thiel: Yes, my intention of "new_column" is for it to be added. This is however not primarily related to this issue. The code example above is just my usual testcase for my own code which modifies the dataframe to match a schema beforehand. In my opinion the schema should be the single source of truth. Thus columns of the df which are not part of the schema should be dropped or raise an error. Columns which are not in the Dataframe should be added with the invalid value corresponding to the schema dtype (or raise an error again). I am not sure how the index should be handled. I really do not like that we cannot specify the dtype there. I believe this is due to the index being saved in the metadata of parquet, which also implies that the information in the index, presumably the most important column, is not so easily available across platforms as a usual column. For all my applications I stopped writing the index to the parquet file and use a regular parquet column instead. If you make sure the column is the first column, the perfomance implication when using s3 are minimal as no seek needs to be performed. This is also supported by the fact that The only other major thing which is bothering me is that Ints can't be NaN. I really like the pandas Int64 columns. However as this is not supported by parquet yet as far as I know, this is a problem for another day. |
Joris Van den Bossche / @jorisvandenbossche: In [26]: pq.ParquetDataset(PATH_PYARROW_MANUAL).read(columns=['DPRD_ID', 'strings']).to_pandas()
Out[26]:
DPRD_ID strings partition_column
0 0 nan 0
1 1 nan 0
2 2 a 1
3 3 b 1 vs In [28]: import pyarrow.dataset as ds
In [29]: ds.dataset(PATH_PYARROW_MANUAL).to_table(columns=['DPRD_ID', 'strings']).to_pandas()
Out[29]:
DPRD_ID strings
0 0 nan
1 1 nan
2 2 a
3 3 b So once we use the datasets API under the hood in pyarrow.parquet (ARROW-8039), this issue should be solved (might want to add a test for it) |
Francois Saint-Jacques / @fsaintjacques: |
I just noticed that no matter which columns are specified on load of a dataset, the partition column is always returned. This might lead to strange behaviour, as the resulting dataframe has more than the expected columns:
df_pq has column
partition_column
Reporter: Christian Thiel / @c-thiel
Assignee: Joris Van den Bossche / @jorisvandenbossche
Related issues:
PRs and other links:
Note: This issue was originally created as ARROW-3861. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: