Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python][Dataset] Expose schema inference / validation options in the factory #24418

Open
1 task done
asfimport opened this issue Mar 25, 2020 · 4 comments
Open
1 task done

Comments

@asfimport
Copy link

asfimport commented Mar 25, 2020

ARROW-8058 added options related to schema inference / validation for the Dataset factory. We should expose this in Python in the dataset(..) factory function:

  • Add ability to pass a user-specified schema with a schema keyword, instead of inferring the schema from (one of) the files (to be passed to the factory finish method)

  • Add validate_schema option to toggle whether the schema is validated against the actual files or not.

  • Expose in some way the number of fragments to be inspected when inferring or validating the schema. Not sure yet what the best API for this would be.

    Some relevant notes from the original PR: ARROW-8058: [Dataset] Relax DatasetFactory discovery validation #6687 (comment)

Reporter: Joris Van den Bossche / @jorisvandenbossche

Subtasks:

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-8221. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Weston Pace / @westonpace:
I think another thing that would be included in this work is the ability to specify columns that exist in some, but not all, of the items in the dataset.  Today if I specify column names I get an error if the first table doesn't contain that column even if the other tables do.

 

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:
@westonpace do you have an example? Normally, missing columns in one of the files is something that should be supported already (they get filled with nulls).

In [1]: table = pa.table({'a': [1, 2, 3]})

In [2]: import pyarrow.dataset as ds

In [3]: ds.write_dataset(table, "test_columns", format="parquet")

In [6]: ds.dataset("test_columns").to_table().to_pandas()
Out[6]: 
   a
0  1
1  2
2  3

In [7]: ds.dataset("test_columns", schema=pa.schema([('a', 'int64'), ('b', 'float64')])).to_table().to_pandas()
Out[7]: 
   a   b
0  1 NaN
1  2 NaN
2  3 NaN

@asfimport
Copy link
Author

Krisztian Szucs / @kszucs:
Postponing it to 7.0

@asfimport
Copy link
Author

Apache Arrow JIRA Bot:
This issue was last updated over 90 days ago, which may be an indication it is no longer being actively worked. To better reflect the current state, the issue is being unassigned per project policy. Please feel free to re-take assignment of the issue if it is being actively worked, or if you plan to start that work soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants