[Python][Dataset] Expose schema inference / validation options in the factory #24418

asfimport · 2020-03-25T19:56:49Z

ARROW-8058 added options related to schema inference / validation for the Dataset factory. We should expose this in Python in the dataset(..) factory function:

Add ability to pass a user-specified schema with a schema keyword, instead of inferring the schema from (one of) the files (to be passed to the factory finish method)
Add validate_schema option to toggle whether the schema is validated against the actual files or not.
Expose in some way the number of fragments to be inspected when inferring or validating the schema. Not sure yet what the best API for this would be.

Some relevant notes from the original PR: ARROW-8058: [Dataset] Relax DatasetFactory discovery validation #6687 (comment)

Reporter: Joris Van den Bossche / @jorisvandenbossche

_{Note: This issue was originally created as ARROW-8221. Please see the migration documentation for further details.}

The text was updated successfully, but these errors were encountered:

asfimport · 2020-11-10T17:05:35Z

Weston Pace / @westonpace:
I think another thing that would be included in this work is the ability to specify columns that exist in some, but not all, of the items in the dataset. Today if I specify column names I get an error if the first table doesn't contain that column even if the other tables do.

asfimport · 2020-11-11T12:46:26Z

Joris Van den Bossche / @jorisvandenbossche:
@westonpace do you have an example? Normally, missing columns in one of the files is something that should be supported already (they get filled with nulls).

In [1]: table = pa.table({'a': [1, 2, 3]})

In [2]: import pyarrow.dataset as ds

In [3]: ds.write_dataset(table, "test_columns", format="parquet")

In [6]: ds.dataset("test_columns").to_table().to_pandas()
Out[6]: 
   a
0  1
1  2
2  3

In [7]: ds.dataset("test_columns", schema=pa.schema([('a', 'int64'), ('b', 'float64')])).to_table().to_pandas()
Out[7]: 
   a   b
0  1 NaN
1  2 NaN
2  3 NaN

asfimport · 2021-10-13T21:34:09Z

Krisztian Szucs / @kszucs:
Postponing it to 7.0

asfimport · 2022-10-13T17:51:28Z

Apache Arrow JIRA Bot:
This issue was last updated over 90 days ago, which may be an indication it is no longer being actively worked. To better reflect the current state, the issue is being unassigned per project policy. Please feel free to re-take assignment of the issue if it is being actively worked, or if you plan to start that work soon.

asfimport added this to the 11.0.0 milestone Jan 11, 2023

raulcd removed this from the 11.0.0 milestone Jan 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python][Dataset] Expose schema inference / validation options in the factory #24418

[Python][Dataset] Expose schema inference / validation options in the factory #24418

asfimport commented Mar 25, 2020 •

edited

asfimport commented Nov 10, 2020

asfimport commented Nov 11, 2020

asfimport commented Oct 13, 2021

asfimport commented Oct 13, 2022

[Python][Dataset] Expose schema inference / validation options in the factory #24418

[Python][Dataset] Expose schema inference / validation options in the factory #24418

Comments

asfimport commented Mar 25, 2020 • edited

Subtasks:

Related issues:

PRs and other links:

asfimport commented Nov 10, 2020

asfimport commented Nov 11, 2020

asfimport commented Oct 13, 2021

asfimport commented Oct 13, 2022

asfimport commented Mar 25, 2020 •

edited