https://arrow.apache.org/docs/python/generated/pyarrow.csv.ParseOptions.html#pyarrow.csv.ParseOptions allows for skipping invalid rows by means of the invalid_row_handler.
In https://arrow.apache.org/docs/python/generated/pyarrow.csv.ConvertOptions.html#pyarrow.csv.ConvertOptions, one can supply a schema to get correct types in the resulting table.
I have a data source that almost always follows a specific schema, but its data isn't validated beforehand. In practice, it's possible for a field which is int16 99.9% of the time to have an out-of-range value in a few rows.
I'd like to handle those cases similarly to the invalid_row_handler, perhaps allowing to set failing conversions to NULL, or supplying a handler to apply a more specific operation.
Reporter: Tim Loderhose
Note: This issue was originally created as ARROW-16834. Please see the migration documentation for further details.
https://arrow.apache.org/docs/python/generated/pyarrow.csv.ParseOptions.html#pyarrow.csv.ParseOptions allows for skipping invalid rows by means of the
invalid_row_handler.In https://arrow.apache.org/docs/python/generated/pyarrow.csv.ConvertOptions.html#pyarrow.csv.ConvertOptions, one can supply a schema to get correct types in the resulting table.
I have a data source that almost always follows a specific schema, but its data isn't validated beforehand. In practice, it's possible for a field which is int16 99.9% of the time to have an out-of-range value in a few rows.
I'd like to handle those cases similarly to the
invalid_row_handler, perhaps allowing to set failing conversions to NULL, or supplying a handler to apply a more specific operation.Reporter: Tim Loderhose
Note: This issue was originally created as ARROW-16834. Please see the migration documentation for further details.