CSV schema inference assumes `Utf8` for empty columns #4903

kskalski · 2023-10-09T07:23:17Z

Describe the bug
When rows processed by schema inference do not contain any data (are empty) for given column, that column is inferred as nullable DataType::Utf8. This data type is in fact a "catch-all" that permits any values later on, but it is in fact a limiting and to a degree incorrect behavior, since user is led to assume this column did contain some data and it was string or something that forced string type.

To Reproduce

int_column,null_column,string_column
1,,"a"
2,,"b"

Expected behavior
Inference should return int*, null, utf8

Additional context
My algorithm uses inference with limited number of rows as a kind of best-effort / incremental performance improvement, when I read some data and see inferred schema has nulls, I may repeat inference with more rows or without row limit. If inference wrongly returns some data-type that isn't there, then I will end up with unnecessarily widened Utf8 datatime, while in fact later on this column actually contains just ints or booleans.

Another use-case is that I have several files of the same shape (or they could be several random offsets into the same file) and I want to infer schemas for each of them, then merge them to see if any still contain nulls.
With #4901 and fixing behavior described in this issue I can implement above strategy correctly.

The text was updated successfully, but these errors were encountered:

tustvold · 2023-10-09T16:33:22Z

This should be a relatively straightforward change to the logic in InferredDataType::get to return Null if self.packed == 0

tustvold · 2023-10-18T09:42:57Z

label_issue.py automatically added labels {'arrow'} from #4910

kskalski added the bug label Oct 9, 2023

kskalski mentioned this issue Oct 9, 2023

fix(csv)!: infer null for empty column. #4910

Merged

tustvold closed this as completed in #4910 Oct 10, 2023

tustvold added the arrow Changes to the arrow crate label Oct 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSV schema inference assumes `Utf8` for empty columns #4903

CSV schema inference assumes `Utf8` for empty columns #4903

kskalski commented Oct 9, 2023

tustvold commented Oct 9, 2023 •

edited

tustvold commented Oct 18, 2023

CSV schema inference assumes Utf8 for empty columns #4903

CSV schema inference assumes Utf8 for empty columns #4903

Comments

kskalski commented Oct 9, 2023

tustvold commented Oct 9, 2023 • edited

tustvold commented Oct 18, 2023

CSV schema inference assumes `Utf8` for empty columns #4903

CSV schema inference assumes `Utf8` for empty columns #4903

tustvold commented Oct 9, 2023 •

edited