Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSV schema inference assumes Utf8 for empty columns #4903

Closed
kskalski opened this issue Oct 9, 2023 · 2 comments · Fixed by #4910
Closed

CSV schema inference assumes Utf8 for empty columns #4903

kskalski opened this issue Oct 9, 2023 · 2 comments · Fixed by #4910
Labels
arrow Changes to the arrow crate bug

Comments

@kskalski
Copy link
Contributor

kskalski commented Oct 9, 2023

Describe the bug
When rows processed by schema inference do not contain any data (are empty) for given column, that column is inferred as nullable DataType::Utf8. This data type is in fact a "catch-all" that permits any values later on, but it is in fact a limiting and to a degree incorrect behavior, since user is led to assume this column did contain some data and it was string or something that forced string type.

To Reproduce

int_column,null_column,string_column
1,,"a"
2,,"b"

Expected behavior
Inference should return int*, null, utf8

Additional context
My algorithm uses inference with limited number of rows as a kind of best-effort / incremental performance improvement, when I read some data and see inferred schema has nulls, I may repeat inference with more rows or without row limit. If inference wrongly returns some data-type that isn't there, then I will end up with unnecessarily widened Utf8 datatime, while in fact later on this column actually contains just ints or booleans.

Another use-case is that I have several files of the same shape (or they could be several random offsets into the same file) and I want to infer schemas for each of them, then merge them to see if any still contain nulls.
With #4901 and fixing behavior described in this issue I can implement above strategy correctly.

@kskalski kskalski added the bug label Oct 9, 2023
@tustvold
Copy link
Contributor

tustvold commented Oct 9, 2023

This should be a relatively straightforward change to the logic in InferredDataType::get to return Null if self.packed == 0

@tustvold
Copy link
Contributor

label_issue.py automatically added labels {'arrow'} from #4910

@tustvold tustvold added the arrow Changes to the arrow crate label Oct 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants