You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
When rows processed by schema inference do not contain any data (are empty) for given column, that column is inferred as nullable DataType::Utf8. This data type is in fact a "catch-all" that permits any values later on, but it is in fact a limiting and to a degree incorrect behavior, since user is led to assume this column did contain some data and it was string or something that forced string type.
Expected behavior
Inference should return int*, null, utf8
Additional context
My algorithm uses inference with limited number of rows as a kind of best-effort / incremental performance improvement, when I read some data and see inferred schema has nulls, I may repeat inference with more rows or without row limit. If inference wrongly returns some data-type that isn't there, then I will end up with unnecessarily widened Utf8 datatime, while in fact later on this column actually contains just ints or booleans.
Another use-case is that I have several files of the same shape (or they could be several random offsets into the same file) and I want to infer schemas for each of them, then merge them to see if any still contain nulls.
With #4901 and fixing behavior described in this issue I can implement above strategy correctly.
The text was updated successfully, but these errors were encountered:
Describe the bug
When rows processed by schema inference do not contain any data (are empty) for given column, that column is inferred as nullable
DataType::Utf8
. This data type is in fact a "catch-all" that permits any values later on, but it is in fact a limiting and to a degree incorrect behavior, since user is led to assume this column did contain some data and it was string or something that forced string type.To Reproduce
Expected behavior
Inference should return
int*
,null
,utf8
Additional context
My algorithm uses inference with limited number of rows as a kind of best-effort / incremental performance improvement, when I read some data and see inferred schema has nulls, I may repeat inference with more rows or without row limit. If inference wrongly returns some data-type that isn't there, then I will end up with unnecessarily widened
Utf8
datatime, while in fact later on this column actually contains just ints or booleans.Another use-case is that I have several files of the same shape (or they could be several random offsets into the same file) and I want to infer schemas for each of them, then merge them to see if any still contain nulls.
With #4901 and fixing behavior described in this issue I can implement above strategy correctly.
The text was updated successfully, but these errors were encountered: