Frames: Ensure nulls are read as default values when appropriate.#14020
Frames: Ensure nulls are read as default values when appropriate.#14020LakshSingla merged 4 commits intoapache:masterfrom
Conversation
Fixes a bug where LongFieldWriter didn't write a properly transformed zero when writing out a null. This had no meaningful effect in SQL-compatible null handling mode, because the field would get treated as a null anyway. But it does have an effect in default-value mode: it would cause Long.MIN_VALUE to get read out instead of zero. Also adds NullHandling checks to the various frame-based column selectors, allowing reading of nullable frames by servers in default-value mode.
| final String sql = "INSERT INTO foo1\n" | ||
| + "SELECT TIME_PARSE(dim1) AS __time, dim1 as cnt\n" | ||
| + "FROM foo\n" | ||
| + "PARTITIONED BY DAY\n" | ||
| + "CLUSTERED BY dim1"; |
There was a problem hiding this comment.
If you were to add a WHERE TIME_PARSE(dim1) is not null to the query, do the results of the two queries becomes the same regardless of mode? I think they should, but curious.
There was a problem hiding this comment.
I just tried it, and yes, they are the same in that case. Both queries ingest the two rows with parseable timestamps, and ignore the four with unparseable timestamps.
LakshSingla
left a comment
There was a problem hiding this comment.
Should the null handling be done in the StringFieldWriters as well? Consider the following test case in CalciteQueryTest (permalink) which works with the native engine, however doesn't with MSQ. Also due to the ORDER BY the order of the results produced can be different when comparing nulls versus when comparing null with ""
SELECT dim1, dim2, SUM(cnt) AS thecnt
FROM druid.foo
GROUP BY dim1, dim2
HAVING SUM(cnt) = 1
ORDER BY dim2
LIMIT 4
@LakshSingla I didn't update the StringFieldWriter (or other writers, generally) but I did update the StringFieldReader to return However, this didn't help with the test I did update the test to run with MSQ in SQL-compat mode though, by adding: |
|
The failing test has to do with branch coverage in |
There was a problem hiding this comment.
-
To disambiguate the
coercementioned above, I think you mean the one that is present in theNativeQueryMakerright? If so, this will also fix a few other test cases that I was seeing fail because of a type mismatch betweenDOUBLEandFLOAT,so I think that would be pretty cool. -
Do we not require the changes in the
StringFrameColumnReaderfor appropriate null handling as well? I see that there is something done in thegetStringUtf8method, so I might be wrong -
I think we should handle the nulls appropriately in the following line (permalink) as well, when the
StringFieldReaderhas figured out that the field is aNULL_BYTE. WDYT? -
Unrelated to the change, while digging through that piece of code, I found the following condition (permalink)
if ((dataLength == 0 && NullHandling.replaceWithDefault()) ||
(dataLength == 1 && memory.getByte(dataStart) == FrameWriterUtils.NULL_STRING_MARKER)) {
return null;
}Should it instead be:
if ((dataLength == 0 && NullHandling.replaceWithDefault()) ||
(dataLength == 1 && memory.getByte(dataStart) == FrameWriterUtils.NULL_STRING_MARKER && NullHandling.replaceWithDefault())) {
return null;
}Rest of the changes LGTM 🚀
Yes, that's the one I mean. In a future PR I'm planning to use this same logic for MSQ results too.
It's already there:
There's nothing special to do there, since in default-value mode, the convention is that nulls and empty strings are both returned as nulls from selectors. So the
It believe it's correct as-is. It's saying that we should return
|
|
Thanks for the PR! Merging since codecov failures can be ignored due to #14020 (comment) |
…ache#14020) * Frames: Ensure nulls are read as default values when appropriate. Fixes a bug where LongFieldWriter didn't write a properly transformed zero when writing out a null. This had no meaningful effect in SQL-compatible null handling mode, because the field would get treated as a null anyway. But it does have an effect in default-value mode: it would cause Long.MIN_VALUE to get read out instead of zero. Also adds NullHandling checks to the various frame-based column selectors, allowing reading of nullable frames by servers in default-value mode. (cherry picked from commit d52bc33)
The main change is the new tests: we now subclass CalciteJoinQueryTest in CalciteSelectJoinQueryMSQTest twice, once for Broadcast and once for SortMerge. Two supporting production changes for default-value mode: 1) InputNumberDataSource is marked as concrete, to allow leftFilter to be pushed down to it. 2) In default-value mode, numeric frame field readers can now return nulls. This is necessary when stacking joins on top of joins: nulls must be preserved for semantics that match broadcast joins and native queries. 3) In default-value mode, StringFieldReader.isNull returns true on empty strings in addition to nulls. This is more consistent with the behavior of the selectors, which map empty strings to null as well in that mode. As an effect of change (2), the InsertTimeNull change from apache#14020 (to replace null timestamps with default timestamps) is reverted. IMO, this is fine, as either behavior is defensible, and the change from apache#14020 hasn't been released yet.
* MSQ: Subclass CalciteJoinQueryTest, other supporting changes. The main change is the new tests: we now subclass CalciteJoinQueryTest in CalciteSelectJoinQueryMSQTest twice, once for Broadcast and once for SortMerge. Two supporting production changes for default-value mode: 1) InputNumberDataSource is marked as concrete, to allow leftFilter to be pushed down to it. 2) In default-value mode, numeric frame field readers can now return nulls. This is necessary when stacking joins on top of joins: nulls must be preserved for semantics that match broadcast joins and native queries. 3) In default-value mode, StringFieldReader.isNull returns true on empty strings in addition to nulls. This is more consistent with the behavior of the selectors, which map empty strings to null as well in that mode. As an effect of change (2), the InsertTimeNull change from #14020 (to replace null timestamps with default timestamps) is reverted. IMO, this is fine, as either behavior is defensible, and the change from #14020 hasn't been released yet. * Adjust tests. * Style fix. * Additional tests.
…ache#14020) * Frames: Ensure nulls are read as default values when appropriate. Fixes a bug where LongFieldWriter didn't write a properly transformed zero when writing out a null. This had no meaningful effect in SQL-compatible null handling mode, because the field would get treated as a null anyway. But it does have an effect in default-value mode: it would cause Long.MIN_VALUE to get read out instead of zero. Also adds NullHandling checks to the various frame-based column selectors, allowing reading of nullable frames by servers in default-value mode.
Fixes a bug where LongFieldWriter didn't write a properly transformed zero when writing out a null. This had no meaningful effect in SQL-compatible null handling mode, because the field would get treated as a null anyway. But it does have an effect in default-value mode: it would cause Long.MIN_VALUE to get read out instead of zero.
Also adds NullHandling checks to the various frame-based column selectors, allowing reading of nullable frames by servers in default-value mode.