Add query context parameter to remove null bytes when writing frames #16579

LakshSingla · 2024-06-10T04:19:44Z

Description

This PR adds a new query context parameter removeNullBytes which the users can set to remove null bytes while writing the string and string array data to frames. To remove null bytes, the users wrap the string columns in a REPLACE function. The query planner can sometimes optimize and add the function in the later stages, especially when subqueries are involved. This can cause the query to fail when run with the MSQ engine, which cannot work with null bytes in string fields. Adding the context parameter ensures that the null bytes are removed from the first stage itself when the data is read.

Upgrade-Downgrade considerations

The mismatch of the controller-worker versions shouldn't cause any backward incompatibilities. If a worker on an older version encounters this flag, they will ignore it and throw an error on encountering \0. Likewise, if the controller is on an older version, it won't pass this flag to the workers, and they will throw on encountering \0 (default behavior). Therefore, in the presence of older versions, we would behave in the default way and throw an exception, which is the original behavior of the code (without this change).

Release note

MSQ cannot process null bytes in string fields, and the current workaround is to remove them using the REPLACE function. 'removeNullBytes' context parameter has been added which sanitizes the input string fields by removing these null bytes.

Key changed/added classes in this PR

MyFoo
OurBar
TheirBaz

This PR has:

kgyrtkirk · 2024-06-10T12:02:44Z

I wonder why add an option to remove null bytes on writes - instead of adding an option to remove them during read;
the main difference between the two is that: in case normalization is done during write - the pass on the data may have to tolerate/handle these differences; but after a write it disappears

...so I wonder if there are any drawbacks of doing the normalization during read instead?

LakshSingla · 2024-06-10T16:08:28Z

@kgyrtkirk I didn't fully understand this -

in case normalization is done during write - the pass on the data may have to tolerate/handle these differences; but after a write it disappears

We read the external data and write it to frames almost simultaneously. I am not sure if normalising it while reading would change much, unless I am misinterpreting the comment.

This reverts commit db2abac.

kgyrtkirk · 2024-06-11T13:15:42Z

I wonder if the following is true:

suppose there is a table which has column which contains a string with a \0
based on my interpretation of the PR; normalization happens at write time
the 1st stage will see the field containing the \0 - so if it computes some function say: char_length ; it will be counted in
and further stages or if the data is persisted the \0 will not anymore be there

that's why I thinked that normalizing at read time might be a better way to do this...as that will provide consistent behaviour even for the 1st usage as well.

Now that I've thinked about it a bit more: I guess in that case it will be harder to identify which columns should be normalized at read time (and I guess a \0 could possibly be added by a function as well). As this might be more complicated to do...maybe it doesn't worth the effort

cryptoe

Accidentally approved. How will it work if we directly stream stuff from the reader to the super sorter.

cryptoe

One comment. Rest all lgtm. Thank you for adding tests.

cryptoe · 2024-06-24T05:55:35Z

processing/src/main/java/org/apache/druid/frame/write/FrameWriterUtils.java

  )
  {
+    if (allowNullBytes && removeNullBytes) {


This is quite hot piece of code. Do we wanna have this check here ?

Instead can we make public static methods which never allow this condition to happen ?
We can make this a private method

cryptoe · 2024-06-25T06:08:05Z

processing/src/main/java/org/apache/druid/frame/write/FrameWriterUtils.java

@@ -242,12 +241,36 @@ public static void verifySortColumns(
    }
  }

+  public static void copyByteBufferToMemoryAllowingNullBytes(


please add some java docs to both these public methods.

cryptoe · 2024-06-26T05:24:21Z

...ulti-stage-query/src/main/java/org/apache/druid/msq/indexing/error/InvalidNullByteFault.java

@@ -64,7 +64,8 @@ public InvalidNullByteFault(
    super(
        CODE,
        "Invalid null byte at source[%s], rowNumber[%d], column[%s], value[%s], position[%d]. "
-        + "Consider sanitizing the input string column using REPLACE(\"%s\", U&'\\0000', '') AS %s",
+        + "Consider sanitizing the input string column using \"REPLACE(\"%s\", U&'\\0000', '') AS %s\" or setting 'removeNullBytes' "


Very nice change. Thank you.

cryptoe · 2024-06-26T05:24:46Z

docs/multi-stage-query/reference.md

@@ -410,6 +410,7 @@ The following table lists the context parameters for the MSQ task engine:
 | `skipTypeVerification` | INSERT or REPLACE<br /><br />During query validation, Druid validates that [string arrays](../querying/arrays.md) and [multi-value dimensions](../querying/multi-value-dimensions.md) are not mixed in the same column. If you are intentionally migrating from one to the other, use this context parameter to disable type validation.<br /><br />Provide the column list as comma-separated values or as a JSON array in string form.| empty list |
 | `failOnEmptyInsert` | INSERT or REPLACE<br /><br /> When set to false (the default), an INSERT query generating no output rows will be no-op, and a REPLACE query generating no output rows will delete all data that matches the OVERWRITE clause.  When set to true, an ingest query generating no output rows will throw an `InsertCannotBeEmpty` fault. | `false` |
 | `storeCompactionState` | REPLACE<br /><br /> When set to true, a REPLACE query stores as part of each segment's metadata a `lastCompactionState` field that captures the various specs used to create the segment. Future compaction jobs skip segments whose `lastCompactionState` matches the desired compaction state. Works the same as [`storeCompactionState`](../ingestion/tasks.md#context-parameters) task context flag. | `false` |
+| `removeNullBytes` | SELECT, INSERT or REPLACE<br /><br /> The MSQ engine cannot process null bytes in strings and throws `InvalidNullByteFault` if it encounters them in the source data. If the parameter is set to true, The MSQ engine will remove the null bytes in string fields when reading the data. | `false` |


Should we document this ?
cc @gianm ?

I think so, as the users would require it if REPLACE(...) isn't working.

If we decide to undocumented it, we can have a follow up patch.

LakshSingla · 2024-06-26T09:31:22Z

Thanks for the review @cryptoe!

LakshSingla added 2 commits June 7, 2024 21:49

some more refactoring

7c52002

working

ac6108c

github-actions bot added Area - Batch Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 labels Jun 10, 2024

Tests

36fe0ae

LakshSingla added 2 commits June 10, 2024 21:46

update benchmarks

db2abac

Revert "update benchmarks"

5f67ccb

This reverts commit db2abac.

cryptoe approved these changes Jun 19, 2024

View reviewed changes

cryptoe requested changes Jun 19, 2024

View reviewed changes

add tests

bd4e2dd

cryptoe reviewed Jun 24, 2024

View reviewed changes

LakshSingla added 2 commits June 25, 2024 10:33

review modifications

c55aca6

Merge branch 'master' into strip-null-bytes

c9249ee

cryptoe approved these changes Jun 25, 2024

View reviewed changes

changes

4436cc4

github-actions bot added the Area - Documentation label Jun 26, 2024

LakshSingla added the Release Notes label Jun 26, 2024

cryptoe reviewed Jun 26, 2024

View reviewed changes

LakshSingla merged commit 71b3b5a into apache:master Jun 26, 2024
88 checks passed

LakshSingla deleted the strip-null-bytes branch June 26, 2024 09:30

kfaraz added this to the 31.0.0 milestone Oct 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add query context parameter to remove null bytes when writing frames #16579

Add query context parameter to remove null bytes when writing frames #16579

LakshSingla commented Jun 10, 2024 •

edited

Loading

kgyrtkirk commented Jun 10, 2024

LakshSingla commented Jun 10, 2024

kgyrtkirk commented Jun 11, 2024

cryptoe left a comment •

edited

Loading

cryptoe left a comment

cryptoe Jun 24, 2024

cryptoe Jun 24, 2024

cryptoe Jun 25, 2024

cryptoe Jun 26, 2024

cryptoe Jun 26, 2024

LakshSingla Jun 26, 2024

cryptoe Jun 26, 2024

LakshSingla commented Jun 26, 2024

Add query context parameter to remove null bytes when writing frames #16579

Add query context parameter to remove null bytes when writing frames #16579

Conversation

LakshSingla commented Jun 10, 2024 • edited Loading

Description

Upgrade-Downgrade considerations

Release note

Key changed/added classes in this PR

kgyrtkirk commented Jun 10, 2024

LakshSingla commented Jun 10, 2024

kgyrtkirk commented Jun 11, 2024

cryptoe left a comment • edited Loading

Choose a reason for hiding this comment

cryptoe left a comment

Choose a reason for hiding this comment

cryptoe Jun 24, 2024

Choose a reason for hiding this comment

cryptoe Jun 24, 2024

Choose a reason for hiding this comment

cryptoe Jun 25, 2024

Choose a reason for hiding this comment

cryptoe Jun 26, 2024

Choose a reason for hiding this comment

cryptoe Jun 26, 2024

Choose a reason for hiding this comment

LakshSingla Jun 26, 2024

Choose a reason for hiding this comment

cryptoe Jun 26, 2024

Choose a reason for hiding this comment

LakshSingla commented Jun 26, 2024

LakshSingla commented Jun 10, 2024 •

edited

Loading

cryptoe left a comment •

edited

Loading