-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add query context parameter to remove null bytes when writing frames #16579
Conversation
I wonder why add an option to remove null bytes on writes - instead of adding an option to remove them during read; ...so I wonder if there are any drawbacks of doing the normalization during read instead? |
@kgyrtkirk I didn't fully understand this -
We read the external data and write it to frames almost simultaneously. I am not sure if normalising it while reading would change much, unless I am misinterpreting the comment. |
This reverts commit db2abac.
I wonder if the following is true:
that's why I thinked that normalizing at read time might be a better way to do this...as that will provide consistent behaviour even for the 1st usage as well. Now that I've thinked about it a bit more: I guess in that case it will be harder to identify which columns should be normalized at read time (and I guess a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Accidentally approved. How will it work if we directly stream stuff from the reader to the super sorter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One comment. Rest all lgtm. Thank you for adding tests.
) | ||
{ | ||
if (allowNullBytes && removeNullBytes) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is quite hot piece of code. Do we wanna have this check here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead can we make public static methods which never allow this condition to happen ?
We can make this a private method
@@ -242,12 +241,36 @@ public static void verifySortColumns( | |||
} | |||
} | |||
|
|||
public static void copyByteBufferToMemoryAllowingNullBytes( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add some java docs to both these public methods.
@@ -64,7 +64,8 @@ public InvalidNullByteFault( | |||
super( | |||
CODE, | |||
"Invalid null byte at source[%s], rowNumber[%d], column[%s], value[%s], position[%d]. " | |||
+ "Consider sanitizing the input string column using REPLACE(\"%s\", U&'\\0000', '') AS %s", | |||
+ "Consider sanitizing the input string column using \"REPLACE(\"%s\", U&'\\0000', '') AS %s\" or setting 'removeNullBytes' " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice change. Thank you.
@@ -410,6 +410,7 @@ The following table lists the context parameters for the MSQ task engine: | |||
| `skipTypeVerification` | INSERT or REPLACE<br /><br />During query validation, Druid validates that [string arrays](../querying/arrays.md) and [multi-value dimensions](../querying/multi-value-dimensions.md) are not mixed in the same column. If you are intentionally migrating from one to the other, use this context parameter to disable type validation.<br /><br />Provide the column list as comma-separated values or as a JSON array in string form.| empty list | | |||
| `failOnEmptyInsert` | INSERT or REPLACE<br /><br /> When set to false (the default), an INSERT query generating no output rows will be no-op, and a REPLACE query generating no output rows will delete all data that matches the OVERWRITE clause. When set to true, an ingest query generating no output rows will throw an `InsertCannotBeEmpty` fault. | `false` | | |||
| `storeCompactionState` | REPLACE<br /><br /> When set to true, a REPLACE query stores as part of each segment's metadata a `lastCompactionState` field that captures the various specs used to create the segment. Future compaction jobs skip segments whose `lastCompactionState` matches the desired compaction state. Works the same as [`storeCompactionState`](../ingestion/tasks.md#context-parameters) task context flag. | `false` | | |||
| `removeNullBytes` | SELECT, INSERT or REPLACE<br /><br /> The MSQ engine cannot process null bytes in strings and throws `InvalidNullByteFault` if it encounters them in the source data. If the parameter is set to true, The MSQ engine will remove the null bytes in string fields when reading the data. | `false` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we document this ?
cc @gianm ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so, as the users would require it if REPLACE(...) isn't working.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we decide to undocumented it, we can have a follow up patch.
Thanks for the review @cryptoe! |
Description
This PR adds a new query context parameter
removeNullBytes
which the users can set to remove null bytes while writing the string and string array data to frames. To remove null bytes, the users wrap the string columns in aREPLACE
function. The query planner can sometimes optimize and add the function in the later stages, especially when subqueries are involved. This can cause the query to fail when run with the MSQ engine, which cannot work with null bytes in string fields. Adding the context parameter ensures that the null bytes are removed from the first stage itself when the data is read.Upgrade-Downgrade considerations
The mismatch of the controller-worker versions shouldn't cause any backward incompatibilities. If a worker on an older version encounters this flag, they will ignore it and throw an error on encountering \0. Likewise, if the controller is on an older version, it won't pass this flag to the workers, and they will throw on encountering \0 (default behavior). Therefore, in the presence of older versions, we would behave in the default way and throw an exception, which is the original behavior of the code (without this change).
Release note
MSQ cannot process null bytes in string fields, and the current workaround is to remove them using the REPLACE function. 'removeNullBytes' context parameter has been added which sanitizes the input string fields by removing these null bytes.
Key changed/added classes in this PR
MyFoo
OurBar
TheirBaz
This PR has: