Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add query context parameter to remove null bytes when writing frames #16579

Merged
merged 9 commits into from
Jun 26, 2024

Conversation

LakshSingla
Copy link
Contributor

@LakshSingla LakshSingla commented Jun 10, 2024

Description

This PR adds a new query context parameter removeNullBytes which the users can set to remove null bytes while writing the string and string array data to frames. To remove null bytes, the users wrap the string columns in a REPLACE function. The query planner can sometimes optimize and add the function in the later stages, especially when subqueries are involved. This can cause the query to fail when run with the MSQ engine, which cannot work with null bytes in string fields. Adding the context parameter ensures that the null bytes are removed from the first stage itself when the data is read.

Upgrade-Downgrade considerations

The mismatch of the controller-worker versions shouldn't cause any backward incompatibilities. If a worker on an older version encounters this flag, they will ignore it and throw an error on encountering \0. Likewise, if the controller is on an older version, it won't pass this flag to the workers, and they will throw on encountering \0 (default behavior). Therefore, in the presence of older versions, we would behave in the default way and throw an exception, which is the original behavior of the code (without this change).

Release note

MSQ cannot process null bytes in string fields, and the current workaround is to remove them using the REPLACE function. 'removeNullBytes' context parameter has been added which sanitizes the input string fields by removing these null bytes.


Key changed/added classes in this PR
  • MyFoo
  • OurBar
  • TheirBaz

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@github-actions github-actions bot added Area - Batch Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 labels Jun 10, 2024
@kgyrtkirk
Copy link
Member

I wonder why add an option to remove null bytes on writes - instead of adding an option to remove them during read;
the main difference between the two is that: in case normalization is done during write - the pass on the data may have to tolerate/handle these differences; but after a write it disappears

...so I wonder if there are any drawbacks of doing the normalization during read instead?

@LakshSingla
Copy link
Contributor Author

@kgyrtkirk I didn't fully understand this -

in case normalization is done during write - the pass on the data may have to tolerate/handle these differences; but after a write it disappears

We read the external data and write it to frames almost simultaneously. I am not sure if normalising it while reading would change much, unless I am misinterpreting the comment.

@kgyrtkirk
Copy link
Member

I wonder if the following is true:

  • suppose there is a table which has column which contains a string with a \0
  • based on my interpretation of the PR; normalization happens at write time
  • the 1st stage will see the field containing the \0 - so if it computes some function say: char_length ; it will be counted in
  • and further stages or if the data is persisted the \0 will not anymore be there

that's why I thinked that normalizing at read time might be a better way to do this...as that will provide consistent behaviour even for the 1st usage as well.

Now that I've thinked about it a bit more: I guess in that case it will be harder to identify which columns should be normalized at read time (and I guess a \0 could possibly be added by a function as well). As this might be more complicated to do...maybe it doesn't worth the effort

Copy link
Contributor

@cryptoe cryptoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accidentally approved. How will it work if we directly stream stuff from the reader to the super sorter.

Copy link
Contributor

@cryptoe cryptoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment. Rest all lgtm. Thank you for adding tests.

)
{
if (allowNullBytes && removeNullBytes) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite hot piece of code. Do we wanna have this check here ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead can we make public static methods which never allow this condition to happen ?
We can make this a private method

@@ -242,12 +241,36 @@ public static void verifySortColumns(
}
}

public static void copyByteBufferToMemoryAllowingNullBytes(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add some java docs to both these public methods.

@@ -64,7 +64,8 @@ public InvalidNullByteFault(
super(
CODE,
"Invalid null byte at source[%s], rowNumber[%d], column[%s], value[%s], position[%d]. "
+ "Consider sanitizing the input string column using REPLACE(\"%s\", U&'\\0000', '') AS %s",
+ "Consider sanitizing the input string column using \"REPLACE(\"%s\", U&'\\0000', '') AS %s\" or setting 'removeNullBytes' "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice change. Thank you.

@@ -410,6 +410,7 @@ The following table lists the context parameters for the MSQ task engine:
| `skipTypeVerification` | INSERT or REPLACE<br /><br />During query validation, Druid validates that [string arrays](../querying/arrays.md) and [multi-value dimensions](../querying/multi-value-dimensions.md) are not mixed in the same column. If you are intentionally migrating from one to the other, use this context parameter to disable type validation.<br /><br />Provide the column list as comma-separated values or as a JSON array in string form.| empty list |
| `failOnEmptyInsert` | INSERT or REPLACE<br /><br /> When set to false (the default), an INSERT query generating no output rows will be no-op, and a REPLACE query generating no output rows will delete all data that matches the OVERWRITE clause. When set to true, an ingest query generating no output rows will throw an `InsertCannotBeEmpty` fault. | `false` |
| `storeCompactionState` | REPLACE<br /><br /> When set to true, a REPLACE query stores as part of each segment's metadata a `lastCompactionState` field that captures the various specs used to create the segment. Future compaction jobs skip segments whose `lastCompactionState` matches the desired compaction state. Works the same as [`storeCompactionState`](../ingestion/tasks.md#context-parameters) task context flag. | `false` |
| `removeNullBytes` | SELECT, INSERT or REPLACE<br /><br /> The MSQ engine cannot process null bytes in strings and throws `InvalidNullByteFault` if it encounters them in the source data. If the parameter is set to true, The MSQ engine will remove the null bytes in string fields when reading the data. | `false` |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we document this ?
cc @gianm ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, as the users would require it if REPLACE(...) isn't working.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we decide to undocumented it, we can have a follow up patch.

@LakshSingla LakshSingla merged commit 71b3b5a into apache:master Jun 26, 2024
88 checks passed
@LakshSingla LakshSingla deleted the strip-null-bytes branch June 26, 2024 09:30
@LakshSingla
Copy link
Contributor Author

Thanks for the review @cryptoe!

@kfaraz kfaraz added this to the 31.0.0 milestone Oct 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area - Batch Ingestion Area - Documentation Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 Release Notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants