-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add row based schema validation code to detect schema mismatch #5984
Conversation
ae94529
to
d7cd676
Compare
d7cd676
to
57acbde
Compare
High level question, do we really need row level validation? What is the overhead of this validation? |
@jackjlli Could you describe the case which is not caught by schema level check? |
@@ -586,22 +586,32 @@ public boolean isSingleValue() { | |||
|
|||
public PinotDataType getSingleValueType() { | |||
switch (this) { | |||
case BYTE: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this PR is just adding a check, why does it need to modify the functionality?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is to get the single value type for single value type. This method should return the same single value type for single value type itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And this method would be called in https://github.com/apache/incubator-pinot/blob/2cfaed37cf581362b87a36e924cdd5744d430e03/pinot-core/src/main/java/org/apache/pinot/core/data/recordtransformer/DataTypeTransformer.java#L112
If the single value type between source and dest are the same, the data type are the same. E.g. if source is string_array and dest is string, the data type is the same, even though we should mark the flag of single-value multi-value mismatch
.
Most of the cases can be covered by validating pinot schema and avro schema. One tricky thing is that when all the fields are required to be fetched, we convert the avro generic record to string first, then parse it as a json:
The data type of the value from the k-v pair might get changed. |
I don't think we ever init record extractor without the fields. Also, converting to string then serializing as json doesn't seem correct. We should fix that instead of adding the row based validation. |
This PR can be closed as we've already had the schema validation in another PR(#5873). |
Description
This PR adds row based schema validation code to detect schema mismatch.
This validation will capture the schema mismatch on row basis, so that we don't miss any cases when converting the raw data to pinot data.
Sample mismatch detailed information: