Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add row based schema validation code to detect schema mismatch #5984

Closed
wants to merge 1 commit into from

Conversation

jackjlli
Copy link
Contributor

@jackjlli jackjlli commented Sep 4, 2020

Description

This PR adds row based schema validation code to detect schema mismatch.

This validation will capture the schema mismatch on row basis, so that we don't miss any cases when converting the raw data to pinot data.

Sample mismatch detailed information:

Found data type mismatch from the following Pinot columns: [column2]
Found single-value multi-value field mismatch from the following Pinot columns: [column3]
Found multi-value structure mismatch from the following Pinot columns: [column3, column16]

@jackjlli jackjlli force-pushed the add-row-based-schema-validator branch 3 times, most recently from ae94529 to d7cd676 Compare September 6, 2020 00:34
@jackjlli jackjlli force-pushed the add-row-based-schema-validator branch from d7cd676 to 57acbde Compare September 6, 2020 03:41
@Jackie-Jiang
Copy link
Contributor

High level question, do we really need row level validation? What is the overhead of this validation?
Seems like the validation only applies to the Avro reader. In that case, validating Pinot schema and Avro schema should be enough. IMO we are paying a lot of overhead for very little gain.

@mayankshriv
Copy link
Contributor

@jackjlli Could you describe the case which is not caught by schema level check?

@@ -586,22 +586,32 @@ public boolean isSingleValue() {

public PinotDataType getSingleValueType() {
switch (this) {
case BYTE:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this PR is just adding a check, why does it need to modify the functionality?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is to get the single value type for single value type. This method should return the same single value type for single value type itself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this method would be called in https://github.com/apache/incubator-pinot/blob/2cfaed37cf581362b87a36e924cdd5744d430e03/pinot-core/src/main/java/org/apache/pinot/core/data/recordtransformer/DataTypeTransformer.java#L112
If the single value type between source and dest are the same, the data type are the same. E.g. if source is string_array and dest is string, the data type is the same, even though we should mark the flag of single-value multi-value mismatch.

@jackjlli
Copy link
Contributor Author

jackjlli commented Sep 9, 2020

Most of the cases can be covered by validating pinot schema and avro schema. One tricky thing is that when all the fields are required to be fetched, we convert the avro generic record to string first, then parse it as a json:
https://github.com/apache/incubator-pinot/blob/d54b04a2562f86dfb3adaa02ff400951d8108738/pinot-plugins/pinot-input-format/pinot-avro-base/src/main/java/org/apache/pinot/plugin/inputformat/avro/AvroRecordExtractor.java#L49

  /**
   * Converts from a GenericRecord to a json map
   */
  public static Map<String, Object> genericRecordToJson(GenericRecord genericRecord) {
    try {
      String jsonString = genericRecord.toString();
      return DEFAULT_MAPPER.readValue(jsonString, new TypeReference<Map<String, Object>>() {
      });
    } catch (IOException e) {
      throw new IllegalStateException("Caught exception when converting generic record " + genericRecord + " to JSON");
    }
  }

The data type of the value from the k-v pair might get changed.

@Jackie-Jiang
Copy link
Contributor

Most of the cases can be covered by validating pinot schema and avro schema. One tricky thing is that when all the fields are required to be fetched, we convert the avro generic record to string first, then parse it as a json:
https://github.com/apache/incubator-pinot/blob/d54b04a2562f86dfb3adaa02ff400951d8108738/pinot-plugins/pinot-input-format/pinot-avro-base/src/main/java/org/apache/pinot/plugin/inputformat/avro/AvroRecordExtractor.java#L49

  /**
   * Converts from a GenericRecord to a json map
   */
  public static Map<String, Object> genericRecordToJson(GenericRecord genericRecord) {
    try {
      String jsonString = genericRecord.toString();
      return DEFAULT_MAPPER.readValue(jsonString, new TypeReference<Map<String, Object>>() {
      });
    } catch (IOException e) {
      throw new IllegalStateException("Caught exception when converting generic record " + genericRecord + " to JSON");
    }
  }

The data type of the value from the k-v pair might get changed.

I don't think we ever init record extractor without the fields. Also, converting to string then serializing as json doesn't seem correct. We should fix that instead of adding the row based validation.

@jackjlli
Copy link
Contributor Author

This PR can be closed as we've already had the schema validation in another PR(#5873).

@jackjlli jackjlli closed this Feb 24, 2021
@jackjlli jackjlli deleted the add-row-based-schema-validator branch February 24, 2021 19:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants