Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Hadoop counters for detecting schema mismatch #5873

Merged
merged 5 commits into from
Sep 2, 2020

Conversation

jackjlli
Copy link
Member

@jackjlli jackjlli commented Aug 16, 2020

Description

This PR adds Hadoop counters for detecting schema mismatch (AVRO only).

The counters include:

  • data types don't match (int <-> long)
  • single value vs multi value
  • multi-value column uses non-array structure
  • some pinot column is missing in the raw data

Here are the sample detailed information when schemas mismatch:

The Pinot column: (extra_column: STRING) is missing in the AVRO schema of input data.
The Pinot column: (column1: STRING) doesn't match with the column (union: LONG) in input AVRO schema.
The Pinot column: column2 is 'multi-value' column but the column: union from input AVRO schema is 'single-value' column.
The Pinot column: column2 is 'multi-value' column but the column: union from input AVRO schema is of 'int' type, which should have been of 'array' type.

Copy link
Member

@kishoreg kishoreg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this specific to Hadoop. This could be part of segment creator and used across all recordreader right?

@jackjlli jackjlli force-pushed the add-counter-for-detecting-schema-mismatch branch from b87c6cf to 8b0b778 Compare August 16, 2020 20:28
@jackjlli
Copy link
Member Author

Why is this specific to Hadoop. This could be part of segment creator and used across all recordreader right?

That makes sense. I've moved the counters to SegmentCreationJob from HadoopSegmentCreationJob.

@kishoreg
Copy link
Member

Most of these are already tracked in transformers. Invalid columns etc

1 similar comment
@kishoreg
Copy link
Member

Most of these are already tracked in transformers. Invalid columns etc

@@ -243,14 +257,15 @@ protected void map(LongWritable key, Text value, Context context)
addAdditionalSegmentGeneratorConfigs(segmentGeneratorConfig, hdfsInputFile, sequenceId);

_logger.info("Start creating segment with sequence id: {}", sequenceId);
SegmentIndexCreationDriver driver = new SegmentIndexCreationDriverImpl();
SegmentIndexCreationDriverImpl driver = new SegmentIndexCreationDriverImpl();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we are breaking interface here, what' the reasoning for that? Either the api should be justified to be part of the interface, or the design is broken somehow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1
It has been on my radar to add a columnar segment creation driver (for realtime), and this will break completely

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a class called SchemaValidator to validate Pinot schema and the input data schema

@@ -353,8 +368,71 @@ protected void addAdditionalSegmentGeneratorConfigs(SegmentGeneratorConfig segme
int sequenceId) {
}

public void validateSchema(SegmentGeneratorConfig segmentGeneratorConfig, RecordReader recordReader) {
if (recordReader instanceof AvroRecordReader) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we will have to either write pairwise validators (pinot-avro, pinot-orc, pinot-json, etc). Or can write pair-wise schema converters (avro->pinot, orc->pinot, json->pinot), and then the schema validator will only compare two pinot schemas (one provided as input, other derived from format). At this point, I see pros/cons in both, but leaning towards former as it provides dedicated validation between formats.

However, in either of the approaches, I'd recommend creating interfaces/impls. For example, an interface for validator (with pair-wise impls), or an interaface for converter (with pair-wise converters, and validator just works over interface).

@jackjlli jackjlli force-pushed the add-counter-for-detecting-schema-mismatch branch from 8b0b778 to d638096 Compare August 20, 2020 04:27
@jackjlli
Copy link
Member Author

Most of these are already tracked in transformers. Invalid columns etc

Correct, but it's encapsulated in RecordRecorder and it will dirty the RecordRecorder interface if such statistics require more method in the interface. Added a class called SchemaValidator to track this.

@jackjlli jackjlli force-pushed the add-counter-for-detecting-schema-mismatch branch 2 times, most recently from 1cf73d1 to 0fb58c0 Compare August 20, 2020 05:29
@jackjlli
Copy link
Member Author

Most of these are already tracked in transformers. Invalid columns etc

Plus, the transformers are applied on each of the records. We don't have to do the schema validation at every record. All we need to do is to validate the schemas only once when a segment is about to be built.
@kishoreg let me know if you have any other concern.

/**
* Validator to validate the schema between Pinot schema and input raw data schema
*/
public interface SchemaValidator {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest to rename this as IngestionSchemaValidator ? Or, InputSchemaValidator (still confusing I think). Otherwise it reads as if we are validating pinot schema.

If we add a validator in PinotSchemaRestlet when the rest api call updates the schema, what would we call it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do u agree to this? Or, do you want to leave it as SchemaValidator?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed it to IngestionSchemaValidator in the latest push


void init(Schema pinotSchema, String inputFilePath);

boolean isDataTypeMismatch();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we return the field names that have this type of mismatch, and also an error message (e.g. "input type 'List' does not match with pinot data type 'int' for field 'X') that will be awesome.
We can then log this as an error message during segment creation

Copy link
Member Author

@jackjlli jackjlli Sep 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a class called SchemaValidatorResult to provide detailed information. I've added the sample detailed info in the description of this PR.

@jackjlli jackjlli force-pushed the add-counter-for-detecting-schema-mismatch branch from 0fb58c0 to 200b4bf Compare September 2, 2020 17:52
@jackjlli jackjlli force-pushed the add-counter-for-detecting-schema-mismatch branch from 200b4bf to 33094df Compare September 2, 2020 17:58
@jackjlli jackjlli force-pushed the add-counter-for-detecting-schema-mismatch branch from cd92952 to 266f6b1 Compare September 2, 2020 20:35
/**
* Validator to validate the schema between Pinot schema and input raw data schema
*/
public interface SchemaValidator {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do u agree to this? Or, do you want to leave it as SchemaValidator?

@jackjlli jackjlli force-pushed the add-counter-for-detecting-schema-mismatch branch from 4de10c0 to cdad5aa Compare September 2, 2020 21:52
Copy link
Contributor

@mcvsubbu mcvsubbu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this.

@jackjlli jackjlli force-pushed the add-counter-for-detecting-schema-mismatch branch from cdad5aa to 4380465 Compare September 2, 2020 22:32
@jackjlli jackjlli merged commit 8a31bf7 into master Sep 2, 2020
@jackjlli jackjlli deleted the add-counter-for-detecting-schema-mismatch branch September 2, 2020 22:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants