-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Hadoop counters for detecting schema mismatch #5873
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this specific to Hadoop. This could be part of segment creator and used across all recordreader right?
b87c6cf
to
8b0b778
Compare
That makes sense. I've moved the counters to |
Most of these are already tracked in transformers. Invalid columns etc |
1 similar comment
Most of these are already tracked in transformers. Invalid columns etc |
@@ -243,14 +257,15 @@ protected void map(LongWritable key, Text value, Context context) | |||
addAdditionalSegmentGeneratorConfigs(segmentGeneratorConfig, hdfsInputFile, sequenceId); | |||
|
|||
_logger.info("Start creating segment with sequence id: {}", sequenceId); | |||
SegmentIndexCreationDriver driver = new SegmentIndexCreationDriverImpl(); | |||
SegmentIndexCreationDriverImpl driver = new SegmentIndexCreationDriverImpl(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like we are breaking interface here, what' the reasoning for that? Either the api should be justified to be part of the interface, or the design is broken somehow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
It has been on my radar to add a columnar segment creation driver (for realtime), and this will break completely
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a class called SchemaValidator
to validate Pinot schema and the input data schema
@@ -353,8 +368,71 @@ protected void addAdditionalSegmentGeneratorConfigs(SegmentGeneratorConfig segme | |||
int sequenceId) { | |||
} | |||
|
|||
public void validateSchema(SegmentGeneratorConfig segmentGeneratorConfig, RecordReader recordReader) { | |||
if (recordReader instanceof AvroRecordReader) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like we will have to either write pairwise validators (pinot-avro, pinot-orc, pinot-json, etc). Or can write pair-wise schema converters (avro->pinot, orc->pinot, json->pinot), and then the schema validator will only compare two pinot schemas (one provided as input, other derived from format). At this point, I see pros/cons in both, but leaning towards former as it provides dedicated validation between formats.
However, in either of the approaches, I'd recommend creating interfaces/impls. For example, an interface for validator (with pair-wise impls), or an interaface for converter (with pair-wise converters, and validator just works over interface).
8b0b778
to
d638096
Compare
Correct, but it's encapsulated in RecordRecorder and it will dirty the |
1cf73d1
to
0fb58c0
Compare
Plus, the transformers are applied on each of the records. We don't have to do the schema validation at every record. All we need to do is to validate the schemas only once when a segment is about to be built. |
/** | ||
* Validator to validate the schema between Pinot schema and input raw data schema | ||
*/ | ||
public interface SchemaValidator { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest to rename this as IngestionSchemaValidator
? Or, InputSchemaValidator
(still confusing I think). Otherwise it reads as if we are validating pinot schema.
If we add a validator in PinotSchemaRestlet when the rest api call updates the schema, what would we call it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do u agree to this? Or, do you want to leave it as SchemaValidator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed it to IngestionSchemaValidator
in the latest push
|
||
void init(Schema pinotSchema, String inputFilePath); | ||
|
||
boolean isDataTypeMismatch(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we return the field names that have this type of mismatch, and also an error message (e.g. "input type 'List' does not match with pinot data type 'int' for field 'X') that will be awesome.
We can then log this as an error message during segment creation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a class called SchemaValidatorResult
to provide detailed information. I've added the sample detailed info in the description of this PR.
0fb58c0
to
200b4bf
Compare
200b4bf
to
33094df
Compare
cd92952
to
266f6b1
Compare
/** | ||
* Validator to validate the schema between Pinot schema and input raw data schema | ||
*/ | ||
public interface SchemaValidator { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do u agree to this? Or, do you want to leave it as SchemaValidator?
pinot-spi/src/main/java/org/apache/pinot/spi/data/SchemaValidatorResult.java
Outdated
Show resolved
Hide resolved
pinot-spi/src/main/java/org/apache/pinot/spi/data/SchemaValidatorResult.java
Outdated
Show resolved
Hide resolved
4de10c0
to
cdad5aa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this.
cdad5aa
to
4380465
Compare
Description
This PR adds Hadoop counters for detecting schema mismatch (AVRO only).
The counters include:
Here are the sample detailed information when schemas mismatch: