Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add protobuf schema registry #10839

Merged
merged 12 commits into from
Mar 9, 2021

Conversation

bananaaggle
Copy link
Contributor

Confluent's Schema Registry supports for deserializing protobuf now. So I add this feature to druid-protobuf-extension module. This module only work when user provide a .desc file in old version. Now it can get schema from schema registry dynamically.
This submission adds the io.confluent:kafka-schema-registry-client:6.0.1/io.confluent:kafka-protobuf-provider:6.0.1 dependency to the druid-protobuf-extension module.
Old spec is:

"parser": {
      "type": "protobuf",
      "descriptor": "file:///tmp/metrics.desc",
      "protoMessageType": "Metrics"
      ......
}

New spec is:

"parser": {
      "type": "protobuf",
      "protoBytesDecoder": {
           "type": "schema_registry",
          "url": "http://XXXXXX:8081",
          "capacity": 10,
        }
      ......
}

and

"parser": {
      "type": "protobuf",
      "protoBytesDecoder": {
          "type": "file",
          "descriptor": "file:///tmp/metrics.desc",
          "protoMessageType": "Metrics"
        }
      ......
}

Description

Fixed the bug ...

Renamed the class ...

Added a forbidden-apis entry ...


This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

Key changed/added classes in this PR
  • MyFoo
  • OurBar
  • TheirBaz

@clintropolis
Copy link
Member

Hi @bananaaggle, thanks for the contribution it seems like a useful addition and I would like to review it, but it is impossible if you keep closing and opening as a new PR. Could you please just stick to one PR and do any updates in place so that we can review it?

@bananaaggle
Copy link
Contributor Author

Hi @bananaaggle, thanks for the contribution it seems like a useful addition and I would like to review it, but it is impossible if you keep closing and opening as a new PR. Could you please just stick to one PR and do any updates in place so that we can review it?

Sorry, I will stick to one PR.

Copy link
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall lgtm

@@ -225,6 +225,9 @@ def build_compatible_license_names():
compatible_licenses['Apache License v2.0'] = 'Apache License version 2.0'
compatible_licenses['Apache License, version 2.0'] = 'Apache License version 2.0'
compatible_licenses['Apache 2.0 License'] = 'Apache License version 2.0'
compatible_licenses['Apache License, 2.0'] = 'Apache License version 2.0'
compatible_licenses['Confluent Community License'] = 'Confluent Community License'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, what requires the confluent community license? I don't see anything in licenses.yaml mentioning it. Last i knew, the clients to confluent things were still apache licensed (i'm not sure we can distribute anything with this license in Druid packages, would have to check into that, but I suspect not since it has field of use restrictions)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Screen Shot 2021-02-07 at 10 48 58 AM

This is dependency report generated by druid. I think it means kafka-protobuf-provider is licensed by Confluent Community License and Apache License. I change this python script because it can not work if this map lose key 'Confluent Community License'.

Screen Shot 2021-02-07 at 10 56 51 AM

Screen Shot 2021-02-07 at 10 57 03 AM

In MVN Repository, it marked as Apache license. So can I use it in Druid packages? I'm not very sure about that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at the source files for kafka-protobuf-provider here, https://github.com/confluentinc/schema-registry/tree/master/protobuf-provider/src/main/java/io/confluent/kafka/schemaregistry/protobuf

and there appears to be a mix of Apache and Confluent Community License headers, which I assume it means both licenses apply.

Unfortunately, I think this means that we cannot distribute this jar, but I think that instead we can provide instructions in the documentation on how users can download it (see https://www.apache.org/legal/resolved.html#optional), similar to how we do with the MySQL connector.

The pom.xml will need to mark this jar as provided, and you will need to remove this license from this python script, and remove kafka-protobuf-provider from license.yaml. I think maybe it is ok if tests still depend on it, since it wouldn't cause packaging to distribute it? but i'm unsure exactly...

private Parser<String, Object> parser;
private final List<String> dimensions;

@JsonCreator
public ProtobufInputRowParser(
@JsonProperty("parseSpec") ParseSpec parseSpec,
@JsonProperty("descriptor") String descriptorFilePath,
@JsonProperty("protoMessageType") String protoMessageType
@JsonProperty("protoBytesDecoder") ProtobufBytesDecoder protobufBytesDecoder
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So that this change is not incompatible with previous druid versions, I think we should probably consider leaving the old constructor properties in place and marking them as deprecated instead of removing them, and then enforce that only one of protoBytesDecoder or descriptor/protoMessageType can be set, and then automatically construct a 'FileBasedProtobufBytesDecoder' if the old properties are set and the new is null.

Then in a future version we can consider dropping them

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I'll implement it.

}
}

@SuppressWarnings("checkstyle:RightCurly")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this suppression for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll delete it.

DynamicMessage message = DynamicMessage.parseFrom(descriptor, rawMessage);
return message;
}
catch (Exception e) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this "try" should block be split up error handling so that error messaging for failing to fetch the schema from the registry, which doesn't exactly seem like a parse exception, can be distinguished from an actual parse exception

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll split it.

@bananaaggle
Copy link
Contributor Author

@clintropolis Hi, I change code follow your suggestion two weeks ago, could you please help me review this change?

Copy link
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry very for the delay @bananaaggle! The code changes lgtm 👍, but I think we need to fix the license stuff. One minor nitpick is that some of the exception handling could maybe be a bit more specific than catching Exception probably?

Also, could you update the documentation for the protobuf extension, https://github.com/apache/druid/blob/master/docs/development/extensions-core/protobuf.md, to include the new stuff, and instructions to get the jar to use it? You can do something similar to here: https://github.com/apache/druid/blob/master/docs/development/extensions-core/mysql.md#installing-the-mysql-connector-library.

@@ -225,6 +225,9 @@ def build_compatible_license_names():
compatible_licenses['Apache License v2.0'] = 'Apache License version 2.0'
compatible_licenses['Apache License, version 2.0'] = 'Apache License version 2.0'
compatible_licenses['Apache 2.0 License'] = 'Apache License version 2.0'
compatible_licenses['Apache License, 2.0'] = 'Apache License version 2.0'
compatible_licenses['Confluent Community License'] = 'Confluent Community License'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at the source files for kafka-protobuf-provider here, https://github.com/confluentinc/schema-registry/tree/master/protobuf-provider/src/main/java/io/confluent/kafka/schemaregistry/protobuf

and there appears to be a mix of Apache and Confluent Community License headers, which I assume it means both licenses apply.

Unfortunately, I think this means that we cannot distribute this jar, but I think that instead we can provide instructions in the documentation on how users can download it (see https://www.apache.org/legal/resolved.html#optional), similar to how we do with the MySQL connector.

The pom.xml will need to mark this jar as provided, and you will need to remove this license from this python script, and remove kafka-protobuf-provider from license.yaml. I think maybe it is ok if tests still depend on it, since it wouldn't cause packaging to distribute it? but i'm unsure exactly...

@bananaaggle
Copy link
Contributor Author

bananaaggle commented Feb 23, 2021

Sorry very for the delay @bananaaggle! The code changes lgtm 👍, but I think we need to fix the license stuff. One minor nitpick is that some of the exception handling could maybe be a bit more specific than catching Exception probably?

Also, could you update the documentation for the protobuf extension, https://github.com/apache/druid/blob/master/docs/development/extensions-core/protobuf.md, to include the new stuff, and instructions to get the jar to use it? You can do something similar to here: https://github.com/apache/druid/blob/master/docs/development/extensions-core/mysql.md#installing-the-mysql-connector-library.

It is a good idea. I've separated this jar and write document for its usage.

@bananaaggle
Copy link
Contributor Author

@clintropolis Hi, I've written document for this feature. Can you help me refine it? Thanks.

Copy link
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, just a few minor suggestions on docs, thanks 👍

@@ -72,22 +72,47 @@ message Metrics {
}
```

### Descriptor file
### When use descriptor file
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### When use descriptor file
### When using a descriptor file

@@ -56,7 +56,7 @@ Here is a JSON example of the 'metrics' data schema used in the example.

### Proto file

The corresponding proto file for our 'metrics' dataset looks like this.
The corresponding proto file for our 'metrics' dataset looks like this. You can use Protobuf parser with a proto file or Confluent schema registry.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The corresponding proto file for our 'metrics' dataset looks like this. You can use Protobuf parser with a proto file or Confluent schema registry.
The corresponding proto file for our 'metrics' dataset looks like this. You can use Protobuf parser with a proto file or [Confluent Schema Registry](https://docs.confluent.io/platform/current/schema-registry/index.html).

}
```

This feature uses Confluent's Protobuf provider which is not included in the Druid distribution and must be installed separately. You can fetch it and its dependencies from Maven Central at:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This feature uses Confluent's Protobuf provider which is not included in the Druid distribution and must be installed separately. You can fetch it and its dependencies from Maven Central at:
This feature uses Confluent's Protobuf provider which is not included in the Druid distribution and must be installed separately. You can fetch it and its dependencies from the Confluent repository and Maven Central at:

}
```

### When use schema registry
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### When use schema registry
### When using Schema Registry

## Create Kafka Supervisor

Below is the complete Supervisor spec JSON to be submitted to the Overlord.
Make sure these keys are properly configured for successful ingestion.

### When use descriptor file
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### When use descriptor file
### When using a descriptor file


Next, we use the `protoc` Protobuf compiler to generate the descriptor file and save it as `metrics.desc`. The descriptor file must be either in the classpath or reachable by URL. In this example the descriptor file was saved at `/tmp/metrics.desc`, however this file is also available in the example files. From your Druid install directory:

```
protoc -o /tmp/metrics.desc ./quickstart/protobuf/metrics.proto
```

### When use schema registry
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### When use schema registry
### When using Schema Registry


Next, we use the `protoc` Protobuf compiler to generate the descriptor file and save it as `metrics.desc`. The descriptor file must be either in the classpath or reachable by URL. In this example the descriptor file was saved at `/tmp/metrics.desc`, however this file is also available in the example files. From your Druid install directory:

```
protoc -o /tmp/metrics.desc ./quickstart/protobuf/metrics.proto
```

### When use schema registry

At first make sure your schema registry version is later than 5.5. Next, we post this schema to schema registry.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
At first make sure your schema registry version is later than 5.5. Next, we post this schema to schema registry.
Make sure your Schema Registry version is later than 5.5. Next, we can post a schema to add it to the registry:

@bananaaggle
Copy link
Contributor Author

@clintropolis Hi, thanks for your review. I've refine document for this feature. What else should I do before merging?

Copy link
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks 👍

@clintropolis
Copy link
Member

clintropolis commented Mar 4, 2021

@clintropolis Hi, thanks for your review. I've refine document for this feature. What else should I do before merging?

I think this is g2g after CI.

If you are interested in doing a follow-up PR or two, the protobuf extension is one of the few ingestion formats that has not been ported to the newer InputFormat interface which has replaced parsers. There are a handful of implementations that could be used as reference (JSON, Avro though it is missing an InputFormat for streaming support and using schema registry itself, and many others). I think this should be relatively straight-forward with the protobuf extension since most of the work is being done in the decoder implementations, which could also be re-used by the InputFormat implementation.

Another thing, after #10929 goes in it would probably be possible to add an integration test for the functionality added in this PR. A fair warning, our integration tests are a bit rough around the edges to work with, but I think most of the parts would be there (I guess the protobuf jar we can't include would need added to the Dockerfile similar to mysql). I'm not sure we actually have any integration tests using protobuf currently.

@bananaaggle
Copy link
Contributor Author

lgtm, thanks 👍

@clintropolis Hi, thanks for your review. I've refine document for this feature. What else should I do before merging?

I think this is g2g after CI.

If you are interested in doing a follow-up PR or two, the protobuf extension is one of the few ingestion formats that has not been ported to the newer InputFormat interface which has replaced parsers. There are a handful of implementations that could be used as reference (JSON, Avro though it is missing an InputFormat for streaming support and using schema registry itself, and many others). I think this should be relatively straight-forward with the protobuf extension since most of the work is being done in the decoder implementations, which could also be re-used by the InputFormat implementation.

Another thing, after #10929 goes in it would probably be possible to add an integration test for the functionality added in this PR. A fair warning, our integration tests are a bit rough around the edges to work with, but I think most of the parts would be there (I guess the protobuf jar we can't include would need added to the Dockerfile similar to mysql). I'm not sure we actually have any integration tests using protobuf currently.

Yeah, I will learn something about InputFormat interface and implement it with protobuf. As for integration test, I'm not very familiar with this component, but I think I can finish it.

@JsonCreator
public SchemaRegistryBasedProtobufBytesDecoder(
@JsonProperty("url") String url,
@JsonProperty("capacity") Integer capacity
Copy link
Member

@clintropolis clintropolis Mar 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, I forgot about one thing, could you add support for supplying config and headers to the schema registry client? See https://github.com/apache/druid/pull/10314/files#diff-29d909859e829594d32e7d39a72d233c9947cedf9b35ddf9e4566547799adb0fR51-R52 for details (and docs could be more or less copied too probably https://github.com/apache/druid/pull/10314/files#diff-b5e5f4cf38f24309777ff1750a3ed7d5f55492afe2e8eded4f7f71337b2c4a90R1041)

@bananaaggle
Copy link
Contributor Author

Hi, @clintropolis. I've added support for supplying config and headers. And I change document about it, can you review it? Thanks.

Copy link
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes lgtm, thanks 👍

CI is failing asking for druid-protobuf-extensions to have

<dependency>
  <groupId>com.google.code.findbugs</groupId>
  <artifactId>jsr305</artifactId>
  <version>2.0.1</version>
  <scope>provided</scope>
</dependency>

added to the pom.xml, not really sure which of the changes caused that to happen...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants