Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-4861] [build] Package optional project artifacts #2664

Conversation

greghogan
Copy link
Contributor

Package the Flink connectors, metrics, and libraries into subdirectories of a new opt directory in the release/snapshot tarballs.

The resultant directory with Flink jars and transitive dependencies:

opt/connectors
opt/connectors/batch
opt/connectors/batch/antlr-3.4.jar
opt/connectors/batch/asm-3.1.jar
opt/connectors/batch/avro-ipc-1.7.6.jar
opt/connectors/batch/avro-mapred-1.7.1.jar
opt/connectors/batch/bonecp-0.7.1.RELEASE.jar
opt/connectors/batch/commons-httpclient-3.1.jar
opt/connectors/batch/commons-logging-api-1.0.4.jar
opt/connectors/batch/commons-math-2.2.jar
opt/connectors/batch/commons-pool-1.5.4.jar
opt/connectors/batch/datanucleus-api-jdo-3.2.1.jar
opt/connectors/batch/datanucleus-core-3.2.2.jar
opt/connectors/batch/datanucleus-rdbms-3.2.1.jar
opt/connectors/batch/derby-10.4.2.0.jar
opt/connectors/batch/disruptor-3.3.0.jar
opt/connectors/batch/findbugs-annotations-1.3.9-1.jar
opt/connectors/batch/flink-hadoop-compatibility_2.10-1.2-SNAPSHOT.jar
opt/connectors/batch/flink-hbase_2.10-1.2-SNAPSHOT.jar
opt/connectors/batch/flink-hcatalog-1.2-SNAPSHOT.jar
opt/connectors/batch/flink-jdbc-1.2-SNAPSHOT.jar
opt/connectors/batch/guava-12.0.1.jar
opt/connectors/batch/hbase-client-1.2.3.jar
opt/connectors/batch/hbase-common-1.2.3.jar
opt/connectors/batch/hbase-common-1.2.3-tests.jar
opt/connectors/batch/hbase-hadoop2-compat-1.2.3.jar
opt/connectors/batch/hbase-hadoop-compat-1.2.3.jar
opt/connectors/batch/hbase-prefix-tree-1.2.3.jar
opt/connectors/batch/hbase-procedure-1.2.3.jar
opt/connectors/batch/hbase-protocol-1.2.3.jar
opt/connectors/batch/hbase-server-1.2.3.jar
opt/connectors/batch/hcatalog-core-0.12.0.jar
opt/connectors/batch/hive-cli-0.12.0.jar
opt/connectors/batch/hive-common-0.12.0.jar
opt/connectors/batch/hive-exec-0.12.0.jar
opt/connectors/batch/hive-metastore-0.12.0.jar
opt/connectors/batch/hive-serde-0.12.0.jar
opt/connectors/batch/hive-service-0.12.0.jar
opt/connectors/batch/hive-shims-0.12.0.jar
opt/connectors/batch/htrace-core-3.1.0-incubating.jar
opt/connectors/batch/httpclient-4.1.3.jar
opt/connectors/batch/httpcore-4.1.3.jar
opt/connectors/batch/jackson-jaxrs-1.9.13.jar
opt/connectors/batch/jamon-runtime-2.4.1.jar
opt/connectors/batch/jasper-compiler-5.5.23.jar
opt/connectors/batch/jasper-runtime-5.5.23.jar
opt/connectors/batch/JavaEWAH-0.3.2.jar
opt/connectors/batch/javolution-5.5.1.jar
opt/connectors/batch/jcodings-1.0.8.jar
opt/connectors/batch/jdo-api-3.0.1.jar
opt/connectors/batch/jersey-server-1.9.jar
opt/connectors/batch/jetty-6.1.26.jar
opt/connectors/batch/joni-2.1.2.jar
opt/connectors/batch/json-20090211.jar
opt/connectors/batch/jta-1.1.jar
opt/connectors/batch/libfb303-0.9.0.jar
opt/connectors/batch/libthrift-0.9.0.jar
opt/connectors/batch/metrics-core-2.2.0.jar
opt/connectors/batch/servlet-api-2.5-20081211.jar
opt/connectors/batch/snappy-0.2.jar
opt/connectors/batch/ST4-4.0.4.jar
opt/connectors/batch/velocity-1.7.jar
opt/connectors/streaming
opt/connectors/streaming/amqp-client-3.3.1.jar
opt/connectors/streaming/antlr-runtime-3.5.jar
opt/connectors/streaming/asm-4.1.jar
opt/connectors/streaming/asm-commons-4.1.jar
opt/connectors/streaming/commons-pool2-2.3.jar
opt/connectors/streaming/elasticsearch-1.7.1.jar
opt/connectors/streaming/flink-connector-cassandra_2.10-1.2-SNAPSHOT.jar
opt/connectors/streaming/flink-connector-elasticsearch_2.10-1.2-SNAPSHOT.jar
opt/connectors/streaming/flink-connector-elasticsearch2_2.10-1.2-SNAPSHOT.jar
opt/connectors/streaming/flink-connector-filesystem_2.10-1.2-SNAPSHOT.jar
opt/connectors/streaming/flink-connector-flume_2.10-1.2-SNAPSHOT.jar
opt/connectors/streaming/flink-connector-kafka-0.10_2.10-1.2-SNAPSHOT.jar
opt/connectors/streaming/flink-connector-kafka-0.8_2.10-1.2-SNAPSHOT.jar
opt/connectors/streaming/flink-connector-kafka-0.9_2.10-1.2-SNAPSHOT.jar
opt/connectors/streaming/flink-connector-kafka-base_2.10-1.2-SNAPSHOT.jar
opt/connectors/streaming/flink-connector-nifi_2.10-1.2-SNAPSHOT.jar
opt/connectors/streaming/flink-connector-rabbitmq_2.10-1.2-SNAPSHOT.jar
opt/connectors/streaming/flink-connector-redis_2.10-1.2-SNAPSHOT.jar
opt/connectors/streaming/flink-connector-twitter_2.10-1.2-SNAPSHOT.jar
opt/connectors/streaming/jackson-core-2.7.4.jar
opt/connectors/streaming/jedis-2.8.0.jar
opt/connectors/streaming/kafka_2.10-0.8.2.2.jar
opt/connectors/streaming/kafka-clients-0.9.0.1.jar
opt/connectors/streaming/lucene-analyzers-common-4.10.4.jar
opt/connectors/streaming/lucene-core-4.10.4.jar
opt/connectors/streaming/lucene-grouping-4.10.4.jar
opt/connectors/streaming/lucene-highlighter-4.10.4.jar
opt/connectors/streaming/lucene-join-4.10.4.jar
opt/connectors/streaming/lucene-memory-4.10.4.jar
opt/connectors/streaming/lucene-misc-4.10.4.jar
opt/connectors/streaming/lucene-queries-4.10.4.jar
opt/connectors/streaming/lucene-queryparser-4.10.4.jar
opt/connectors/streaming/lucene-sandbox-4.10.4.jar
opt/connectors/streaming/lucene-spatial-4.10.4.jar
opt/connectors/streaming/lucene-suggest-4.10.4.jar
opt/connectors/streaming/lz4-1.2.0.jar
opt/connectors/streaming/nifi-api-0.6.1.jar
opt/connectors/streaming/nifi-client-dto-0.6.1.jar
opt/connectors/streaming/nifi-site-to-site-client-0.6.1.jar
opt/connectors/streaming/nifi-utils-0.6.1.jar
opt/connectors/streaming/snakeyaml-1.12.jar
opt/connectors/streaming/spatial4j-0.4.1.jar
opt/connectors/streaming/swagger-annotations-1.5.3-M1.jar
opt/connectors/streaming/zkclient-0.3.jar
opt/lib
opt/lib/cep
opt/lib/cep/flink-cep_2.10-1.2-SNAPSHOT.jar
opt/lib/cep/flink-cep-scala_2.10-1.2-SNAPSHOT.jar
opt/lib/gelly
opt/lib/gelly/flink-gelly_2.10-1.2-SNAPSHOT.jar
opt/lib/gelly/flink-gelly-examples_2.10-1.2-SNAPSHOT.jar
opt/lib/gelly/flink-gelly-scala_2.10-1.2-SNAPSHOT.jar
opt/lib/ml
opt/lib/ml/arpack_combined_all-0.1.jar
opt/lib/ml/breeze_2.10-0.12.jar
opt/lib/ml/breeze-macros_2.10-0.12.jar
opt/lib/ml/core-1.1.2.jar
opt/lib/ml/flink-ml_2.10-1.2-SNAPSHOT.jar
opt/lib/ml/jtransforms-2.4.0.jar
opt/lib/ml/opencsv-2.3.jar
opt/lib/ml/shapeless_2.10.4-2.0.0.jar
opt/lib/ml/spire_2.10-0.7.4.jar
opt/lib/ml/spire-macros_2.10-0.7.4.jar
opt/lib/storm
opt/lib/storm/carbonite-1.4.0.jar
opt/lib/storm/clj-stacktrace-0.2.2.jar
opt/lib/storm/clj-time-0.4.1.jar
opt/lib/storm/clojure-1.5.1.jar
opt/lib/storm/commons-exec-1.1.jar
opt/lib/storm/core.incubator-0.1.0.jar
opt/lib/storm/disruptor-2.10.1.jar
opt/lib/storm/flink-examples-batch_2.10-1.2-SNAPSHOT.jar
opt/lib/storm/flink-storm_2.10-1.2-SNAPSHOT.jar
opt/lib/storm/flink-storm-examples_2.10-1.2-SNAPSHOT.jar
opt/lib/storm/joda-time-2.5.jar
opt/lib/storm/json-simple-1.1.jar
opt/lib/storm/logback-core-1.0.13.jar
opt/lib/storm/math.numeric-tower-0.0.1.jar
opt/lib/storm/storm-core-0.9.4.jar
opt/lib/storm/storm-starter-0.9.4.jar
opt/lib/storm/tools.cli-0.2.4.jar
opt/lib/storm/tools.logging-0.2.3.jar
opt/lib/storm/tools.macro-0.1.0.jar
opt/lib/storm/twitter4j-core-3.0.3.jar
opt/lib/storm/twitter4j-stream-3.0.3.jar
opt/metrics
opt/metrics/flink-metrics-dropwizard-1.2-SNAPSHOT.jar
opt/metrics/flink-metrics-ganglia-1.2-SNAPSHOT.jar
opt/metrics/flink-metrics-graphite-1.2-SNAPSHOT.jar
opt/metrics/flink-metrics-statsd-1.2-SNAPSHOT.jar
opt/metrics/gmetric4j-1.0.7.jar
opt/metrics/metrics-ganglia-3.1.0.jar
opt/metrics/metrics-graphite-3.1.0.jar
opt/metrics/oncrpc-1.0.7.jar

@StephanEwen
Copy link
Contributor

I think this is a good idea.
However, having all the individual dependency jar files in the folder makes it very hard to use. To actually use the libs (from an IDE without maven) it is very hard to understand what jar file would belong to which connector.

Would it make sense to have a sub directory for each connector, with its specific dependencies?

I am also wondering if there are maybe some unwanted test dependencies in the binary build. For example opt/connectors/batch/derby-10.4.2.0.jar is a test dependency from the JDBC in/out formats.

@greghogan
Copy link
Contributor Author

@StephanEwen thanks for looking at this.

In this first cut I included all the build artifacts. Should connectors be included?

We should also consider allowing jars to be placed in subdirectories of Flink's lib folder.

@StephanEwen
Copy link
Contributor

Connectors are already included, correct?

@StephanEwen
Copy link
Contributor

What do you mean by "We should also consider allowing jars to be placed in subdirectories of Flink's lib folder."?

@greghogan
Copy link
Contributor Author

The mailing list conversation only included an indirect reference to including connectors in this opt directory. I was just trying to verify that we did in fact wish to include these.

I'm not seeing Flink load jar files from a subdirectory of lib. For example, if I copy flink-gelly into lib/gelly then I am unable to run the Gelly examples. It should be possible to copy a subdirectory of opt into lib.

@StephanEwen
Copy link
Contributor

Subdirectories of lib are currently not considered. That would require a bunch of changes, in the shell scripts and in the Yarn/Mesos setup code. Do you think that is important?

If we package any optional dependencies, I think the connectors would be very worthwhile.

@greghogan greghogan force-pushed the 4861_package_optional_project_artifacts branch from 4fe6870 to d61be19 Compare October 27, 2016 18:10
@greghogan
Copy link
Contributor Author

@StephanEwen latest commit assembles connectors into separate directories.

I'll create a ticket for loading jar files from subdirectories oflib.

@greghogan
Copy link
Contributor Author

I need to rework this since additional classes are being pulled into the uber jar.

Package the Flink connectors, metrics, and libraries into subdirectories
of a new opt directory in the release/snapshot tarballs.
This new module assembles optional Flink packages with dependencies
which are then copied by flink-dist into build-target. A separate
module is required to prevent these packages from inclusion in the uber
jar (or otherwise requiring a tremendious, brittle exclusion list).
@greghogan greghogan force-pushed the 4861_package_optional_project_artifacts branch from d61be19 to d2170ba Compare November 1, 2016 16:37
@greghogan
Copy link
Contributor Author

@StephanEwen the latest commit uses a separate module to prevent the optional packages and dependencies from being included in the flink-dist uber jar. flink-dist-opt assembles the dependencies when are then simply copied by flink-dist.

@rmetzger
Copy link
Contributor

I tried out the change and I like the idea.
One issue I found is that transitive dependencies are not properly added: Kafka 0.10 depends on the Kafka 0.9 code, but that one (and its dependencies) are not added in the kafka 0.10 directory. I'm not sure if the assembly plugin supports that.

@greghogan
Copy link
Contributor Author

@rmetzger it appears that project artifacts are not included as transitive dependencies and I had overlooked 0.9 as a dependency for the 0.10 connector. After correcting this the provided hierarchy is as follows:

opt/connectors/streaming/kafka-0.10
opt/connectors/streaming/kafka-0.10/flink-connector-kafka-0.10_2.10-1.2-SNAPSHOT.jar
opt/connectors/streaming/kafka-0.10/flink-connector-kafka-0.9_2.10-1.2-SNAPSHOT.jar
opt/connectors/streaming/kafka-0.10/flink-connector-kafka-base_2.10-1.2-SNAPSHOT.jar
opt/connectors/streaming/kafka-0.10/kafka-clients-0.9.0.1.jar
opt/connectors/streaming/kafka-0.10/lz4-1.2.0.jar
opt/connectors/streaming/kafka-0.8
opt/connectors/streaming/kafka-0.8/flink-connector-kafka-0.8_2.10-1.2-SNAPSHOT.jar
opt/connectors/streaming/kafka-0.8/flink-connector-kafka-base_2.10-1.2-SNAPSHOT.jar
opt/connectors/streaming/kafka-0.8/kafka_2.10-0.8.2.2.jar
opt/connectors/streaming/kafka-0.8/scala-library-2.10.4.jar
opt/connectors/streaming/kafka-0.8/zkclient-0.3.jar
opt/connectors/streaming/kafka-0.9
opt/connectors/streaming/kafka-0.9/flink-connector-kafka-0.9_2.10-1.2-SNAPSHOT.jar
opt/connectors/streaming/kafka-0.9/flink-connector-kafka-base_2.10-1.2-SNAPSHOT.jar
opt/connectors/streaming/kafka-0.9/kafka-clients-0.9.0.1.jar
opt/connectors/streaming/kafka-0.9/lz4-1.2.0.jar

@rmetzger
Copy link
Contributor

Thank you for fixing the issue so quickly.
I'm wondering whether the current approach is a good idea, because it requires manual checking of all transitive dependencies. We have something similar (=manual as well) in the quickstart, where we exclude some artifact, and it doesn't really happen that people update that list of excludes.
So I fear that if somebody changes a connector for example, it will not be added here.
I guess adding an assembly descriptor for each connector / module would solve the problem with the transitive dependencies. I don't know if there's a more efficient approach.

What do you think?

@greghogan
Copy link
Contributor Author

@rmetzger, I have not been able to improve on the current configuration. useTransitiveFiltering seems to work as the inverse from the plugin documentation (true prevents transitive dependencies from being filtered), but project artifacts are still ignored as transitive dependencies.

This implementation is only including so there shouldn't be the same risk of failing to exclude an unneeded dependency.

How would creating separate assembly descriptors be beneficial?

@greghogan greghogan force-pushed the 4861_package_optional_project_artifacts branch from a1cf913 to 428f01d Compare November 23, 2016 17:57
@rmetzger
Copy link
Contributor

I thought that transitive dependencies are resolved in the scope of assembly descriptors. But I'm not so sure about that anymore.

@StephanEwen
Copy link
Contributor

How about just building a fat jar for each connector / library?
That way it becomes quite easy for users - they simply refer to one jar.

@rmetzger
Copy link
Contributor

I had the same thought. We could add the maven assembly plugin / shade plugin to each connector / library to build a fat jar, and then add some logic to flink-dist to collect these fat jars into the final dist.
I'm not sure how easy it is to pull build outputs from other modules into the dist module, but we are doing that for the examples as well.

@greghogan
Copy link
Contributor Author

@StephanEwen @rmetzger, why would a user copy an optional fat jar rather than having it included in their uber jar?

By creating fat jars, do we not have the potential for duplicate dependencies if more than one fat jar is included on the classpath? I don't think we can shade since the user may be depending on the transitive dependencies.

@greghogan
Copy link
Contributor Author

@StephanEwen @rmetzger, as I revisit this, I'm still questioning the viability of shading an uber jar. For example, a user depending on a Kafka connector. Normally the dependency would be packaged in the user's uber jar. Instead, the user could mark the dependency as provided and copy the connector jar into lib/. If the user makes use of any transitive dependencies (as I assume Kafka class would be used) then because of the shading would not the user be required to add Kafka as a dependency to their pom?

@StephanEwen
Copy link
Contributor

@greghogan I think that when an immediate dependency is "provided", then its transitive dependencies are not pulled. If a user adds a "provided" flink-streaming-connector-kafka-0.9_2.10, then no Kafka dependency will be pulled into the user program uber jar.

Consequently, all transitive dependencies should be added to the lib folder, which is what we would get by uber-jaring the Kafka connector.

@greghogan
Copy link
Contributor Author

New implementation using maven-shade-plugin in #3000.

@rmetzger
Copy link
Contributor

rmetzger commented Jan 3, 2017

@greghogan I guess this PR is invalid by now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants