Skip to content

Conversation

srowen
Copy link
Member

@srowen srowen commented Aug 11, 2018

What changes were proposed in this pull request?

Remove jets3t dependency, and bouncy castle which it brings in; update licenses and deps
Note this just takes over #21146

How was this patch tested?

Existing tests.

@srowen
Copy link
Member Author

srowen commented Aug 11, 2018

CC @steveloughran

@SparkQA
Copy link

SparkQA commented Aug 12, 2018

Test build #94632 has finished for PR 22081 at commit a222cb8.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

pom.xml Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this changes from jets3t.version>0.9.4?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah really the change is to remove jets3t; I think Steve also thought it necessary to add back javax.activation that it brought in but wasn't otherwise depended-on by Hadoop. That is I think this patches a gap in the Hadoop pom?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the reason for the activation is that it still comes in from somewhere and the version pulled in drops from 1.1.1 to 1.1; meaning it'd be an accidental downgrade of a JAR. I don't know exactly what uses javax.activation: it is one of those historical artifacts whose main role, mapping mime types, is potentially used somewhere important.

@SparkQA
Copy link

SparkQA commented Aug 12, 2018

Test build #4243 has finished for PR 22081 at commit a222cb8.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member Author

srowen commented Aug 12, 2018

Hm @steveloughran looks like the Kinesis tests fail reliably. That makes me suspicious that jets3t is needed, given that's the AWS dependency here. WDYT?

@steveloughran
Copy link
Contributor

steveloughran commented Aug 13, 2018

I've just pushed up my PR which is ~ in sync with this one; I'll close that one now and this can be the one to use.

Assume: kinesis uses bouncy castle somewhere. There's some hints in the AWS docs

Encrypt and Decrypt Amazon Kinesis Records Using AWS KMS covers end-to-end encryption of Kinesis records. For this you need the AWS encryption SDK, whose docs say you need bouncy castle.

And it looks like the AWS encryption SDK does explicitly depend on bouncy castle.

Imagine if somehow the removal of bouncy castle as a java crypto provider was stopping that round trip working with some of the encrypt/decrypt not happening. In which case adding bouncy castle should fix things. It worked before because jets3t in spark-core added bouncy castle, and the last bouncy-castle version update made it in sync with kinesis (and broke jets3t, but nobody has noticed...)

But

  • There's no refs to javax.crypto, the aws crypto libs or calls to the class KinesisEncryptionUtilsreferenced in the blog post in the spark kinesis module (it's not in the latest SDKs either(
  • There's no build-time dependency on the aws-sdk encryption, which would transitively pull in the bouncy castle stuff.
  • Looking through the aws-sdk-bundle: no refs to javax.crypto in the kinesis code; encryption refs limited to the PUT request where you can request server-side encryption with a given KMS key.
  • Nor is there any com.aws.encryptionsdk in that bundle, or shaded bouncy castle (which is good, as otherwise I'd have to deal with the fact that some ASF projects were shipping a shaded version of it unknowingly)

It could just be a strong java crypto provided is needed, and in the absence of the unlimited java crypto JAR in the JDK lib dir (where it's needed for kerberos to work), bouncy-castle needs to be on the CP.

What to do?

  1. you can remove jets3t independent of the bouncy castle changes, because Kinesis isn't going to be using jets3t. The aws-s3 module significantly supercedes the jets3t client's functionality, and is the only one you'd expect the other parts of the AWS SDK to pick up.
  2. the bouncy-castle dependency could be upgraded to a later version in the kinesis module(s) alone, and explicitly added to kinesis-asl.
  3. Someone needs to do some experiments with what happens to the test suite with/without the full JCE and bouncy castle, maybe including more details on whats not matching up in the round trip tests
  4. Maybe including some new test which somehow explores what encryption algorithms/keys you get with/without the BC and JCE-unlimited JARs

@srowen
Copy link
Member Author

srowen commented Aug 14, 2018

Hm, I wonder, does the (newer) Kinesis SDK pull in bouncy castle? that's fine if so, that would make sense. If #22099 works, then we'll see if this then passes.

I suppose it's possible it's not actually related, and the Kinesis bit hasn't worked for a while because it's just never changed and so not tested, and needed an SDK update for some other reason. Who knows.

@steveloughran
Copy link
Contributor

No, the SDKs dont pull in bouncy-castle. Checked via mvnrepo

  • core sdk pulls in jackson & httpclient; historically v. fussy about httpclient versions
  • kinesis SDK: adds protobuf-2.6, guava 18.0, others.

I've checked with Shane: the jenkins systems do not have the unlimited javax crypto in, so suspect that bouncy-castle is just needed for testing

@srowen
Copy link
Member Author

srowen commented Aug 14, 2018

I see. I guess I'm trying to figure out whether it's reasonable or not to pull in bouncy castle -- just in the Kinesis module I guess -- on behalf of the user then?

That's the default conservative thing to do, but having just gone through the ECCN process, I realize it's not trivial for us to redistribute bouncy castle. Worth trimming if it's really about all the same to users.

@steveloughran
Copy link
Contributor

I think it should be stripped; maybe add a note to the docs "you need unlimited JCE for kinesis to work"

@srowen
Copy link
Member Author

srowen commented Aug 14, 2018

Makes sense, but then I wonder how the tests work? we need a test dependency on it?

@steveloughran
Copy link
Contributor

making a test-time option is a reasonable idea -getting the unlimited JCE on the test machines (they don't right now) would remove the need for this

shaneknapp pushed a commit to shaneknapp/spark that referenced this pull request Aug 15, 2018
…ions

This PR has been superceded by apache#22081

## What changes were proposed in this pull request?

Increment the kinesis client, producer and transient AWS SDK versions to a more recent release.

This is to help with the move off bouncy castle of apache#21146 and apache#22081; the goal is that moving up to the new SDK will allow a JVM with unlimited JCE but without bouncy castle to work with Kinesis endpoints.

Why this specific set of artifacts? it syncs up with the 1.11.271 AWS SDK used by hadoop 3.0.3, hadoop-3.1. and hadoop 3.1.1; that's been stable for the uses there (s3, STS, dynamo).

## How was this patch tested?

Running all the external/kinesis-asl tests via maven with java 8.121 & unlimited JCE, without bouncy castle (apache#21146); default endpoint of us-west.2. Without this SDK update I was getting http cert validation errors, with it they went away.

# This PR is not ready without

* Jenkins test runs to see what it is happy with
* more testing: repeated runs, another endpoint
* looking at the new deprecation warnings and selectively addressing them (the AWS SDKs are pretty aggressive about deprecation, but sometimes they increase the complexity of the client code or block some codepaths off completely)

Closes apache#22099 from steveloughran/cloud/SPARK-25111-kinesis.

Authored-by: Steve Loughran <stevel@hortonworks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
@SparkQA
Copy link

SparkQA commented Aug 16, 2018

Test build #94849 has finished for PR 22081 at commit d0334c1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member Author

srowen commented Aug 16, 2018

Merged to master

@asfgit asfgit closed this in b3e6fe7 Aug 16, 2018
@steveloughran
Copy link
Contributor

Thanks. Two less JARs on the CP to keep up to date —what more can anyone want?

@srowen srowen deleted the SPARK-23654 branch August 20, 2018 21:08
sumwale pushed a commit to TIBCOSoftware/snappy-spark that referenced this pull request Jun 10, 2022
Remove jets3t dependency, and bouncy castle which it brings in; update licenses and deps
Note this just takes over apache#21146

Existing tests.

Closes apache#22081 from srowen/SPARK-23654.

Authored-by: Sean Owen <srowen@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants