Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-30874][SQL] Support Postgres Kerberos login in JDBC connector #27637

Closed
wants to merge 10 commits into from

Conversation

gaborgsomogyi
Copy link
Contributor

What changes were proposed in this pull request?

When loading DataFrames from JDBC datasource with Kerberos authentication, remote executors (yarn-client/cluster etc. modes) fail to establish a connection due to lack of Kerberos ticket or ability to generate it.

This is a real issue when trying to ingest data from kerberized data sources (SQL Server, Oracle) in enterprise environment where exposing simple authentication access is not an option due to IT policy issues.

In this PR I've added Postgres support (other supported databases will come in later PRs).

What this PR contains:

  • Added keytab and principal JDBC options
  • Added ConnectionProvider trait and it's impementations:
    • BasicConnectionProvider => unsecure connection
    • PostgresConnectionProvider => postgres secure connection
  • Added ConnectionProvider tests
  • Added PostgresKrbIntegrationSuite docker integration test
  • Created SecurityUtils to concentrate re-usable security related functionalities
  • Documentation

Why are the changes needed?

Missing JDBC kerberos support.

Does this PR introduce any user-facing change?

Yes, 2 additional JDBC options added:

  • keytab
  • principal

If both provided then Spark does kerberos authentication.

How was this patch tested?

To demonstrate the functionality with a standalone application I've created this repository: https://github.com/gaborgsomogyi/docker-kerberos

  • Additional + existing unit tests
  • Additional docker integration test
  • Test on cluster manually
  • SKIP_API=1 jekyll build

@SparkQA
Copy link

SparkQA commented Feb 19, 2020

Test build #118676 has finished for PR 27637 at commit b7275e6.

  • This patch fails RAT tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class PGJDBCConfiguration(

@SparkQA
Copy link

SparkQA commented Feb 19, 2020

Test build #118677 has finished for PR 27637 at commit e8d8f2b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gaborgsomogyi
Copy link
Contributor Author

cc @HeartSaVioR

@dongjoon-hyun
Copy link
Member

Thank you for doing this, @gaborgsomogyi !

@gaborgsomogyi
Copy link
Contributor Author

@dongjoon-hyun thanks for investing your time! Finishing up some testing with mysql and going to resolve the suggestions...

Copy link
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just done the first pass of reviewing.

@SparkQA
Copy link

SparkQA commented Feb 21, 2020

Test build #118792 has finished for PR 27637 at commit e3f6200.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 21, 2020

Test build #118797 has finished for PR 27637 at commit 9c50a75.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code change looks good.

Looks like the tests added for integration test wasn't running on Jenkins. I'm not sure whether it requires some tag on PR title, or it's completely manual.

@HeartSaVioR
Copy link
Contributor

HeartSaVioR commented Feb 26, 2020

./build/mvn test -Pdocker-integration-tests -pl :spark-docker-integration-tests_2.12 -Dtest=none -DwildcardSuites="org.apache.spark.sql.jdbc.*"

This has been failing as following error (with super slow download):

[ERROR] Failed to execute goal on project spark-docker-integration-tests_2.12: Could not resolve dependencies for project org.apache.spark:spark-docker-integration-tests_2.12:jar:3.0.0-SNAPSHOT: Could not transfer artifact com.ibm.db2.jcc:db2jcc4:jar:10.5.0.5 from/to db (https://app.camunda.com/nexus/content/repositories/public/): GET request of: com/ibm/db2/jcc/db2jcc4/10.5.0.5/db2jcc4-10.5.0.5.jar from db failed: Premature end of Content-Length delimited message body (expected: 3,411,524; received: 1,968,865)

It's not related to the change but I cannot run this to make sure the tests pass. Unfortunately I haven't found any alternative repo for db2jcc4 with such version.

@gaborgsomogyi
Copy link
Contributor Author

Looks like the tests added for integration test wasn't running on Jenkins. I'm not sure whether it requires some tag on PR title, or it's completely manual.

AFAIK there is no jenkins integration for this. What I'm mainly doing is compiling Spark code with docker integration tests and then executing them just like unit tests.

The problem is maybe because of corrupted m2 repo?! Not sure but worth a try to delete it and retry...

@gaborgsomogyi
Copy link
Contributor Author

Another idea to try with sbt...

@SparkQA
Copy link

SparkQA commented Feb 27, 2020

Test build #119037 has finished for PR 27637 at commit febf39a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 28, 2020

Test build #119053 has finished for PR 27637 at commit 1f318d9.

  • This patch fails from timeout after a configured wait of 400m.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gaborgsomogyi
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Feb 28, 2020

Test build #119069 has finished for PR 27637 at commit 1f318d9.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gaborgsomogyi
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Feb 28, 2020

Test build #119101 has finished for PR 27637 at commit 1f318d9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 2, 2020

Test build #119181 has finished for PR 27637 at commit 9affc4f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gaborgsomogyi
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Mar 3, 2020

Test build #119216 has finished for PR 27637 at commit 9affc4f.

  • This patch fails build dependency tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gaborgsomogyi
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Mar 3, 2020

Test build #119217 has finished for PR 27637 at commit 9affc4f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}

val driverClass = "org.postgresql.Driver"
val appEntry = "pgjdbc"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following a previous comment of yours... if you have a Spark app that needs to connect to 2 different pgsql data sources, each using different credentials, will you have a problem here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a really good point which the actual code doesn't cover. Postgres supports to configure jaasApplicationName which must be used to overcome this issue. Adding the support...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to support this jdbc url must be parsed which can be done several ways:

  • Using postgres by calling the appropriate API function through reflection => No ugly dependency
  • Using postgres by calling the appropriate API function through provided dependency => No ugly reflection
  • Re-implement it => It's an overkill

I've chosen the first approach but if you think the second is better we can change it.

@SparkQA
Copy link

SparkQA commented Mar 4, 2020

Test build #119300 has finished for PR 27637 at commit 87c84ba.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 4, 2020

Test build #119311 has finished for PR 27637 at commit 8a2cc85.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@vanzin vanzin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks ok, just a small comment.

Comment on lines 47 to 48
val parseURL = driver.getClass.getMethod("parseURL", classOf[String], classOf[Properties])
val properties = parseURL.invoke(driver, options.url, null).asInstanceOf[Properties]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels a bit yucky but I couldn't find an easily available method to parse the query string... (you could use URI.getQuery().split("&").find(...) though.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried this before but not parsing it properly so abandoned:

scala> new java.net.URI("jdbc:postgresql://localhost/postgres?jaasApplicationName=custompgjdbc").getQuery()
res1: String = null

I was able to solve this with more code where I've concluded that custom implementation doesn't worth.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels a bit yucky

I agree but don't have less yucky idea...
...split("?")...split("&").find(...) would be one possibility but I think it would be brittle in a different way...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok... even though using reflection to access private methods in external libraries is always a brittle solution to anything, if that becomes a problem this can be rewritten easily.

@SparkQA
Copy link

SparkQA commented Mar 9, 2020

Test build #119559 has finished for PR 27637 at commit 301cc1a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor

vanzin commented Mar 12, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Mar 13, 2020

Test build #119730 has finished for PR 27637 at commit 301cc1a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor

vanzin commented Mar 13, 2020

Merging to master.

@vanzin vanzin closed this in 231e650 Mar 13, 2020
sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
### What changes were proposed in this pull request?
When loading DataFrames from JDBC datasource with Kerberos authentication, remote executors (yarn-client/cluster etc. modes) fail to establish a connection due to lack of Kerberos ticket or ability to generate it.

This is a real issue when trying to ingest data from kerberized data sources (SQL Server, Oracle) in enterprise environment where exposing simple authentication access is not an option due to IT policy issues.

In this PR I've added Postgres support (other supported databases will come in later PRs).

What this PR contains:
* Added `keytab` and `principal` JDBC options
* Added `ConnectionProvider` trait and it's impementations:
  * `BasicConnectionProvider` => unsecure connection
  * `PostgresConnectionProvider` => postgres secure connection
* Added `ConnectionProvider` tests
* Added `PostgresKrbIntegrationSuite` docker integration test
* Created `SecurityUtils` to concentrate re-usable security related functionalities
* Documentation

### Why are the changes needed?
Missing JDBC kerberos support.

### Does this PR introduce any user-facing change?
Yes, 2 additional JDBC options added:
* keytab
* principal

If both provided then Spark does kerberos authentication.

### How was this patch tested?
To demonstrate the functionality with a standalone application I've created this repository: https://github.com/gaborgsomogyi/docker-kerberos

* Additional + existing unit tests
* Additional docker integration test
* Test on cluster manually
* `SKIP_API=1 jekyll build`

Closes apache#27637 from gaborgsomogyi/SPARK-30874.

Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: Marcelo Vanzin <vanzin@apache.org>
@gatorsmile
Copy link
Member

@gaborgsomogyi Has anyone discussed the Kerberos renewal and expiration?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
6 participants