[SPARK-30874][SQL] Support Postgres Kerberos login in JDBC connector #27637

gaborgsomogyi · 2020-02-19T12:57:28Z

What changes were proposed in this pull request?

When loading DataFrames from JDBC datasource with Kerberos authentication, remote executors (yarn-client/cluster etc. modes) fail to establish a connection due to lack of Kerberos ticket or ability to generate it.

This is a real issue when trying to ingest data from kerberized data sources (SQL Server, Oracle) in enterprise environment where exposing simple authentication access is not an option due to IT policy issues.

In this PR I've added Postgres support (other supported databases will come in later PRs).

What this PR contains:

Added keytab and principal JDBC options
Added ConnectionProvider trait and it's impementations:
- BasicConnectionProvider => unsecure connection
- PostgresConnectionProvider => postgres secure connection
Added ConnectionProvider tests
Added PostgresKrbIntegrationSuite docker integration test
Created SecurityUtils to concentrate re-usable security related functionalities
Documentation

Why are the changes needed?

Missing JDBC kerberos support.

Does this PR introduce any user-facing change?

Yes, 2 additional JDBC options added:

keytab
principal

If both provided then Spark does kerberos authentication.

How was this patch tested?

To demonstrate the functionality with a standalone application I've created this repository: https://github.com/gaborgsomogyi/docker-kerberos

Additional + existing unit tests
Additional docker integration test
Test on cluster manually
SKIP_API=1 jekyll build

SparkQA · 2020-02-19T13:02:37Z

Test build #118676 has finished for PR 27637 at commit b7275e6.

This patch fails RAT tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class PGJDBCConfiguration(

SparkQA · 2020-02-19T16:18:56Z

Test build #118677 has finished for PR 27637 at commit e8d8f2b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gaborgsomogyi · 2020-02-19T16:20:32Z

cc @HeartSaVioR

...-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/DockerJDBCIntegrationSuite.scala

...integration-tests/src/test/scala/org/apache/spark/sql/jdbc/PostgresKrbIntegrationSuite.scala

...in/scala/org/apache/spark/sql/execution/datasources/jdbc/connection/ConnectionProvider.scala

core/src/main/scala/org/apache/spark/util/SecurityUtils.scala

dongjoon-hyun · 2020-02-20T06:48:28Z

Thank you for doing this, @gaborgsomogyi !

gaborgsomogyi · 2020-02-20T16:37:33Z

@dongjoon-hyun thanks for investing your time! Finishing up some testing with mysql and going to resolve the suggestions...

HeartSaVioR

Just done the first pass of reviewing.

docs/sql-data-sources-jdbc.md

core/src/main/scala/org/apache/spark/util/SecurityUtils.scala

...integration-tests/src/test/scala/org/apache/spark/sql/jdbc/PostgresKrbIntegrationSuite.scala

...tegration-tests/src/test/scala/org/apache/spark/sql/jdbc/DockerKrbJDBCIntegrationSuite.scala

.../org/apache/spark/sql/execution/datasources/jdbc/connection/PostgresConnectionProvider.scala

...apache/spark/sql/execution/datasources/jdbc/connection/PostgresConnectionProviderSuite.scala

SparkQA · 2020-02-21T15:25:05Z

Test build #118792 has finished for PR 27637 at commit e3f6200.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-02-21T19:03:31Z

Test build #118797 has finished for PR 27637 at commit 9c50a75.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR

The code change looks good.

Looks like the tests added for integration test wasn't running on Jenkins. I'm not sure whether it requires some tag on PR title, or it's completely manual.

HeartSaVioR · 2020-02-26T07:24:20Z

./build/mvn test -Pdocker-integration-tests -pl :spark-docker-integration-tests_2.12 -Dtest=none -DwildcardSuites="org.apache.spark.sql.jdbc.*"

This has been failing as following error (with super slow download):

[ERROR] Failed to execute goal on project spark-docker-integration-tests_2.12: Could not resolve dependencies for project org.apache.spark:spark-docker-integration-tests_2.12:jar:3.0.0-SNAPSHOT: Could not transfer artifact com.ibm.db2.jcc:db2jcc4:jar:10.5.0.5 from/to db (https://app.camunda.com/nexus/content/repositories/public/): GET request of: com/ibm/db2/jcc/db2jcc4/10.5.0.5/db2jcc4-10.5.0.5.jar from db failed: Premature end of Content-Length delimited message body (expected: 3,411,524; received: 1,968,865)

It's not related to the change but I cannot run this to make sure the tests pass. Unfortunately I haven't found any alternative repo for db2jcc4 with such version.

gaborgsomogyi · 2020-02-26T09:09:32Z

Looks like the tests added for integration test wasn't running on Jenkins. I'm not sure whether it requires some tag on PR title, or it's completely manual.

AFAIK there is no jenkins integration for this. What I'm mainly doing is compiling Spark code with docker integration tests and then executing them just like unit tests.

The problem is maybe because of corrupted m2 repo?! Not sure but worth a try to delete it and retry...

gaborgsomogyi · 2020-02-26T09:10:24Z

Another idea to try with sbt...

docs/sql-data-sources-jdbc.md

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala

SparkQA · 2020-02-27T17:23:39Z

Test build #119037 has finished for PR 27637 at commit febf39a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-02-28T04:17:32Z

Test build #119053 has finished for PR 27637 at commit 1f318d9.

This patch fails from timeout after a configured wait of 400m.
This patch merges cleanly.
This patch adds no public classes.

gaborgsomogyi · 2020-02-28T06:06:05Z

retest this please

SparkQA · 2020-02-28T08:05:01Z

Test build #119069 has finished for PR 27637 at commit 1f318d9.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

gaborgsomogyi · 2020-02-28T18:59:09Z

retest this please

SparkQA · 2020-02-28T22:11:30Z

Test build #119101 has finished for PR 27637 at commit 1f318d9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

.../org/apache/spark/sql/execution/datasources/jdbc/connection/PostgresConnectionProvider.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala

SparkQA · 2020-03-02T21:35:21Z

Test build #119181 has finished for PR 27637 at commit 9affc4f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gaborgsomogyi · 2020-03-03T09:06:20Z

retest this please

SparkQA · 2020-03-03T09:09:12Z

Test build #119216 has finished for PR 27637 at commit 9affc4f.

This patch fails build dependency tests.
This patch merges cleanly.
This patch adds no public classes.

gaborgsomogyi · 2020-03-03T09:12:31Z

retest this please

SparkQA · 2020-03-03T12:46:16Z

Test build #119217 has finished for PR 27637 at commit 9affc4f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala

...in/scala/org/apache/spark/sql/execution/datasources/jdbc/connection/ConnectionProvider.scala

vanzin · 2020-03-03T17:27:26Z

.../org/apache/spark/sql/execution/datasources/jdbc/connection/PostgresConnectionProvider.scala

+  }
+
+  val driverClass = "org.postgresql.Driver"
+  val appEntry = "pgjdbc"


Following a previous comment of yours... if you have a Spark app that needs to connect to 2 different pgsql data sources, each using different credentials, will you have a problem here?

That's a really good point which the actual code doesn't cover. Postgres supports to configure jaasApplicationName which must be used to overcome this issue. Adding the support...

In order to support this jdbc url must be parsed which can be done several ways:

Using postgres by calling the appropriate API function through reflection => No ugly dependency

Using postgres by calling the appropriate API function through provided dependency => No ugly reflection

Re-implement it => It's an overkill

I've chosen the first approach but if you think the second is better we can change it.

SparkQA · 2020-03-04T14:57:23Z

Test build #119300 has finished for PR 27637 at commit 87c84ba.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-03-04T17:53:12Z

Test build #119311 has finished for PR 27637 at commit 8a2cc85.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin

Looks ok, just a small comment.

vanzin · 2020-03-06T21:33:07Z

.../org/apache/spark/sql/execution/datasources/jdbc/connection/PostgresConnectionProvider.scala

+    val parseURL = driver.getClass.getMethod("parseURL", classOf[String], classOf[Properties])
+    val properties = parseURL.invoke(driver, options.url, null).asInstanceOf[Properties]


This feels a bit yucky but I couldn't find an easily available method to parse the query string... (you could use URI.getQuery().split("&").find(...) though.)

I've tried this before but not parsing it properly so abandoned:

scala> new java.net.URI("jdbc:postgresql://localhost/postgres?jaasApplicationName=custompgjdbc").getQuery() res1: String = null

I was able to solve this with more code where I've concluded that custom implementation doesn't worth.

This feels a bit yucky

I agree but don't have less yucky idea...
...split("?")...split("&").find(...) would be one possibility but I think it would be brittle in a different way...

Ok... even though using reflection to access private methods in external libraries is always a brittle solution to anything, if that becomes a problem this can be rewritten easily.

.../org/apache/spark/sql/execution/datasources/jdbc/connection/PostgresConnectionProvider.scala

SparkQA · 2020-03-09T11:54:18Z

Test build #119559 has finished for PR 27637 at commit 301cc1a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2020-03-12T23:22:43Z

retest this please

SparkQA · 2020-03-13T01:56:12Z

Test build #119730 has finished for PR 27637 at commit 301cc1a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2020-03-13T01:58:13Z

Merging to master.

### What changes were proposed in this pull request? When loading DataFrames from JDBC datasource with Kerberos authentication, remote executors (yarn-client/cluster etc. modes) fail to establish a connection due to lack of Kerberos ticket or ability to generate it. This is a real issue when trying to ingest data from kerberized data sources (SQL Server, Oracle) in enterprise environment where exposing simple authentication access is not an option due to IT policy issues. In this PR I've added Postgres support (other supported databases will come in later PRs). What this PR contains: * Added `keytab` and `principal` JDBC options * Added `ConnectionProvider` trait and it's impementations: * `BasicConnectionProvider` => unsecure connection * `PostgresConnectionProvider` => postgres secure connection * Added `ConnectionProvider` tests * Added `PostgresKrbIntegrationSuite` docker integration test * Created `SecurityUtils` to concentrate re-usable security related functionalities * Documentation ### Why are the changes needed? Missing JDBC kerberos support. ### Does this PR introduce any user-facing change? Yes, 2 additional JDBC options added: * keytab * principal If both provided then Spark does kerberos authentication. ### How was this patch tested? To demonstrate the functionality with a standalone application I've created this repository: https://github.com/gaborgsomogyi/docker-kerberos * Additional + existing unit tests * Additional docker integration test * Test on cluster manually * `SKIP_API=1 jekyll build` Closes apache#27637 from gaborgsomogyi/SPARK-30874. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@apache.org>

gatorsmile · 2020-04-26T22:55:23Z

@gaborgsomogyi Has anyone discussed the Kerberos renewal and expiration?

[SPARK-30874][SQL] Support Postgres Kerberos login in JDBC connector

b7275e6

Add header

e8d8f2b

dongjoon-hyun added the SQL label Feb 20, 2020

dongjoon-hyun reviewed Feb 20, 2020

View reviewed changes

...-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/DockerJDBCIntegrationSuite.scala Show resolved Hide resolved

dongjoon-hyun reviewed Feb 20, 2020

View reviewed changes

...integration-tests/src/test/scala/org/apache/spark/sql/jdbc/PostgresKrbIntegrationSuite.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Feb 20, 2020

View reviewed changes

...in/scala/org/apache/spark/sql/execution/datasources/jdbc/connection/ConnectionProvider.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Feb 20, 2020

View reviewed changes

core/src/main/scala/org/apache/spark/util/SecurityUtils.scala Outdated Show resolved Hide resolved

Review fixes

e3f6200

HeartSaVioR reviewed Feb 21, 2020

View reviewed changes

Review fixes

9c50a75

HeartSaVioR reviewed Feb 25, 2020

View reviewed changes

vanzin reviewed Feb 26, 2020

View reviewed changes

docs/sql-data-sources-jdbc.md Show resolved Hide resolved

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala Outdated Show resolved Hide resolved

Review fix

febf39a

Test fix

1f318d9

vanzin reviewed Feb 29, 2020

View reviewed changes

Review fix

9affc4f

vanzin reviewed Mar 3, 2020

View reviewed changes

gaborgsomogyi added 2 commits March 4, 2020 12:51

Multi-instance fix

87c84ba

Cleanup

8a2cc85

vanzin reviewed Mar 6, 2020

View reviewed changes

Review fix

301cc1a

vanzin closed this in 231e650 Mar 13, 2020

		val parseURL = driver.getClass.getMethod("parseURL", classOf[String], classOf[Properties])
		val properties = parseURL.invoke(driver, options.url, null).asInstanceOf[Properties]

[SPARK-30874][SQL] Support Postgres Kerberos login in JDBC connector #27637

[SPARK-30874][SQL] Support Postgres Kerberos login in JDBC connector #27637

Conversation

gaborgsomogyi commented Feb 19, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Feb 19, 2020

SparkQA commented Feb 19, 2020

gaborgsomogyi commented Feb 19, 2020

dongjoon-hyun commented Feb 20, 2020

gaborgsomogyi commented Feb 20, 2020

HeartSaVioR left a comment

Choose a reason for hiding this comment

SparkQA commented Feb 21, 2020

SparkQA commented Feb 21, 2020

HeartSaVioR left a comment

Choose a reason for hiding this comment

HeartSaVioR commented Feb 26, 2020 • edited

gaborgsomogyi commented Feb 26, 2020

gaborgsomogyi commented Feb 26, 2020

SparkQA commented Feb 27, 2020

SparkQA commented Feb 28, 2020

gaborgsomogyi commented Feb 28, 2020

SparkQA commented Feb 28, 2020

gaborgsomogyi commented Feb 28, 2020

SparkQA commented Feb 28, 2020

SparkQA commented Mar 2, 2020

gaborgsomogyi commented Mar 3, 2020

SparkQA commented Mar 3, 2020

gaborgsomogyi commented Mar 3, 2020

SparkQA commented Mar 3, 2020

vanzin Mar 3, 2020

Choose a reason for hiding this comment

gaborgsomogyi Mar 4, 2020

Choose a reason for hiding this comment

gaborgsomogyi Mar 4, 2020 • edited

Choose a reason for hiding this comment

SparkQA commented Mar 4, 2020

SparkQA commented Mar 4, 2020

vanzin left a comment

Choose a reason for hiding this comment

vanzin Mar 6, 2020

Choose a reason for hiding this comment

gaborgsomogyi Mar 9, 2020

Choose a reason for hiding this comment

gaborgsomogyi Mar 9, 2020

Choose a reason for hiding this comment

vanzin Mar 12, 2020

Choose a reason for hiding this comment

SparkQA commented Mar 9, 2020

vanzin commented Mar 12, 2020

SparkQA commented Mar 13, 2020

vanzin commented Mar 13, 2020

gatorsmile commented Apr 26, 2020

HeartSaVioR commented Feb 26, 2020 •

edited

gaborgsomogyi Mar 4, 2020 •

edited