-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-30874][SQL] Support Postgres Kerberos login in JDBC connector #27637
Conversation
Test build #118676 has finished for PR 27637 at commit
|
Test build #118677 has finished for PR 27637 at commit
|
cc @HeartSaVioR |
...-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/DockerJDBCIntegrationSuite.scala
Show resolved
Hide resolved
...integration-tests/src/test/scala/org/apache/spark/sql/jdbc/PostgresKrbIntegrationSuite.scala
Outdated
Show resolved
Hide resolved
...in/scala/org/apache/spark/sql/execution/datasources/jdbc/connection/ConnectionProvider.scala
Outdated
Show resolved
Hide resolved
Thank you for doing this, @gaborgsomogyi ! |
@dongjoon-hyun thanks for investing your time! Finishing up some testing with mysql and going to resolve the suggestions... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just done the first pass of reviewing.
...integration-tests/src/test/scala/org/apache/spark/sql/jdbc/PostgresKrbIntegrationSuite.scala
Outdated
Show resolved
Hide resolved
...tegration-tests/src/test/scala/org/apache/spark/sql/jdbc/DockerKrbJDBCIntegrationSuite.scala
Outdated
Show resolved
Hide resolved
.../org/apache/spark/sql/execution/datasources/jdbc/connection/PostgresConnectionProvider.scala
Outdated
Show resolved
Hide resolved
...apache/spark/sql/execution/datasources/jdbc/connection/PostgresConnectionProviderSuite.scala
Outdated
Show resolved
Hide resolved
...apache/spark/sql/execution/datasources/jdbc/connection/PostgresConnectionProviderSuite.scala
Outdated
Show resolved
Hide resolved
...apache/spark/sql/execution/datasources/jdbc/connection/PostgresConnectionProviderSuite.scala
Outdated
Show resolved
Hide resolved
...apache/spark/sql/execution/datasources/jdbc/connection/PostgresConnectionProviderSuite.scala
Outdated
Show resolved
Hide resolved
Test build #118792 has finished for PR 27637 at commit
|
Test build #118797 has finished for PR 27637 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code change looks good.
Looks like the tests added for integration test wasn't running on Jenkins. I'm not sure whether it requires some tag on PR title, or it's completely manual.
This has been failing as following error (with super slow download):
It's not related to the change but I cannot run this to make sure the tests pass. Unfortunately I haven't found any alternative repo for db2jcc4 with such version. |
AFAIK there is no jenkins integration for this. What I'm mainly doing is compiling Spark code with docker integration tests and then executing them just like unit tests. The problem is maybe because of corrupted m2 repo?! Not sure but worth a try to delete it and retry... |
Another idea to try with |
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala
Outdated
Show resolved
Hide resolved
Test build #119037 has finished for PR 27637 at commit
|
Test build #119053 has finished for PR 27637 at commit
|
retest this please |
Test build #119069 has finished for PR 27637 at commit
|
retest this please |
Test build #119101 has finished for PR 27637 at commit
|
.../org/apache/spark/sql/execution/datasources/jdbc/connection/PostgresConnectionProvider.scala
Outdated
Show resolved
Hide resolved
.../org/apache/spark/sql/execution/datasources/jdbc/connection/PostgresConnectionProvider.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala
Outdated
Show resolved
Hide resolved
Test build #119181 has finished for PR 27637 at commit
|
retest this please |
Test build #119216 has finished for PR 27637 at commit
|
retest this please |
Test build #119217 has finished for PR 27637 at commit
|
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala
Outdated
Show resolved
Hide resolved
...in/scala/org/apache/spark/sql/execution/datasources/jdbc/connection/ConnectionProvider.scala
Outdated
Show resolved
Hide resolved
...in/scala/org/apache/spark/sql/execution/datasources/jdbc/connection/ConnectionProvider.scala
Outdated
Show resolved
Hide resolved
} | ||
|
||
val driverClass = "org.postgresql.Driver" | ||
val appEntry = "pgjdbc" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following a previous comment of yours... if you have a Spark app that needs to connect to 2 different pgsql data sources, each using different credentials, will you have a problem here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a really good point which the actual code doesn't cover. Postgres supports to configure jaasApplicationName
which must be used to overcome this issue. Adding the support...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to support this jdbc url
must be parsed which can be done several ways:
- Using
postgres
by calling the appropriate API function through reflection => No ugly dependency - Using
postgres
by calling the appropriate API function through provided dependency => No ugly reflection - Re-implement it => It's an overkill
I've chosen the first approach but if you think the second is better we can change it.
Test build #119300 has finished for PR 27637 at commit
|
Test build #119311 has finished for PR 27637 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks ok, just a small comment.
val parseURL = driver.getClass.getMethod("parseURL", classOf[String], classOf[Properties]) | ||
val properties = parseURL.invoke(driver, options.url, null).asInstanceOf[Properties] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels a bit yucky but I couldn't find an easily available method to parse the query string... (you could use URI.getQuery().split("&").find(...)
though.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tried this before but not parsing it properly so abandoned:
scala> new java.net.URI("jdbc:postgresql://localhost/postgres?jaasApplicationName=custompgjdbc").getQuery()
res1: String = null
I was able to solve this with more code where I've concluded that custom implementation doesn't worth.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels a bit yucky
I agree but don't have less yucky idea...
...split("?")...split("&").find(...)
would be one possibility but I think it would be brittle in a different way...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok... even though using reflection to access private methods in external libraries is always a brittle solution to anything, if that becomes a problem this can be rewritten easily.
.../org/apache/spark/sql/execution/datasources/jdbc/connection/PostgresConnectionProvider.scala
Show resolved
Hide resolved
Test build #119559 has finished for PR 27637 at commit
|
retest this please |
Test build #119730 has finished for PR 27637 at commit
|
Merging to master. |
### What changes were proposed in this pull request? When loading DataFrames from JDBC datasource with Kerberos authentication, remote executors (yarn-client/cluster etc. modes) fail to establish a connection due to lack of Kerberos ticket or ability to generate it. This is a real issue when trying to ingest data from kerberized data sources (SQL Server, Oracle) in enterprise environment where exposing simple authentication access is not an option due to IT policy issues. In this PR I've added Postgres support (other supported databases will come in later PRs). What this PR contains: * Added `keytab` and `principal` JDBC options * Added `ConnectionProvider` trait and it's impementations: * `BasicConnectionProvider` => unsecure connection * `PostgresConnectionProvider` => postgres secure connection * Added `ConnectionProvider` tests * Added `PostgresKrbIntegrationSuite` docker integration test * Created `SecurityUtils` to concentrate re-usable security related functionalities * Documentation ### Why are the changes needed? Missing JDBC kerberos support. ### Does this PR introduce any user-facing change? Yes, 2 additional JDBC options added: * keytab * principal If both provided then Spark does kerberos authentication. ### How was this patch tested? To demonstrate the functionality with a standalone application I've created this repository: https://github.com/gaborgsomogyi/docker-kerberos * Additional + existing unit tests * Additional docker integration test * Test on cluster manually * `SKIP_API=1 jekyll build` Closes apache#27637 from gaborgsomogyi/SPARK-30874. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@apache.org>
@gaborgsomogyi Has anyone discussed the Kerberos renewal and expiration? |
What changes were proposed in this pull request?
When loading DataFrames from JDBC datasource with Kerberos authentication, remote executors (yarn-client/cluster etc. modes) fail to establish a connection due to lack of Kerberos ticket or ability to generate it.
This is a real issue when trying to ingest data from kerberized data sources (SQL Server, Oracle) in enterprise environment where exposing simple authentication access is not an option due to IT policy issues.
In this PR I've added Postgres support (other supported databases will come in later PRs).
What this PR contains:
keytab
andprincipal
JDBC optionsConnectionProvider
trait and it's impementations:BasicConnectionProvider
=> unsecure connectionPostgresConnectionProvider
=> postgres secure connectionConnectionProvider
testsPostgresKrbIntegrationSuite
docker integration testSecurityUtils
to concentrate re-usable security related functionalitiesWhy are the changes needed?
Missing JDBC kerberos support.
Does this PR introduce any user-facing change?
Yes, 2 additional JDBC options added:
If both provided then Spark does kerberos authentication.
How was this patch tested?
To demonstrate the functionality with a standalone application I've created this repository: https://github.com/gaborgsomogyi/docker-kerberos
SKIP_API=1 jekyll build