Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Improve key-pair auth performance #1082

Open
3 tasks done
colin-rogers-dbt opened this issue Jun 12, 2024 · 2 comments
Open
3 tasks done

[Feature] Improve key-pair auth performance #1082

colin-rogers-dbt opened this issue Jun 12, 2024 · 2 comments
Assignees
Labels
enhancement New feature or request

Comments

@colin-rogers-dbt
Copy link
Contributor

colin-rogers-dbt commented Jun 12, 2024

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing dbt-snowflake functionality, rather than a Big Idea better suited to a discussion

Describe the feature

@VersusFacit's analysis:

Params:
Unix time command -- acceptable imprecision for the orders of magnitude we're dealing with here
5000 node project
average 2 runs each for each authentication method
dbt snowflake - user pass: 427.83s user 42.43s system 16% cpu 46:49.12 total
dbt snowflake - key pair: 1011.76s user 44.96s system 32% cpu 53:42.99 total

400 vs 1000 is quite a dramatic difference!

Avenues for investigation:

  • We want to first try caching the keyfile contents
  • Redo timing analysis to see if there's any benefit
  • Look at any extra calls being made to snowflake
  • If that fails, create a ticket for further investigation
@colin-rogers-dbt colin-rogers-dbt added enhancement New feature or request triage labels Jun 12, 2024
@VersusFacit
Copy link
Contributor

VersusFacit commented Jun 12, 2024

Useful docs I used for testing: https://docs.snowflake.com/en/user-guide/key-pair-auth

openssl genrsa 2048 | openssl pkcs8 -topk8 -inform PEM -out rsa_key.p8 -nocrypt
openssl rsa -in rsa_key.p8 -pubout -out rsa_key.pub
ALTER USER <you> SET RSA_PUBLIC_KEY=''; -- yes this will require admin help; Don't include header/footer strings

@amardatar
Copy link

Hey team! I found this issue after switching from a 2048-bit key-pair to a 4096-bit one, and found my dbt run times increasing from ~5 minutes to ~15 minutes.

I had a bit of a dig on this, and figured I'd share some findings (and can put together some suggestions for changes as well if there's a preference on how this is handled in the project).

The core issue seems to be that the private key is being read (and validated - I'll get to that) on every dbt node, which eats up the time.

First - this took me too long to find, but the easiest solution seems to just be using the reuse_connections profile config. Maybe it could be a suggestion in the key-pair authentication section of the docs to use that config?

Anyway - in terms of testing what was going on, I did a few tests using the Snowflake Connector for Python and found that (with a fixed private key) execution times were virtually the same across either password, 2048-bit key, or 4096-bit key options.

I had a look at dbt-snowflake and found the above, i.e. that the private key was being read each on each node. Adding a bit of caching somewhat resolved the issue and substantially reduced run times.

I was a bit surprised by this, so I decided to check how long loading keys actually took. My test script looked like this:

def benchmark_key_import(key: str, unsafe_skip_rsa_key_validation: bool = False, n_tries: int = 1000):
    start = time.time()
    for _ in range(0, n_tries):
        private_key = serialization.load_pem_private_key(
            data=bytes(key, 'utf-8'),
            password=None,
            unsafe_skip_rsa_key_validation=unsafe_skip_rsa_key_validation
        )
        key_bytes = private_key.private_bytes(
            encoding=serialization.Encoding.DER,
            format=serialization.PrivateFormat.PKCS8,
            encryption_algorithm=serialization.NoEncryption(),
        )
    end = time.time()
    print(end - start)

Some results from that:

  • Loading a 2048-bit key 1000 times with validation (which is enabled by default) took 73s
  • Loading a 4096-bit key 1000 times with validation took 375s
  • Loading a 2048-bit key 1000 times with no validation took 0.15s

That validation is pretty substantial, which is why I didn't want to immediately put a PR together.

The cryptography docs don't provide much detail on exactly what's unsafe about skipping validation, and I don't know nearly enough about the security elements to say for sure. However, the Snowflake Connector for Python is also using cryptography, and either requires bytes (which it reads with validation) or an instance of RSAPrivateKey (which would already be validated). Essentially, this means that dbt-snowflake can (and perhaps should) skip validation since it's already being done later by the Snowflake Connector and there's no value in doing it twice.

Caching would of course help as well; I imagine there aren't any cases where data changes during an execution such that a cached result became invalid (and if it did, then it could probably be stored in a dictionary instead to maintain it) but my sense based on the above is that skipping validation would be a more sensible solution and would effectively invalidate the need for caching.

Beyond that, and as mentioned at the top, I think enabling the reuse_connections config is the ideal option, since it also means the re-validation that is happening in the Snowflake Connector is also skipped. Enabling this config did result in the shortest run-times in my testing (and run-times that were largely equal across password/private key auth methods). This might be academic, but I'd be interested to know if there's any particular reason for this being disabled by default, and if there's any telemetry on how often it's enabled?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants