[Feature] Improve key-pair auth performance #1082

colin-rogers-dbt · 2024-06-12T15:38:35Z

Is this your first time submitting a feature request?

I have read the expectations for open source contributors
I have searched the existing issues, and I could not find an existing issue for this feature
I am requesting a straightforward extension of existing dbt-snowflake functionality, rather than a Big Idea better suited to a discussion

Describe the feature

Params:
Unix time command -- acceptable imprecision for the orders of magnitude we're dealing with here
5000 node project
average 2 runs each for each authentication method
dbt snowflake - user pass: 427.83s user 42.43s system 16% cpu 46:49.12 total
dbt snowflake - key pair: 1011.76s user 44.96s system 32% cpu 53:42.99 total

400 vs 1000 is quite a dramatic difference!

Avenues for investigation:

We want to first try caching the keyfile contents
Redo timing analysis to see if there's any benefit
Look at any extra calls being made to snowflake
If that fails, create a ticket for further investigation

The text was updated successfully, but these errors were encountered:

VersusFacit · 2024-06-12T17:49:37Z

Useful docs I used for testing: https://docs.snowflake.com/en/user-guide/key-pair-auth

openssl genrsa 2048 | openssl pkcs8 -topk8 -inform PEM -out rsa_key.p8 -nocrypt
openssl rsa -in rsa_key.p8 -pubout -out rsa_key.pub

ALTER USER <you> SET RSA_PUBLIC_KEY=''; -- yes this will require admin help; Don't include header/footer strings

amardatar · 2024-07-02T15:30:23Z

Hey team! I found this issue after switching from a 2048-bit key-pair to a 4096-bit one, and found my dbt run times increasing from ~5 minutes to ~15 minutes.

I had a bit of a dig on this, and figured I'd share some findings (and can put together some suggestions for changes as well if there's a preference on how this is handled in the project).

The core issue seems to be that the private key is being read (and validated - I'll get to that) on every dbt node, which eats up the time.

First - this took me too long to find, but the easiest solution seems to just be using the reuse_connections profile config. Maybe it could be a suggestion in the key-pair authentication section of the docs to use that config?

Anyway - in terms of testing what was going on, I did a few tests using the Snowflake Connector for Python and found that (with a fixed private key) execution times were virtually the same across either password, 2048-bit key, or 4096-bit key options.

I had a look at dbt-snowflake and found the above, i.e. that the private key was being read each on each node. Adding a bit of caching somewhat resolved the issue and substantially reduced run times.

I was a bit surprised by this, so I decided to check how long loading keys actually took. My test script looked like this:

def benchmark_key_import(key: str, unsafe_skip_rsa_key_validation: bool = False, n_tries: int = 1000):
    start = time.time()
    for _ in range(0, n_tries):
        private_key = serialization.load_pem_private_key(
            data=bytes(key, 'utf-8'),
            password=None,
            unsafe_skip_rsa_key_validation=unsafe_skip_rsa_key_validation
        )
        key_bytes = private_key.private_bytes(
            encoding=serialization.Encoding.DER,
            format=serialization.PrivateFormat.PKCS8,
            encryption_algorithm=serialization.NoEncryption(),
        )
    end = time.time()
    print(end - start)

Some results from that:

Loading a 2048-bit key 1000 times with validation (which is enabled by default) took 73s
Loading a 4096-bit key 1000 times with validation took 375s
Loading a 2048-bit key 1000 times with no validation took 0.15s

That validation is pretty substantial, which is why I didn't want to immediately put a PR together.

The cryptography docs don't provide much detail on exactly what's unsafe about skipping validation, and I don't know nearly enough about the security elements to say for sure. However, the Snowflake Connector for Python is also using cryptography, and either requires bytes (which it reads with validation) or an instance of RSAPrivateKey (which would already be validated). Essentially, this means that dbt-snowflake can (and perhaps should) skip validation since it's already being done later by the Snowflake Connector and there's no value in doing it twice.

Caching would of course help as well; I imagine there aren't any cases where data changes during an execution such that a cached result became invalid (and if it did, then it could probably be stored in a dictionary instead to maintain it) but my sense based on the above is that skipping validation would be a more sensible solution and would effectively invalidate the need for caching.

Beyond that, and as mentioned at the top, I think enabling the reuse_connections config is the ideal option, since it also means the re-validation that is happening in the Snowflake Connector is also skipped. Enabling this config did result in the shortest run-times in my testing (and run-times that were largely equal across password/private key auth methods). This might be academic, but I'd be interested to know if there's any particular reason for this being disabled by default, and if there's any telemetry on how often it's enabled?

colin-rogers-dbt added enhancement New feature or request triage labels Jun 12, 2024

jtcohen6 removed the triage label Jun 17, 2024

colin-rogers-dbt assigned VersusFacit, McKnight-42 and mikealfare and unassigned VersusFacit and McKnight-42 Jun 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Improve key-pair auth performance #1082

[Feature] Improve key-pair auth performance #1082

colin-rogers-dbt commented Jun 12, 2024 •

edited by mikealfare

Loading

VersusFacit commented Jun 12, 2024 •

edited

Loading

amardatar commented Jul 2, 2024

[Feature] Improve key-pair auth performance #1082

[Feature] Improve key-pair auth performance #1082

Comments

colin-rogers-dbt commented Jun 12, 2024 • edited by mikealfare Loading

Is this your first time submitting a feature request?

Describe the feature

VersusFacit commented Jun 12, 2024 • edited Loading

amardatar commented Jul 2, 2024

colin-rogers-dbt commented Jun 12, 2024 •

edited by mikealfare

Loading

VersusFacit commented Jun 12, 2024 •

edited

Loading