[CT-472] postgres: Retry n times on connection timeout #5022

barberscott · 2022-04-09T20:19:53Z

Is there an existing feature request for this?

I have searched the existing issues

Describe the Feature

Runs/builds/tests can create hundreds of independent database connections depending on the size of the project and a single connection timeout due to transient network connections, EC2 load (e.g. RDS case when connecting through EC2), or postgres load can cause an entire run to fail. Connection timeouts can be, and most often are, transient and will often succeed on a retry.

Describe alternatives you've considered

No response

Who will this benefit?

Anyone using the postgres connector.

Are you interested in contributing this feature?

No response

Anything else?

This would be similar to connect_retries on Snowflake.
See: https://github.com/dbt-labs/dbt-snowflake/pull/6/files

jtcohen6 · 2022-04-11T08:51:02Z

Thanks @barberscott! I think this will be the place for the change, for both Postgres and dbt-labs/dbt-redshift#96, since they both use dbt-postgres's psycopg2 connection.

There are related recommendations in #3303 around:

Specific error codes/messages to capture as transient, and likely to succeed on retry
A proposal for user-specified "retryable" errors, as another config in profiles.yml

At this point, given precedent in other adapters (Snowflake + Spark), I'm not opposed to a naive retry_all parameter. It makes sense to turn on in production (best-possible chance at query success), and leave off in development (quicker iteration on errors).

I sense we'd need to add both complementary configs (connect_retries + connect_timeout), with reasonable defaults. Perhaps we also want to add our attempt at retry_on_database_errors, based on the psycopg2 errors we strongly believe to be transient.

tomasfarias · 2022-06-02T06:38:07Z

Hey all, since this issue is significantly hindering us, I went ahead and modified the Postgres adapter to take a stab at fixing it.

The basic premise was modifying PostgresCredentials to take in the new connect_retries and retry_on_errors parameters:

@dataclass
class PostgresCredentials(Credentials):
    host: str
    user: str
    port: Port
    password: str  # on postgres the password is mandatory
    connect_retries: int = 3  # 3 seemed like a sensible default number of retries
    connect_timeout: int = 10
    role: Optional[str] = None
    search_path: Optional[str] = None
    keepalives_idle: int = 0  # 0 means to use the default value
    sslmode: Optional[str] = None
    sslcert: Optional[str] = None
    sslkey: Optional[str] = None
    sslrootcert: Optional[str] = None
    application_name: Optional[str] = "dbt"
    retry_on_errors: list[str] = [
        "08006", # ConnectionFailure
    ]
...

Users can set SQLSTATE codes that they wish to retry on. I'm uncertain whether taking codes is the right approach instead of exception class names (or allowing both).

Afterwards, I added an except clause for these errors in a connection loop added to PostgresConnectionManager.open:

    @classmethod
    def open(cls, connection):
        ...

        retryable_errors = tuple([psycopg2.errors.lookup(error_code) for error_code in credentials.retry_on_errors])

        attempt = 0
        while True:
            try:
                handle = psycopg2.connect(
                    dbname=credentials.database,
                    user=credentials.user,
                    host=credentials.host,
                    password=credentials.password,
                    port=credentials.port,
                    connect_timeout=credentials.connect_timeout,
                    **kwargs,
                )

                if credentials.role:
                    handle.cursor().execute("set role {}".format(credentials.role))

                connection.handle = handle
                connection.state = "open"

            except retryable_errors as e:
                logger.debug(
                    "Got a retryable error on attempt {} to open a postgres " "connection: '{}'".format(attempt, e)
                )

                attempt += 1
                if attempt < credentials.connect_retries:
                    continue

                logger.debug(
                    "Attempt number reached or exceeded {} retries when opening a postgres connection".format(credentials.connect_retries)
                )

                connection.handle = None
                connection.state = "fail"

                raise dbt.exceptions.FailedToConnectException(str(e))

            except psycopg2.Error as e:
                logger.debug(
                    "Got an unknown error on attempt {} to open a postgres " "connection: '{}'".format(attempt, e)
                )

                connection.handle = None
                connection.state = "fail"

                raise dbt.exceptions.FailedToConnectException(str(e))

        return connection

Some iteration over the code is most likely needed (as well as documentation). However, would this work for a PR? I'll be glad to add any necessary testing + documentation and contribute this.

Thanks.

jtcohen6 · 2022-06-02T10:12:28Z

@tomasfarias An approach in this vein makes a lot of sense to me! Thanks for sharing your code.

SQLState errors offer a concise way of specifying retryable errors (if a bit arcane). I think we need to establish a reasonable default set, so that 99% of users never need to override (or even know about) that config. We could also consider a brute force "retry all" option.

pycopg2 doesn't offer us in-built retry. Would it better to look for another OSS retry alternative (e.g. tenacity), rather than rolling our own? It's not very complex code, but we do now have it in a few places (snowflake, spark, bigquery sorta) — and I think there's a lot of benefit in standardizing this in core / across adapters, both for users and for us as maintainers.

cc @nathaniel-may — I know retry is a thing you know / care a lot about, so I'd be curious to hear any thoughts you might have here.

nathaniel-may · 2022-06-03T15:41:08Z

Thanks for tagging me in @jtcohen6. This is one of those cases I would lean on "correctness above all else" rather than ask users to know enough about warehouse internals to list retryable error codes. Thankfully most dbt operations are retryable because they're idempotent in nature, but we do have a few places where we do stateful operations such as incremental models where we may need to be abundantly careful about retrying.

My preference is to write our own retry logic unless this library solves a particularly difficult problem for us. Personally, I don't see much value-add in a general purpose "retry library," but I could be convinced if I'm missing something.

jtcohen6 · 2022-06-29T17:00:45Z

@tomasfarias I'd be very supportive of a PR for this, if you're open to submitting one! My main recommendation would be, in keeping with Nate's suggestion — rather than ask users to list error codes that are retryable, let's try to develop a good set for starting out. I'm particularly interested in connection errors / timeouts.

tomasfarias · 2022-06-30T00:03:26Z

I'll submit a PR for this week. I've also been looking into dbt-bigquery and dbt-snowflake to ensure the retry implementation can be re-used across adapters, and it's not just a dbt-postgres/redshif thing.

Thanks for the patience, I've been on vacation for the last couple of weeks.

barberscott added enhancement New feature or request triage labels Apr 9, 2022

github-actions bot changed the title ~~Retry n times on connection timeout~~ [CT-472] Retry n times on connection timeout Apr 9, 2022

barberscott changed the title ~~[CT-472] Retry n times on connection timeout~~ [CT-472] postgres: Retry n times on connection timeout Apr 10, 2022

jtcohen6 added Team:Adapters Issues designated for the adapter area of the code postgres and removed triage labels Apr 11, 2022

nathaniel-may mentioned this issue Apr 19, 2022

[CT-471] Redshift: Retry n times on connection timeout dbt-labs/dbt-redshift#96

Closed

jtcohen6 mentioned this issue Jun 29, 2022

Retry model on connection errors dbt-msft/dbt-sqlserver#119

Closed

tomasfarias mentioned this issue Jul 3, 2022

[CT-472] feat: Retrying method for acquiring connection handles #5432

Merged

6 tasks

jtcohen6 mentioned this issue Jul 5, 2022

[CT-791] intermittent database connection error dbt-labs/dbt-redshift#129

Closed

jtcohen6 added this to the v1.2 milestone Jul 15, 2022

tomasfarias mentioned this issue Jul 18, 2022

Update Postgres profile configuration to reference retries parameter dbt-labs/docs.getdbt.com#1733

Closed

1 task

nathaniel-may closed this as completed in #5432 Jul 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CT-472] postgres: Retry n times on connection timeout #5022

[CT-472] postgres: Retry n times on connection timeout #5022

barberscott commented Apr 9, 2022 •

edited

Loading

jtcohen6 commented Apr 11, 2022

tomasfarias commented Jun 2, 2022

jtcohen6 commented Jun 2, 2022

nathaniel-may commented Jun 3, 2022

jtcohen6 commented Jun 29, 2022

tomasfarias commented Jun 30, 2022

[CT-472] postgres: Retry n times on connection timeout #5022

[CT-472] postgres: Retry n times on connection timeout #5022

Comments

barberscott commented Apr 9, 2022 • edited Loading

Is there an existing feature request for this?

Describe the Feature

Describe alternatives you've considered

Who will this benefit?

Are you interested in contributing this feature?

Anything else?

jtcohen6 commented Apr 11, 2022

tomasfarias commented Jun 2, 2022

jtcohen6 commented Jun 2, 2022

nathaniel-may commented Jun 3, 2022

jtcohen6 commented Jun 29, 2022

tomasfarias commented Jun 30, 2022

barberscott commented Apr 9, 2022 •

edited

Loading