Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support simply query cycle and therefore multi-statement .execute() calls #162

Open
dataders opened this issue Jun 1, 2023 · 3 comments
Labels
enhancement New feature or request

Comments

@dataders
Copy link

dataders commented Jun 1, 2023

ask

related to: dbt-labs/dbt-core#39 dbt-labs/dbt-core#46 dbt-labs/dbt-core#82

allow for dbt (and other consumers/users) to make use of this driver in a simple query flow manner. two specific places that introduce friction are:

  1. strings with multiple, ;-delimited SQL statements cannot be sent to `cursor.execute()
  2. by default the driver today interprets every SQL operation to be a prepared statement even when it isn't.

I love @Brooke-white 's summary in #39 (comment). I've been looking for this terminology all week!

yes I agree it'd be much cleaner (and simpler) to support this in a single call to execute.
I believe psycopg supports this due to the approach taken to executing the statement. There are two approaches a driver can take to execute a statement -- simple query cycle and extended query cycle. Simple query cycle supports executing multiple statements, while extended query protocol does not. redshift_connector uses extended query protocol, which is the root cause for our lack of support. If you're interested, the above links give an interesting overview of the two approaches and their trade offs.
When I have extra time I am working to add support for simple query cycle (i.e. multiple statement execution), but there's no timeline on when this would be available as we are prioritizing other tasks at this time. Ideally, I'd like to provide support for both extended and simple query cycles so users have the option to fine tune how statements are executed :)

background

dbt is a framework that effectively takes SELECT statements, and wraps them within boilerplate DDL to create a DAG of tables and views. Only a somewhat auxillary feature, seeds, make use of prepared statements with an INSERT INTO ... VALUES statement.

In some ways, dbt is a stand in for the parsing and preparation that a database might do in an extended query flow pattern. Using both both complicates things greatly.

recently, in dbt-labs/dbt-redshift#251, dbt-redshift migrated away from psycopg2 to redshift_connector in order to enable

  1. IAM auth to redshift serverless, and
  2. the redshift team to more proactively accommodate and fix issues upstream.

however this change has introduced a number of regressions for our users. While we have both:

example

cursor = redshift_connector.connect(**vals).cursor()
cursor.execute("select 1 as id; select 2 as id")
records_rs = cursor.fetchall()

Error message (full traceback)

ProgrammingError: {'S': 'ERROR', 'C': '42601', 'M':
    'cannot insert multiple commands into a prepared statement',
'F': '../src/pg/src/backend/tcop/postgres.c','L': '3916', 'R': 'exec_parse_message'}

counter-counter-argument

@Brooke-white I saw you mention somewhere that the spec for .execute() in the PEP 249 – Python Database API Specification v2.0
lays out that the input may only be a single operation and that .executemany() should be used for multi-operation/-statement queries. However, if I'm reading this right, the purpose of .executemany() is to execute the the operation over a list of parameter sets. The first argument is still a singular operation.

sidebar

I'm very unclear on exact meanings of the following terms here: query, statement, operation

further complications

all of this confusion is making it even harder to determine how transactions are handled by default between the simple and extended query flows

@Brooke-white
Copy link
Contributor

Hi @dataders

thanks for opening this issue. i appreciate you taking the time to do so. As discussed in dbt-labs/dbt-core#39, moving to support simple query protocol requires a non-trivial amount of work on the redshift-connector side, and some pitfalls related to text transfer protocol. Given that, I will raise this request with our redshift driver team and let you know if it can be prioritized/added to our roadmap as we are currently prioritizing other features.

In the mean time, my best recommendation would be to utilize a sql parsing utility like sqlparse to split multi-statements into individual statements than can be executed by redshift-connector to help immediately relieve the pain caused by redshift-connectors sole use of extended query protocol.

I apologize we haven't been able to incorporate this functionality into the project earlier, it's something I've always wanted to see come to fruition since multi-statement execution has been a long running ask from folks on GitHub. 😢

I will provide an update here after discussing with the team! :)

@Brooke-white Brooke-white added the enhancement New feature or request label Jun 6, 2023
@merubhanot
Copy link

Hey Brooke and OP,

I'm glad to find this discussion here. I'm switching to redshift_connector for similar reasons as described (IAM auth), and my main concern with the multiple statements not being possible (and sqlparse being the workaround) is in the case of a longer set of SQL commands that generate temporary tables along the way. For instance, in this oversimplified example:

query_tmp = """
CREATE TEMP TABLE t1 AS (
SELECT a,b
FROM table1
);

SELECT a
FROM t1;
"""

My understanding is that sqlparse will split this into two statements, execute the first, then the second will fail because it doesn't see the temporary table. Similarly, submitting this to redshift_connector directly will lead to the 'cannot insert multiple commands into a prepared statement.'

Am I missing something here? Is there a way to get this to run with some combination of sqlparse and redshift_connector without refactoring the query to make non-temporary tables (and drop them at the end), which may be a time-consuming refactor depending on the size of the codebase?

Appreciate all the work on this!

@Brooke-white
Copy link
Contributor

Hey @merubhanot ,

Here's an example of how sqlparse could be used. Please note I expanded your example a bit to include creation of table1 and insertion of some testing data so it can be run to completion.

This is possible because when using a temporary table, it remains in scope for the duration of your session. You can read a bit more about this in the Official AWS Docs for CREATE TABLE, but the short of it is that these split up sql statements are executed in the same session.

Keyword that creates a temporary table that is visible only within the current session. The table is automatically dropped at the end of the session in which it is created. The temporary table can have the same name as a permanent table. The temporary table is created in a separate, session-specific schema. (You can't specify a name for this schema.) This temporary schema becomes the first schema in the search path, so the temporary table takes precedence over the permanent table unless you qualify the table name with the schema name to access the permanent table. For more information about schemas and precedence, see search_path.

import sqlparse
import redshift_connector

query_tmp = sqlparse.split("""
CREATE TEMP TABLE table1(a int, b int);

INSERT INTO table1 (a, b) VALUES (1,2);

CREATE TEMP TABLE t1 AS (
SELECT a,b
FROM table1
);

SELECT a FROM t1;
""")

with redshift_connector.connect(...) as conn:
    with conn.cursor() as cursor:
        for stmt in query_tmp:
            cursor.execute(stmt)

        print(cursor.fetchall())
>>> ([1],)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants