Fix common sql DbApiHook fetch_all_handler #25430

FanatoniQ · 2022-07-31T15:53:45Z

This PR fixes fetch_all_handler mentioned in issues #25388 and possibly linked to #25412

Ref:

Detailed explanation on the issue: #25429

^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

uranusjr · 2022-08-01T02:37:45Z

airflow/providers/common/sql/hooks/sql.py

-    if cursor.returns_rows:
+    if cursor.description is not None:
        return cursor.fetchall()


This doesn’t make sense. description only returns some information of the cursor and has nothing to do to whether the cursor returns data or not.

According to PEP 249, whether a cursor returns information can be checked by

if cursor.rowcount is not None and cursor.rowcount >= 0

@uranusjr This doesn't look true to me, I am using the following as reference:

https://peps.python.org/pep-0249/#description

https://docs.sqlalchemy.org/en/14/core/connections.html?highlight=returns_#sqlalchemy.engine.CursorResult.returns_rows

Also:

>>> import pymssql >>> c = pymssql.connect(host, login, password) >>> cur = c.cursor() >>> cur.execute("SELECT SUSER_SNAME();") >>> cur.rowcount -1 >>> cur.description (('', 1, None, None, None, None, None),) >>> cur.execute("PRINT('1');") >>> cur.rowcount -1 >>> cur.description

Edit: I have the same behaviour with sqlite3 and jaydebeapi

I am not sure how pymssql does things, but according to PEP 249, description does not offer the same functionality as SQLAlchemy’s return_rows. If rowcount does not either, you need to find another way that actually has a backing standard. Since DbApiHook should work for all standard-compliant databases, we can’t rely on individual database behaviours, but must refer to the standard.

@uranusjr Totally agree on the all standard-compliant part. However then this would mean the sqlalchemy's documentation is wrong since for returns_rows it only mentions description. Do you have an example where description is not None and no rows where returned ?

@uranusjr I find sqlalchemy notes on rowcount: https://docs.sqlalchemy.org/en/14/core/connections.html?highlight=returns_#sqlalchemy.engine.CursorResult.rowcount very interesting

Quoting sqlalchemy:

about CursorResult.return_rows "Overall, the value of CursorResult.returns_rows should always be synonymous with whether or not the DBAPI cursor had a .description attribute, indicating the presence of result columns, noting that a cursor that returns zero rows still has a .description if a row-returning statement was emitted."

about row_count (aforementioned link) "Statements that use RETURNING may not return a correct rowcount."

about row_count (aforementioned link) "Contrary to what the Python DBAPI says, it does not return the number of rows available from the results of a SELECT statement as DBAPIs cannot support this functionality when rows are unbuffered."

Quoting PEP-249:

about cursor.rowcount https://peps.python.org/pep-0249/#id48 "The term number of affected rows generally refers to the number of rows deleted, updated or inserted by the last statement run on the database cursor."

about cursor.description https://peps.python.org/pep-0249/#description "This attribute will be None for operations that do not return rows or if the cursor has not had an operation invoked via the .execute*() method yet."

about cursor.description https://peps.python.org/pep-0249/#description "This attribute will be None for operations that do not return rows or if the cursor has not had an operation invoked via the [.execute*()](https://peps.python.org/pep-0249/#id14) method yet."

This is indeed part of the standardm, so I do no see why we should not base the decision on that @uranusjr ? It's quite explicitly stated in the PEP that description is only present when there are some rows potentially to be returned (and it can be 0 rows as well).

What I particularly do not like about rowcount is this noe about -1:

The attribute is -1 in case no [.execute*()](https://peps.python.org/pep-0249/#id14) has been performed on the cursor or the rowcount of the last operation is cannot be determined by the interface. [[7]](https://peps.python.org/pep-0249/#id46) Note Future versions of the DB API specification could redefine the latter case to have the object return None instead of -1.

I think just having the note indicate that we should avoid it, and there is absolutely no more guarantees rowcount gives us than description:

This attribute will be None for operations that do not return rows or if the cursor has not had an operation invoked via the [.execute*()](https://peps.python.org/pep-0249/#id14) method yet."

This is the same, only less ambiguous IMHO.

kazanzhy · 2022-08-01T22:17:28Z

First of all, I made one more error here.
There is cursor.execute() almost everywhere except ExasolHook where conn.execute() is called.
And only last one is returning CursorResult. for other cases there are different cursors for different databases.

So if I correctly understand we have to determine how to figure out if .fetchall() could be called in the cursor.
I see the next solutions:

try ... except straight but can slow down the process.
Use one of the cursor attributes (https://peps.python.org/pep-0249/#cursor-attributes):

description is not None
... will be None for operations that do not return rows or if the cursor has not had an operation invoked via the .execute*() method yet.
We could guarantee that the handler will be called only after the .execute in DbApiHook. But in DbApiHook we're calling .execute many times in the same cursor.
cursor.rowcount > 0
... specifies the number of rows that the last .execute*() produced (for DQL statements like SELECT) or affected (for DML statements like UPDATE or INSERT).
We have to fetch results so there is no guarantee that DML statements will return some results
rownumber is not None
... should provide the current 0-based index of the cursor in the result set or None if the index cannot be determined.
The index can be seen as the index of the cursor in a sequence (the result set). The next fetch operation will fetch the row indexed by .rownumber in that sequence.
It probably couldn't be used

Here's the implementation for Postgres and seems we could use description.
https://github.com/psycopg/psycopg2/search?q=notuples

And here are some experiments:

from sqlalchemy import create_engine
engine = create_engine('postgresql://reader:NWDMCE5xdipIjRrp@hh-pgsql-public.ebi.ac.uk:5432/pfmegrnargs')
connection = engine.connect().execution_options(autocommit=True).connection

cursor = connection.cursor()
cursor.execute('SELECT 1;') 
print(cursor.description) # (Column(name='?column?', type_code=23),)
print(cursor.rowcount) # 1
print(cursor.rownumber) # 0
cursor.fetchall() # [(1,)]

cursor = connection.cursor()
query = """
CREATE TEMP TABLE IF NOT EXISTS tmp (field TEXT);
"""
cursor.execute(query) 
print(cursor.description) # None
print(cursor.rowcount) # -1
print(cursor.rownumber) # 0
cursor.fetchall() # psycopg2.ProgrammingError: no results to fetch

cursor = connection.cursor()
query = """
CREATE TEMP TABLE IF NOT EXISTS tmp (field TEXT); INSERT INTO tmp (field) VALUES ('test');
"""
cursor.execute(query) 
print(cursor.description) # None
print(cursor.rowcount) # -1
print(cursor.rownumber) # 0
cursor.fetchall() # psycopg2.ProgrammingError: no results to fetch

cursor = connection.cursor()
query = """
CREATE TEMP TABLE IF NOT EXISTS tmp (field TEXT); INSERT INTO tmp (field) VALUES ('test') RETURNING *;
"""
cursor.execute(query) 
print(cursor.description) # (Column(name='field', type_code=25),)
print(cursor.rowcount) # 1
print(cursor.rownumber) # 0
cursor.fetchall() # psycopg2.ProgrammingError: no results to fetch

cursor = connection.cursor()
query = """
CREATE TEMP TABLE IF NOT EXISTS tmp (field TEXT); INSERT INTO tmp (field) VALUES ('test'); SELECT 1;
"""
cursor.execute(query) 
print(cursor.description) # (Column(name='?column?', type_code=23),)
print(cursor.rowcount) # 1
print(cursor.rownumber) # 0
cursor.fetchall() # [(1,)]

uranusjr · 2022-08-02T03:00:43Z

I think the problem with try .. except is less about the performance (the difference should be minimal compared to the actual SQL database connection time), but we don’t know what to catch in the first place (the exception class is different for each cursor implementation). Otherwise I’d use that.

FanatoniQ · 2022-08-02T07:06:44Z

Pep and sqlalchemy states rowcount may be missleading and is only usable with UPDATE DELETE...

Regarding rownumber it's listed in optional dbapi2 extensions...

I don't think that the fact that we are using the same cursor in the run for loop causes an issue.

Sqlalchemy's implementation of returns_rows is simply like they say on the documentation : cursor.description is not None.

Sqlalchemy's underlying implementation compliancy with dbapi2 should be the one to follow.

@kazanzhy thanks for the experiments, as I am saying I think that if you use the same cursor like we do in the run loop you'll get correct description. If you could check with pgsql, I have already seen this for other drivers in the past as well.

kazanzhy · 2022-08-02T12:09:28Z

@FanatoniQ I don't get it. You're saying that

Sqlalchemy's implementation of returns_rows is simply like they say on the documentation : cursor.description is not None.

But in this PR you're changing if cursor.returns_rows: to if cursor.description is not None:

FanatoniQ · 2022-08-02T13:13:21Z

@FanatoniQ I don't get it. You're saying that

Sqlalchemy's implementation of returns_rows is simply like they say on the documentation : cursor.description is not None.

But in this PR you're changing if cursor.returns_rows: to if cursor.description is not None:

@kazanzhy

The sqlalchemy's implementation of returns_rows is correct, like the documentation says sqlalchemy's underlying implementation is cursor.description is not None.

The fact is that in DbApiHook.run uses stock driver (connections and cursors) and not sqlalchemy. To be clearer if you change fetch_all_handler to add a strong runtime type check like so: if not isinstance(cursor, sqlalchemy.engine.CursorResult): raise AttributeError("the cursor isn't an sqlalchemy cursor") it will fail with the attribute error.

If you look at my duplicated issue, it explains why the tests passed when they should have failed: only sqlalchemy has the returns_rows attribute and tests passed because the mock did not have a spec: getattr(cursor, "anyattrname") returns a Mock instead of raising AttributeError.

I hope this is clear.

I don't see why we would go another route to be dbapi2 compliant than to follow sqlalchemy: cursor.description is not None...

potiuk

This generally LGTM, but I would like to hear what @uranusjr and @kazanzhy have to say still. Maybe there are other reasons why using description is bad (but I can't see why).

uranusjr

Given what we have, this change is good. I kind of wonder perhaps the more fundamental problem here is how fetch_all_handler is designed to be used in the first place, but it’s likely difficult to rewrite things.

FanatoniQ · 2022-08-05T08:32:13Z

@potiuk @uranusjr @kazanzhy I force pushed so that the cursor values in the tests are not misleading:
https://github.com/apache/airflow/compare/1ad0e8b4180777b2e24b6b0f75fee9a901251a72..fb1c513a454620b3e336edefaed20afa1184bd6b

We should be good now 😉

potiuk · 2022-08-05T11:00:45Z

Yep. I definitely want to merge that one before the next provider's wave :)

potiuk · 2022-08-05T11:43:57Z

🎉

boring-cyborg bot added area:providers provider:common-sql labels Jul 31, 2022

FanatoniQ changed the title ~~fixed fetch_all_handler~~ fixed common sql DbApiHook fetch_all_handler Jul 31, 2022

This was referenced Jul 31, 2022

common sql fetch_all_handler bug #25429

Closed

apache-airflow-providers-jdbc fails with jaydebeapi.Error #25388

Closed

uranusjr reviewed Aug 1, 2022

View reviewed changes

FanatoniQ force-pushed the fix-sql-common-fetch_all_handler branch 2 times, most recently from d4ff834 to 1ad0e8b Compare August 1, 2022 17:15

FanatoniQ changed the title ~~fixed common sql DbApiHook fetch_all_handler~~ Fix common sql DbApiHook fetch_all_handler Aug 1, 2022

potiuk approved these changes Aug 4, 2022

View reviewed changes

uranusjr approved these changes Aug 5, 2022

View reviewed changes

Fix fetch_all_handler & db-api tests for it

fb1c513

FanatoniQ force-pushed the fix-sql-common-fetch_all_handler branch from 1ad0e8b to fb1c513 Compare August 5, 2022 08:28

potiuk merged commit d82436b into apache:main Aug 5, 2022

FanatoniQ mentioned this pull request Aug 7, 2022

Fixing JdbcOperator non-SELECT statement run #25412

Merged

potiuk linked an issue Aug 7, 2022 that may be closed by this pull request

apache-airflow-providers-jdbc fails with jaydebeapi.Error #25388

Closed

2 tasks

This was referenced Aug 10, 2022

Status of testing Providers that were prepared on August 10, 2022 #25634

Closed

Status of testing Providers that were prepared on August 10, 2022 #25640

Closed

Status of testing Providers that were prepared on August 15, 2022 #25721

Closed

potiuk mentioned this pull request Mar 4, 2023

Support for Python 3.11 for Google Provider (upgrading all dependencies) #27292

Closed

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix common sql DbApiHook fetch_all_handler #25430

Fix common sql DbApiHook fetch_all_handler #25430

FanatoniQ commented Jul 31, 2022 •

edited

uranusjr Aug 1, 2022

FanatoniQ Aug 1, 2022 •

edited

uranusjr Aug 1, 2022

FanatoniQ Aug 1, 2022 •

edited

FanatoniQ Aug 1, 2022

FanatoniQ Aug 1, 2022 •

edited

potiuk Aug 4, 2022 •

edited

potiuk Aug 4, 2022 •

edited

kazanzhy commented Aug 1, 2022

uranusjr commented Aug 2, 2022 •

edited

FanatoniQ commented Aug 2, 2022 •

edited

kazanzhy commented Aug 2, 2022 •

edited

FanatoniQ commented Aug 2, 2022 •

edited

potiuk left a comment

uranusjr left a comment •

edited

FanatoniQ commented Aug 5, 2022 •

edited

potiuk commented Aug 5, 2022

potiuk commented Aug 5, 2022

Fix common sql DbApiHook fetch_all_handler #25430

Fix common sql DbApiHook fetch_all_handler #25430

Conversation

FanatoniQ commented Jul 31, 2022 • edited

uranusjr Aug 1, 2022

Choose a reason for hiding this comment

FanatoniQ Aug 1, 2022 • edited

Choose a reason for hiding this comment

uranusjr Aug 1, 2022

Choose a reason for hiding this comment

FanatoniQ Aug 1, 2022 • edited

Choose a reason for hiding this comment

FanatoniQ Aug 1, 2022

Choose a reason for hiding this comment

FanatoniQ Aug 1, 2022 • edited

Choose a reason for hiding this comment

potiuk Aug 4, 2022 • edited

Choose a reason for hiding this comment

potiuk Aug 4, 2022 • edited

Choose a reason for hiding this comment

kazanzhy commented Aug 1, 2022

uranusjr commented Aug 2, 2022 • edited

FanatoniQ commented Aug 2, 2022 • edited

kazanzhy commented Aug 2, 2022 • edited

FanatoniQ commented Aug 2, 2022 • edited

potiuk left a comment

Choose a reason for hiding this comment

uranusjr left a comment • edited

Choose a reason for hiding this comment

FanatoniQ commented Aug 5, 2022 • edited

potiuk commented Aug 5, 2022

potiuk commented Aug 5, 2022

FanatoniQ commented Jul 31, 2022 •

edited

FanatoniQ Aug 1, 2022 •

edited

FanatoniQ Aug 1, 2022 •

edited

FanatoniQ Aug 1, 2022 •

edited

potiuk Aug 4, 2022 •

edited

potiuk Aug 4, 2022 •

edited

uranusjr commented Aug 2, 2022 •

edited

FanatoniQ commented Aug 2, 2022 •

edited

kazanzhy commented Aug 2, 2022 •

edited

FanatoniQ commented Aug 2, 2022 •

edited

uranusjr left a comment •

edited

FanatoniQ commented Aug 5, 2022 •

edited