Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix common sql DbApiHook fetch_all_handler #25430

Merged
merged 1 commit into from
Aug 5, 2022

Conversation

FanatoniQ
Copy link
Contributor

@FanatoniQ FanatoniQ commented Jul 31, 2022

This PR fixes fetch_all_handler mentioned in issues #25388 and possibly linked to #25412

Ref:

Detailed explanation on the issue: #25429


^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@FanatoniQ FanatoniQ changed the title fixed fetch_all_handler fixed common sql DbApiHook fetch_all_handler Jul 31, 2022
Comment on lines -39 to 37
if cursor.returns_rows:
if cursor.description is not None:
return cursor.fetchall()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn’t make sense. description only returns some information of the cursor and has nothing to do to whether the cursor returns data or not.

According to PEP 249, whether a cursor returns information can be checked by

if cursor.rowcount is not None and cursor.rowcount >= 0

Copy link
Contributor Author

@FanatoniQ FanatoniQ Aug 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@uranusjr This doesn't look true to me, I am using the following as reference:

Also:

>>> import pymssql
>>> c = pymssql.connect(host, login, password)
>>> cur = c.cursor()
>>> cur.execute("SELECT SUSER_SNAME();")
>>> cur.rowcount
-1
>>> cur.description
(('', 1, None, None, None, None, None),)
>>> cur.execute("PRINT('1');")
>>> cur.rowcount
-1
>>> cur.description

Edit: I have the same behaviour with sqlite3 and jaydebeapi

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure how pymssql does things, but according to PEP 249, description does not offer the same functionality as SQLAlchemy’s return_rows. If rowcount does not either, you need to find another way that actually has a backing standard. Since DbApiHook should work for all standard-compliant databases, we can’t rely on individual database behaviours, but must refer to the standard.

Copy link
Contributor Author

@FanatoniQ FanatoniQ Aug 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@uranusjr Totally agree on the all standard-compliant part. However then this would mean the sqlalchemy's documentation is wrong since for returns_rows it only mentions description. Do you have an example where description is not None and no rows where returned ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@FanatoniQ FanatoniQ Aug 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quoting sqlalchemy:

  • about CursorResult.return_rows "Overall, the value of CursorResult.returns_rows should always be synonymous with whether or not the DBAPI cursor had a .description attribute, indicating the presence of result columns, noting that a cursor that returns zero rows still has a .description if a row-returning statement was emitted."
  • about row_count (aforementioned link) "Statements that use RETURNING may not return a correct rowcount."
  • about row_count (aforementioned link) "Contrary to what the Python DBAPI says, it does not return the number of rows available from the results of a SELECT statement as DBAPIs cannot support this functionality when rows are unbuffered."

Quoting PEP-249:

Copy link
Member

@potiuk potiuk Aug 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

about cursor.description https://peps.python.org/pep-0249/#description 
"This attribute will be None for operations that do not return rows
or if the cursor has not had an operation
invoked via the [.execute*()](https://peps.python.org/pep-0249/#id14) method yet."

This is indeed part of the standardm, so I do no see why we should not base the decision on that @uranusjr ? It's quite explicitly stated in the PEP that description is only present when there are some rows potentially to be returned (and it can be 0 rows as well).

Copy link
Member

@potiuk potiuk Aug 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I particularly do not like about rowcount is this noe about -1:

The attribute is -1 in case no [.execute*()](https://peps.python.org/pep-0249/#id14) 
has been performed on the cursor or the rowcount of the last operation 
is cannot be determined by the interface. [[7]](https://peps.python.org/pep-0249/#id46)

Note
Future versions of the DB API specification could redefine the latter case to have the object return None instead of -1.

I think just having the note indicate that we should avoid it, and there is absolutely no more guarantees rowcount gives us than description:

This attribute will be None for operations that do not return rows
or if the cursor has not had an operation
invoked via the [.execute*()](https://peps.python.org/pep-0249/#id14) method yet."

This is the same, only less ambiguous IMHO.

@FanatoniQ FanatoniQ force-pushed the fix-sql-common-fetch_all_handler branch 2 times, most recently from d4ff834 to 1ad0e8b Compare August 1, 2022 17:15
@FanatoniQ FanatoniQ changed the title fixed common sql DbApiHook fetch_all_handler Fix common sql DbApiHook fetch_all_handler Aug 1, 2022
@kazanzhy
Copy link
Contributor

kazanzhy commented Aug 1, 2022

First of all, I made one more error here.
There is cursor.execute() almost everywhere except ExasolHook where conn.execute() is called.
And only last one is returning CursorResult. for other cases there are different cursors for different databases.

So if I correctly understand we have to determine how to figure out if .fetchall() could be called in the cursor.
I see the next solutions:

  1. try ... except straight but can slow down the process.
  2. Use one of the cursor attributes (https://peps.python.org/pep-0249/#cursor-attributes):
  • description is not None
    ... will be None for operations that do not return rows or if the cursor has not had an operation invoked via the .execute*() method yet.
    We could guarantee that the handler will be called only after the .execute in DbApiHook. But in DbApiHook we're calling .execute many times in the same cursor.
  • cursor.rowcount > 0
    ... specifies the number of rows that the last .execute*() produced (for DQL statements like SELECT) or affected (for DML statements like UPDATE or INSERT).
    We have to fetch results so there is no guarantee that DML statements will return some results
  • rownumber is not None
    ... should provide the current 0-based index of the cursor in the result set or None if the index cannot be determined.
    The index can be seen as the index of the cursor in a sequence (the result set). The next fetch operation will fetch the row indexed by .rownumber in that sequence.
    It probably couldn't be used

Here's the implementation for Postgres and seems we could use description.
https://github.com/psycopg/psycopg2/search?q=notuples

And here are some experiments:

from sqlalchemy import create_engine
engine = create_engine('postgresql://reader:NWDMCE5xdipIjRrp@hh-pgsql-public.ebi.ac.uk:5432/pfmegrnargs')
connection = engine.connect().execution_options(autocommit=True).connection

cursor = connection.cursor()
cursor.execute('SELECT 1;') 
print(cursor.description) # (Column(name='?column?', type_code=23),)
print(cursor.rowcount) # 1
print(cursor.rownumber) # 0
cursor.fetchall() # [(1,)]

cursor = connection.cursor()
query = """
CREATE TEMP TABLE IF NOT EXISTS tmp (field TEXT);
"""
cursor.execute(query) 
print(cursor.description) # None
print(cursor.rowcount) # -1
print(cursor.rownumber) # 0
cursor.fetchall() # psycopg2.ProgrammingError: no results to fetch

cursor = connection.cursor()
query = """
CREATE TEMP TABLE IF NOT EXISTS tmp (field TEXT); INSERT INTO tmp (field) VALUES ('test');
"""
cursor.execute(query) 
print(cursor.description) # None
print(cursor.rowcount) # -1
print(cursor.rownumber) # 0
cursor.fetchall() # psycopg2.ProgrammingError: no results to fetch

cursor = connection.cursor()
query = """
CREATE TEMP TABLE IF NOT EXISTS tmp (field TEXT); INSERT INTO tmp (field) VALUES ('test') RETURNING *;
"""
cursor.execute(query) 
print(cursor.description) # (Column(name='field', type_code=25),)
print(cursor.rowcount) # 1
print(cursor.rownumber) # 0
cursor.fetchall() # psycopg2.ProgrammingError: no results to fetch

cursor = connection.cursor()
query = """
CREATE TEMP TABLE IF NOT EXISTS tmp (field TEXT); INSERT INTO tmp (field) VALUES ('test'); SELECT 1;
"""
cursor.execute(query) 
print(cursor.description) # (Column(name='?column?', type_code=23),)
print(cursor.rowcount) # 1
print(cursor.rownumber) # 0
cursor.fetchall() # [(1,)]

@uranusjr
Copy link
Member

uranusjr commented Aug 2, 2022

I think the problem with try .. except is less about the performance (the difference should be minimal compared to the actual SQL database connection time), but we don’t know what to catch in the first place (the exception class is different for each cursor implementation). Otherwise I’d use that.

@FanatoniQ
Copy link
Contributor Author

FanatoniQ commented Aug 2, 2022

Pep and sqlalchemy states rowcount may be missleading and is only usable with UPDATE DELETE...

Regarding rownumber it's listed in optional dbapi2 extensions...

I don't think that the fact that we are using the same cursor in the run for loop causes an issue.

Sqlalchemy's implementation of returns_rows is simply like they say on the documentation : cursor.description is not None.

Sqlalchemy's underlying implementation compliancy with dbapi2 should be the one to follow.

@kazanzhy thanks for the experiments, as I am saying I think that if you use the same cursor like we do in the run loop you'll get correct description. If you could check with pgsql, I have already seen this for other drivers in the past as well.

@kazanzhy
Copy link
Contributor

kazanzhy commented Aug 2, 2022

@FanatoniQ I don't get it. You're saying that

Sqlalchemy's implementation of returns_rows is simply like they say on the documentation : cursor.description is not None.

But in this PR you're changing if cursor.returns_rows: to if cursor.description is not None:

@FanatoniQ
Copy link
Contributor Author

FanatoniQ commented Aug 2, 2022

@FanatoniQ I don't get it. You're saying that

Sqlalchemy's implementation of returns_rows is simply like they say on the documentation : cursor.description is not None.

But in this PR you're changing if cursor.returns_rows: to if cursor.description is not None:

@kazanzhy

The sqlalchemy's implementation of returns_rows is correct, like the documentation says sqlalchemy's underlying implementation is cursor.description is not None.

The fact is that in DbApiHook.run uses stock driver (connections and cursors) and not sqlalchemy. To be clearer if you change fetch_all_handler to add a strong runtime type check like so: if not isinstance(cursor, sqlalchemy.engine.CursorResult): raise AttributeError("the cursor isn't an sqlalchemy cursor") it will fail with the attribute error.

If you look at my duplicated issue, it explains why the tests passed when they should have failed: only sqlalchemy has the returns_rows attribute and tests passed because the mock did not have a spec: getattr(cursor, "anyattrname") returns a Mock instead of raising AttributeError.

I hope this is clear.

I don't see why we would go another route to be dbapi2 compliant than to follow sqlalchemy: cursor.description is not None...

Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This generally LGTM, but I would like to hear what @uranusjr and @kazanzhy have to say still. Maybe there are other reasons why using description is bad (but I can't see why).

Copy link
Member

@uranusjr uranusjr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given what we have, this change is good. I kind of wonder perhaps the more fundamental problem here is how fetch_all_handler is designed to be used in the first place, but it’s likely difficult to rewrite things.

@FanatoniQ FanatoniQ force-pushed the fix-sql-common-fetch_all_handler branch from 1ad0e8b to fb1c513 Compare August 5, 2022 08:28
@FanatoniQ
Copy link
Contributor Author

FanatoniQ commented Aug 5, 2022

@potiuk @uranusjr @kazanzhy I force pushed so that the cursor values in the tests are not misleading:
https://github.com/apache/airflow/compare/1ad0e8b4180777b2e24b6b0f75fee9a901251a72..fb1c513a454620b3e336edefaed20afa1184bd6b

We should be good now 😉

@potiuk
Copy link
Member

potiuk commented Aug 5, 2022

Yep. I definitely want to merge that one before the next provider's wave :)

@potiuk potiuk merged commit d82436b into apache:main Aug 5, 2022
@potiuk
Copy link
Member

potiuk commented Aug 5, 2022

🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

apache-airflow-providers-jdbc fails with jaydebeapi.Error
4 participants