[AIRFLOW-922] Update PrestoHook to enable synchronous execution#2206
[AIRFLOW-922] Update PrestoHook to enable synchronous execution#2206patrickmckenna wants to merge 8 commits intoapache:masterfrom
Conversation
|
@patrickmckenna, thanks for your PR! By analyzing the history of the files in this pull request, we identified @mistercrunch, @artwr and @smarden1 to be potential reviewers. |
|
Sorry for letting this PR languish with broken tests! I had assumed using The previous version of this hook included a fair amount of error message reformatting. But there was no logic to actually recover from the errors; they were just re-raised with the original error hidden. I thought it might be better to remove that reformatting (in ead2f33). However, I see that 6dd4b3b added more error message reformatting (though the new method doesn't appear to be called anywhere). @artwr @mistercrunch Would you prefer then I that I revert ead2f33b9c6051e65c147f8d34c7ec92c11d4544 and incorporate 6dd4b3b? |
|
The tests are passing. |
|
@patrickmckenna are you still working on this PR? |
|
As presto user I would love to see this one it. I have several use cases for this. |
a106623 to
62edae7
Compare
|
I think giving For now, I've updated it to incorporate (most of) the latest commits on I'm a bit confused by the partial test failures, which occur only on some Python 2 builds. Anyone else have insights here? (The test suite does seem a bit flaky—e.g. this is recent successful build of /cc @SamWildmo @Rotemlofer (as interested users) |
| return cursor.poll() is None | ||
| except Exception as ex: | ||
| msg = "Couldn't determine statement execution status: ".format(ex) | ||
| self.log.error(msg) |
There was a problem hiding this comment.
You should pass unformatted text to logger.
| self.log.error(msg) | |
| self.log.error("Couldn't determine statement execution status: %s", ex) |
There was a problem hiding this comment.
@mik-laj ah, is that the agreed upon style preference? Happy to change it, just wasn't aware (didn't see anything in the docs or linting tests enforcing that, but may very well have missed it 😄).
There was a problem hiding this comment.
This is not the style of the code, but the principle of using loggers. If you format before pass to the logger, you create an unnecessary object that the logger can ignore when the level is too low. One of the ways to increase application performance is to reduce the number of logs collected.
In special cases, the logger does not format the text. It can save the message separately and separate data to make analysis easier. If you format data before pass it to the logger, this functionality disappears.
These notes applies to any programming language.
|
It looks like Python2 isn't passing on the tests: |
|
Apologies for my slow reply! All of the currently failing builds show the same 2 errors, which appear unrelated to this PR: failing testsI'm not sure how to understand this (and am all the more confused because the latest build of |
This will allow PrestoHook to be used by Operators derived from BaseOperator, which (https://git.io/fhamQ) should perform or trigger certain tasks synchronously (wait for completion) Notes on the other differences between this and DbApiHook.run: - no need for utf-8 encoding (https://git.io/vD9LI) b/c PyHive does it automatically (https://git.io/vD9Lm) - no closing/commiting the cursor/conn (https://git.io/vD9Lc), because those are no-ops w/ PyHive (https://git.io/vD9L6, https://git.io/vD9L1) presto.Cursor does have a _poll_interval attribute, but it has no public accessor, so it seemed safer to make that value a parameter to pass to PrestoHook.run.
Catch only network-related exceptions when polling Presto. And make str handling work in Python 2, too.
fffe2aa to
bd1c4d6
Compare
Codecov Report
@@ Coverage Diff @@
## master #2206 +/- ##
==========================================
+ Coverage 74.3% 74.34% +0.03%
==========================================
Files 426 426
Lines 27867 27888 +21
==========================================
+ Hits 20706 20732 +26
+ Misses 7161 7156 -5
Continue to review full report at Codecov.
|
| cursor.execute(stmt, parameters) | ||
|
|
||
| if poll_interval is not None: | ||
| while not self.execution_finished(cursor): |
There was a problem hiding this comment.
It's definitely A Bad Idea™️ to poll indefinitely, but I'm assuming all the timeout logic is expected to live in the operators using this hook. @Fokko please LMK if that's assumption's inaccurate, and if there ought to be some minimal safeguards here, e.g. an upper bound on the number of pings to send or time to wait.
There was a problem hiding this comment.
In other operators there are similar constructions that define an upper bound to thrown an exception when the maximum time is exceeded: https://github.com/apache/airflow/blob/master/airflow/hooks/druid_hook.py#L90
|
@patrickmckenna Can you pick up the latest suggestions from the PR? |
|
@patrickmckenna Closing this for lack of activity. |
|
@patrickmckenna is there a chance you will continue to work on that? It's shame that this amazing work will go to waste. This PR is important without it we can not schedule Presto jobs on Airflow as everything is considered success... we can not set dependencies. |
JIRA
https://issues.apache.org/jira/browse/AIRFLOW-922
Description
This updates
PrestoHookso that it can block until a statement finishes executing. Currently,PrestoHook.runreturns as soon as it sends a statement, so Operators that use it won't run synchronously.There are other, minor changes too that seemed worth making (hence the separate commits), though they're unrelated to the primary goal of this PR. Of course if you'd rather jettison those, or put them in a separate PR, that's fine by me.
Tests
I added some under
tests/contrib/hooks, though I wasn't sure if that was the proper location. (PrestoHookis built in, not user-contributed, but AFAICT there are no existing tests for it.)Commits
I haven't squashed commits yet, because I wanted to keep the history easy to see until the code is in a satisfactory state. Once it is, I'm happy to rewrite the commit history (unless you plan to just squash and merge to take care of that?).