New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BEAM-11736] Propagate pipeline options to direct runner #13723
Conversation
@pabloem it seems like you might have touched some nearby code, could you suggest who to add for review? |
I've looked at the failing tests and as far as I can tell they are unrelated to the changes |
I've fixed the problem. The issue was with the snippets tests. The snippets tests were creating multiple PipelineOptions subclasses with conflicting/duplicated arguments. Calling PipelineOptions.get_all_options results in inspecting the subclasses of the PipelineOptions class. Even though the temporary PipelineOptions classes have fallen out of scope that had not yet been reflected in the PipelineOptions state. It seems that the calling |
e1fdf87
to
f3b15fe
Compare
Codecov Report
@@ Coverage Diff @@
## master #13723 +/- ##
==========================================
+ Coverage 82.74% 82.77% +0.02%
==========================================
Files 466 466
Lines 57525 57540 +15
==========================================
+ Hits 47600 47628 +28
+ Misses 9925 9912 -13
Continue to review full report at Codecov.
|
retest this please |
@pabloem is there a command for rerunning specific unit test suite? I've run the failing test for macos-latest-py37 locally and it seems to be passing so I suspect it is flakey (given all others pass) |
@dandy10 Great catch and work with this one! I believe that I've also stumbled upon this issue. I'm trying to get s3 (Minio) to work for TFX and currently this issue is blocking me. These are my supplied pipeline options:
and this is the error that I'm facing:
@pabloem Any chance this could be reviewed and merged ASAP? This would be extremely helpful. |
@dandy10 I tried to use beam from this PR but I ended up with the same error still. I simply cloned the repo, installed dependencies from build-requirements.txt and ran setup.py which completed successfully. Did I miss something or do I have to build anything else as well? |
@ConverJens forgive the silly question, but when you say you cloned the repo, do you mean that you cloned my fork and checked out this branch locally? I've just successfully rerun a pipeline from this branch (installing into a clean virtual environment) which requires setting a custom s3_endpoint, using the direct runner with multi_processing so I think it should work as is. Perhaps you could place a print statement into your sdk_worker_main.py to verify if the PIPELINE_OPTIONS environment variable is set, and the expected options are present? Also worth checking is whether you also have another version of apache beam installed locally. The subprocess runs |
@dandy10 Sorry, I forgot to checkout your branch from your fork. Pipeline options now seems to be passed on but I still get an error though. I've passed endpoint args but beam is still trying to connect to aws rather than the specified endpoint. Do you why that is?
The error I'm getting:
Do I need to specify s3_region, tokens etc as well even when I try to access a local Minio instance? Which beam args are you testing with? Any help would be appreciated! |
@ConverJens no problem. Are you sure the endpoint url has been passed correctly? It looks like it's still trying to connect to aws: I think there might be a typo in one of the options, how about when you change --s3_access_key to --s3_access_key_id? I'm just using the s3_endpoint_url, s3_access_key_id, s3_secret_access_key (not setting --s3_verify=False although I don't think that would be causing the issue you're seeing). I did have some timeouts when testing if I didn't set the cert bundle correctly, maybe something similar is happening here? You can override the default with the AWS_CA_BUNDLE environment variable. |
@dandy10 I corrected my s3_access_key argument but still got the same error. The weird thing is that I believe that I have passed them on correctly because if switch to in_memory and one worker it works as expected and beam writes to minio, so something seems slightly off. I think there is a discrepancy on how the options are passed on to the filesystem itself. Looking at the first line of the error message I'm getting it first says that:
which is the correct endpoint. Later on the same line it says:
which seems to indicate that the s3 client is still trying to reach amazon.com which I assume is a default value. Do you have any idea why that is? Below is the full first line again.
|
@ConverJens can you add a print log in the boto3_client.Client initialiser to verify that the parameters have made it all the way up to that point? If it has then that's all this PR is aiming to do, and it must be another issue. Another thing to verify. Are you running your test locally with the apache_beam package built from my fork, or are you running your tests by submitting a job to kubeflow? If it's the latter have you ensured that the worker pod which is running the job has the patched apache_beam installed? |
I will try that to verify the parameters and compare it to when using in_memory and see if there is a difference.
I'm using a custom image where I have installed your form so I'm positive that I'm running the correct beam code. I'll let you know next week when I have checked the params in the boto3_client. |
@dandy10
From the output it seems as the pipeline options are indeed empty:
Any idea where to continue troubleshooting? |
@dandy10
|
@dandy10
|
That's strange, I'm really not sure why it's not working for you. I think the quickest way to troubleshoot would be to add logging at the points in the PR which I've changed, to verify that you're going down the same path, and see where exactly the options are lost. Could you log them as they are passed through the |
I'll start reviewing this tomorrow. Thanks! |
@ConverJens did you make any progress on finding what the issue was? @pabloem did you get a chance to look yet? |
@dandy10 No not much, unfortunately. I added logging to both classes that you suggested and from SubprocessSdkWorker I get empty pipeline options, but even more strange is that I get no output at all from my loggs in SwitchingDirectRunner. If this is indeed the case, it would explain why there are no options present downstream since they are exported in SwitchingDirectRunner. Is there any other path that to initialize the pipeline that doesn't go through SwitchingDirectRunner? These are the first part of the beam logs where the subprocesses are initalized and each subprocess calls sdk_worker_main.
|
@dandy10 For reference, this is how the beam pipeline is initiated in TFX:
|
@ConverJens there's the problem. You're not specifying the runner type so TFX is defaulting to go straight to the FnApiRunner (without specifying the provision info, which is what this PR is fixing). If you explicitly specify '--runner=DirectRunner' as an extra beam arg I think it should solve the issue. |
@dandy10 That solved it! I was previously only specifying --direct_runner but actually setting --runner=DirectRunner did it. Thank you for your help and this great contribution! Consider this PR third party tested! |
Excellent, glad to help, and thanks for testing! |
This LGTM. Thank you for the contribution! |
…options to direct runner * propagate pipeline options to direct runner * delay new import * fix import reference * format fix * remove mypy comment * fix import order error * only serialize known args * fix snippets tests Co-authored-by: John Doe <john@doe.org>
TLDR; Propagates pipeline options to the python DirectRunner when running in multi_processing mode.
When running a test pipeline with the DirectRunner, and with the direct_running_mode=multi_processing I found that pipeline options were not set as expected in the worker processes, causing the pipeline to fail. Specifically I required filesystem related options to be set in order to access a remote filesystem. It seems that the sdk_worker_main.py runner script expects the options to be populated into an environment variable (PIPELINE_OPTIONS) prior to running, which was not the case for the SubprocessSdkWorker.
It also seems that the sdk_worker_main.py needs to set the loaded pipeline options using the global
filesystems.FileSystems.set_options
which wasn't the case before. This seems more surprising to me, because I would've thought that this would have blocked other filesystems running through the various portable runners which go through this script. I don't have a great understanding of all of the different paths through which to configure the workers though, so perhaps this can also occur through different paths (such as the grpc control channels?). It would be great to get some guidance if my understanding is incorrect.Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
R: @username
).[BEAM-XXX] Fixes bug in ApproximateQuantiles
, where you replaceBEAM-XXX
with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.CHANGES.md
with noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
Post-Commit Tests Status (on master branch)
Pre-Commit Tests Status (on master branch)
See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI.