[BEAM-3418] Send worker_id in all grpc channels to runner harness#4587
[BEAM-3418] Send worker_id in all grpc channels to runner harness#4587aaltay merged 1 commit intoapache:masterfrom
Conversation
| 'avro>=1.8.1,<2.0.0', | ||
| 'crcmod>=1.7,<2.0', | ||
| 'dill==0.2.6', | ||
| 'grpcio>=1.0,<2', |
There was a problem hiding this comment.
What is the reason for this change? AFAIK, some runner (Dataflow) requires grpcio 1.3? Also, it is became more restrictive for users each time we reduce the list of allowed versions for a dependency.
There was a problem hiding this comment.
grpcio 1.8 is the first python version which allows passing/receiving client headers
There was a problem hiding this comment.
We use header to send worker_id in all channels and GRPC headers are only supported after 1.8.
There was a problem hiding this comment.
Sounds good. Just try running dataflow runner test before finishing this PR.
There was a problem hiding this comment.
Sure, Tried running apache_beam.runners.dataflow.dataflow_runner_test.DataflowRunnerTest
There was a problem hiding this comment.
I meant running a job on dataflow using: https://github.com/apache/beam/blob/master/.test-infra/jenkins/job_beam_PostCommit_Python_Verify.groovy
Commenting "Run Python PostCommit" on the PR should trigger one.
| class WorkerIdInterceptor(grpc.StreamStreamClientInterceptor): | ||
|
|
||
| # Unique worker Id for this worker. | ||
| _worker_id = os.environ['WORKER_ID'] if os.environ.has_key( |
There was a problem hiding this comment.
The ID I think you want is this one: https://github.com/apache/beam/blob/master/sdks/python/container/boot.go#L41
There was a problem hiding this comment.
Yes, For testing I am modifying the internal version of boot.go. Currently its a bit of hack as I artificially generate the id in boot.go for testing. The hack will be removed once we have the multi container thing setup.
I was using woker_id to be more explicit but we can change it to just id.
There was a problem hiding this comment.
Sorry if I was unclear: this PR should also change the referenced boot.go code to set the WORKER_ID. The code here should also fail if that ID is not found -- the ID is set by the runner so that the gRPC servers can know who's who; they are not just unique IDs.
There was a problem hiding this comment.
Added the relevant boot.go changes
There was a problem hiding this comment.
Do we want to fail if WORKER_ID is not found?
There was a problem hiding this comment.
For backward compatibility of containers, I would like to assign a UUID if worker_id is not provided.
There was a problem hiding this comment.
Do we really need to be backward compatible? This is mostly new code with no production usage. I would prefer to not have it succeed like this. But if you think this is necessary in the interim, we can add a TODO to remove the UUID generation.
There was a problem hiding this comment.
Created a jira issue BEAM-3904 to clean it up.
I want to keep keep it around to decouple sdk changes to internal container changes.
| class WorkerIdInterceptor(grpc.StreamStreamClientInterceptor): | ||
|
|
||
| # Unique worker Id for this worker. | ||
| _worker_id = os.environ['WORKER_ID'] if os.environ.has_key( |
There was a problem hiding this comment.
Sorry if I was unclear: this PR should also change the referenced boot.go code to set the WORKER_ID. The code here should also fail if that ID is not found -- the ID is set by the runner so that the gRPC servers can know who's who; they are not just unique IDs.
5970737 to
c9338fa
Compare
|
@herohde Please have a look. I have updated the PR to include the boot.go changes. However I am not sure how the actual id is passed to the boot.go as a flag. I made similar change in internal version of boot.go and using the new worker image. |
|
retest this please |
c9338fa to
483f26f
Compare
| # the flag if 'NO_MULTIPLE_SDK_CONTAINERS' is present. | ||
| # TODO: Cleanup MULTIPLE_SDK_CONTAINERS once we depricate Python SDK till | ||
| # version 2.4. | ||
| if ('MULTIPLE_SDK_CONTAINERS' not in self.proto.experiments and |
There was a problem hiding this comment.
How confident are we to make this a default behavior for 2.5?
There was a problem hiding this comment.
I expect this CL to get in 2.5.
In a way this flag is required to help router distinguish between old SDK (sdk till 2.4) and new SDK (sdk from 2.5). So once we do not have any sdk which is older than 2.5, we don't need to distinguish between sdk atleast for MultiSdk functionality and hence it automatically becomes the default feature.
| self.proto.experiments.append(experiment) | ||
| # Add MULTIPLE_SDK_CONTAINERS flag if its not already present. Do not add | ||
| # the flag if 'NO_MULTIPLE_SDK_CONTAINERS' is present. | ||
| # TODO: Cleanup MULTIPLE_SDK_CONTAINERS once we depricate Python SDK till |
bd777cf to
8c92cce
Compare
|
@aaltay Can you please take a look? |
| class WorkerIdInterceptor(grpc.StreamStreamClientInterceptor): | ||
|
|
||
| # Unique worker Id for this worker. | ||
| _worker_id = os.environ['WORKER_ID'] if os.environ.has_key( |
There was a problem hiding this comment.
Do we want to fail if WORKER_ID is not found?
| # | ||
| """Client Interceptor to inject worker_id""" | ||
| from __future__ import absolute_import | ||
| from __future__ import division |
There was a problem hiding this comment.
Do we need print_function and division imports?
There was a problem hiding this comment.
No, we don't need these imports.
I added them based to resolve the compatibility issue between python 2 and 3 based on https://docs.python.org/3/howto/pyporting.html
Should I remove them?
There was a problem hiding this comment.
Let's remove them, if they are not needed now. I do not see print() or / being used here.
| metadata = [] | ||
| if client_call_details.metadata is not None: | ||
| metadata = list(client_call_details.metadata) | ||
| metadata.append(('worker_id', self._worker_id)) |
There was a problem hiding this comment.
Would it be an error (or expected) for client_call_details to already have worker_id?
There was a problem hiding this comment.
It should be an error.
| self._worker_count = worker_count | ||
| self._worker_index = 0 | ||
| self._control_channel = grpc.insecure_channel(control_address) | ||
| self._control_channel = grpc.intercept_channel(self._control_channel, |
There was a problem hiding this comment.
Should we simplify this as:
self._control_channel = grpc.intercept_channel(grpc.insecure_channel(control_address), WorkerIdInterceptor())
| def __init__(self, log_service_descriptor): | ||
| super(FnApiLogRecordHandler, self).__init__() | ||
| self._log_channel = grpc.insecure_channel(log_service_descriptor.url) | ||
| self._log_channel = grpc.intercept_channel(self._log_channel, |
There was a problem hiding this comment.
(Same simplification comment applies here.)
| if (job_type.startswith('FNAPI_') and | ||
| 'use_multiple_sdk_containers' not in self.proto.experiments and | ||
| 'no_use_multiple_sdk_containers' not in self.proto.experiments): | ||
| self.proto.experiments.append('use_multiple_sdk_containers') |
There was a problem hiding this comment.
It is preferable to modify debug_options.experiments (as done above for runner_harness_override). This also properly helps with updating the user visible pipeline options in the UI, and it will auto added to the proto by the loop above.
It would also help combine things related to if job_type.startswith('FNAPI_'): in a single place.
| class WorkerIdInterceptor(grpc.StreamStreamClientInterceptor): | ||
|
|
||
| # Unique worker Id for this worker. | ||
| _worker_id = os.environ['WORKER_ID'] if os.environ.has_key( |
There was a problem hiding this comment.
For backward compatibility of containers, I would like to assign a UUID if worker_id is not provided.
| metadata = [] | ||
| if client_call_details.metadata is not None: | ||
| metadata = list(client_call_details.metadata) | ||
| metadata.append(('worker_id', self._worker_id)) |
There was a problem hiding this comment.
It should be an error.
| self._worker_count = worker_count | ||
| self._worker_index = 0 | ||
| self._control_channel = grpc.insecure_channel(control_address) | ||
| self._control_channel = grpc.intercept_channel(self._control_channel, |
| def __init__(self, log_service_descriptor): | ||
| super(FnApiLogRecordHandler, self).__init__() | ||
| self._log_channel = grpc.insecure_channel(log_service_descriptor.url) | ||
| self._log_channel = grpc.intercept_channel(self._log_channel, |
| options=[("grpc.max_receive_message_length", -1), | ||
| ("grpc.max_send_message_length", -1)]) | ||
| # Add workerId to the grpc channel | ||
| grpc_channel = grpc.intercept_channel(grpc_channel, |
There was a problem hiding this comment.
Not Simplifying to keep readability.
| if (job_type.startswith('FNAPI_') and | ||
| 'use_multiple_sdk_containers' not in self.proto.experiments and | ||
| 'no_use_multiple_sdk_containers' not in self.proto.experiments): | ||
| self.proto.experiments.append('use_multiple_sdk_containers') |
|
Could you squash your commits? |
|
Sure, I will squash them. |
Adding use_multiple_sdk_containers flag for FNAPI pipelines.
10cadca to
1f52bb8
Compare
DESCRIPTION HERE
Follow this checklist to help us incorporate your contribution quickly and easily:
[BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replaceBEAM-XXXwith the appropriate JIRA issue.mvn clean verifyto make sure basic checks pass. A more thorough check will be performed on your pull request automatically.