You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When running TFX's Evaluator component (which uses tensorflow_model_analysis to create beam jobs), the following error occurs in one of the workers, while all the other workers was able to complete successfully:
Traceback:
Error message from worker: Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/apache_beam/runners/worker/sdk_worker.py", line 287, in _execute
response = task()
File "/usr/local/lib/python3.9/site-packages/apache_beam/runners/worker/sdk_worker.py", line 360, in
lambda: self.create_worker().do_instruction(request), request)
File "/usr/local/lib/python3.9/site-packages/apache_beam/runners/worker/sdk_worker.py", line 596, in do_instruction
return getattr(self, request_type)(
File "/usr/local/lib/python3.9/site-packages/apache_beam/runners/worker/sdk_worker.py", line 635, in process_bundle
monitoring_infos = bundle_processor.monitoring_infos()
File "/usr/local/lib/python3.9/site-packages/apache_beam/runners/worker/bundle_processor.py", line 1139, in monitoring_infos
op.monitoring_infos(transform_id, dict(tag_to_pcollection_id)))
File "/usr/local/lib/python3.9/site-packages/apache_beam/runners/worker/operations.py", line 543, in monitoring_infos
all_monitoring_infos.update(self.user_monitoring_infos(transform_id))
File "/usr/local/lib/python3.9/site-packages/apache_beam/runners/worker/operations.py", line 584, in user_monitoring_infos
return self.metrics_container.to_runner_api_monitoring_infos(transform_id)
File "/usr/local/lib/python3.9/site-packages/apache_beam/metrics/execution.py", line 309, in to_runner_api_monitoring_infos
all_metrics = [
File "/usr/local/lib/python3.9/site-packages/apache_beam/metrics/execution.py", line 310, in
cell.to_runner_api_monitoring_info(key.metric_name, transform_id)
File "/usr/local/lib/python3.9/site-packages/apache_beam/metrics/cells.py", line 76, in to_runner_api_monitoring_info
mi = self.to_runner_api_monitoring_info_impl(name, transform_id)
File "/usr/local/lib/python3.9/site-packages/apache_beam/metrics/cells.py", line 150, in to_runner_api_monitoring_info_impl
return monitoring_infos.int64_user_counter(
File "/usr/local/lib/python3.9/site-packages/apache_beam/metrics/monitoring_infos.py", line 185, in int64_user_counter
return create_monitoring_info(
File "/usr/local/lib/python3.9/site-packages/apache_beam/metrics/monitoring_infos.py", line 302, in create_monitoring_info
return metrics_pb2.MonitoringInfo(
TypeError: 7006 has type numpy.int64, but expected one of: bytes
This error also happens only occasionally, but frequent enough to break production pipeline with a recurring schedule. I would like to understand the root cause of this error to prevent issues in production.
The root of the problem here is that numpy.int64 is not an instance of an int, so the numpy.int64 isn't getting encoded before being dropped into the MonitoringInfo proto. That logic check is here:
It looks like end-to-end we're expecting only int values to be aggregated within the Beam metric, but we don't necessarily encounter a hard check for that until we are trying to encode metric into a proto. It looks like the inc() and dec() methods get a type hint through their default params but update() doesn't enforce a type hint yet. Would you know if that is the method being used to work on the metric here?
It is possible that update() calls are happening withing Beam internals, but it sounds like somewhere in the application an incorrect type is passed as a counter value. It may be a bug in TFMA codebase. Perhaps adding logs to print offending counter name could help.
What happened?
When running TFX's Evaluator component (which uses
tensorflow_model_analysis
to create beam jobs), the following error occurs in one of the workers, while all the other workers was able to complete successfully:Traceback:
This error also happens only occasionally, but frequent enough to break production pipeline with a recurring schedule. I would like to understand the root cause of this error to prevent issues in production.
Related issue in TFMA: tensorflow/model-analysis#171
Issue Priority
Priority: 2 (default / most bugs should be filed as P2)
Issue Components
The text was updated successfully, but these errors were encountered: