Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: 16177 has type numpy.int64, but expected one of: bytes #27469

Open
2 of 15 tasks
EdwardCuiPeacock opened this issue Jul 12, 2023 · 3 comments
Open
2 of 15 tasks

[Bug]: 16177 has type numpy.int64, but expected one of: bytes #27469

EdwardCuiPeacock opened this issue Jul 12, 2023 · 3 comments

Comments

@EdwardCuiPeacock
Copy link

EdwardCuiPeacock commented Jul 12, 2023

What happened?

When running TFX's Evaluator component (which uses tensorflow_model_analysis to create beam jobs), the following error occurs in one of the workers, while all the other workers was able to complete successfully:

Traceback:

Error message from worker: Traceback (most recent call last):

  File "/usr/local/lib/python3.9/site-packages/apache_beam/runners/worker/sdk_worker.py", line 287, in _execute

    response = task()

  File "/usr/local/lib/python3.9/site-packages/apache_beam/runners/worker/sdk_worker.py", line 360, in 

    lambda: self.create_worker().do_instruction(request), request)

  File "/usr/local/lib/python3.9/site-packages/apache_beam/runners/worker/sdk_worker.py", line 596, in do_instruction

    return getattr(self, request_type)(

  File "/usr/local/lib/python3.9/site-packages/apache_beam/runners/worker/sdk_worker.py", line 635, in process_bundle

    monitoring_infos = bundle_processor.monitoring_infos()

  File "/usr/local/lib/python3.9/site-packages/apache_beam/runners/worker/bundle_processor.py", line 1139, in monitoring_infos

    op.monitoring_infos(transform_id, dict(tag_to_pcollection_id)))

  File "/usr/local/lib/python3.9/site-packages/apache_beam/runners/worker/operations.py", line 543, in monitoring_infos

    all_monitoring_infos.update(self.user_monitoring_infos(transform_id))

  File "/usr/local/lib/python3.9/site-packages/apache_beam/runners/worker/operations.py", line 584, in user_monitoring_infos

    return self.metrics_container.to_runner_api_monitoring_infos(transform_id)

  File "/usr/local/lib/python3.9/site-packages/apache_beam/metrics/execution.py", line 309, in to_runner_api_monitoring_infos

    all_metrics = [

  File "/usr/local/lib/python3.9/site-packages/apache_beam/metrics/execution.py", line 310, in 

    cell.to_runner_api_monitoring_info(key.metric_name, transform_id)

  File "/usr/local/lib/python3.9/site-packages/apache_beam/metrics/cells.py", line 76, in to_runner_api_monitoring_info

    mi = self.to_runner_api_monitoring_info_impl(name, transform_id)

  File "/usr/local/lib/python3.9/site-packages/apache_beam/metrics/cells.py", line 150, in to_runner_api_monitoring_info_impl

    return monitoring_infos.int64_user_counter(

  File "/usr/local/lib/python3.9/site-packages/apache_beam/metrics/monitoring_infos.py", line 185, in int64_user_counter

    return create_monitoring_info(

  File "/usr/local/lib/python3.9/site-packages/apache_beam/metrics/monitoring_infos.py", line 302, in create_monitoring_info

    return metrics_pb2.MonitoringInfo(

TypeError: 7006 has type numpy.int64, but expected one of: bytes

This error also happens only occasionally, but frequent enough to break production pipeline with a recurring schedule. I would like to understand the root cause of this error to prevent issues in production.

Related issue in TFMA: tensorflow/model-analysis#171

Issue Priority

Priority: 2 (default / most bugs should be filed as P2)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner
@jrmccluskey
Copy link
Contributor

The root of the problem here is that numpy.int64 is not an instance of an int, so the numpy.int64 isn't getting encoded before being dropped into the MonitoringInfo proto. That logic check is here:

if isinstance(metric, int):

It looks like end-to-end we're expecting only int values to be aggregated within the Beam metric, but we don't necessarily encounter a hard check for that until we are trying to encode metric into a proto. It looks like the inc() and dec() methods get a type hint through their default params but update() doesn't enforce a type hint yet. Would you know if that is the method being used to work on the metric here?

@tvalentyn
Copy link
Contributor

Thanks for taking a look, @jrmccluskey .

It is possible that update() calls are happening withing Beam internals, but it sounds like somewhere in the application an incorrect type is passed as a counter value. It may be a bug in TFMA codebase. Perhaps adding logs to print offending counter name could help.

@tvalentyn
Copy link
Contributor

let's continue on tensorflow/model-analysis#171

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants