Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Reading from BigQuery provides inconsistent schemas #28151

Open
2 of 15 tasks
robertwb opened this issue Aug 25, 2023 · 4 comments
Open
2 of 15 tasks

[Bug]: Reading from BigQuery provides inconsistent schemas #28151

robertwb opened this issue Aug 25, 2023 · 4 comments

Comments

@robertwb
Copy link
Contributor

What happened?

When doing a BigQuery Read like

p | beam.io.ReadFromBigQuery(
    table='apache-beam-testing:beam_bigquery_io_test.taxi_small',
    output_type='BEAM_ROW')

the TIMESTAMP fields are converted to fields of schema type Field{name=event_timestamp, description=, type=LOGICAL_TYPE<beam:logical_type:micros_instant:v1>, options={{}}} whereas in Java they are converted into (incompatible) fields of schema type Field{name=event_timestamp, description=, type=DATETIME, options={{}}}.

The Python one is probably the one that is wrong here. In addition, one cannot write elements of this type to another BigQuery table as one gets

  File "/Users/robertwb/Work/beam/incubator-beam/sdks/python/apache_beam/io/gcp/bigquery_file_loads.py", line 261, in process
    writer.write(row)
  File "/Users/robertwb/Work/beam/incubator-beam/sdks/python/apache_beam/io/gcp/bigquery_tools.py", line 1432, in write
    return self._file_handle.write(self._coder.encode(row) + b'\n')
  File "/Users/robertwb/Work/beam/incubator-beam/sdks/python/apache_beam/io/gcp/bigquery_tools.py", line 1379, in encode
    return json.dumps(
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/json/__init__.py", line 234, in dumps
    return cls(
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/json/encoder.py", line 199, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/json/encoder.py", line 257, in iterencode
    return _iterencode(o, 0)
  File "/Users/robertwb/Work/beam/incubator-beam/sdks/python/apache_beam/io/gcp/bigquery_tools.py", line 152, in default_encoder
    raise TypeError(
TypeError: Object of type 'Timestamp' is not JSON serializable [while running 'WriteToBigQueryHandlingErrors/WriteToBigQuery/BigQueryBatchFileLoads/ParDo(WriteRecordsToFile)/ParDo(WriteRecordsToFile)']

Issue Priority

Priority: 2 (default / most bugs should be filed as P2)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner
@JoeCMoore
Copy link

JoeCMoore commented Aug 25, 2023

This occurs when using the BigQuery streaming write API too. My bigquery.Timestamp schema field type is translated to DATETIME which is causing an input schema output schema mismatch error.

bigquery.Timestamp seems to stem from python datetime.datetime but even when formatting stamps in this manner it still translated the LOGICAL_TYPE<beam:logical_type:micros_instant:v1> to DATETIME.

The thread here seems to suggest it's been fixed but this might be in a pre-release not current (2.49.0 at the time of writing). Do we have an ETA on this fix?

We can use the legacy streaming API but data in the streaming buffer is not queryable for ~2-3 minutes removing the possibility of any sort of real time queries.

@Abacn
Copy link
Contributor

Abacn commented Aug 25, 2023

@JoeCMoore For BigQuery streaming write API use case please try add

LogicalType.register_logical_type(MillisInstant)

before pipeline creation. This would solve the issue.

This was due to #22679 makes the source-of-truth for shcema translation to be in python side, and then logical types with same language types has conflict (MillisInstant, MicrosInstant). This workaround was used to fix tests: b0484e7

I see this reports a couple of times. Indeed we need to figure out a way for long term fix

@johnjcasey
Copy link
Contributor

@Abacn is there a way to fix this long term? it seems like we should just have millisinstant

@Abacn
Copy link
Contributor

Abacn commented Dec 6, 2023

@Abacn is there a way to fix this long term? it seems like we should just have millisinstant

I tried to make it default but it broke many coder unit tests: #29182

I do not have a good idea for the long term fix unless we remove joda time in Java library, or at least use java.time.Instant in all IOs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants