Skip to content

[Bug]: BigqueryAvroUtils builds invalid Avro schema which prevents insert #25526

@ghost

Description

What happened?

I will add a failing test below, but basically we have a structure in our system which looks something like;

class1 { identifier: record1 }
class2 { identifier: record2, class1: class1 }

That is, we have two separate members with the name "identifier" in two different parts of the type we're trying to write to BigQuery.

When BigqueryIO calls BigQueryAvroUtils.toGenericAvroSchema() on the type, it generates a schema for the structure, but unfortunately calling toString() on the resulting avro schema crashes with;

Method threw 'org.apache.avro.SchemaParseException' exception.

It seems to be due to that;

  • BigQueryAvroUtils.toGenericAvroSchema uses a static namespace of "org.apache.beam.sdk.io.gcp.bigquery" for all types, no matter where in the type structure it's located. If it in this case for example adding the encompassing type to the namespace (org.apache.beam.sdk.io.gcp.bigquery.class1.identifier), there should be no problem.

  • It seems to handle the member name (identifier) as a type name in the schema, so it thinks the two members with the same name are trying to redefine a type.

Not quite clear on the terminology here so I may be using it wrong, but basically it tries to register org.apache.beam.sdk.io.gcp.bigquery.identifier twice in org.apache.avro.Schema$Names.put and that crashes the write to BQ.

The structure is working without any issues up to Beam 2.42 but fails on 2.43, 2.44 and now also 2.45.

To maybe make it clearer, here's a very basic unit test (in Kotlin, but should translate over to java fairly easily I hope) that fails on the toString() call; it builds the TableSchema manually, but in the same structure as it's seems to be built by BigqueryIO for our type.

package org.apache.beam.sdk.io.gcp.bigquery;

import com.google.api.services.bigquery.model.TableFieldSchema
import org.junit.jupiter.api.Test

class SchemaTest {

    @Test
    fun test() {

        val stringSchema1 = TableFieldSchema().setName("id1").setType("STRING")
        val stringSchema2 = TableFieldSchema().setName("id2").setType("STRING")

        val identifier1Schema = TableFieldSchema().setName("identifier").setType("RECORD")
            .setFields(listOf(stringSchema1))

        val identifier2Schema = TableFieldSchema().setName("identifier").setType("RECORD")
            .setFields(listOf(stringSchema2))

        val recordSchema = TableFieldSchema().setName("record").setType("RECORD")
            .setFields(listOf(identifier1Schema))

        val rootSchema = TableFieldSchema().setName("root").setType("RECORD")
            .setFields(listOf(recordSchema, identifier2Schema))

        val output = BigQueryAvroUtils.toGenericAvroSchema("root", rootSchema.fields)

        val outputAsString = output.toString()
    }
}

The test fails as is, but renaming the member id2 to id1 so that both instances of the member with the name "identifier" are seen as the same type makes the test pass.

If it helps, I'll try to make a more complete example that builds the TableSchema from the type in the same way BigqueryIO does, but I hope this makes the problem clear.

Issue Priority

Priority: 2 (default / most bugs should be filed as P2)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions