New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[BEAM-14035] Convert BigQuery SchemaIO to SchemaTransform #17135

Closed

damondouglas wants to merge 10 commits into apache:master from damondouglas:beam-14035-refactor-BigQuerySchemaIOProvider

Contributor

damondouglas commented Mar 20, 2022 •

edited

This PR, currently work-in-progress, closes BEAM-14035 by:

Creating a BigQuerySchemaIOConfiguration configuration that models Query, Extract, or Load BigQuery jobs
Creating a BigQuerySchemaTransform implementation of SchemaTransform
Refactoring BigQuerySchemaIOProvider as an extension of TypedSchemaTransformProvider<BigQuerySchemaIOConfiguration>
Creating a BigQueryRowReader implementation of PTransform<PBegin, PCollectionRowTuple>

Remaining work is:

Update BigQueryRowReader to handle an Extract BigQuery job
Create a BigQueryRowWriter implementation of PTransform<PCollectionRowTuple, PDone>
Update BigQuerySchemaIOConfiguration with Load Job configuration properties
Tests using FakeBigQuery services (initial attempt failed)

I would like to request @laraschmidt to review this PR or delegate where relevant.

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Choose reviewer(s) and mention them in a comment (R: @username).
Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
~~Update CHANGES.md with noteworthy changes.~~
~~If this contribution is large, please file an Apache Individual Contributor License Agreement.~~

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

See CI.md for more information about GitHub Actions CI.

damondouglas added 4 commits

March 19, 2022 23:42


          WIP: refactor BigQuerySchemaIOProvider; begin BigQuerySchemaIOConfigu…

916d30c

…ration


          Implement Query Job reader

ef60ce9


          Reformat code

95464a3


          Fix license formatting

0379a21

github-actions bot added gcp io java labels

damondouglas added 6 commits

March 20, 2022 08:15


          Fix licenses on additional files

c7702af


          Add BigQuerySchemaIOConfigurationTest

a46ce7c


          Convert isEmpty to not isPresent to resolve error

a8c65d7


          Fix spotless error

813e4b4


          Fix :javaExamplesDataflowPreCommit errors

0c5f8c0


          Fix :sdks:java:io:google-cloud-platform:spotlessApply

15b18ef

laraschmidt reviewed

View reviewed changes

.../core/src/main/java/org/apache/beam/sdk/schemas/transforms/TypedSchemaTransformProvider.java

@@ @@ -42,13 +42,13 @@ @@
               @Experimental(Kind.SCHEMAS)
               public abstract class TypedSchemaTransformProvider<ConfigT> implements SchemaTransformProvider {
-                abstract Class<ConfigT> configurationClass();
+                public abstract Class<ConfigT> configurationClass();

Contributor

laraschmidt Mar 21, 2022

Why do these need to be public?

laraschmidt reviewed

View reviewed changes

Contributor

laraschmidt left a comment

I was a little confused about the structure of this change so left some comments about that. We don't need to support new features / all features in the transform. We should just be able to mimic the behavior of SchemaIO and ideally this would just be a change to change the structure - e.g. I wouldn't expect a different way of reading/generating transforms, etc in this change. Let me know if I misunderstood or missed something.

...oud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQuerySchemaIOProvider.java

                */
               @Internal
               @Experimental
-              @AutoService(SchemaIOProvider.class)

Contributor

laraschmidt Mar 21, 2022

We probably still need AutoService, just for the SchemaTransformProvider instead of SchemaIOProvider.

...oud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQuerySchemaIOProvider.java

-                public String identifier() {
-                  return "bigquery";
+                public BigQuerySchemaIOProvider() {
+                  super();

Contributor

laraschmidt Mar 21, 2022

The base class doesn't have a constructor. Is it normal to call super in this case?

...oud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQuerySchemaIOProvider.java

                  */
                 @Override
-                public BigQuerySchemaIO from(String location, Row configuration, @Nullable Schema dataSchema) {
-                  return new BigQuerySchemaIO(location, configuration);
+                public SchemaTransform from(BigQuerySchemaIOConfiguration configuration) {

Contributor

laraschmidt Mar 21, 2022

SchemaTransform, but maybe BigQueryConfig is good enough?

...oud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQuerySchemaIOProvider.java

    
                public boolean requiresDataSchema() {

                  return false;

                public String identifier() {

                  return BigQuerySchemaIOConfiguration.IDENTIFIER;

Contributor

laraschmidt Mar 21, 2022

Conceptually this doesn't really belong in the configuration class. I'd either do a string or a constant described in this file

...oud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQuerySchemaIOProvider.java

                */
               @Internal
               @Experimental
-              @AutoService(SchemaIOProvider.class)

Contributor

laraschmidt Mar 21, 2022

We need one of these for read and write, right? Shouldn't that be in the name?

Contributor

laraschmidt Mar 21, 2022

Typically I've seen the SchemaProvider + the SchemaTransform/IO in one file. So we'd have two sets of those. I don't know if that fits with the classes here or not though.

...oud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQuerySchemaIOProvider.java

-                public PCollection.IsBounded isBounded() {
-                  return PCollection.IsBounded.BOUNDED;
+                public List<String> inputCollectionNames() {
+                  // TODO: determine valid input collection names for JobType

Contributor

laraschmidt Mar 21, 2022

You can also just put input or output if it's just one input and output. I think this is more useful when there's several inputs, e.g. a join.

.../java/org/apache/beam/sdk/io/gcp/bigquery/schematransform/BigQuerySchemaIOConfiguration.java

+               */
+              @DefaultSchema(AutoValueSchema.class)
+              @AutoValue
+              public abstract class BigQuerySchemaIOConfiguration {

Contributor

laraschmidt Mar 21, 2022

Where did this list of configs come from? For the first version we can probably just skip with the set of configurations the SchemaIO provided.

eturn Schema.builder()
.addNullableField("table", FieldType.STRING)
.addNullableField("query", FieldType.STRING)
.addNullableField("queryLocation", FieldType.STRING)
.addNullableField("createDisposition", FieldType.STRING)
.build();

Contributor

laraschmidt Mar 21, 2022

the SchemaTransform doesn't have to expose everything the base class does - though eventually it may. When it comes to converting it ot the new form we can just mimic what was in the original form though.

...c/main/java/org/apache/beam/sdk/io/gcp/bigquery/schematransform/BigQuerySchemaTransform.java

+              import org.apache.beam.sdk.values.PCollectionRowTuple;
+              @AutoValue
+              public abstract class BigQuerySchemaTransform implements SchemaTransform {

Contributor

laraschmidt Mar 21, 2022

Need one for write and one for read

...c/main/java/org/apache/beam/sdk/io/gcp/bigquery/schematransform/BigQuerySchemaTransform.java

+                    @Override
+                    public PCollectionRowTuple expand(PCollectionRowTuple input) {
+                      if (input.getAll().isEmpty()) {
+                        BigQueryRowReader.Builder builder = BigQueryRowReader.builderOf(getConfiguration());

Contributor

laraschmidt Mar 21, 2022

Why not combine BigQueryRowReader and the SchemaTransform?

...m/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/schematransform/TableRowToBeamRowFn.java

+              import org.apache.beam.sdk.transforms.DoFn;
+              import org.apache.beam.sdk.values.Row;
+              class TableRowToBeamRowFn extends DoFn<TableRow, Row> {

Contributor

laraschmidt Mar 21, 2022

Why do we need to add this? Just moving it to a shared location?

damondouglas closed this

damondouglas deleted the beam-14035-refactor-BigQuerySchemaIOProvider branch

March 24, 2022 23:24

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment