Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-14035] Convert BigQuery SchemaIO to SchemaTransform #17135

Conversation

damondouglas
Copy link
Contributor

@damondouglas damondouglas commented Mar 20, 2022

This PR, currently work-in-progress, closes BEAM-14035 by:

  • Creating a BigQuerySchemaIOConfiguration configuration that models Query, Extract, or Load BigQuery jobs
  • Creating a BigQuerySchemaTransform implementation of SchemaTransform
  • Refactoring BigQuerySchemaIOProvider as an extension of TypedSchemaTransformProvider<BigQuerySchemaIOConfiguration>
  • Creating a BigQueryRowReader implementation of PTransform<PBegin, PCollectionRowTuple>

Remaining work is:

  • Update BigQueryRowReader to handle an Extract BigQuery job
  • Create a BigQueryRowWriter implementation of PTransform<PCollectionRowTuple, PDone>
  • Update BigQuerySchemaIOConfiguration with Load Job configuration properties
  • Tests using FakeBigQuery services (initial attempt failed)

I would like to request @laraschmidt to review this PR or delegate where relevant.


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Choose reviewer(s) and mention them in a comment (R: @username).
  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests

See CI.md for more information about GitHub Actions CI.

@@ -42,13 +42,13 @@
@Experimental(Kind.SCHEMAS)
public abstract class TypedSchemaTransformProvider<ConfigT> implements SchemaTransformProvider {

abstract Class<ConfigT> configurationClass();
public abstract Class<ConfigT> configurationClass();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do these need to be public?

Copy link
Contributor

@laraschmidt laraschmidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was a little confused about the structure of this change so left some comments about that. We don't need to support new features / all features in the transform. We should just be able to mimic the behavior of SchemaIO and ideally this would just be a change to change the structure - e.g. I wouldn't expect a different way of reading/generating transforms, etc in this change. Let me know if I misunderstood or missed something.

*/
@Internal
@Experimental
@AutoService(SchemaIOProvider.class)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably still need AutoService, just for the SchemaTransformProvider instead of SchemaIOProvider.

public String identifier() {
return "bigquery";
public BigQuerySchemaIOProvider() {
super();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The base class doesn't have a constructor. Is it normal to call super in this case?

*/
@Override
public BigQuerySchemaIO from(String location, Row configuration, @Nullable Schema dataSchema) {
return new BigQuerySchemaIO(location, configuration);
public SchemaTransform from(BigQuerySchemaIOConfiguration configuration) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SchemaTransform, but maybe BigQueryConfig is good enough?

public boolean requiresDataSchema() {
return false;
public String identifier() {
return BigQuerySchemaIOConfiguration.IDENTIFIER;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conceptually this doesn't really belong in the configuration class. I'd either do a string or a constant described in this file

*/
@Internal
@Experimental
@AutoService(SchemaIOProvider.class)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need one of these for read and write, right? Shouldn't that be in the name?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically I've seen the SchemaProvider + the SchemaTransform/IO in one file. So we'd have two sets of those. I don't know if that fits with the classes here or not though.

public PCollection.IsBounded isBounded() {
return PCollection.IsBounded.BOUNDED;
public List<String> inputCollectionNames() {
// TODO: determine valid input collection names for JobType
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also just put input or output if it's just one input and output. I think this is more useful when there's several inputs, e.g. a join.

*/
@DefaultSchema(AutoValueSchema.class)
@AutoValue
public abstract class BigQuerySchemaIOConfiguration {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where did this list of configs come from? For the first version we can probably just skip with the set of configurations the SchemaIO provided.

eturn Schema.builder()
.addNullableField("table", FieldType.STRING)
.addNullableField("query", FieldType.STRING)
.addNullableField("queryLocation", FieldType.STRING)
.addNullableField("createDisposition", FieldType.STRING)
.build();

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the SchemaTransform doesn't have to expose everything the base class does - though eventually it may. When it comes to converting it ot the new form we can just mimic what was in the original form though.

import org.apache.beam.sdk.values.PCollectionRowTuple;

@AutoValue
public abstract class BigQuerySchemaTransform implements SchemaTransform {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need one for write and one for read

@Override
public PCollectionRowTuple expand(PCollectionRowTuple input) {
if (input.getAll().isEmpty()) {
BigQueryRowReader.Builder builder = BigQueryRowReader.builderOf(getConfiguration());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not combine BigQueryRowReader and the SchemaTransform?

import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.values.Row;

class TableRowToBeamRowFn extends DoFn<TableRow, Row> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to add this? Just moving it to a shared location?

@damondouglas damondouglas deleted the beam-14035-refactor-BigQuerySchemaIOProvider branch March 24, 2022 23:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants