Skip to content

[BEAM-8203] Add support for AvroTable in SQL#9597

Merged
amaliujia merged 1 commit intoapache:masterfrom
bmv126:avro-table-provider-support-in-sql
Sep 20, 2019
Merged

[BEAM-8203] Add support for AvroTable in SQL#9597
amaliujia merged 1 commit intoapache:masterfrom
bmv126:avro-table-provider-support-in-sql

Conversation

@bmv126
Copy link
Contributor

@bmv126 bmv126 commented Sep 17, 2019

Adding support Avro Table Provider.


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Choose reviewer(s) and mention them in a comment (R: @username).
  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

Post-Commit Tests Status (on master branch)

Lang SDK Apex Dataflow Flink Gearpump Samza Spark
Go Build Status --- --- Build Status --- --- Build Status
Java Build Status Build Status Build Status Build Status
Build Status
Build Status
Build Status Build Status Build Status
Build Status
Python Build Status
Build Status
Build Status
Build Status
--- Build Status
Build Status
Build Status --- --- Build Status
XLang --- --- --- Build Status --- --- ---

Pre-Commit Tests Status (on master branch)

--- Java Python Go Website
Non-portable Build Status Build Status Build Status Build Status
Portable --- Build Status --- ---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

@bmv126
Copy link
Contributor Author

bmv126 commented Sep 17, 2019

R: @amaliujia R: @reuvenlax

@amaliujia
Copy link
Contributor

Run JavaPortabilityApi PreCommit

Copy link
Contributor

@amaliujia amaliujia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good! Left some comments.


/** A {@link PTransform} to convert {@link GenericRecord} to {@link Row}. */
@AutoValue
public abstract class GenericRecordReadConverter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this class be moved to AvroTable.java to make this change more compact?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per @reuvenlax comment this class is removed now.


/** A {@link PTransform} to convert {@link Row} to {@link GenericRecord}. */
@AutoValue
public abstract class GenericRecordWriteConverter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this write converter be moved as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the package org.apache.beam.sdk.extensions.sql.meta.provider.parquet there is also a ReadConverter class but these are not specific to Parquet.
I was thinking of a separate PR to move these converter classes to some other package. Can you let me know, what is the best place to move these classes ?

.apply("GenericRecordToRow", writeConverter)
.apply(
"AvroIOWrite",
AvroIO.writeGenericRecords(AvroUtils.toAvroSchema(schema, tableName, null))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am less familiar with Avro code: why we need use tableName as parameter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AvroUtils.toAvroSchema() method provides an avroSchema from the BeamSchema. But the avroSchema geenrated does not have the name of the avroSchema.

Now when I use this avroSchema (without the name) to write/read records it fails with error
SchemaParseException: No name in schema: {"type":"record","fields": ... }

So I had to provide the name for the schema, I have provided the same using the tableName.


/** {@link AvroTable} is a {@link org.apache.beam.sdk.extensions.sql.BeamSqlTable}. */
public class AvroTable extends BaseBeamTable implements Serializable {
private final String filePattern;
Copy link
Contributor

@amaliujia amaliujia Sep 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a concern to keep only one file pattern and read/write uses different patterns (a folder for read and a file for write). It limits this table to a read-only table or write-only table.

The better way might be two parameters for read and write separately. A not-set parameter will fail a read or write operation.

However I can see where this pattern comes from: text table has already used it. So it's up to you if you want to change it in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had followed the TextTableProvider approach over here. May be if needed a seperate PR can be used to handle this read/write using different patterns.

@bmv126 bmv126 force-pushed the avro-table-provider-support-in-sql branch from e0402fc to 94a8b26 Compare September 18, 2019 16:58
@bmv126
Copy link
Contributor Author

bmv126 commented Sep 20, 2019

Run JavaPortabilityApi PreCommit

@amaliujia amaliujia merged commit b8b7bce into apache:master Sep 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants