Skip to content

[BEAM-4601][SQL] Support BigQuery read from SQL.#5830

Merged
kennknowles merged 1 commit intoapache:masterfrom
amaliujia:rui_wang-read_bigquery
Jul 2, 2018
Merged

[BEAM-4601][SQL] Support BigQuery read from SQL.#5830
kennknowles merged 1 commit intoapache:masterfrom
amaliujia:rui_wang-read_bigquery

Conversation

@amaliujia
Copy link
Contributor

@amaliujia amaliujia commented Jun 29, 2018

Add BigQuery read from SQL. Beam SQL could read all SQL types that Beam SQL can write.


Follow this checklist to help us incorporate your contribution quickly and easily:

  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

It will help us expedite review of your Pull Request if you tag someone (e.g. @username) to look at it.

Post-Commit Tests Status (on master branch)

Lang SDK Apex Dataflow Flink Gearpump Spark
Go Build Status --- --- --- --- ---
Java Build Status Build Status Build Status Build Status Build Status Build Status
Python Build Status --- Build Status
Build Status
--- --- ---

@amaliujia amaliujia force-pushed the rui_wang-read_bigquery branch from 4e9683f to b4879e5 Compare June 29, 2018 08:14
@amaliujia
Copy link
Contributor Author

R @kennknowles
CC @apilloud @akedin

@amaliujia amaliujia force-pushed the rui_wang-read_bigquery branch from b4879e5 to fcd58d3 Compare June 29, 2018 17:34
Copy link
Member

@apilloud apilloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't done a full review yet (it mostly looks good). Here are two early comments on code location.

* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.beam.sdk.extensions.sql.meta.provider.avro;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the right location for this package. I'm not sure what the best location would be, but @reuvenlax says he knows.

Copy link
Contributor Author

@amaliujia amaliujia Jun 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am putting this class here as the first step to support avro format table.

Not sure if utils that convert Avro to Beam row should be in somewhere else. Also want to see if @reuvenlax has some inputs.

return begin
.apply(
BigQueryIO.read(
new SerializableFunction<SchemaAndRecord, Row>() {
Copy link
Member

@apilloud apilloud Jun 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should definitely live in sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryUtils.java

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I should wrap it into BigQueryUtils.java, then AvroUtils.java should also be in IO component so I don't need add dependency of SQL to IO (as IO will need call AvroUtils if not do so).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe AvroUtils.java is going to end up in something like org.apache.beam.sdk.schemas.utils, but until you hear from Reuven where it should go the Bigquery IO is a good location.

@amaliujia amaliujia force-pushed the rui_wang-read_bigquery branch from fcd58d3 to a11997e Compare June 29, 2018 19:15
@amaliujia
Copy link
Contributor Author

amaliujia commented Jun 29, 2018

I refactored code a little bit to move code to Bigquery IO. If there is a better place I will be happy to refactor again.


/** Utils to help convert Apache Avro types to Beam types. */
public class AvroUtils {
public static Object convertAvroFormat(Field beamField, Object value) throws RuntimeException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we don't support recursive rows or array types?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes because I didn't find a nested structure from Schema.Field. In order to convert data in Arvo format, beam type (INT32, etc.) has to be known. However, Schema.Field does not have a recursive structure to get type information.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand - Schema.Field can be of type ROW, in which case the field has a nested schema.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we know the STRING from ROW<ROW<ROW<STRING>> from Schema.Field?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

O actually you are right, I missed this:

    // For ROW types, returns the schema for the row.
    @Nullable
    public abstract Schema getRowSchema();

I do can get nested schema/fields.

@kennknowles
Copy link
Member

It is better to get something working and then revise. So I think it is OK to start with non-nested data, though getting the nested stuff done is mandatory too.

@amaliujia
Copy link
Contributor Author

run java postcommit

@amaliujia
Copy link
Contributor Author

@kennknowles can you take a look?

@kennknowles
Copy link
Member

Looking now.

@kennknowles
Copy link
Member

LGTM. Is there a JIRA about adding complex types?

@kennknowles kennknowles merged commit a58f1ff into apache:master Jul 2, 2018
@amaliujia amaliujia deleted the rui_wang-read_bigquery branch July 2, 2018 18:05
@amaliujia
Copy link
Contributor Author

Yep, a JIRA is created to track it: https://issues.apache.org/jira/browse/BEAM-4710

@lukecwik
Copy link
Member

lukecwik commented Jul 6, 2018

@apilloud
Copy link
Member

apilloud commented Jul 6, 2018

I think that is what #5892 is trying to address?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants