[BEAM-4601][SQL] Support BigQuery read from SQL.#5830
[BEAM-4601][SQL] Support BigQuery read from SQL.#5830kennknowles merged 1 commit intoapache:masterfrom
Conversation
4e9683f to
b4879e5
Compare
|
R @kennknowles |
b4879e5 to
fcd58d3
Compare
apilloud
left a comment
There was a problem hiding this comment.
Haven't done a full review yet (it mostly looks good). Here are two early comments on code location.
| * See the License for the specific language governing permissions and | ||
| * limitations under the License. | ||
| */ | ||
| package org.apache.beam.sdk.extensions.sql.meta.provider.avro; |
There was a problem hiding this comment.
This is not the right location for this package. I'm not sure what the best location would be, but @reuvenlax says he knows.
There was a problem hiding this comment.
I am putting this class here as the first step to support avro format table.
Not sure if utils that convert Avro to Beam row should be in somewhere else. Also want to see if @reuvenlax has some inputs.
| return begin | ||
| .apply( | ||
| BigQueryIO.read( | ||
| new SerializableFunction<SchemaAndRecord, Row>() { |
There was a problem hiding this comment.
This should definitely live in sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryUtils.java
There was a problem hiding this comment.
If I should wrap it into BigQueryUtils.java, then AvroUtils.java should also be in IO component so I don't need add dependency of SQL to IO (as IO will need call AvroUtils if not do so).
There was a problem hiding this comment.
I believe AvroUtils.java is going to end up in something like org.apache.beam.sdk.schemas.utils, but until you hear from Reuven where it should go the Bigquery IO is a good location.
fcd58d3 to
a11997e
Compare
|
I refactored code a little bit to move code to Bigquery IO. If there is a better place I will be happy to refactor again. |
|
|
||
| /** Utils to help convert Apache Avro types to Beam types. */ | ||
| public class AvroUtils { | ||
| public static Object convertAvroFormat(Field beamField, Object value) throws RuntimeException { |
There was a problem hiding this comment.
so we don't support recursive rows or array types?
There was a problem hiding this comment.
Yes because I didn't find a nested structure from Schema.Field. In order to convert data in Arvo format, beam type (INT32, etc.) has to be known. However, Schema.Field does not have a recursive structure to get type information.
There was a problem hiding this comment.
I'm not sure I understand - Schema.Field can be of type ROW, in which case the field has a nested schema.
There was a problem hiding this comment.
Can we know the STRING from ROW<ROW<ROW<STRING>> from Schema.Field?
There was a problem hiding this comment.
O actually you are right, I missed this:
// For ROW types, returns the schema for the row.
@Nullable
public abstract Schema getRowSchema();
I do can get nested schema/fields.
|
It is better to get something working and then revise. So I think it is OK to start with non-nested data, though getting the nested stuff done is mandatory too. |
|
run java postcommit |
|
@kennknowles can you take a look? |
|
Looking now. |
|
LGTM. Is there a JIRA about adding complex types? |
|
Yep, a JIRA is created to track it: https://issues.apache.org/jira/browse/BEAM-4710 |
|
org.apache.beam.sdk.extensions.sql.meta.provider.bigquery.BigQueryReadWriteIT.testSQLRead fails a lot: |
|
I think that is what #5892 is trying to address? |
Add BigQuery read from SQL. Beam SQL could read all SQL types that Beam SQL can write.
Follow this checklist to help us incorporate your contribution quickly and easily:
[BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replaceBEAM-XXXwith the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.It will help us expedite review of your Pull Request if you tag someone (e.g.
@username) to look at it.Post-Commit Tests Status (on master branch)