Skip to content

[BEAM-2436] table is not regeisted in BeamSql.query#3342

Closed
mingmxu wants to merge 1 commit intoapache:DSL_SQLfrom
mingmxu:BEAM-2436
Closed

[BEAM-2436] table is not regeisted in BeamSql.query#3342
mingmxu wants to merge 1 commit intoapache:DSL_SQLfrom
mingmxu:BEAM-2436

Conversation

@mingmxu
Copy link

@mingmxu mingmxu commented Jun 10, 2017

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling 0950a8d on XuMingmin:BEAM-2436 into ** on apache:DSL_SQL**.

PCollection<BeamSQLRow> outputStream = inputTable.apply(BeamSql.simpleQuery(sql));
//Case 1. run a simple SQL query over input PCollection with BeamSql.simpleQuery;
PCollection<BeamSQLRow> outputStream = inputTable.apply(
BeamSql.simpleQuery("select c2, c3 from TABLE_A where c1=1"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The usage of BeamSql.simpleQuery seems a little weird: user have never specified the table name: TABLE_A.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simpleQuery is added for single table SQL. As it's not possible to name a PCollection here, it's automately set to table name used in query.

This brings a potential issue of table name conflict, the solution may be limiting the scope of table schema, will open a new task to talk.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let's talk further in the new task.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to James' concerns.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its extremely convenient to not have to name your tables in the simpleQuery approach.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@takidau as talking about the design of interface, the default method is public static PTransform<PCollectionTuple, PCollection<BeamSqlRow>> query(String sqlQuery), which relies on the named TupleTag to specify table name.

public static PTransform<PCollection<BeamSqlRow>, PCollection<BeamSqlRow>> simpleQuery(String sqlQuery) is a special case which runs on single table/PCollection. There's no exiting method to name it, so the table name in query is take as granted.

For both methods, potentially the table name would be mixed up, that's why I said a further task is needed to have a separated schema namespace for each query.

@lukecwik any comments?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lukecwik do you mean pcollection.simpleQuery('SELECT C1, C2'), to replace pcollection.simpleQuery('SELECT C1, C2 FROM TABLE_NAME')?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xumingmin Having the ability to use 'SELECT C1, C2 FROM TABLE_NAME' or 'SELECT C1, C2' doesn't matter to me. What matters to me is that they don't need to use a PCollectionTuple with a single PCollection and TupleTag pair.

I was thinking that the whole idea of global registration whenever someone calls query or simpleQuery will impact the users future pipeline construction depending on the order in which they apply parts of their pipeline is not a good idea. I can see how its useful in a CLI where they aren't building a pipeline programmatically but should be limited to paths which the CLI code handles.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, a global schema namespace doesn't sound good in DSL, would address the issue in a new task.

@xumingming
Copy link
Contributor

LGTM

@mingmxu
Copy link
Author

mingmxu commented Jun 12, 2017

rebase to fix conflict.

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling e859df4 on XuMingmin:BEAM-2436 into ** on apache:DSL_SQL**.

@mingmxu
Copy link
Author

mingmxu commented Jun 13, 2017

@lukecwik @takidau any pending requests here?
If not, could you help to merge it?

@lukecwik
Copy link
Member

I don't think you want to register tables with BeamSqlEnv within expand(). BeamSqlEnv is currently the global namespace.

@mingmxu
Copy link
Author

mingmxu commented Jun 13, 2017

As discussed above, there should be a BeamSqlEnv per query, to limit the scope of resisted tables. I would prefer to do it in another task, as it impacts both BeamSql and BeamSqlCli.

@lukecwik
Copy link
Member

lukecwik commented Jun 14, 2017

SGTM
Merged

asfgit pushed a commit that referenced this pull request Jun 14, 2017
@mingmxu
Copy link
Author

mingmxu commented Jun 14, 2017

Thanks @lukecwik, @xumingming

Created BEAM2446 to track the next step.

@mingmxu mingmxu closed this Jun 14, 2017
@mingmxu mingmxu deleted the BEAM-2436 branch June 14, 2017 00:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants