Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-6861] Add support to specify a query in CassandraIO #8090

Closed
wants to merge 5 commits into from

Conversation

stankiewicz
Copy link
Contributor

@stankiewicz stankiewicz commented Mar 19, 2019

CassandraIO.Read currently selects all fields and maps them to POJO.
With this change it is possible to select only subset of fields or added computed fields to allow reading things like write Timestamp or other functions.


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Choose reviewer(s) and mention them in a comment (R: @username).
  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

Post-Commit Tests Status (on master branch)

Lang SDK Apex Dataflow Flink Gearpump Samza Spark
Go Build Status --- --- --- --- --- ---
Java Build Status Build Status Build Status Build Status
Build Status
Build Status
Build Status Build Status Build Status
Python Build Status
Build Status
--- Build Status
Build Status
Build Status --- --- ---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

@stankiewicz stankiewicz changed the title [BEAM-XXXX] Select fields and computed fields in CassandraIO.Read [BEAM-6861] Select fields and computed fields in CassandraIO.Read Mar 19, 2019
@stankiewicz
Copy link
Contributor Author

@iemejia - this is very simple PR, would you have time check it? Or can you recommend someone who would have time?
It's already green.

Copy link
Member

@iemejia iemejia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmm I hadn't seen how this has evolved. It is not good to keep adding more options and methods to the main API for something that in the end is only building a query. Maybe worth to take a look if we can generalize this to allow to pass the full query akin to how JdbcIO does withQuery(String query) and refactor the existing code to use this, we can also provide some external builder class to build the query (we are doing this in a recent PR on MongoDbIO), but well this last part (the helper builder) is probably worth a different PR.

@@ -483,6 +508,11 @@ private CassandraIO() {}
.collect(Collectors.joining(","));

List<BoundedSource<T>> sources = new ArrayList<>();
String selectFields = "*";
if (spec.selectFields() != null) {
selectFields = String.join(",", spec.selectFields().get());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please ensure that this produces a valid string e.g. "*" if selectFields is an empty List. I know this won't happen because of the precondition but better to make the code more robust.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

for (Scientist sci : input) {
assertNotNull(sci.id);
assertNull(sci.name);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if you read nameTs here? Since objects are mapped I was wondering if it will be zero, probably worth asserting that to validate that the values were not 'filled' (given that they were not projected).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nameTs will be null as it is not returned by Row so it's not mapped.
Agree on assert.

}

@Test
public void testSelectFieldsWithFunction() throws Exception {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this method, it is almost identical to the previous one or is there something I am missing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is testing UDFs (writetime(person_name)), but I can merge this into one test

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes better do only one test

@iemejia
Copy link
Member

iemejia commented Mar 20, 2019

@stankiewicz WDYT about the withQuery idea, I think we need to better generalize this to have a nicer API and don't add new methods everytime a new requirement comes into place.

@stankiewicz
Copy link
Contributor Author

stankiewicz commented Mar 20, 2019

@iemejia , thanks for feedback - challenge with withQuery in JdbcIO is that it is expected to be full query, like select a from b or select a from b where c='d'.
In CassandraIO select is generated so it can be split based on key ranges so 'where' part is partially dynamic so where part is set via separate field and I want select fields also set via separate field/list of fields.
That's why external (generic to all APIs) builder can be challenge to support every case, unless it gives placeholder for each of those (select fields, where fields).
So I generally agree about cleaning it up but I don't think I could generalize mentioned MongoDB (BEAM-6241) or JdbcIO and reuse it in CassandraIO query builder because they are totally different but definitely I could try to create such builder for Cassandra that will gather projection/select fields and filter and will produce query for CassandraIO

@iemejia
Copy link
Member

iemejia commented Mar 20, 2019

Select is generated in Cassandra only to add the ranges no? I mean I don't see a huge difference with MongoDbIO who does something similar. And about the field selection well it could be more user friendly to have the method, but it also adds more maintenance work that will be completely fixed just by supporting the full query, We will even support unforeseen cases.

@stankiewicz
Copy link
Contributor Author

stankiewicz commented Mar 20, 2019

so what do you thing about withQuery like in sqoop where user have to provide $CONDITIONS tag so library can insert range filters?

I'm happy to rewrite it with such approach to have full select statement with placeholder for internal filters and I will support old where method as well and mark it as deprecated.

@stankiewicz
Copy link
Contributor Author

to summarise change - if user provides withQuery, I will support where field, otherwise I will produce query based on schema, table name and where field (if exists).
WDYT @iemejia ?

@iemejia
Copy link
Member

iemejia commented Mar 20, 2019

if user provides withQuery we respect the user query, otherwise we build it (as it is the case right now). To test that the select fields works, we provide a SELECT with specific fields using the new withQuery and we don't add an specific method for the list of fields.

@stankiewicz
Copy link
Contributor Author

ok, @iemejia, I've done the change. Still I fill it's not perfect with this $CONDTITONS part but I don't find more ellegant way.
In JdbcIO it's not there causing IO to be not scalable across many machines - one worker is doing the query. In Sqoop it was possible to split it based on same mechanism like in CassandraIO but for custom query it required the trick with $CONDTITONS.

@srfrnk
Copy link
Contributor

srfrnk commented Mar 20, 2019

Sorry for barging in on this.
If there is a need for a more robust query building mechanism we might consider using Datastux QueryBuilder.
I tryed to go for that approach when I added withWhere but since their classes are not Serializable I could not.
I attempted to fix this problem here but it was rejected.
Maybe with enough influence we can manage to make the change?
If not maybe this repo can be forked into Beam and so a working QueryBuilder can be used without having to rebuild it?

WDYT?

@stankiewicz
Copy link
Contributor Author

Hi @srfrnk!
I don't like the idea of forking driver as we will end up maintaining it.
I haven't thought about it but I like the idea of programmatic generation of query instead of doing String joins and formats. Challenge here is what we want to expose instead of withWhere() or withQuery() - withQueryBuilder()? If yes it would be nice alternative because we can then just add key ranges dynamically. But then the challenge is how to support value provider for it.

If we figure out serialisation then we can clean this up a little bit. In the meantime I would try to merge this.

@stankiewicz
Copy link
Contributor Author

@iemejia , WDYT about current implementation?

@iemejia
Copy link
Member

iemejia commented Mar 25, 2019

@stankiewicz Sorry for slow review, I will check ASAP probably today or tuesday at latest. Thanks.

@stankiewicz
Copy link
Contributor Author

@iemejia , no problem :) Thanks for all the effort!

@iemejia
Copy link
Member

iemejia commented Mar 26, 2019

I am reviewing this. Mostly LGTM, I will probably do the missing minor fixes and merge, will confirm later on. Thanks.

@stankiewicz
Copy link
Contributor Author

@iemejia - thanks a lot!

@iemejia iemejia changed the title [BEAM-6861] Select fields and computed fields in CassandraIO.Read [BEAM-6861] Add support to specify a query in CassandraIO Mar 27, 2019
@stankiewicz
Copy link
Contributor Author

@iemejia , Ismael, any chance closing that before cut-off or are we already too late? Thanks, Radek

Copy link
Member

@iemejia iemejia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Merging manually to do some final touches.

@iemejia
Copy link
Member

iemejia commented Mar 27, 2019

Thanks a lot @stankiewicz !

iemejia added a commit that referenced this pull request Mar 27, 2019
@iemejia iemejia closed this Mar 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants