Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to specify or change configuration #13

Open
DALDEI opened this issue Feb 29, 2020 · 4 comments
Open

How to specify or change configuration #13

DALDEI opened this issue Feb 29, 2020 · 4 comments

Comments

@DALDEI
Copy link

DALDEI commented Feb 29, 2020

I cannot find how to change the or specify the configuration, specifically the loading strategy.
All the members are private and only a few mutators are exposed.
The only place I can see to override is in the Connection() constructor -- but my existing app uses libraries that call the generic JDBC connect().
Would be useful to either expose all the configuration properties or to allow a global override of the factory.

@iconara
Copy link
Collaborator

iconara commented Mar 2, 2020

Currently the loading strategy is not configurable. We've prepared for it to be configurable, that's why it's in ConnectionConfiguration. We just haven't figured out whether or not it should be configurable, and if it's going to be configurable, should it be possible to switch between the current default (bypass GetQueryResults and load straight from S3) and the GetQueryResults strategy, or if you're supposed to be able to plug in your own strategies.

The same goes for the polling strategy, which could be useful to be able to configure, but also useful to be able to replace with your own implementation in some cases.

What's your use case? Do you want to switch to using GetQueryResults or do you want to provide your own loading strategy?

@DALDEI
Copy link
Author

DALDEI commented Apr 23, 2020

First it is unclear to me still what the 'current' loadings strategy is -- the docs and the implementation seem misaligned. when I went to look I discovered its not changeable --
Whats lead to this was trying to debug why results were seeming to not get cached even though I had set the query token.

`By setting a client request token on a query execution you can make Athena reuse a previous result set if the exact same query has already been run. If you run the same query multiple times this can save money and improve performance.

`

I was rerunnign the same query and specifying the query token but it still took as long.
What I believe now is I (and maybe you ? ) misunderstood what this does.
Its not as it at first seemed -- a way to retrieve the results from cache from a prior query -- rather it seems more like a deduplication solution much like SQS receipt tokens -- to prevent accidentally issuing the same query due to failure on the client side when the server side had succeeded.
IN none of the cases I tried does it do what i hoped and bypass the query and go right to the results from the last one that worked.

So .. on the path to see if I could implement that -- I thought maybe if I could get the S3 URL to the results then I could fetch it myself -- and yes I can -- but its not so useful without the metadata -- and I dont want to parse that when you have a parser already --
But I cant use it without also issuing a new query ...

So back to step one -- there is no obvious way to cache query results without making another copy first -- I was hoping to deduce a way to reuse the polling or results code via the various configurable features but found that in fact they are not configurable.
But thats where I stopped -- I dont know if it is a short step or a long hike from there to be able to reuse previous query results, and didnt see an easy way to tell.


So to answer your question -- my use case is neither -- although maybe it might be both if they were configuable , hard to tell.

@grddev
Copy link
Contributor

grddev commented Apr 28, 2020

First it is unclear to me still what the 'current' loadings strategy is -- the docs and the implementation seem misaligned.

If there is a misalignment, it would be great to fix it, but as far as I can see, the documentation specifies that the default loading strategy is to load from S3, bypassing the Athena API, and the implementation selects the S3 loading strategy, that implements this behavior.

By setting a client request token on a query execution you can make Athena reuse a previous result set if the exact same query has already been run. If you run the same query multiple times this can save money and improve performance.

I was rerunnign the same query and specifying the query token but it still took as long. What I believe now is I (and maybe you ? ) misunderstood what this does.
Its not as it at first seemed -- a way to retrieve the results from cache from a prior query -- rather it seems more like a deduplication solution much like SQS receipt tokens -- to prevent accidentally issuing the same query due to failure on the client side when the server side had succeeded.
IN none of the cases I tried does it do what i hoped and bypass the query and go right to the results from the last one that worked.

The feature is indeed intended and described as to be used to ensure exactly-once processing, but we are actively using it for the caching benefit described in the documentation. Arguably, the README could be a bit clearer on what the original purpose of the token is, and that it can also be used for caching. We could probably also make it more clear that you must provide the same token every time you execute the query, including the first one. The way it is written now, it sounds as if you could provide a token to gain access to a previously executed query, which is not true.

@dbarvitsky
Copy link

Here is a specific use case for overloading configuration:

the application must assume a specific role to access Athena and S3 (which is different from the default role the process is running with).

The way to make it sort of work work with 4.0:

  • create custom class io.burt.athena.configuration.CustomConnectionConfigurationFactory extending ConnectionConfigurationFactory, overriding the createConnectionConfiguration method, and inlining the ConnectionConfiguration interface there.
  • create custom classio.burt.athena.CustomDataSource extending AthenaDataSource that takes ConnectionConfigurationFactory as an argument and passes it to super constructor.
  • now you can create CustomDataSource instead of AthenaDataSource and pass your custom connection configuration to it.

The default Athena driver, unfortunately, is auto-registering itself with default configuration upon class-load and therefore leaves no opportunity to inject a custom configuration. Original non-open-source Athena driver sort of dealt with this problem by having a configuration parameter that is a fully-qualified class name that would be doing configuration work. I'd argue this is pretty nasty and not a good way to do these things. There are many ways of dealing with configuration injection here, but none of them are decent. I'd say half-bad solution would be to have a base non-self-registering driver.

LMK if you want an MR for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants