Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bulk API in High Level Rest Client does not support global parameters #26026

Closed
dadoonet opened this issue Aug 2, 2017 · 18 comments · Fixed by #34528
Closed

Bulk API in High Level Rest Client does not support global parameters #26026

dadoonet opened this issue Aug 2, 2017 · 18 comments · Fixed by #34528
Assignees
Labels
good first issue low hanging fruit

Comments

@dadoonet
Copy link
Member

dadoonet commented Aug 2, 2017

When using the REST API we can use:

POST test/doc/_bulk?pipeline=xyz
{ "index" : {  } }
{ "field1" : "value1" }
{ "index" : {  } }
{ "field1" : "value1" }

But with the High Level REST API, this global pipeline option is not available.

Which means that we must use:

POST test/doc/_bulk
{ "index" : { "pipeline": "xyz" } }
{ "field1" : "value1" }
{ "index" : { "pipeline": "xyz" } }
{ "field1" : "value1" }

Which will lead to have much more data having to pass on the network.

I suggest that we add a field pipeline in BulkRequest class. The same is true for default index and type.

@javanna
Copy link
Member

javanna commented Aug 3, 2017

This is a side-effect of reusing java API objects in the high-level REST client, and it applies also to default index, type and routing value. These parameters are applied at parsing time, when the request comes in as bytes, that go through a parsing phase. In that parsing phase the default values are copied to each items that don't have specific values set.

When providing the items as proper request objects (java API or high-level client), such parameters need to be specified as part of each item directly. The other option would be to make them all settable to the BulkRequest (and also BulkProcessor) but then when setting an item we would have to check and eventually modify its state, which doesn't seem like a great idea.

Also, users have been using the java API this way for a while, hence this should not cause problems and the migration is straight-forward. We should probably consider making these changes if we'll ever move away from sharing classes with Elasticsearch core.

@dadoonet
Copy link
Member Author

dadoonet commented Aug 3, 2017

but then when setting an item we would have to check and eventually modify its state, which doesn't seem like a great idea.

Isn't that the same level of complexity when we have to deal with index name or type? Like the following (untested though):

POST foo/bar/_bulk
{ "index" : { "_index": "test", "_type": "doc" } }
{ "field1" : "value1" }
{ "index" : { } }
{ "field1" : "value1" }

Also, users have been using the java API this way for a while, hence this should not cause problems and the migration is straight-forward. We should probably consider making these changes if we'll ever move away from sharing classes with Elasticsearch core.

I agree that this can totally wait. That's more a concern for people like me who are moving from their own made REST Client to the official one as the REST API supports this global pipeline parameter. That said, this option does not seem to be documented anywhere. 😛

@javanna
Copy link
Member

javanna commented Aug 3, 2017

Isn't that the same level of complexity when we have to deal with index name or type?

yes, in fact that is not supported by BulkRequest either, as far as I can see.

@dadoonet
Copy link
Member Author

dadoonet commented Aug 3, 2017

Ha! Good point.

@tlrx
Copy link
Member

tlrx commented Aug 4, 2017

We discussed about this in Fix-it Friday and we agreed that it would be nice to be able to set a default pipeline/index/routing etc on BulkRequest similar to what the other clients are doing.

But now I've read again @javanna's comments I agree that we won't be able to do this with the core classes except by modifying their state (and copy them over) which is really not good.

I suggest that we add a field pipeline in BulkRequest class.

I don't think this is how we should do that. Instead we could add methods in the RestHighLevelClient that would allow to pass default values for the index/type/routing/pipeline/etc

RestHighLevelClient.bulk(BulkRequest bulkRequest, String defaultIndex, Header... headers)
RestHighLevelClient.bulk(BulkRequest bulkRequest, String defaultIndex, String defaultType, Header... headers)
RestHighLevelClient.bulk(BulkRequest bulkRequest, String defaultIndex, String defaultType, String defaultRouting, Header... headers)

The values could be passed to the org.elasticsearch.client.Request.bulk() method in charge of creating the request query string and body, and additional query parameters could be set there.

@javanna
Copy link
Member

javanna commented Aug 4, 2017

What is the problem that we are trying to solve here: ease of use or performance? If it is the latter it should be measured, rather than assuming it is a problem. The transport client always worked like this, although it uses the transport layer rather than the REST layer which is quite a difference. On the ease of use aspect, I agree this is not ideal, yet not a huge issue, something that transport client users got used to. I am against adding specialized methods to the high level client, as it will complicate things and set a precedent for doing the same in similar cases. We should take this into account as a reason to have our own request objects one day, or accept such limitations as a consequence of depending from Elasticsearch core and reusing its classes. It saves us a lot of work, but it does come with a cost.

@dadoonet
Copy link
Member Author

dadoonet commented Aug 4, 2017

What is the problem that we are trying to solve here?

To me, it's the potential network usage overhead. Not really about performance more about network "costs".
Might be a non brainer though.

The second thing is that I'd love that our Java REST Client exposes all options that we have in the REST API.
I'm fine BTW removing the support for /index/type/_bulk and only support /_bulk if this is what we want. I'd just like to have that consistent. I mean that the reference guide explains the REST API. I'd expect all same options from a Java (or whatever else) client).

@javanna
Copy link
Member

javanna commented Aug 4, 2017

I'd just like to have that consistent. I mean that the reference guide explains the REST API. I'd expect all same options from a Java (or whatever else) client).

I agree with this ideally, but we gave priority to migrating from transport client and reusing existing classes, this is the compromise we made and it's not possible to have 100% consistency at the same time. I think we will encounter more similar issues, and that is kind of expected.

@javanna javanna removed the :Bulk label Oct 26, 2017
@javanna
Copy link
Member

javanna commented Mar 22, 2018

Thinking more about this, we have encountered a few parameters that are only supported at REST, and the way we dealt with them was to add support for them to the request so that the high-level REST client could expose and use them, although the transport client doesn't do anything with them. We could do the same here by adding index, pipeline etc to BulkRequest and document that only the high-level REST client does something with them.

@javanna javanna added good first issue low hanging fruit help wanted adoptme and removed discuss labels Mar 22, 2018
@javanna javanna changed the title Bulk API in High Level Rest Client should support the pipeline option Bulk API in High Level Rest Client does not support global parameters Jun 7, 2018
@koalalam
Copy link

koalalam commented Jun 7, 2018

While the plan is to depreciate TransportClient and to use High Level REST client instead, any solid roadmap for this enhancement?

@javanna
Copy link
Member

javanna commented Jun 7, 2018

Note that transport client is already deprecated. This is something that we want to address, yet not high priority as there is a work-around, meaning that you can provide the info on each inner request like you'd do with the transport client.

@koalalam
Copy link

koalalam commented Jun 7, 2018

That will work only if this is not turned on: https://www.elastic.co/guide/en/elasticsearch/reference/6.2/url-access-control.html

@pushpavanthar
Copy link

@javanna Has anyone already started working on this? If not can I pick this up?

@javanna
Copy link
Member

javanna commented Jun 25, 2018

@pushpavanthar nobody is working on this, feel free to take it. Thanks!

@pgomulka pgomulka removed the help wanted adoptme label Oct 3, 2018
@pgomulka pgomulka self-assigned this Oct 3, 2018
@pgomulka
Copy link
Contributor

pgomulka commented Oct 4, 2018

We could do the same here by adding index, pipeline etc to BulkRequest and document that only the high-level REST client does something with them.

@javanna indeed we could just add these to the bulk request, and then apply defaults when converting in RequestConverters, but index and type are missing it would fail validation per request (IndexRequeset.validation for instance)

@javanna
Copy link
Member

javanna commented Oct 10, 2018

@pgomulka I see the problem, the request gets converted after validation, and validate fails if we don't have all of the items with the required parameters set. Let's add some logic to RestHighLevelClient#bulk that goes through the items and sets the default values taken from the main request so that validate is happy? cc @hub-cap

@hub-cap
Copy link
Contributor

hub-cap commented Oct 15, 2018

im ++ to this. Thanks for adding good first issue, as this is a great first issue.

@pgomulka
Copy link
Contributor

pgomulka commented Oct 16, 2018

Things that need to be covered:

  1. mandatory global parameters: index, type. These I think is better to force user to provide during BulkRequest or BulkdProcessor creation. Later when a SubRequest is added without a required parameter, it can be checked and overridden with default global.
  2. optional parameter pipeline, routing
  3. global parameter should be overridden when the same parameter is provided on a SubRequest

1 & 2 & 3 should be supported when using with all following styles:

  • sync BulkRequest
  • async Bulk Request
  • BulkProcessor

pgomulka added a commit that referenced this issue Oct 30, 2018
Bulk Request in High level rest client should be consistent with what is
possible in Rest API, therefore should support global parameters. Global
parameters are passed in URL in Rest API.

Some parameters are mandatory - index, type - and would fail validation
if not provided before before the bulk is executed.
Optional parameters - routing, pipeline.

The usage of these should be consistent across sync/async execution,
bulk processor and BulkRequestBuilder

closes #26026
pgomulka added a commit to pgomulka/elasticsearch that referenced this issue Nov 19, 2018
Bulk Request in High level rest client should be consistent with what is
possible in Rest API, therefore should support global parameters. Global
parameters are passed in URL in Rest API.

Some parameters are mandatory - index, type - and would fail validation
if not provided before before the bulk is executed.
Optional parameters - routing, pipeline.

The usage of these should be consistent across sync/async execution,
bulk processor and BulkRequestBuilder

closes elastic#26026
pgomulka added a commit to pgomulka/elasticsearch that referenced this issue Nov 19, 2018
Bulk Request in High level rest client should be consistent with what is
possible in Rest API, therefore should support global parameters. Global
parameters are passed in URL in Rest API.

Some parameters are mandatory - index, type - and would fail validation
if not provided before before the bulk is executed.
Optional parameters - routing, pipeline.

The usage of these should be consistent across sync/async execution,
bulk processor and BulkRequestBuilder

closes elastic#26026
pgomulka added a commit that referenced this issue Nov 20, 2018
Bulk Request in High level rest client should be consistent with what is
possible in Rest API, therefore should support global parameters. Global
parameters are passed in URL in Rest API.

Some parameters are mandatory - index, type - and would fail validation
if not provided before before the bulk is executed.
Optional parameters - routing, pipeline.

The usage of these should be consistent across sync/async execution,
bulk processor and BulkRequestBuilder

closes #26026
backport of #34528
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue low hanging fruit
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants