Update API: update by query #1607

Closed
monken opened this Issue Jan 12, 2012 · 160 comments

Projects

None yet
monken commented Jan 12, 2012

#1583 allows to update individual documents. Update by query will reduce the network roundtrips radically if you want to update a number of documents and push work from the client to ES.

curl -XPOST localhost:9200/index/type/_update -d '{
    "query" : { "constant_score" : { "filter" : { "term" : { "counter" : 0 } } } },
    "script" : "ctx._source.counter += count",
    "params" : {
        "count" : 4
    }
}'
Member

Would really love this feature too!

r10r commented Jun 3, 2012

+1

darklow commented Jun 5, 2012

+1

gc20 commented Sep 25, 2012

+1

+1

+1

+1

Aoseala commented Jan 3, 2013

+1

timotta commented Jan 24, 2013

I really need this feature

burzum commented Feb 4, 2013

👍

Contributor
ofavre commented Feb 13, 2013

While waiting this feature to be officially finished and released, I've packaged the pull request #2231 as a plugin: yakaz/elasticsearch-action-updatebyquery.
Have fun.

neogenix commented Mar 4, 2013

+1

scriby commented Mar 4, 2013

+1

gnurag commented Mar 7, 2013

+1

Contributor

+1

+1

+1

steegi commented Apr 11, 2013

+1

+1

oowl commented Apr 27, 2013

+1

ttghr commented May 15, 2013

+1

Contributor
acerb commented May 23, 2013

+1

qw3r commented Jun 5, 2013

👍 🙏

+1

@greatwitenorth greatwitenorth referenced this issue in parisholley/wordpress-fantastic-elasticsearch Jun 13, 2013
Closed

added ability to search by author display name #17

Is there a way to pass the score of the query as a parameter to the update script? I need to update entries with scores updated based on the fields of its children.

+1

gboivin commented Jul 18, 2013

@scottc52 Did you manage to do it? I am also looking for a way to do this.

jstray commented Jul 18, 2013

+1

@gboivin Nope. I'm doing a has_child query and sending a seperate update request, but it's slow.

waiting this feature too..

YannBrrd commented Aug 1, 2013

+1

khsibr commented Aug 1, 2013

+1

theorm commented Aug 14, 2013

+1

+1

Contributor

+1

scoolen commented Aug 27, 2013

+1

YannBrrd commented Sep 2, 2013

Just wrote a little script to help wait for something... more "production ready" ;-)

https://github.com/YannBrrd/esNodeUpdater

Feel free to comment/update...

Is there an official status on this feature from the dev team? I don't see any input from them. Are there plans to add this feature to the core or is the preference to have users use a plugin like the one listed above?

Owner
kimchy commented Sep 25, 2013

We plan to get back on this one, the main reason we put this on hold is that we need to have a way to stop existing update by queries, as they can be execute by mistake on a large amount of data, causing problems...

@martijnvg martijnvg closed this Sep 25, 2013
@martijnvg martijnvg reopened this Sep 25, 2013

+1. Thanks for the update and working on this.

kapso commented Oct 16, 2013

+1

+1

+1

+1, sounds useful

olsp commented Nov 6, 2013

+1

pionize commented Nov 8, 2013

+1

+1

+1

mishu- commented Dec 9, 2013

+1

MrHash commented Dec 11, 2013

+2

+1

Plasma commented Dec 28, 2013

+1

+1

vaidik commented Jan 10, 2014

+1

shulard commented Jan 23, 2014

Have you ever think to implement with a double HTTP call this feature. I think about warmers which give the possibility to store the query and then execute the query (it's not really the same thing but it make me think about).

@kimchy you tell that you think a way to stop the update if it was launched on a big amount of data by mistake. If you stop it, maybe indexed data will be in invalid state (maybe it is possible to rollback...?). Maybe a better approach will be prevent mistake.

If you require two HTTP calls before trigger the real mass update (1 to prepare and 1 to really trigger it with a transation id between) and then an update status handler (like the dataimporthandler in SolR) to know when the query is really done.

I'm not sure to be really clear but I think it can be a solution to prevent mistake calls...

+1

yarinm commented Jan 29, 2014

+1

I'd also like to upvote this.

seti123 commented Jan 31, 2014

+1

@kimchy: Perfomance can't 'be the question: Currently I'm running thousands of queries to lookup data (e.g. OSM index address lookup for GPS locations - lookups are fast, hey I got ElasticSearch!) and update each document in other index (e.g. to add clear text address). My updates add new fields. A bulk update inside ES must be more efficient than 10.000 Lookup queries + 10.000 update requests (also using bulk updates ...). From coding and runtime point of view it would be more efficient, e.g. the bulk update file get 20.000 lines and could have only 2 with the new feature - all data moved over the network and making ES busy reading bulk update files ...

Maybe you agree to add limits to update operation e.g. _update/_query=some_conditions&size=1000 in that way it avoids to update a million docs - and we as developer can decide if we run 1000*1000 updates to update a million records... It should return number of docs updated to give some control if another update call is required.

seti123 commented Feb 3, 2014

For my scenario (enrich records after lookups in other indicies) I might do it another way: insert data first to mongoDb, do lookups in ElasticSearch update records in Mongo, use mongo river to get final results in ElasticSearch to show it in GUI (build on top of ES). Has anybody experienced with such scenarios? I hoped I could go ES only way ... until now, I did reject using a DB in my project.

YannBrrd commented Feb 3, 2014

Hi,

you could simply use Couchbase + Elasticsearch for this, as Couchbase
offers an interface with Elasticsearch

Cordialement,
Yann Barraud

2014-02-03 seti123 notifications@github.com:

For my scenario (enrich records after lookups in other indicies) I might
do it another way: insert data first to mongoDb, do lookups in
ElasticSearch update records in Mongo, use mongo river to get final results
in ElasticSearch to show it in GUI (build on top of ES). Has anybody
experienced with such scenarios? I hoped I could go ES only way ... until
now, I did reject using a DB in my project.

Reply to this email directly or view it on GitHubhttps://github.com/elasticsearch/elasticsearch/issues/1607#issuecomment-33917801
.

Contributor
weeyum commented Feb 11, 2014

+1

girak commented Feb 11, 2014

+100

seti123 commented Feb 23, 2014

Is there an alternative in ElasticSearch e.g. trigger a script that does an action when new data is inserted or updated? Some kind of before Index-Trigger could help me remove the pre-processing chain (we did now Message Ques with REDIS and 0MQ processing chain before we insert Data in ES - all of it costs network bandwdtih to shuffle data for parallel precessing ...)

I would like to see
http://localhost:9200/index/type/_preprocessBeforeIndex?script=myDataAnalysisScript
http://localhost:9200/index/type/_preprocessBeforeUpdate?script=myDataAnalysisScript
The Script mus be able to add new fields to the current record before ES stores/index it (to avoid double index action after changes). As we work a lot with node.js the scripts should work in the language required (in our Case JavaScript).

Even better if we could define the Script in the MAPPING per Type of data instead on a generated indicies.
Any plug-in avalable that is able to trigger such scripts? Any documentation of using ES API in Scripts?

xreal commented Feb 24, 2014

+1

+1

geirhha commented Mar 17, 2014

+1

+1

demockk commented Mar 18, 2014

+1

+1

ibagui commented Mar 19, 2014

+1

nkhare commented Apr 4, 2014

+1

gvolpe commented Apr 8, 2014

Waiting for this feature... (+1)

+1

+1

JGailor commented Apr 23, 2014

+1

alari commented Apr 24, 2014

+1

+1

pmishev commented May 16, 2014

Is this feature under development at all?
This would solve so many problems that are almost impossible to handle reliably on the application level right now.

fuyp commented May 23, 2014

+1

+1

Contributor
ofavre commented May 30, 2014

Just to remind you that since mid February 2013 I've packaged, and maintained ever since, the "official pull request" #2231 via @martijnvg's branch as a plugin: yakaz/elasticsearch-action-updatebyquery.

+1

xawiers commented Jun 5, 2014

+1

jrots commented Jun 18, 2014

+1

+1
How it is possible to have this feature since February 2013 still not merged to master?

efuquen commented Jul 7, 2014

+1
Ditto on @KrzysztofWilczek comment. Why has the PR been left to stagnate over the past year with no updates? This is by far the most commented on issue.

+1

We got this issue several month ago (see my posts as @seti123 January/February ) and I would like to share our results - after giving up on DB+ES River (too much worries about version dependencies) we evaluated our use case sucessfully with Crate Data (which uses ES as library and adds a SQL interface for mapping & query including "update by query" https://crate.io/docs/stable/sql/dml.html#updating-data ).
A good starting point to read about similarites & differences: https://crate.io/blog/crate_data_elasticsearch

Owner

Closed in favour of #2230

@ofavre ofavre referenced this issue in yakaz/elasticsearch Nov 13, 2014
@ofavre @ofavre ofavre + ofavre Provide more context variables in update scripts
In addition to `_source`, the following variables are available through
the `ctx` map: `_index`, `_type`, `_id`, `_version`, `_routing`,
`_parent`, `_timestamp`, `_ttl`.

Some of these fields are more useful still within the context of an
Update By Query, see #1607, #2230, #2231.
8038920
@dakrone dakrone added a commit that referenced this issue Nov 14, 2014
@ofavre @dakrone ofavre + dakrone Provide more context variables in update scripts
In addition to `_source`, the following variables are available through
the `ctx` map: `_index`, `_type`, `_id`, `_version`, `_routing`,
`_parent`, `_timestamp`, `_ttl`.

Some of these fields are more useful still within the context of an
Update By Query, see #1607, #2230, #2231.
4d68d3d
@dakrone dakrone added a commit that referenced this issue Nov 14, 2014
@ofavre @dakrone ofavre + dakrone Provide more context variables in update scripts
In addition to `_source`, the following variables are available through
the `ctx` map: `_index`, `_type`, `_id`, `_version`, `_routing`,
`_parent`, `_timestamp`, `_ttl`.

Some of these fields are more useful still within the context of an
Update By Query, see #1607, #2230, #2231.
f4750d6
leibale commented Jan 1, 2015

+1

Contributor

+1

+1

+1

+1

+1

binque commented May 15, 2015

+1

saval commented May 26, 2015

+1

artild commented May 29, 2015

+1

bn96 commented Jun 1, 2015

+1

vkopitsa commented Jun 7, 2015

+1

ifgh commented Jun 12, 2015

+1

will update by query support setPostFilter?
issue # 12295

xelllee commented Aug 4, 2015

+1

+1

fiserro commented Sep 4, 2015

+1

marioeu commented Sep 15, 2015

+1

ogorun commented Sep 16, 2015

+1

ron521 commented Nov 3, 2015

+1

can some one review this and give feedback.
https://discuss.elastic.co/t/updatebyqueryresponse-throwing-timeout/29176

Update by query fails while update more then 20 + million record .

rayward commented Nov 3, 2015

@Praveen82 you are using a 3rd party plugin. This isn't the right place to be requesting support, you should post that as an issue on that plugin's repository.

Contributor
nik9000 commented Dec 19, 2015

#15125 is implementing a syntax that will look a little like

curl -XPOST localhost:9200/index/type/_update_by_query -d '{
    "query" : { "term" : { "counter" : 0 } },
    "script" : {
      "inline": "ctx._source.counter += count",
      "params" : {
          "count" : 4
      }
  }
}'

The reason this was stalled for so long is because of those timeouts: up until now there has been a way to launch long running job in Elasticsearch and report on their status and things. With the task management api (#15347) eminent I picked up the torch on "reindex" and "update-by-query" style things and started them again with the intent to integrate with task management ASAP.

Anyway, #15125 and any followup PRs are the place to look for this feature.

+1

+1

@nik9000 nik9000 added v2.3.0 v5.0.0 and removed v2.3.0 v5.0.0 labels Apr 20, 2016
Contributor
nik9000 commented Apr 20, 2016

Update by query is live in 2.3.0 and 5.0.0-alpha-1. The docs are here.

Does update by query in 2.3.+ or 5.+ support the javascript plugin?

Contributor
nik9000 commented Aug 9, 2016

Does update by query in 2.3.+ or 5.+ support the javascript plugin?

If you really want it, sure. In 2.3+ we test update-by-query against groovy and in 5.+ we test against painless. We used to test against groovy and it worked there as well. I expect javascript will work fine.

JS support would be pretty slick.

Contributor
nik9000 commented Aug 9, 2016

JS support would be pretty slick.

As I said, it exists, you just have to install the plugin.

The trouble with all of these languages is that their implementation on the JVM aren't properly oriented for embedding. That is why we don't include it by default.

Anyway, if you want to talk more about it I think discuss.elastic.co is a more appropriate place for it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment