Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add delete-by-query plugin #11516

Merged
merged 1 commit into from Jun 17, 2015
Merged

Add delete-by-query plugin #11516

merged 1 commit into from Jun 17, 2015

Conversation

tlrx
Copy link
Member

@tlrx tlrx commented Jun 5, 2015

This pull request adds a new plugin called "delete-by-query" which implements the now deprecated delete-by-query feature using scan/scroll/bulk requests.

Notes:

  • size parameter controls the scroll shard_size and the number of actions in bulk requests (defaults to 1000)
  • timeout parameter can be used to stop scrolling documents after a given time
  • response now looks like this (here a node is killed during the DBQ execution):
{  
   "took":60866,
   "timed_out":false,
   "_indices":{  
      "_all":{  
         "found":531046,
         "deleted":79901,
         "missing":0,
         "failed":301702
      },
      "disposants-2014":{  
         "found":375702,
         "deleted":74000,
         "missing":0,
         "failed":301702
      },
      "beer":{  
         "found":5901,
         "deleted":5901,
         "missing":0,
         "failed":0
      }
   },
   "failures":[  
      {  
         "shard":-1,
         "index":null,
         "reason":{  
            "type":"node_not_connected_exception",
            "reason":"[Puck][inet[/192.168.1.16:9300]] Node not connected"
         }
      }
   ]
}

Since the process involves the execution of a scan request (which can fail), then successive async scroll requests (which can also fail) we may imagine a better failure reporting. When a scroll request succeed, the scrolled documents are added to a Bulk request executed in an async manner. If the bulk fails, all documents are reported as failed documents in the counter.

Rest API documentation and test will be added later.

}
scanRequest.source(source);

logger.debug("executing scan request");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this probably need to be trace, action package has DEBUG enabled by default

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

applies to other logging statements here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@kimchy
Copy link
Member

kimchy commented Jun 6, 2015

I left some minor comments around logging and usage of thread pool (not needed I think).

I could't follow why we need a semaphore and such, I think I a missing something. My thought was that we do search -> bulk -> search -> .... until there are no more results, so always async callback execution type chain until we are done.

@tlrx
Copy link
Member Author

tlrx commented Jun 9, 2015

@kimchy thanks for your review! Your comments make sense, no need to use semaphore stuff... I rebased and updated the code, it is way simpler now.

I'll add some rest tests too.

final String nextScrollId = scrollResponse.getScrollId();
addShardFailures(scrollResponse.getShardFailures());

if (logger.isDebugEnabled()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be trace?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

raaaah yes

@tlrx
Copy link
Member Author

tlrx commented Jun 10, 2015

@s1monw thanks for your review. I updated the code following your comment and added a REST test. Can you please have another look if possible? Thanks :)

Documentation will be added in another PR.

out.writeBoolean(false);
} else {
out.writeBoolean(true);
out.writeVLong(timeout);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you write a VLong make sure it's not negative!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, timeout has been changed to TimeValue

}

private boolean isTimedOut() {
return request.timeout() != null && (System.currentTimeMillis() >= (startTime + request.timeout().millis()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also use the Threadpool estimations here mabye?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

@s1monw
Copy link
Contributor

s1monw commented Jun 12, 2015

looks pretty good though. I left a bunch of comments

@tlrx
Copy link
Member Author

tlrx commented Jun 16, 2015

@s1monw thanks a lot for your review, very valuable. I updated the code following your comments, please let me know if there are still things to improve.

I'd love to have your help on writing documentation for this plugin, since I'm not sure to be able to explain all fallacies of the previous implementation.

@s1monw
Copy link
Contributor

s1monw commented Jun 16, 2015

I'd love to have your help on writing documentation for this plugin, since I'm not sure to be able to explain all fallacies of the previous implementation.

lets get this in as is and open an issue for the documentation I will take a look at comment on it what aspects I would take into account?

@s1monw
Copy link
Contributor

s1monw commented Jun 16, 2015

oh yeah so here is my LGTM ;)

The delete by query plugin adds support for deleting all of the documents (from one or more indices) which match the specified query. It is a replacement for the problematic delete-by-query functionality which has been removed from Elasticsearch core in 2.0. Internally, it uses the Scan/Scroll and Bulk APIs to delete documents in an efficient and safe manner. It is slower than the old delete-by-query functionality, but fixes the problems with the previous implementation.

Closes elastic#7052
@tlrx
Copy link
Member Author

tlrx commented Jun 17, 2015

@s1monw thanks!

I created #11723 for the java doc aspect.

@clintongormley clintongormley removed the :Core/Infra/Plugins Plugin API and infrastructure label Jun 18, 2015
tlrx added a commit to tlrx/elasticsearch that referenced this pull request Jun 23, 2015
This page is placed in a /plugins directory until we figure where to place all plugins documentation.
@tlrx tlrx deleted the delete-by-query branch May 19, 2016 09:52
@lcawl lcawl added :Distributed/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. and removed :Plugin Delete By Query labels Feb 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. >feature v2.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants