Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Basic reindex implementation #15125

Merged
merged 129 commits into from Jan 13, 2016

Conversation

Projects
None yet
@nik9000
Copy link
Contributor

commented Nov 30, 2015

Note: this has been edited from my first comment.

This creates basic reindex and update-by-query implementations that work like delete-by-query. They scroll documents and then turn around and issue bulk requests for each.

This isn't the last word on reindex. Before we merge the feature branch down to master we'll need to fix it to retry on bulk rejection and integrate it with the task management api.

For those of you reading the issues, realize that all the discussion about APIs is discussion. The API we're going to merge is the result of those discussions. You can read about it by reading the asciidoc files in the PR.

@nik9000

This comment has been minimized.

Copy link
Contributor Author

commented Nov 30, 2015

@tlrx this is my start. I'm intentionally not copying things from delete-by-query because I'm trying to understand why it has the pieces it has.

@clintongormley

View changes

docs/plugins/index-by-search.asciidoc Outdated
The index-by-search plugin adds support for indexing all documents that match
a query. Internally it uses {ref}/search-request-scroll.html[Scroll] and
{ref}/docs-bulk.html[Bulk] APIs much like
{ref}/plusin-delete-by-query[Delete by Query] though it might not use that

This comment has been minimized.

Copy link
@clintongormley

clintongormley Dec 1, 2015

Member

plusin -> plugin

@clintongormley

This comment has been minimized.

Copy link
Member

commented Dec 1, 2015

Some thoughts...

Initially I was a fan of specifying the index and type in the URL, but now I think it is confusing. For instance, what if the source index is in a remote cluster? I'm leaning more towards specifying the source and destination explicitly, eg:

POST _index_by_search
{
  "src": {
    "index": ["index_1", "index_2"],
    "query": {
      "match": {
        "status": "published"
      }
    }
  },
  "dest": {
    "index": "new_index"
  }
}

Essentially, the src would accept anything that a search request would accept, and dest would be used to configure the bulk request.

Routing could be changed or removed as follows:

POST _index_by_search
{
  "src": {
    "index": ["index_1", "index_2"],
    "routing": ["foo", "bar"],
    "query": {
      "match": {
        "status": "published"
      }
    }
  },
  "dest": {
    "index": "new_index",
    "routing": null
  }
}

The version_type parameter could be used as follows:

  • external: set the new version to the same value as the existing version (and fail if the document already exists)
  • internal: update the document only if it exists and is the same version as retrieved by the scroll request
  • none: write the new document regardless of whether it exists or not, or whether it has already been updated or not
@nik9000

This comment has been minimized.

Copy link
Contributor Author

commented Dec 1, 2015

Initially I was a fan of specifying the index and type in the URL, but now I think it is confusing.

I think I'm with you. I'll move it to the single endpoint.

none: write the new document regardless of whether it exists or not, or whether it has already been updated or not

We actually have force for that now.

"routing": null

Right now it copies routing values if they are set. I figure that should be the default. Maybe call it "routing": "keep" and make "routing": "discard" cause it to just throw the routing away.

external: set the new version to the same value as the existing version (and fail if the document already exists)

Can we follow the usual external semantics and write the document only if the version is newer?

internal: update the document only if it exists and is the same version as retrieved by the scroll request

I figured internal would just work like internal versioning works in bulk requests and write the document with version 1 if it doesn't exist.

One thing that comes up when I play with this is that its simpler to just support "index" requests in the bulk, at least for now. I'll think about it some more because update requests would be useful because they'd let you use scripts for free. You could probably get away with only returning smaller portions of the document during the query too.

@clintongormley

This comment has been minimized.

Copy link
Member

commented Dec 2, 2015

none: write the new document regardless of whether it exists or not, or whether it has already been updated or not
We actually have force for that now.

force is a slightly different, in that it will forcibly set the version to the specified value (which could cause the version number to drop), while what I was envisaging was just incrementing the version number, the same way we would if a version hadn't been specified.

"routing": null
Right now it copies routing values if they are set. I figure that should be the default. Maybe call it "routing": "keep" and make "routing": "discard" cause it to just throw the routing away.

The nice thing about routing: null is that you could also set it to routing: foo.

Can we follow the usual external semantics and write the document only if the version is newer?

Not if you're indexing into the same index - nothing would be updated. The use case I'm seeing here is: we use external versioning so eg a db is the source of the version number, now we want to add some multi-field with a new mapping and backfill existing docs, so we reindex but keep the version number the same.

We could use force to index the doc regardless and to keep the same version number, but that wouldn't allow for skipping newer docs... not sure if this is an issue.

I figured internal would just work like internal versioning works in bulk requests and write the document with version 1 if it doesn't exist.

This isn't how it works. Try this:

PUT t/t/1?version=4
{}

You'll get a conflict exception because the document doesn't exist. Instead we could use none (which should just delete the version number before indexing).

This versioning thing is complex. We should probably have a brainstorming session to make sure that we nail it down correctly.

@bleskes

This comment has been minimized.

Copy link
Member

commented Dec 2, 2015

when re-indexing into the same index, internal versioning is what I think we need, i.e., only re-index if it has changed.

For another index it's tricky. If people use external versioning in general, then it's a good fit. If they don't and we concurrently index into a remote while other processes also index it then external version semantics doesn't really make sense . It doesn't mean much that the version in the source index is higher than the one in the target index. In that case I think the only two options are either create (i.e., only reindex if the document doesn't exist) or override (i.e., always index, no guarantees over which copy survives).

In all cases I don't think force makes sense. It can be used as a measure of last resort where people use one index as a source of truth to another and want to override existing docs, but it's super expert and dangerous.

@clintongormley

This comment has been minimized.

Copy link
Member

commented Dec 2, 2015

when re-indexing into the same index, internal versioning is what I think we need, i.e., only re-index if it has changed.

We probably also need the ability to specify what should happen if there is a version conflict: ignore and continue, or throw an exception.

@nik9000

This comment has been minimized.

Copy link
Contributor Author

commented Dec 2, 2015

The nice thing about routing: null is that you could also set it to routing: foo.

I implemented routing: keep, routing: discard and routing: =foo last night. It feels a bit better than relying unset being different from null. I'm happy to iterate on it. Its simple to change.

when re-indexing into the same index, internal versioning is what I think we need, i.e., only re-index if it has changed.

But that'll bump the version number which isn't what someone using external versioning wants. I wonder if those folks are just out of luck here?

either create (i.e., only reindex if the document doesn't exist) or override (i.e., always index, no guarantees over which copy survives).

I think this'd work fine for now. The big use case for writing to an other index is when someone wants to change analysis or sharding.

@bleskes

This comment has been minimized.

Copy link
Member

commented Dec 2, 2015

when re-indexing into the same index, internal versioning is what I think we need, i.e., only re-index if it has changed.

But that'll bump the version number which isn't what someone using external versioning wants. I wonder if those folks are just out of luck here?

Good point and the first valid use for EXTERNAL_GTE I heard :) .In general I was talking about the defaults for the operations that make sense. In general I think we should just allow people to specify what ever versioning support they want.

@rjernst

View changes

plugins/index-by-search/licenses/no_deps.txt Outdated
@@ -0,0 +1 @@
This plugin has no third party dependencies

This comment has been minimized.

Copy link
@rjernst

rjernst Dec 2, 2015

Member

This directory/file is no longer needed.

This comment has been minimized.

Copy link
@nik9000

nik9000 Dec 2, 2015

Author Contributor

Will remove.

nik9000 added some commits Nov 25, 2015

Basic index-by-search
This creates an index-by-search plugin with a very basic, shell of an
implementation of that is very like delete-by-query. At some point we'll
integrate it with the task managament work, but for now this works.
More docs
Fixes to make the example work
Copy routing
Adds a test for routing in general and a test for the grandparent case.
More docs and tests
Small fix around sorts, validation, etc.

@nik9000 nik9000 force-pushed the nik9000:first_reindex branch to 222c986 Dec 2, 2015

@nik9000

This comment has been minimized.

Copy link
Contributor Author

commented Dec 2, 2015

Rebased so I could remove the file. I figure rebase is ok here because no one is actively reviewing the code.

}
--------------------------------------------------

The `src` parameter can also be specified as `source` for those that like that

This comment has been minimized.

Copy link
@clintongormley

clintongormley Jan 12, 2016

Member

Let's keep the naming simple : src and dest

This comment has been minimized.

Copy link
@nik9000

nik9000 Jan 12, 2016

Author Contributor

See, I think "source" is more simple.

This comment has been minimized.

Copy link
@clintongormley

clintongormley Jan 12, 2016

Member

I'd also be happy with source and dest

This comment has been minimized.

Copy link
@nik9000

nik9000 Jan 12, 2016

Author Contributor

Done.

--------------------------------------------------
POST /_reindex
{
"size": 1,

This comment has been minimized.

Copy link
@clintongormley

clintongormley Jan 12, 2016

Member

Hmm interesting... so size here acts as "max docs in total" or per shard or what? Does this make any sense? Can you even control it over multiple shards?

This comment has been minimized.

Copy link
@clintongormley

clintongormley Jan 12, 2016

Member

I can imagine using size to take a sample of the source index, but wondering if it should be named differently as it is confusing with other uses of size (eg in search/scroll)

This comment has been minimized.

Copy link
@nik9000

nik9000 Jan 12, 2016

Author Contributor

limit?

This comment has been minimized.

Copy link
@nik9000

nik9000 Jan 12, 2016

Author Contributor

That'd be very sql-ish. But it'd prevent the confusion in _update_by_query where we have size and scroll_size. They'd become limit and size.


Think of the possibilities! Just be careful! With great power.... Anyway! You
can also change the "_type" and "_id" and even "_index" to change the
destination as much as you need.

This comment has been minimized.

Copy link
@clintongormley

clintongormley Jan 12, 2016

Member

and _routing i presume? what about other metafields? _parent?

This comment has been minimized.

Copy link
@nik9000

nik9000 Jan 12, 2016

Author Contributor

I'll make this a list of all the fields you can change.


By default if `_reindex` sees a document with routing then the routing is
preserved unless it's changed by scripting. You can set `routing` on the `dest`
request to change this:

This comment has been minimized.

Copy link
@clintongormley

clintongormley Jan 12, 2016

Member

or in a script i presume?

This comment has been minimized.

Copy link
@nik9000

nik9000 Jan 12, 2016

Author Contributor

Right. You can set this if you don't want to/can't use a script. But you can use a script to get more power if you want.

@nik9000

This comment has been minimized.

Copy link
Contributor Author

commented Jan 12, 2016

@s1monw I've pushed a few test changes that move some of the tests from integration to unit tests - tests for failure states and tests for how scripts interact with the request. I think some things should remain integration tests - some smoke tests for scripts, the test that update_by_query never reverts changes from another updater, some of the basic smoke tests. But I'll work to make more things unit tests where possible. How far do you think I should go before I merge this into the feature branch and get to work on integrating it with task management? Task management will involve lots more unit tests for things like cancel-ability, status reporting, and eventually throttling.

preserved unless it's changed by scripting. You can set `routing` on the `dest`
request to change this:

`keep`::

This comment has been minimized.

Copy link
@clintongormley

clintongormley Jan 12, 2016

Member

meh, i don't like these options. I'd prefer to go with:

  • routing is preserved by default
  • set it to null to clear
  • set it to another string to set it to that string
  • use a script to make a decision per doc

This comment has been minimized.

Copy link
@nik9000

nik9000 Jan 12, 2016

Author Contributor

set it to null to clear

This means null is different from the default. Are we ok with that?


In addition to the standard parameters like `pretty`, all APIs in this plugin
support `refresh`, `consistency`, and `timeout`.

This comment has been minimized.

Copy link
@clintongormley

clintongormley Jan 12, 2016

Member

Is timeout for the whole reindex request, or per batch?

This comment has been minimized.

Copy link
@nik9000

nik9000 Jan 12, 2016

Author Contributor

per batch. Will update.

This comment has been minimized.

Copy link
@nik9000

nik9000 Jan 12, 2016

Author Contributor

per batch. Will update.

Docs updated.

{
"mappings": {
"test": {
"dynamic": false,

This comment has been minimized.

Copy link
@clintongormley

clintongormley Jan 12, 2016

Member

Use a callout (<1>) on this line to highlight the fact that the flag field is not added to the mapping dynamically, and so is ignored. I missed that in the example at first.

Another example could be adding a multi-field to an existing field?

This comment has been minimized.

Copy link
@nik9000

nik9000 Jan 12, 2016

Author Contributor

Another example could be adding a multi-field to an existing field?

Right. I could add that too but its pretty much the same?

This comment has been minimized.

Copy link
@nik9000

nik9000 Jan 12, 2016

Author Contributor

Use a callout (<1>)

Done.

--------------------------------------------------

Hurray! You can do the exact same thing when adding a field to a multifield.

This comment has been minimized.

Copy link
@clintongormley

clintongormley Jan 12, 2016

Member

Ah, I see the multi-field example here :)

@clintongormley

This comment has been minimized.

Copy link
Member

commented Jan 12, 2016

Just looked through the docs - looking awesome! I assume you're planning to move this to a module rather than a plugin before merging.

nik9000 added some commits Jan 12, 2016

@nik9000

This comment has been minimized.

Copy link
Contributor Author

commented Jan 12, 2016

I assume you're planning to move this to a module rather than a plugin before merging.

I dunno what the plan is. I'm happy to module-ify it before merging it to master. Right now I just want to get this into the feature branch so I can start working on task management.

nik9000 added some commits Jan 12, 2016

Remove src and destination
Now its just source and dest.
@s1monw

This comment has been minimized.

Copy link
Contributor

commented Jan 13, 2016

@nik9000 thanks for working on the tests - I think we can improve over time so if nobody objects lets get this going and move it into master?

@nik9000

This comment has been minimized.

Copy link
Contributor Author

commented Jan 13, 2016

move it into master

I believe the plan is to integrate it with task management, probably move it to a module, and then move it to master. But I'll certainly merge to the feature branch.

@nik9000 nik9000 merged this pull request into elastic:feature/reindex Jan 13, 2016

@nik9000

This comment has been minimized.

Copy link
Contributor Author

commented Jan 13, 2016

Now that I've merged this to feature/reindex I'm actually going to squash it to one commit. I should have done that before merging but I got excited to click the big green button.

@danielmitterdorfer

This comment has been minimized.

Copy link
Member

commented Jan 13, 2016

Incredible, the first huge step is done! Congratulations! :)

@tlrx

This comment has been minimized.

Copy link
Member

commented Jan 14, 2016

Well done @nik9000 !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.