Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

Managing Elasticsearch types #6

Closed
dadoonet opened this Issue Mar 20, 2013 · 27 comments

Comments

Projects
None yet
Contributor

dadoonet commented Mar 20, 2013

Heya,

One of the best practices you guys are encouraging is to set ids for documents on the form: type_id where type sounds like elasticsearch type, _ is a separator and id is the real document ID.

So, let's imagine that user that want to manage types in Elasticsearch has followed your best practices. It could be not so complicated to add to couchbase transport plugin one new option couchbase.typeSeparator.

Then, if couchbase.defaultDocumentType is set to empty, you can send the document to the right type using this separator to split type and id.

What do you think?

Owner

mschoch commented Mar 21, 2013

We started experimenting with something similar on this branch:

https://github.com/couchbaselabs/elasticsearch-transport-couchbase/tree/multi-type

It allows you to configure regular expressions that determine the type from the document id. I didn't add any documentation for how it works yet, but below is the configuration we use internally for our cbugg project:

couchbase.documentTypes.bugid: ^.bugid$
couchbase.documentTypes.ddocVersion: ^/@cbuggddocversion$
couchbase.documentTypes.bug: ^bug-(\d)+$
couchbase.documentTypes.bughistory: ^bug-(\d)+-.$
couchbase.documentTypes.comment: ^c-.
$
couchbase.documentTypes.attachment: ^att-.$
couchbase.documentTypes.ping: ^ping-.
$
couchbase.documentTypes.tag: ^tag-.$
couchbase.documentTypes.user: ^u-.
$

couchbase.documentTypeParentFields.comment: doc.bugId
couchbase.documentTypeParentFields.attachment: doc.bugId
couchbase.documentTypeParentFields.bughistory: doc.id
couchbase.documentTypeParentFields.ping: doc.bugId

couchbase.documentTypeRoutingFields.comment: doc.bugId
couchbase.documentTypeRoutingFields.attachment: doc.bugId
couchbase.documentTypeRoutingFields.bughistory: doc.id
couchbase.documentTypeRoutingFields.ping: doc.bugId

As you can see, first you specify a regular expression for each type you want to support. If none of the regular expressions match the ID, then the default document type is used. Further, if you're setting up any parent/child relationships or want control over how each document is routed, you can use the additional configuration items you see here to refere to elements within the document that contain the parent/routing information.

We're actively using this for one of our internal projects. Would love to get feedback on whether or not this approach works for you.

tdk541 commented Aug 19, 2014

I am trying to use multiple types in elasticsearch. I am using the plugin compiled from the master version. The id is of the form book::company::1. Adding this to elasticsearch.yml is not changing the doc type in elasticsearch

couchbase.documentTypes.book: ^book::.$

What should I change?

Owner

mschoch commented Aug 19, 2014

You also must set:

couchbase.typeSelector: org.elasticsearch.transport.couchbase.capi.RegexTypeSelector

Hi all
Any update on that issue? I have different document types in Couchbase and i will need to have a different mapping for each type in Elasticsearch.
How can I do that?


What I have in couchbase:

bucket: "drinks", 
beer_1234: 
{
  "type": "beer",
  "name": "leffe"
}

How it's indexed in Elasticsearch:

{
  "_index": "drinks",
  "_type": "couchbaseDocument", // <======================== ????
  "_id": "beer_1234",
  "_version": 1,
  "_source": {
    "doc": {
       "type": "beer",
       "name": "leffe"
    },
    "meta": {
       "id": "beer_1234",
       "rev": "9-000049e945bd62fa0000000000000000",
       "expiration": 0,
       "flags": 0
    }
}

What I need:

{
  "_index": "drinks",
  "_type": "beer",  // <======================== NICE TYPE
  "_id": "beer_1234",
  "_version": 1,
  "_source": {
    "doc": {
       "type": "beer",
       "name": "leffe"
    },
    "meta": {
       "id": "beer_1234",
       "rev": "9-000049e945bd62fa0000000000000000",
       "expiration": 0,
       "flags": 0
    }
}

Thanks

Owner

mschoch commented Aug 27, 2014

The feature has been implemented to set the type in Elasticsearch, when it can be determined from the document ID. What you and many have asked for is a way to set the type based on a field in the body of the document (usually type). It turns out this is hard and not desirable for several reasons. Allow me to explain.

  1. In Couchbase documents are uniquely identified by their ID.
  2. In Elasticsearch documents are uniquely identified by the ID and type combination.
  3. The Couchbase XDCR transport has some optimizations that work by only sending the ID, not the body of the document. We wouldn't be able to implement these optimizations, because we wouldn't know how to find the document in Elasticsearch. These are nice to have, but they're just optimizations, we could omit them.
  4. The index operations we perform are expected to overwrite the previous versions of a document. In order to ensure this, the new ID AND type must be same for the old and new versions of the document. If the type is static, or derived from the ID we can ensure this property. If the type is determined from the body of the document, we cannot guarantee that because the could have changed the type.
  5. Why is it such a problem if the type changes? Well the real problem is that we don't know what the old type was. We'd have to actually perform a search to find the old version. Introducing a search into the indexing path seemed like a really bad idea.

So, as an alternative we've allowed you to set the type field in elasticseach based on the ID of the document. So as an example, if your IDs were beer_25 and brewery_27. You could index the into separate beer and brewery types in Elasticsearch. This avoids all the problems noted above, because the type is still immutable.

This feature has been implemented in source, but is still not available in any of the released versions. It is being tested by QA now, and will be in the next released version.

I hope this explanation helps.

jrm2194 commented Sep 25, 2014

This is great. Thank you for this. Quick question though... How to set a parent / child relationship between documents then?

Something like this:
curl -XPUT localhost:9200/posts/post/1 -d '{ "title": "bolivia rated 4" }'
curl -XPUT localhost:9200/posts/post/2 -d '{ "title": "bolivia rated 2" }'
curl -XPUT localhost:9200/posts/post/3 -d '{ "title": "another country rated 4" }'
curl -XPUT localhost:9200/posts/rating/1?parent=1 -d '{ "user_id": 1234, "rating": 4}'
curl -XPUT localhost:9200/posts/rating/2?parent=2 -d '{ "user_id": 1234, "rating": 2}'
curl -XPUT localhost:9200/posts/rating/3?parent=2 -d '{ "user_id": 4234, "rating": 5}'
curl -XPUT localhost:9200/posts/rating/4?parent=3 -d '{ "user_id": 1234, "rating": 5}'

Owner

mschoch commented Sep 25, 2014

If your child document contains the value of the parent document id, you can configure it as shown in the examples above. So, if your rating documents contain a field "post_id", you can configure something like:

couchbase.documentTypes.post: ^p-.$
couchbase.documentTypes.rating: ^r-.$

couchbase.documentTypeParentFields.rating: doc.post_id

couchbase.documentTypeRoutingFields.rating: doc.post_id

Again, this functionality has been merged to the master branch, will be available in the NEXT release, but is not available on the currently released versions.

See also: #44

jrm2194 commented Sep 25, 2014

Super. Good to know. Can't wait for it 👍

awkysam commented Sep 25, 2014

Great...
Doing a POC.....so we have checked out latest from the code branch and
I guess the above configurations need to be put in es-plugin.properties .... ?

Owner

mschoch commented Sep 25, 2014

No, these configurations go in main elasticsearch config file, usually config/elasticsearch.yml

awkysam commented Sep 26, 2014

In elasticsearch.yml file I have added but still the mapping types are not getting updated and Couchbase ids have give like this p-1,p_1,p1,p.1 and all the other combinations I have tried.
Please let me know if have missed out on something?

couchbase.password: password
couchbase.username: Administrator
couchbase.maxConcurrentRequests: 1024
couchbase.typeSelector: org.elasticsearch.transport.couchbase.capi.RegexTypeSelector
couchbase.documentTypes.post: ^p-.$
couchbase.documentTypes.rating: ^r-.$
couchbase.documentTypeParentFields.rating: doc.post_id
couchbase.documentTypeRoutingFields.rating: doc.post_id

Owner

mschoch commented Sep 26, 2014

Can you post your logs. One thing to look for is a message of the form: "See document type: {} with pattern: {} compiling..."

This will confirm that it sees your regular expressions.

awkysam commented Sep 26, 2014

If I set this property the default mapping type in ES gets changed
couchbase.defaultDocumentType : ratings

logs below....

[2014-09-26 16:25:52,201][INFO ][cluster.metadata ] [Frank Simpson] [posts] update_mapping couchbaseDocument
[2014-09-26 16:25:52,322][INFO ][cluster.metadata ] [Frank Simpson] [posts] update_mapping couchbaseDocument
[2014-09-26 16:25:52,689][INFO ][cluster.metadata ] [Frank Simpson] [posts] update_mapping couchbaseDocument
[2014-09-26 16:25:52,776][INFO ][cluster.metadata ] [Frank Simpson] [posts] update_mapping couchbaseDocument
[2014-09-26 16:25:53,011][INFO ][cluster.metadata ] [Frank Simpson] [posts] update_mapping couchbaseDocument

Owner

mschoch commented Sep 26, 2014

Changing the default document type won't help accomplish what you want. We need to see why the regular expression based rules aren't working.

Also these log entries are unrelated. There are some additional log levels you can enable, I don't remember them off the top of my head and my plane is about to take off. Try searching the other issues here for instructions to increase logging.

Instead of looking at the body content to identify the index type, standardize on a field called "_type" and index in elasticsearch based on this field (_type) from couchbase document. Irrespective old version vs new version that is never going to change. e.g. _type: user , then when doc is sent to ES, it will index this doc under "user" type.
Lets say I need to maintain "_type" in my couchbase doc for quick Map/Reduce/View functionality with couchbase, then for ES I need to build id that contains type ="user-2223223" . with this approach Im increasing size of overall DB long run since I have whole type "user" instead of something like "u" because ES needs to get type from document id. Does that make sense ?

Owner

mschoch commented Oct 15, 2014

I'm not sure I understand your proposal. In Couchbase documents have an ID, a value or body, and meta data. You could add a _type field to the body, which has the problems I've already outlined. Or you could add _type to the metadata. Adding new meta data to Couchbase is beyond the scope of this adapter.

I'm not exactly sure I follow your second paragraph.

  1. You could use document ids like u-22223223 to map to ES type "user" if you want to save space. We have to be able to map from the ID to the type, but they don't have to be the same exact value.
  2. Your view could also key on the same naming convention in the ID. Your view just checks meta.id instead of doc._type.

Ah!.. So I misunderstood. So you are saying

"You could use document ids like u-22223223 to map to ES type "user" if you want to save space. We have to be able to map from the ID to the type, but they don't have to be the same exact value."

Which means with my id u-22223223 in my couchbase doc , I can create ES index type = user . I'm guessing thats using couchbase.documentTypesXXXX definition in the elasticsearch.yml . Can you give me an quick example of how to configure above use case ?.

-Thanks.

Owner

mschoch commented Oct 15, 2014

Yes I think the example above already shows this. The following line would go in your elasticsearch.yml along with any other settings.

couchbase.documentTypes.user: ^u-.$

The regular expression matches documents with id starting with "u-" but the Elasticsearch type they map to is "user".

@mschoch perfect!. that makes sense now. Thanks for such a quick response.

hefarouk commented Nov 4, 2014

Has this feature been released?

Owner

mschoch commented Nov 4, 2014

Yes, it is available in the 2.0.0. release at http://www.couchbase.com/downloads

hefarouk commented Nov 5, 2014

Thanks for your prompt response, however I am trying to get the type resolved based on the ID field but it is not working!
I've tried with the beer sample bucket from couchbase, passing beer_123 as the doc id but the type still resolving to couchbaseDocument in elasticsearch.
I am using elasticsearch-transport-couchbase-2.0.0.
Do I have to do any configuration tweaks to get it working?
I've also tried using regx but didn't work:
couchbase.password: ******
couchbase.username: ******
couchbase.typeSelector: org.elasticsearch.transport.couchbase.capi.RegexTypeSelector
couchbase.documentTypes.beer: ^beer.$

Owner

mschoch commented Nov 5, 2014

The problem is the regex does not match the document ID. (yes the examples above are wrong and/or confusing)

I tried your exact regex, and you are right "beer_123" still comes through as type "couchbaseDocument".

I then changed it to ^beer.*$ and then it correctly mapped "beer_123" to type "beer".

Something like ^beer_(\d)+$ might also work for you as well.

hefarouk commented Nov 5, 2014

That's amazing indeed, it does work now... thanks for your feedback, saved my day!

Thanks for this. However this requires a full cluster restart on the ES part, I've opened #63 and will be happy if you could address that

I have been playing around with this plugin for couple of days now and I have come across a problem (maybe design problem or plugin problem or both) which I cannot figure out. Can anyone help me handle my use case of multi tenancy (would also like to mention the comment I made in the post) ? The question is posted here:

http://stackoverflow.com/questions/28314272/couchbase-elastic-search-plugin-with-multi-tenancy/28322772#28322772
and here's the description:
This is more of a design question while integrating couchbase with elastic search plugin. I have used couchbase with multitenancy in our previous product and we followed the the very first suggestion we found on couchbase blog here Single Couchbase Bucket for All Tenants.

Now we are researching on exploiting elastic search features on couchbase data using the couchbase elastic search plugin. Going through the couchbase elastic search plugin documentation (installing and setup) I realized that you will be able to map only one couchbase bucket to one elasticsearch index. That documentation can be found here Elasticsearch plugin configuration and here Connecting to Cluster. Now in that case, just as couchbase bucket, all the documents (regardless of the tenant) will reside in the same index.

Now here's my question. Regardless of how the documents are stored in couchbase, I would like elasticsearch to index to be per tenant. I am still quite new to playing around with the integration between these 2 systems but I am assuming that having separate search index per tenant (and each tenant/index having many different types of it's own) can most definitely 1) increase search performance per tenant 2) the performance of a particular search query on a specific tenant that might have minimal sets of data will not be impacted by having huge sets of data for some other tenant on the same index (although not plausible, assuming that the data sets between tenants differ by a factor of 50x)

What I am wondering is, are my concerns valid. Will performance on search queries be impacted by having all the tenants indexed together? And if so, anyone has any solutions on how can achieve this using couchbase elastic search plugin ?

Collaborator

Branor commented Feb 14, 2015

The plugin doesn't allow replicating a bucket to more than one index right now. In theory, it's possible to add, but it would be a very complicated and fragile feature, because could easily end up with the same document in multiple indexes just by changing some configurations.
HOWEVER! You don't necessarily have to send different tenants to different indexes. You can use the existing feature of mapping documents to types, together with custom routing to ensure that all documents from a single tenant end up under a specific type, and are routed to their own shard. That way you can specify the routing data in the query for improved performance, because ES will only query that particular shard.

You can try a preview build of the new version that includes all these features here: https://github.com/Branor/elasticsearch-transport-couchbase/releases/tag/v2.1.0-SNAPSHOT

PS. You could also try using index aliases with custom filtering and routing to separate tenants without any intervention in the plugin: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-aliases.html

@Branor Branor closed this Aug 13, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment