Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Queries cannot be sorted by a field using its defined "index_name" #8980

Closed
InfinitiesLoop opened this issue Dec 17, 2014 · 11 comments
Closed

Comments

@InfinitiesLoop
Copy link

If you try to sort by "fieldname", and "fieldname" is the name of a mapped field as specified by "index_name", you get parse error stating that no mapping was found for that field. It doesn't seem right that you can search by the index_name but not sort by it, hence this bug. Without a fix for this I may be forced to put my data in the document twice, so that I can have an actual field with the correct name instead of using index_name.

Discussion on the forum:
https://groups.google.com/forum/#!topic/elasticsearch/6-BWdQTPTH0

Reproduced in 1.4 via this gist:
https://gist.github.com/pmishev/11375297

@dadoonet
Copy link
Member

Indeed.

Setting index_name in a sub field is totally ignored:

DELETE /index1
POST /index1/
{
  "mappings": {
     "people": {
        "properties": {
           "work_email": {
              "type": "string",
              "index_name": "email",
              "fields": {
                 "raw": {
                    "index_name": "raw",
                    "type": "string",
                    "index": "not_analyzed"
                 }
              }
           }
        }
     }
  }
}
GET /index1/_mapping

It gives:

{
   "index1": {
      "mappings": {
         "people": {
            "properties": {
               "work_email": {
                  "type": "string",
                  "index_name": "email",
                  "fields": {
                     "raw": {
                        "type": "string",
                        "index": "not_analyzed"
                     }
                  }
               }
            }
         }
      }
   }
}

That said, I'm wondering if you should not better look at copy_to feature?

@InfinitiesLoop
Copy link
Author

Well it doesn't seem like it's ignored in another scenario I have. That one was a gist I found that seemed to reproduce my same problem, but my original problem is this. I have a dynamic template like this (sorry for lack of valid json here, I just copied it from the head plugin which has formatted it):

dynamic_templates: [{
  textfields: {
    mapping: {
      type: multi_field,
      fields: {
        sort: {
          index_name: "{name}_sort",
          analyzer: "keyword_lowercase",
          type: "string"
        },
        "{name}": {
          index_name: "{name}",
          index: "analyzed",
          analyzer: "letterordigit",
          type: "string"
        }
      }
    },
    path_match: "textfields.*"
  }
}]

That template is used with a document that looks like this:

{ id: 123, textfields: { summary: "hello" } }

And it yields a mapping like this (note the index_name of summary_sort seems to be working):

  properties: {
    textfields: {
      properties: {
        summary: {
          analyzer: "letterordigit",
          type: "string"
          fields: {
            sort: {
              index_name: "summary_sort",
              analyzer: "keyword_lowercase",
              type: "string"
            }
          }
        }
      }
    }
 }

But it does not like it when I sort by "summary_sort". I have this 'textfields' container for the fields I want to have analyzed, and doing that allows my dynamic template to target them easily by path (I have other string fields that are not analyzed that go into a string fields container). But I don't want to search for them by that nested path, hence my attempt to use index_name to hide the fact they are nested fields from searches.

So anyway, you're finding that index_name is ignored in a subfield doesn't seem to be the case in my example here, that seems odd.

I thought I had seen somewhere that copy_to was obsolete. I'd be happy to use it though, as long as I can accomplish my goal of 'hiding' the nested path of "textfields.summary_sort". I want it to be just "summary_sort"... can copy_to do that?

Thanks so much for your help!

@dadoonet
Copy link
Member

I thought I had seen somewhere that copy_to was obsolete.

_all field is IMO obsolete now we have copy_to feature. So yes, I believe you copy your summary_sort sub field to any whatever_name field and define this field as non analyzed.

Should work.

@InfinitiesLoop
Copy link
Author

Well that'd be awesome! Is there any concern about it being less performant though since copy_to seems to imply we'll have two instances of the data? Or is it just another reference to the same data?

Not in a position to try it at the moment, so I will try it out tomorrow and update this issue :) If that works though is this still a legit bug? I'm curious if you were able to reproduce my 2nd mapping having index_name appear in the mapping via the dynamic template, and then still not being able to sort on it (but search/sort on the analyzed version works).

@dadoonet
Copy link
Member

It's a copy of data. So you can index it in a different way, using another analyzer.

It's more a workaround I think as IMO what you described initially looks like a bug but I'd love to hear @clintongormley thoughts as well to confirm or infirm this. :)

@clintongormley
Copy link

Hi @InfinitiesLoop

First, multi_field is deprecated in favour of the simpler form (see http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/_multi_fields.html). Then, index_name is also deprecated (#6677) in favour of copy_to.

Note: multi-fields (or the new sub-fields) and copy-to do duplicate content as they create an index for each field name. That said, you can set the original (source) field to index: no, which removes the index you aren't using.

However, your intended goal is to be able to refer to these fields without the textfields. prefix, and I'm afraid this is a non-starter. Dynamic templates don't allow you to rename the field or to add more than one field (or to move the source values to another field).

While you could have a template which matches textfield.foo and set that to index:no and copy_to:foo, you wouldn't be able to control the mapping of the foo field at that point. You'd have to have a separate template to handle that (which means that it wouldn't recognise your textfield. prefix which you're essentially using as a flag to indicate the content type).

I think you're stuck...

@InfinitiesLoop
Copy link
Author

I see. So, I have a document with a mix of fields of different types, and of the string fields some should be analyzed and some should not. I also do not know the names of the fields ahead of time, hence the dynamic template (and there could also be more fields added to documents at any time). I also have containers for numeric fields, date fields, etc, because I want that explicit mapping that "this is a date" without relying on date detection, etc.

If there were a way to create a dynamic mapping that let me differentiate fields I want analyzed and fields I do not, even though they are the same type (string), and that didn't force me to use a field prefix/suffix in searches (in the document is ok), then I could maybe get away without having the 'textfields' container or any other container. Index_name seemed to be the perfect solution to that. Basically, index_name and dynamic templates were a great pair, because it meant you could use something to hang your template on, without having to dictate the structure of the fields in queries.

I'm quite unsure what to do now... I've invested many sprints in rewriting a search system from using RavenDB to ElasticSearch, and now I'm not even sure it's possible to support our needs. Not only is index_name not being honored in my case, but it's being completely removed... I beg you to reconsider or at least validate that my scenario is a valid one that you want to support so I have hope for future versions.

Is there a way I can dictate within the document what the mapping should be? I'll do anything at this point... :(

@clintongormley
Copy link

I see. So, I have a document with a mix of fields of different types, and of the string fields some should be analyzed and some should not. I also do not know the names of the fields ahead of time, hence the dynamic template (and there could also be more fields added to documents at any time). I also have containers for numeric fields, date fields, etc, because I want that explicit mapping that "this is a date" without relying on date detection, etc.

How do you know on the RavenDB side which fields you want analyzed and which fields you don't? What happens if you have duplicate field names, but with different mapping requirements?

With #8870 you are going to have to use the full path to reference fields, no longer the short name, but you can still use wildcards, eg *.some_field. However that won't work for sorting, as we need to know the single field that you want to sort on.

Are you allowing your users to specify their own queries using the query DSL, or are you providing your own API and generating the DSL for them? If the latter, then all you have to do is to maintain a field to namespace.field mapping in your application (which can be refreshed on restart with a GET mapping request) and then rewriting fieldnames to their namespaced variety will be easy. If you're exposing the whole DSL then it is still possible, but will take a lot more work to get it right.

I think that your current design will prove to be flawed in the long term - while it may work with your current requirements, later you'll want to do other stuff like retrieve the docs from Elasticsearch, or run aggregations, or highlight on fields etc, and you'll end up with this complicated scheme where the fields in your docs have no relationship with the fields in Elasticsearch.

@InfinitiesLoop
Copy link
Author

How do you know on the RavenDB side which fields you want analyzed and which fields you don't?

I don't. In fact, we have a hard-coded list of full text fields in the index definition, and we often need to manually fix it when someone needs a new one. Part of rewriting it towards ES was to hopefully get rid of that.

What happens if you have duplicate field names, but with different mapping requirements

Right now there's absolutely a problem if different projects have the same field name of different mapping types. Thankfully though that just hasn't been an actual problem we ran into. But I was hoping to solve that by separating each project into its own Type in ES. They each get their own mapping. It's only a problem then if they actually try to search or do some thing with that field across all the types, which is ok, we can live with that. But most searches are within a single type, so that's ok. I've been assuming that ES is ok with two types in an index having the same field name with different mappings, is that not the case (again though, I understand it may have issues with cross type searches etc).

With #8870 you are going to have to use the full path to reference fields

Since the only reason I had the structure in the document was to hang a path match onto it in a dynamic template, I shall have to switch away from dynamic templates. I think I can do that, but it means that I will have to generate a specific tailored mapping dynamically (and know when I need to amend it). It would be really awesome if dynamic templates were more flexible though, it would save a lot of complexity.

Are you allowing your users to specify their own queries using the query DSL, or are you providing your own API and generating the DSL for them

We abstract lucene away from the user. They are basically writing TSQL-like where clauses using a custom syntax we defined. We are taking that string, tokenizing it, and generating a lucene query string.

all you have to do is to maintain a field to namespace.field mapping in your application

Yeah, I could have a flat list of all the field names across all the types and expand the namespace like you said. If they are searching across all types and there's a conflict, I can't really do that, but we've already established that just can't logically work so that's ok. I will think on this...

end up with this complicated scheme where the fields in your docs have no relationship with the fields in Elasticsearch

Fair enough, I don't want that. All I want is to define a mapping that works for my dynamic schema :) I think I can either (1) generate a non-dynamic mapping from each project configuration (and maintain the mapping by amending to it if a field is added -- which is a lot of complexity because there are constant reindexes occurring as documents change and I will need to coordinate the mapping update), or (2) expand field names as you suggested.

Either solution will get me out from being stuck, but do add complexity that I didn't realize I would end up having due to index_name being removed.

In short, I hope that ES could improve on the options we have for dynamic schemas and dynamic mappings. It doesn't have to be index_name, just something that can allow me to map my fields correctly without having to introduce search/sort-breaking structure to my documents. Perhaps a hint field in the document, or the ability to use a prefix on the field name that can be stripped off but matched on... or what have you. I'd be open to anything that makes it easier. Please! :)

Thanks for your time and attention, I greatly appreciate it.

@clintongormley
Copy link

How do you know on the RavenDB side which fields you want analyzed and which fields you don't?
I don't. In fact, we have a hard-coded list of full text fields in the index definition, and we often need to manually fix it when someone needs a new one. Part of rewriting it towards ES was to hopefully get rid of that.

Are you planning on moving off RavenDB completely, or using Elasticsearch in conjunction with it? Either way, the manual list of of fields is a good approach - that way you have complete control over the mapping, rather than having to try to munge things with dynamic mapping.

Right now there's absolutely a problem if different projects have the same field name of different mapping types. Thankfully though that just hasn't been an actual problem we ran into. But I was hoping to solve that by separating each project into its own Type in ES. They each get their own mapping. It's only a problem then if they actually try to search or do some thing with that field across all the types, which is ok, we can live with that. But most searches are within a single type, so that's ok. I've been assuming that ES is ok with two types in an index having the same field name with different mappings, is that not the case (again though, I understand it may have issues with cross type searches etc).

This is a problem: fields with the same name in different types are the same field! This is the source of numerous problems, just see how many tickets are linked to #4081. With #8870 we are planning to enforce the requirement that fields with the same name in the same index have the same mapping.

You will have to use a separate index for these different projects, rather than separate types.

generate a non-dynamic mapping from each project configuration (and maintain the mapping by amending to it if a field is added -- which is a lot of complexity because there are constant reindexes occurring as documents change and I will need to coordinate the mapping update)

Actually this isn't very complex at all. You will need to create a new index with the appropriate mappings when you reindex anyway, so it should be very easy to generate the mappings for each field as part of the same process. (The requirement to have separate projects in separate indices actually makes this step easier too)

@clintongormley
Copy link

I don't think there is any more to do here. Closing this ticket

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants