Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index time boost in multi_field ignored? #4108

Closed
roytmana opened this issue Nov 6, 2013 · 13 comments
Closed

Index time boost in multi_field ignored? #4108

roytmana opened this issue Nov 6, 2013 · 13 comments
Assignees

Comments

@roytmana
Copy link

roytmana commented Nov 6, 2013

I use multi_field to index content of multiple properties into a single _all-like multi_field. Each contributing property may define different boost when defining it's instance of the multi_field. these boosts are however ignored when searching against such multi_fied while honored while searching against _all

full recreation is here https://gist.github.com/roytmana/7330956

it compares the same query against my multi_field all and ES _all

@roytmana
Copy link
Author

roytmana commented Nov 7, 2013

Any comment? It's a real showstopper for us!

I did a big (considering size of our mappings) rework getting away from using __all because I needed a combined field like _all but analyzed differently (stemmed, shingled, regular...) just to find that multifield is unusable for us because it does not take boost per contributing field into account like _all does

Please let me know if it is a bug and can be fixed or it is not possible i must go back to using _all :-(

@javanna
Copy link
Member

javanna commented Nov 7, 2013

This is again related to the use of "path":"just_name". The boost is usually taken into account when using an ordinary multi_field, but if you use "just_name" for fields with same name, you merge their content into the same lucene field. That said, applying different boosts to the same lucene field doesn't make much sense to me, and that is why your boost gets ignored. If you want to give different weights to those fields, you need to keep them distinct.

I would keep a variation with unique name so that the boost will be taken into account, as they will actually be different lucene fields. Otherwise you could just drop index time boosting and use a multi_match query against multiple fields, giving a different weight to each field.

@ghost ghost assigned javanna Nov 7, 2013
@roytmana
Copy link
Author

roytmana commented Nov 7, 2013

Thank you @javanna but in this case my intention is to combine data from multiple properties into a single field to act like _all.

_all does support bust based on which field contributes to it:

"One of the nice features of the _all field is that it takes into account specific fields boost levels. Meaning that if a title field is boosted more than content, the title (part) in the _all field will mean more than the content (part) in the _all field."

The boost can be applied to individual tokens right? what I expected is that each individual property contributing to the shared (collapsed) multi_field would mark its content with defined boost

@clintongormley
Copy link

I agree with @roytmana - the field boosts are retained when indexing into _all, and I assumed that the same thing would apply when indexing from multiple fields into a single index_name.

In fact, I'd say that this is the one place where field-level index time boosting has a purpose.

@javanna
Copy link
Member

javanna commented Nov 7, 2013

Yeah I see your point guys, I agree, looking into it :)

@roytmana
Copy link
Author

roytmana commented Nov 7, 2013

more so, I expected analyzers and position offset gaps to be honored per contributing field so we have a fine-grained control over how such combined field get put together

for example I use phrase searches and I want to make sure searches across content from different contributing fields are not matched - I would use position_offset_gap for such fields

or I want some fields to contribute stemmed content and few other (say people names or some codes) not stemmed etc.

This is what makes it so powerful

And thanks for looking into it i was getting kind of desperate of this issue being "ignored" I banked lots of my design on multifield power

now when's it going to be fixed ? :-) ha ha

@clintongormley
Copy link

I expected analyzers and position offset gaps to be honored per contributing field so we have a fine-grained control over how such combined field get put together

are they not? i was pretty sure they were.

If not, could you provide a recreation?

@roytmana
Copy link
Author

roytmana commented Nov 8, 2013

maybe they are working I guess I was dramatizing it a bit :-) after struggle with multifield and related highlighting issues (i feel current multifield primary field to index_name naming when using 'just_name' is very unintuitive right now see #4123 as it lumps all primary fields in one lucene field which is in 99% of cases is not what I would expect)

I will test it later tonight or tomorrow and report

@javanna
Copy link
Member

javanna commented Nov 8, 2013

I checked how the _all field works and it has a special treatment, which is why we specifically mention that it keeps the boost from the original fields.

The reason why I said in the first place that it doesn't make sense to have more than one boost for the same field is that index time boosting is per field, using field norms, thus only one value per field. To work around this the _all field contains a payload per term with the original boost, that is used at query time for scoring. This is something that we do only for the _all field and I'm not even sure if we should do it whenever using the same index_name for different fields.

But we should definitely take this into account in the discussion on #4099 regarding future improvements.

@roytmana
Copy link
Author

roytmana commented Nov 8, 2013

oh no! back to using _all field then :-(
so much time wasted and no improvements I hoped for.
I should have tested multifield boosts before basing my design on them

oh well hope you will be able to pull a miracle out of the hat :-)

@clintongormley
Copy link

@roytmana you can achieve what you are after (ie per-field index time boosts) by querying both your custom _all field and the individual field, eg lets say you wanted to search first_name and last_name using the single full_name field, your mapping would look like this:

curl -XPUT "http://localhost:9200/myindex" -d'
{
    "mappings": {
        "person": {
            "properties": {
                "first_name": {
                    "type": "multi_field",
                    "path": "just_name",
                    "fields": {
                        "first_name": { "type": "string"},
                        "full_name": { "type": "string"}
                    }
                },
                "last_name": {
                    "type": "multi_field",
                    "path": "just_name",
                    "fields": {
                        "last_name": { "type": "string"},
                        "full_name": { "type": "string"}
                    }
                }
            }
        }
    }
}'

Then if you wanted to give a slight boost to the last_name field, you could do:

curl -XPOST "http://localhost:9200/myindex/person/_search" -d'
{
   "query": {
      "bool": {
         "must": {
            "match": {
               "full_name": "john smith"
            }
         },
         "should": {
            "match": {
               "last_name": {
                  "query": "john smith",
                  "boost": 0.5
               }
            }
         }
      }
   }
}'

You could even use the rescore functionality to achieve something similar:

curl -XPOST "http://localhost:9200/myindex/person/_search" -d'
{
   "query": {
      "match": {
         "full_name": "john smith"
      }
   },
   "rescore": {
      "window_size": 50,
      "query": {
         "rescore_query_weight": 0.5,
         "rescore_query": {
            "match": {
               "last_name": {
                  "query": "john smith"
               }
            }
         }
      }
   }
}'

And this would probably be more efficient (and certainly more flexible) than using payloads to implement field-level index time boosts on custom _all fields.

@roytmana
Copy link
Author

roytmana commented Dec 3, 2013

Thank you @clintongormley I did not think of re-score but I did use the first approach. My issue is that I have a highly structured data - over 100 fields and it is just the beginning. Half of them are not very useful but i can't afford not to include them in my search but meed to massively de-emphasize them or they will drown the useful results. Another half provide good search corpus with some being more important than others. Out of this half there is a handful of highly relevant fields that gets special boost.
I find that managing it all in queries will bloat queries to no end, require to make sure it is consistent across all the queries, require third parties who may develop against the index to have same knowledge, require the same analysis on the individual fields as on the _all (ex. stemming)

I do not know I guess I could go this route with should clause listing all "important" fields (over 70 now) with various boosts and make sure they analyzed the same as _all (or my _all-like multifield) but I do not know if it will scale in terms of complexity as number of my fields triples in next version. Not sure if it can cause performance issues

@javanna
Copy link
Member

javanna commented Jan 9, 2014

Closing this issue in favour of #4520, which will take care of custom _all fields.

@javanna javanna closed this as completed Jan 9, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants