Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make _index optional for pinned query docs #97450

Merged
merged 19 commits into from Sep 7, 2023

Conversation

ioanatia
Copy link
Contributor

@ioanatia ioanatia commented Jul 7, 2023

Currently pinned queries require either the ids or docs parameter.
docs allows pinning documents from specific indices.
However for docs the _index field is always required:

GET test/_search
{
  "query": {
    "pinned": {
      "organic": {
        "query_string": {
          "query": "something"
        }
      },
      "docs": [
        { "_id": "1" }
      ]
    }
  }
}

returns an error:

{
  "error": {
    "root_cause": [
      {
        "type": "parsing_exception",
        "reason": "[10:22] [pinned] failed to parse field [docs]",
        "line": 10,
        "col": 22
      }
    ],
    "type": "parsing_exception",
    "reason": "[10:22] [pinned] failed to parse field [docs]",
    "line": 10,
    "col": 22,
    "caused_by": {
      "type": "x_content_parse_exception",
      "reason": "[10:22] [pinned] failed to parse field [docs]",
      "caused_by": {
        "type": "illegal_argument_exception",
        "reason": "Required [_index]"
      }
    }
  },
  "status": 400
}

The proposal here is to make _index optional. I don't think we have a strong requirement for making _index required, when it was initially introduced in #74873, we mostly wanted the ability to pin docs from specific indices.

Making _index optional can give more flexibility to use a combination of pinned documents from specific indices or just document ids.
This change can also help with pinned query rules. Currently pinned query rules can accept either ids or docs.
If multiple pinned query rules match and they use a combination of ids and docs, we cannot build a pinned query and we would need to return an error. This is because a pinned query cannot accept both ids and docs.
By making _index optional we would no longer need to return an error when pinned query rules use a combination of ids and docs, because we can easily translate ids in docs.

The following pinned queries would be equivalent:

GET test/_search
{
  "query": {
    "pinned": {
      "organic": {
        "query_string": {
          "query": "something"
        }
      },
      "docs": [
        { "_id": "1" }
      ]
    }
  }
}

GET test/_search
{
  "query": {
    "pinned": {
      "organic": {
        "query_string": {
          "query": "something"
        }
      },
      "ids": [1]
    }
  }
}

The scores should be consistent when using a combination of _docs that might use _index or not - see example

Example
PUT test-1/_doc/1
{
 "title": "doc 1"
}

PUT test-1/_doc/2
{
 "title": "doc 2"
}

PUT test-2/_doc/1
{
 "title": "doc 1"
}

PUT test-2/_doc/3
{
 "title": "lalala"
}


POST test-1,test-2/_search
{
 "query": {
   "pinned": {
     "organic": {
       "query_string": {
         "query": "lalala"
       }
     },
     "docs": [
       { "_id": "2", "_index": "test-1" },
       { "_id": "1" }
     ]
   }
 }
}

response:

{
 "took": 1,
 "timed_out": false,
 "_shards": {
   "total": 2,
   "successful": 2,
   "skipped": 0,
   "failed": 0
 },
 "hits": {
   "total": {
     "value": 4,
     "relation": "eq"
   },
   "max_score": 1.7014124e+38,
   "hits": [
     {
       "_index": "test-1",
       "_id": "2",
       "_score": 1.7014124e+38,
       "_source": {
         "title": "doc 2"
       }
     },
     {
       "_index": "test-1",
       "_id": "1",
       "_score": 1.7014122e+38, // same score as doc with id 1 from test-2
       "_source": {
         "title": "doc 1"
       }
     },
     {
       "_index": "test-2",
       "_id": "1",
       "_score": 1.7014122e+38, // same score as doc with id 1 from test-1
       "_source": {
         "title": "doc 1"
       }
     },
     {
       "_index": "test-2",
       "_id": "3",
       "_score": 0.8025915, // organic result
       "_source": {
         "title": "lalala"
       }
     }
   ]
 }
}

For query rules, if we have two query rules that both match and use a combination of ids and pinned:

PUT _query_rules/test-ruleset
{
  "ruleset_id": "test-ruleset",
  "rules": [
    {
      "rule_id": "1",
      "type": "pinned",
      "criteria": [
        {
          "type": "exact",
          "metadata": "query_string",
          "value": "country"
        }
      ],
      "actions": {
        "docs": [
          { "_index": "singers", "_id": "1" }
        ]
      }
    },
    {
      "rule_id": "2",
      "type": "pinned",
      "criteria": [
        {
          "type": "exact",
          "metadata": "query_string",
          "value": "country"
        }
      ],
      "actions": {
        "ids": [
          2
        ]
      }
    }
  ]
}

and the following query:

POST singers/_search
{
    "query": {
        "rule_query": {
            "organic": {
                "query_string": {
                  "default_field": "name",
                  "query": "country"
                }
            },
            "match_criteria": {
                "query_string": "country"
            },
            "ruleset_id": "test-ruleset"
        }
    }
}

then this would get translated into the following pinned query:

POST singers/_search
{
    "query": {
        "pinned": {
            "organic": {
                "query_string": {
                  "default_field": "name",
                  "query": "country"
                }
            },
            "docs": [
               { "_index": "singers", "_id": "1" },
               {"_id": 2 }
            ]
        }
    }
}

I think we can also simplify the pinned query rule so that it only receives docs:

PUT _query_rules/test-ruleset
{
  "ruleset_id": "test-ruleset",
  "rules": [
    {
      "rule_id": "1",
      "type": "pinned",
      "criteria": [
        {
          "type": "exact",
          "metadata": "query_string",
          "value": "country"
        }
      ],
      "actions": {
        "docs": [
          { "_id": "1" },
          { "_id": "2", "_index": "singers" }
        ]
      }
    }
  ]
}

@kderusso
Copy link
Member

kderusso commented Jul 7, 2023

I think this change makes sense, but we should probably update the pinned query docs as well if we make this change:

  • Denote docs as the preferred way to query pinned documents
  • Specify that _index is optional
  • Note that colliding _ids in both indices will both be pinned if _index is not specified

@ioanatia
Copy link
Contributor Author

There is one side effect that I want to call out here, it is possible to pin the same documents multiple times with the pinned query - however the results will not contain duplicate pinned documents. This example should explain it better:

PUT test1/_doc/1
{
  "title": "document 1"
}

PUT test2/_doc/1
{
  "title": "document 1"
}

GET test1,test2/_search
{
  "query": {
    "pinned": {
      "docs": [
        {
          "_index": "test1",
          "_id": "1"
        },
        {
          "_id": "1" // this pins document with id `id` from test1 again
        }
      ],
      "organic": {
        "match": {
          "description": "document"
        }
      }
    }
  }
}

results:

 "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 1.7014124e+38,
    "hits": [
      {
        "_index": "test1",
        "_id": "1",
        "_score": 1.7014124e+38, // notice that the first hit has a higher score and it's the right pinned result
        "_source": {
          "title": "document 1"
        }
      },
      {
        "_index": "test2",
        "_id": "1",
        "_score": 1.7014122e+38,
        "_source": {
          "title": "document 1"
        }
      }
    ]
  }

I think this should be okay, because a similar behaviour can be achieved on main with the use of aliases:

POST _aliases
{
  "actions": [
    {
      "add": {
        "indices": ["test1", "test2"],
        "alias": "my-alias"
      }
    }
  ]
}

GET my-alias/_search
{
  "query": {
    "pinned": {
      "docs": [
        {
          "_index": "test1",
          "_id": "1"
        },
        {
          "_id": "1",
          "_index": "my-alias" // this also pins the document from test1
        } 
      ],
      "organic": {
        "match": {
          "description": "document"
        }
      }
    }
  }
}

results:

 "hits": [
      {
        "_index": "test1",
        "_id": "1",
        "_score": 1.7014124e+38, // again, the first pinned document has the highest score
        "_source": {
          "title": "document 1"
        }
      },
      {
        "_index": "test2",
        "_id": "1",
        "_score": 1.7014122e+38,
        "_source": {
          "title": "document 1"
        }
      }
    ]

@ioanatia ioanatia marked this pull request as ready for review July 24, 2023 14:14
@elasticsearchmachine elasticsearchmachine added the needs:triage Requires assignment of a team area label label Jul 24, 2023
@ioanatia ioanatia added the Team:Enterprise Search Meta label for Enterprise Search team label Jul 24, 2023
@ioanatia ioanatia requested review from jimczi and a team July 24, 2023 14:15
@elasticsearchmachine elasticsearchmachine removed the Team:Enterprise Search Meta label for Enterprise Search team label Jul 24, 2023
@ioanatia ioanatia added :Search/Search Search-related issues that do not fall into other categories Team:Enterprise Search Meta label for Enterprise Search team labels Jul 24, 2023
@elasticsearchmachine elasticsearchmachine added Team:Search Meta label for search team and removed needs:triage Requires assignment of a team area label Team:Enterprise Search Meta label for Enterprise Search team labels Jul 24, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

@ioanatia ioanatia added >enhancement needs:triage Requires assignment of a team area label Team:Enterprise Search Meta label for Enterprise Search team and removed Team:Search Meta label for search team labels Jul 24, 2023
@elasticsearchmachine
Copy link
Collaborator

Hi @ioanatia, I've created a changelog YAML for you.

@elasticsearchmachine elasticsearchmachine added Team:Search Meta label for search team and removed needs:triage Requires assignment of a team area label Team:Enterprise Search Meta label for Enterprise Search team labels Jul 24, 2023
@saikatsarkar056
Copy link
Contributor

I've noticed that the actual GitHub issue isn't mentioned in this PR. Is this because it's a public repository? 🤔

@saikatsarkar056
Copy link
Contributor

What does organic match mean? I am trying to understand the meaning by reading this doc on pinned query. Is there any other related documentation?

@kderusso
Copy link
Member

kderusso commented Aug 17, 2023

@saikatsarkar056

I've noticed that the actual GitHub issue isn't mentioned in this PR. Is this because it's a public repository? 🤔

Yes.

What does organic match mean? I am trying to understand the meaning by reading this doc on pinned query. Is there any other related documentation?

The organic query is the query that is run, the results are underneath all pinned documents.

@@ -105,13 +102,13 @@ private Item(String id) {
* Read from a stream.
*/
public Item(StreamInput in) throws IOException {
index = in.readString();
index = in.readOptionalString();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BWC serialization must be handled on input and output.

Comment on lines 83 to 85
- skip:
version: " - 8.9.99"
reason: "'_index' was made optional in 8.10"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs bumped to 8.10.99 if this is still going to be merged.

@carlosdelest carlosdelest self-assigned this Aug 29, 2023
Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PinnedQueryBuilderTests should be updated to account for the index being randomly null or not when generating a query.

if (out.getTransportVersion().onOrAfter(OPTIONAL_INDEX_IN_DOCS_VERSION)) {
out.writeOptionalString(index);
} else {
out.writeString(index);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will throw an NPE. It is possible that the coordinator is upgraded to the latest version of Elasticsearch but the relevant data nodes are not. Consequently, if the user does not provide an _index, the coordinator will accept the call, attempt to serialize, and an NPE will throw.

We should instead be nice here and throw a better error message.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch - I've included an IllegalArgumentException that hopefully makes sense 👍

@@ -129,7 +126,7 @@ public XContentBuilder toXContent(XContentBuilder builder, Params params) throws
);

static {
PARSER.declareString(constructorArg(), INDEX_FIELD);
PARSER.declareStringOrNull(optionalConstructorArg(), INDEX_FIELD);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

declareStringOrNull

This means we allow the value to be string or explicitly null. I think we instead allow it to be string or not provided at all. Consequently, it should just be PARSER.declareString(optionalConstructorArg(), INDEX_FIELD);

out.writeOptionalString(index);
} else {
out.writeString(index);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Below in the toXContent method, we should not write the INDEX_FIELD if this.index == null.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!

@carlosdelest
Copy link
Member

PinnedQueryBuilderTests should be updated to account for the index being randomly null or not when generating a query.

Thanks for checking this - I've added that to the tests 👍

@carlosdelest
Copy link
Member

@elasticmachine update branch

@elasticmachine
Copy link
Collaborator

merge conflict between base and head

Copy link
Member

@kderusso kderusso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, thanks for taking it over the line!

# Conflicts:
#	server/src/main/java/org/elasticsearch/TransportVersion.java
@carlosdelest carlosdelest added the auto-merge Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Sep 6, 2023
# Conflicts:
#	server/src/main/java/org/elasticsearch/TransportVersion.java
@elasticsearchmachine elasticsearchmachine merged commit 7cd1987 into elastic:main Sep 7, 2023
12 checks passed
@ioanatia ioanatia deleted the ioana/pinned_query branch September 7, 2023 08:40
alex-spies pushed a commit to dreamquster/elasticsearch that referenced this pull request Sep 8, 2023
Currently pinned queries require either the `ids` or `docs` parameter.
`docs` allows pinning documents from specific indices. However for
`docs` the `_index` field is always required:

```
GET test/_search
{
  "query": {
    "pinned": {
      "organic": {
        "query_string": {
          "query": "something"
        }
      },
      "docs": [
        { "_id": "1" }
      ]
    }
  }
}
```

returns an error:

```
{
  "error": {
    "root_cause": [
      {
        "type": "parsing_exception",
        "reason": "[10:22] [pinned] failed to parse field [docs]",
        "line": 10,
        "col": 22
      }
    ],
    "type": "parsing_exception",
    "reason": "[10:22] [pinned] failed to parse field [docs]",
    "line": 10,
    "col": 22,
    "caused_by": {
      "type": "x_content_parse_exception",
      "reason": "[10:22] [pinned] failed to parse field [docs]",
      "caused_by": {
        "type": "illegal_argument_exception",
        "reason": "Required [_index]"
      }
    }
  },
  "status": 400
}
```

The proposal here is to make `_index` optional. I don't think we have a
strong requirement for making `_index` required, when it was initially
introduced in elastic#74873, we
mostly wanted the ability to pin docs from specific indices.

Making `_index` optional can give more flexibility to use a combination
of pinned documents from specific indices or just document ids. This
change can also help with pinned query rules. Currently pinned query
rules can accept either `ids` or `docs`. If multiple pinned query rules
match and they use a combination of `ids` and `docs`, we cannot build a
pinned query and we would need to return an error. This is because a
pinned query cannot accept both `ids` and `docs`. By making `_index`
optional we would no longer need to return an error when pinned query
rules use a combination of `ids` and `docs`, because we can easily
translate `ids` in `docs`.

The following pinned queries would be equivalent:

```
GET test/_search
{
  "query": {
    "pinned": {
      "organic": {
        "query_string": {
          "query": "something"
        }
      },
      "docs": [
        { "_id": "1" }
      ]
    }
  }
}

GET test/_search
{
  "query": {
    "pinned": {
      "organic": {
        "query_string": {
          "query": "something"
        }
      },
      "ids": [1]
    }
  }
}
```

The scores should be consistent when using a combination of _docs that
might use _index or not - see example

<details>   <summary>Example </summary>

```

PUT test-1/_doc/1 {   "title": "doc 1" }

PUT test-1/_doc/2 {   "title": "doc 2" }

PUT test-2/_doc/1 {   "title": "doc 1" }

PUT test-2/_doc/3 {   "title": "lalala" }

POST test-1,test-2/_search {   "query": {     "pinned": {      
"organic": {         "query_string": {           "query": "lalala"      
}       },       "docs": [         { "_id": "2", "_index": "test-1" },  
{ "_id": "1" }       ]     }   } }

```

response:

```

{   "took": 1,   "timed_out": false,   "_shards": {     "total": 2,    
"successful": 2,     "skipped": 0,     "failed": 0   },   "hits": {    
"total": {       "value": 4,       "relation": "eq"     },    
"max_score": 1.7014124e+38,     "hits": [       {         "_index":
"test-1",         "_id": "2",         "_score": 1.7014124e+38,        
"_source": {           "title": "doc 2"         }       },       {      
"_index": "test-1",         "_id": "1",         "_score": 1.7014122e+38,
// same score as doc with id 1 from test-2         "_source": {         
"title": "doc 1"         }       },       {         "_index": "test-2", 
"_id": "1",         "_score": 1.7014122e+38, // same score as doc with
id 1 from test-1         "_source": {           "title": "doc 1"        
}       },       {         "_index": "test-2",         "_id": "3",      
"_score": 0.8025915, // organic result         "_source": {          
"title": "lalala"         }       }     ]   } }

```

</details>

For query rules, if we have two query rules that both match and use a
combination of `ids` and `pinned`:

```
PUT _query_rules/test-ruleset
{
  "ruleset_id": "test-ruleset",
  "rules": [
    {
      "rule_id": "1",
      "type": "pinned",
      "criteria": [
        {
          "type": "exact",
          "metadata": "query_string",
          "value": "country"
        }
      ],
      "actions": {
        "docs": [
          { "_index": "singers", "_id": "1" }
        ]
      }
    },
    {
      "rule_id": "2",
      "type": "pinned",
      "criteria": [
        {
          "type": "exact",
          "metadata": "query_string",
          "value": "country"
        }
      ],
      "actions": {
        "ids": [
          2
        ]
      }
    }
  ]
}
```

and the following query:

```
POST singers/_search
{
    "query": {
        "rule_query": {
            "organic": {
                "query_string": {
                  "default_field": "name",
                  "query": "country"
                }
            },
            "match_criteria": {
                "query_string": "country"
            },
            "ruleset_id": "test-ruleset"
        }
    }
}
```

then this would get translated into the following pinned query:

```
POST singers/_search
{
    "query": {
        "pinned": {
            "organic": {
                "query_string": {
                  "default_field": "name",
                  "query": "country"
                }
            },
            "docs": [
               { "_index": "singers", "_id": "1" },
               {"_id": 2 }
            ]
        }
    }
}
```

I think we can also simplify the pinned query rule so that it only
receives `docs`:

```
PUT _query_rules/test-ruleset
{
  "ruleset_id": "test-ruleset",
  "rules": [
    {
      "rule_id": "1",
      "type": "pinned",
      "criteria": [
        {
          "type": "exact",
          "metadata": "query_string",
          "value": "country"
        }
      ],
      "actions": {
        "docs": [
          { "_id": "1" },
          { "_id": "2", "_index": "singers" }
        ]
      }
    }
  ]
}
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-merge Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) >enhancement :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team v8.11.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants