Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Max bytes length exceeded when percolating a document with huge geo_shape property #83418

Closed
OlivierRo opened this issue Feb 2, 2022 · 6 comments
Labels
:Analytics/Geo Indexing, search aggregations of geo points and shapes >bug :Search/Percolator Reverse search: find queries that match a document Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) Team:Search Meta label for search team

Comments

@OlivierRo
Copy link

OlivierRo commented Feb 2, 2022

Elasticsearch Version

7.16.3

Installed Plugins

No response

Java Version

bundled

OS Version

Linux 1f98d58db442 3.10.0-957.21.3.el7.x86_64 #1 SMP Tue Jun 18 16:35:19 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

Problem Description

I have a system using 3 types of documents : data with a geo_shape property, workflows using some data and search requests that contains associations between data and workflows (using that data) through a percolate query.

A workflow use Elasticsearch queries for each data it needs. When a workflow is created, it can easily search all data it needs.

The aim here, is also to know what workflows wait for a newly data. When a newly data is created, we use search query index and percolate query to retrieve all workflows needing it.

I have a problem when a data with a huge geo_shape is submitted to percolation (even if search request index is empty) : max_bytes_length_exceeded_exception
If i remove geo_shape property from data mapping, i haven't the problem.
I submitted this case to discuss.elastic.co but doesn't get any response....
(https://discuss.elastic.co/t/percolation-with-document-containing-huge-geo-shape/294557/2)

Steps to Reproduce

data type is "product_polygon"
mapping of dt_product_polygon (ie. data index) is :

{
  "dt_product_polygon" : {
    "aliases" : { },
    "mappings" : {
      "dynamic_templates" : [
        {
          "doubles" : {
            "match_mapping_type" : "double",
            "mapping" : {
              "type" : "double"
            }
          }
        },
        {
          "texts" : {
            "match_mapping_type" : "string",
            "mapping" : {
              "type" : "keyword"
            }
          }
        }
      ],
      "properties" : {
        "@chronosInstance" : {
          "type" : "keyword"
        },
        "@ingestDate" : {
          "type" : "date"
        },
        "creationTime" : {
          "type" : "date"
        },
        "dataType" : {
          "type" : "keyword"
        },
        "granules" : {
          "properties" : {
            "continent" : {
              "properties" : {
                "@chronosInstance" : {
                  "type" : "keyword"
                },
                "@ingestDate" : {
                  "type" : "date"
                },
                "ascendIds" : {
                  "type" : "keyword"
                },
                "code" : {
                  "type" : "keyword"
                },
                "id" : {
                  "type" : "keyword"
                },
                "name" : {
                  "type" : "keyword"
                },
                "nextId" : {
                  "type" : "keyword"
                },
                "passes" : {
                  "type" : "keyword"
                },
                "placemark" : {
                  "type" : "geo_shape"
                },
                "previousId" : {
                  "type" : "keyword"
                },
                "tiles" : {
                  "type" : "keyword"
                },
                "toBeProcessed" : {
                  "type" : "boolean"
                },
                "type" : {
                  "type" : "keyword"
                }
              }
            },
            "segment" : {
              "properties" : {
                "id" : {
                  "type" : "keyword"
                },
                "segmentEndTime" : {
                  "type" : "date"
                },
                "segmentStartTime" : {
                  "type" : "date"
                },
                "toBeProcessed" : {
                  "type" : "boolean"
                },
                "type" : {
                  "type" : "keyword"
                }
              }
            }
          }
        },
        "id" : {
          "type" : "keyword"
        },
        "manuallyImported" : {
          "type" : "boolean"
        },
        "nature" : {
          "type" : "keyword"
        },
        "uri" : {
          "type" : "keyword"
        },
        "valid" : {
          "type" : "boolean"
        }
      }
    },
    "settings" : {
      "index" : {
        "routing" : {
          "allocation" : {
            "include" : {
              "_tier_preference" : "data_content"
            }
          }
        },
        "number_of_shards" : "1",
        "provided_name" : "dt_product_polygon",
        "creation_date" : "1643799069519",
        "number_of_replicas" : "1",
        "uuid" : "3jqOBLZfQ6C5uDAPkJVfcw",
        "version" : {
          "created" : "7160399"
        }
      }
    }
  }
}

mapping of sr_product_polygon (ie. search request index) is:

{
  "sr_product_polygon" : {
    "aliases" : { },
    "mappings" : {
      "dynamic_templates" : [
        {
          "doubles" : {
            "match_mapping_type" : "double",
            "mapping" : {
              "type" : "double"
            }
          }
        },
        {
          "texts" : {
            "match_mapping_type" : "string",
            "mapping" : {
              "type" : "keyword"
            }
          }
        }
      ],
      "properties" : {
        "@ingestDate" : {
          "type" : "date"
        },
        "granules" : {
          "properties" : {
            "continent" : {
              "properties" : {
                "ascendIds" : {
                  "type" : "keyword"
                },
                "code" : {
                  "type" : "keyword"
                },
                "id" : {
                  "type" : "keyword"
                },
                "name" : {
                  "type" : "keyword"
                },
                "nextId" : {
                  "type" : "keyword"
                },
                "passes" : {
                  "type" : "keyword"
                },
                "placemark" : {
                  "type" : "geo_shape"
                },
                "previousId" : {
                  "type" : "keyword"
                },
                "tiles" : {
                  "type" : "keyword"
                },
                "toBeProcessed" : {
                  "type" : "boolean"
                },
                "type" : {
                  "type" : "keyword"
                }
              }
            }
          }
        },
        "id" : {
          "type" : "keyword"
        },
        "query" : {
          "type" : "percolator"
        },
        "uri" : {
          "type" : "keyword"
        },
        "valid" : {
          "type" : "boolean"
        }
      }
    },
    "settings" : {
      "index" : {
        "routing" : {
          "allocation" : {
            "include" : {
              "_tier_preference" : "data_content"
            }
          }
        },
        "number_of_shards" : "1",
        "provided_name" : "sr_product_polygon",
        "creation_date" : "1643799069677",
        "number_of_replicas" : "1",
        "uuid" : "20v18mGPRwa0Xdya8b4gGQ",
        "version" : {
          "created" : "7160399"
        }
      }
    }
  }
}

One document that contains huge geo_shape data.zip

The request that makes a pb (even if sr_product_polygon is empty):

GET sr_product_polygon/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "percolate": {
          "document_type": null,
          "field": "query",
          "index": "dt_product_polygon",
          "id": "SWOT_ProductPolygon_SA.kml"
        }
      }
    }
  }
}

with the result:

{
  "error" : {
    "root_cause" : [
      {
        "type" : "query_shard_exception",
        "reason" : "failed to create query: bytes can be at most 32766 in length; got 35987",
        "index_uuid" : "20v18mGPRwa0Xdya8b4gGQ",
        "index" : "sr_product_polygon"
      }
    ],
    "type" : "search_phase_execution_exception",
    "reason" : "all shards failed",
    "phase" : "query",
    "grouped" : true,
    "failed_shards" : [
      {
        "shard" : 0,
        "index" : "sr_product_polygon",
        "node" : "gl2TPy5SQd263YOIx4UDZw",
        "reason" : {
          "type" : "query_shard_exception",
          "reason" : "failed to create query: bytes can be at most 32766 in length; got 35987",
          "index_uuid" : "20v18mGPRwa0Xdya8b4gGQ",
          "index" : "sr_product_polygon",
          "caused_by" : {
            "type" : "max_bytes_length_exceeded_exception",
            "reason" : "bytes can be at most 32766 in length; got 35987"
          }
        }
      }
    ]
  },
  "status" : 400
}

Logs (if relevant)

No response

@OlivierRo OlivierRo added >bug needs:triage Requires assignment of a team area label labels Feb 2, 2022
@pugnascotia pugnascotia added the :Search/Percolator Reverse search: find queries that match a document label Feb 3, 2022
@elasticmachine elasticmachine added the Team:Search Meta label for search team label Feb 3, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

@pugnascotia pugnascotia removed the needs:triage Requires assignment of a team area label label Feb 3, 2022
@cbuescher cbuescher added the :Analytics/Geo Indexing, search aggregations of geo points and shapes label Feb 3, 2022
@elasticmachine elasticmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Feb 3, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (Team:Analytics)

@imotov
Copy link
Contributor

imotov commented Feb 3, 2022

I was able to reproduce it and the error that is generated is

org.elasticsearch.index.query.QueryShardException: failed to create query: bytes can be at most 32766 in length; got 75750
    at org.elasticsearch.index.query.SearchExecutionContext.toQuery(SearchExecutionContext.java:508)
    at org.elasticsearch.index.query.SearchExecutionContext.toQuery(SearchExecutionContext.java:491)
    at org.elasticsearch.search.SearchService.parseSource(SearchService.java:1159)
    at org.elasticsearch.search.SearchService.createContext(SearchService.java:971)
    at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:621)
    at org.elasticsearch.search.SearchService.lambda$executeQueryPhase$2(SearchService.java:487)
    at org.elasticsearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:47)
    at org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:62)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
    at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:33)
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:776)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: org.apache.lucene.util.BytesRefHash$MaxBytesLengthExceededException: bytes can be at most 32766 in length; got 75750
    at org.apache.lucene.util.BytesRefHash.add(BytesRefHash.java:258)
    at org.apache.lucene.index.memory.MemoryIndex.storeDocValues(MemoryIndex.java:590)
    at org.apache.lucene.index.memory.MemoryIndex.addField(MemoryIndex.java:411)
    at org.apache.lucene.index.memory.MemoryIndex.fromDocument(MemoryIndex.java:324)
    at org.apache.lucene.index.memory.MemoryIndex.fromDocument(MemoryIndex.java:302)
    at org.elasticsearch.percolator.PercolateQueryBuilder.doToQuery(PercolateQueryBuilder.java:526)
    at org.elasticsearch.index.query.AbstractQueryBuilder.toQuery(AbstractQueryBuilder.java:90)
    at org.elasticsearch.index.query.ConstantScoreQueryBuilder.doToQuery(ConstantScoreQueryBuilder.java:125)
    at org.elasticsearch.index.query.AbstractQueryBuilder.toQuery(AbstractQueryBuilder.java:90)
    at org.elasticsearch.index.query.SearchExecutionContext.lambda$toQuery$3(SearchExecutionContext.java:492)
    at org.elasticsearch.index.query.SearchExecutionContext.toQuery(SearchExecutionContext.java:504)
    ... 14 more

It basically fails in PercolateQueryBuilder when we try to [build MemoryIndex] (

MemoryIndex memoryIndex = MemoryIndex.fromDocument(docs.get(0).rootDoc(), analyzer, true, false);
) which seems to be unable to work with large geo shape fields. @iverase WDYT?

@iverase
Copy link
Contributor

iverase commented Feb 4, 2022

I don't have much knowledge of the inner working of percolator but the issue seems related to geo_shape doc values. In this case the binary doc value is very big and it seems the MemoryIndex has a limit on the size, therefore when it tries to store it in MemoryIndex#storeDocValues it throws an error.

As a workaround, you just need to disable doc_values for geo_shape in sr_product_polygon, then the error should be gone:

{
  "sr_product_polygon" : {
    "aliases" : { },
    "mappings" : {
      "dynamic_templates" : [
        {
          "doubles" : {
            "match_mapping_type" : "double",
            "mapping" : {
              "type" : "double"
            }
          }
        },
        {
          "texts" : {
            "match_mapping_type" : "string",
            "mapping" : {
              "type" : "keyword"
            }
          }
        }
      ],
      "properties" : {
        "@ingestDate" : {
          "type" : "date"
        },
        "granules" : {
          "properties" : {
            "continent" : {
              "properties" : {
                "ascendIds" : {
                  "type" : "keyword"
                },
                "code" : {
                  "type" : "keyword"
                },
                "id" : {
                  "type" : "keyword"
                },
                "name" : {
                  "type" : "keyword"
                },
                "nextId" : {
                  "type" : "keyword"
                },
                "passes" : {
                  "type" : "keyword"
                },
                "placemark" : {
                  "type" : "geo_shape",
                   "doc_values": false                <-- HERE
                },
                "previousId" : {
                  "type" : "keyword"
                },
                "tiles" : {
                  "type" : "keyword"
                },
                "toBeProcessed" : {
                  "type" : "boolean"
                },
                "type" : {
                  "type" : "keyword"
                }
              }
            }
          }
        },
        "id" : {
          "type" : "keyword"
        },
        "query" : {
          "type" : "percolator"
        },
        "uri" : {
          "type" : "keyword"
        },
        "valid" : {
          "type" : "boolean"
        }
      }
    },
    "settings" : {
      "index" : {
        "routing" : {
          "allocation" : {
            "include" : {
              "_tier_preference" : "data_content"
            }
          }
        },
        "number_of_shards" : "1",
        "provided_name" : "sr_product_polygon",
        "creation_date" : "1643799069677",
        "number_of_replicas" : "1",
        "uuid" : "20v18mGPRwa0Xdya8b4gGQ",
        "version" : {
          "created" : "7160399"
        }
      }
    }
  }
}

I need to dig more to see if we can automatically disable as I think it is not needed.

@iverase
Copy link
Contributor

iverase commented Feb 4, 2022

It turns out that this is probably a Lucene bug so I opened https://issues.apache.org/jira/browse/LUCENE-10405 to address it.

@iverase
Copy link
Contributor

iverase commented Feb 7, 2022

The issue has been fixed upstream and it will be released in Lucene 9.1. For the time being the workaround would be to disable doc values.

As there is nothing more to do, I hope you don't mind if I close the issue.

Thanks for reporting!

@iverase iverase closed this as completed Feb 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Geo Indexing, search aggregations of geo points and shapes >bug :Search/Percolator Reverse search: find queries that match a document Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) Team:Search Meta label for search team
Projects
None yet
Development

No branches or pull requests

6 participants