From 5434828c42c45647f59e63dd37a786d77c044283 Mon Sep 17 00:00:00 2001 From: Max Hniebergall <137079448+maxhniebergall@users.noreply.github.com> Date: Fri, 22 Nov 2024 10:35:49 -0500 Subject: [PATCH 1/3] Update ml-nlp-limitations.asciidoc --- docs/en/stack/ml/nlp/ml-nlp-limitations.asciidoc | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/docs/en/stack/ml/nlp/ml-nlp-limitations.asciidoc b/docs/en/stack/ml/nlp/ml-nlp-limitations.asciidoc index 8673cdb19..ce6031e0b 100644 --- a/docs/en/stack/ml/nlp/ml-nlp-limitations.asciidoc +++ b/docs/en/stack/ml/nlp/ml-nlp-limitations.asciidoc @@ -9,6 +9,15 @@ The following limitations and known problems apply to the {version} release of the Elastic {nlp} trained models feature. +[discrete] +[[ml-nlp-large-documents-limit-10k-10mb]] +== Semantic text fields are limited at 10k chunks, limiting ingested document size to under ~10MB + +When using semantic text to ingest documents chunking takes place automatically. The number +of chunks is limited by the cluster setting (index.mapping.nested_objects.limit)[https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-settings-limit.html] +which defaults to 10k. This means that documents which are too large will cause errors during +ingestion. To avoid this issue, please split your documents into roughly 1MB parts before ingestion. + [discrete] [[ml-nlp-elser-v1-limit-512]] == ELSER semantic search is limited to 512 tokens per field that inference is applied to @@ -17,4 +26,4 @@ When you use ELSER for semantic search, only the first 512 extracted tokens from each field of the ingested documents that ELSER is applied to are taken into account for the search process. If your data set contains long documents, divide them into smaller segments before ingestion if you need the full text to be -searchable. \ No newline at end of file +searchable. From 44f0edc5360235b11ad4a0248bb8d14cebcfc354 Mon Sep 17 00:00:00 2001 From: Max Hniebergall <137079448+maxhniebergall@users.noreply.github.com> Date: Fri, 22 Nov 2024 11:20:10 -0500 Subject: [PATCH 2/3] change link --- docs/en/stack/ml/nlp/ml-nlp-limitations.asciidoc | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/docs/en/stack/ml/nlp/ml-nlp-limitations.asciidoc b/docs/en/stack/ml/nlp/ml-nlp-limitations.asciidoc index ce6031e0b..b9c32ea86 100644 --- a/docs/en/stack/ml/nlp/ml-nlp-limitations.asciidoc +++ b/docs/en/stack/ml/nlp/ml-nlp-limitations.asciidoc @@ -14,7 +14,8 @@ the Elastic {nlp} trained models feature. == Semantic text fields are limited at 10k chunks, limiting ingested document size to under ~10MB When using semantic text to ingest documents chunking takes place automatically. The number -of chunks is limited by the cluster setting (index.mapping.nested_objects.limit)[https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-settings-limit.html] +of chunks is limited by the cluster setting index.mapping.nested_objects.limit +https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-settings-limit.html which defaults to 10k. This means that documents which are too large will cause errors during ingestion. To avoid this issue, please split your documents into roughly 1MB parts before ingestion. From 336026e04a18886de84426473152ba84fc8ce5e2 Mon Sep 17 00:00:00 2001 From: Max Hniebergall <137079448+maxhniebergall@users.noreply.github.com> Date: Wed, 27 Nov 2024 13:17:12 -0500 Subject: [PATCH 3/3] Update docs/en/stack/ml/nlp/ml-nlp-limitations.asciidoc Co-authored-by: Liam Thompson <32779855+leemthompo@users.noreply.github.com> --- docs/en/stack/ml/nlp/ml-nlp-limitations.asciidoc | 8 ++------ 1 file changed, 2 insertions(+), 6 deletions(-) diff --git a/docs/en/stack/ml/nlp/ml-nlp-limitations.asciidoc b/docs/en/stack/ml/nlp/ml-nlp-limitations.asciidoc index b9c32ea86..e505bb63b 100644 --- a/docs/en/stack/ml/nlp/ml-nlp-limitations.asciidoc +++ b/docs/en/stack/ml/nlp/ml-nlp-limitations.asciidoc @@ -11,13 +11,9 @@ the Elastic {nlp} trained models feature. [discrete] [[ml-nlp-large-documents-limit-10k-10mb]] -== Semantic text fields are limited at 10k chunks, limiting ingested document size to under ~10MB +== Document size limitations when using `semantic_text` fields -When using semantic text to ingest documents chunking takes place automatically. The number -of chunks is limited by the cluster setting index.mapping.nested_objects.limit -https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-settings-limit.html -which defaults to 10k. This means that documents which are too large will cause errors during -ingestion. To avoid this issue, please split your documents into roughly 1MB parts before ingestion. +When using semantic text to ingest documents, chunking takes place automatically. The number of chunks is limited by the {ref}/mapping-settings-limit.html[`index.mapping.nested_objects.limit`] cluster setting, which defaults to 10k. Documents that are too large will cause errors during ingestion. To avoid this issue, please split your documents into roughly 1MB parts before ingestion. [discrete] [[ml-nlp-elser-v1-limit-512]]