From 2a85b0bf374b75f1288ab34e8e8acbba45dac91e Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Istv=C3=A1n=20Zolt=C3=A1n=20Szab=C3=B3?= Date: Wed, 18 Sep 2024 10:10:06 +0200 Subject: [PATCH 1/3] [DOCS] Gives more details to the load data step of the semantic search tutorials. --- .../semantic-search-elser.asciidoc | 23 ++++++++++++++----- .../semantic-search-inference.asciidoc | 16 +++++++------ .../semantic-search-semantic-text.asciidoc | 16 +++++++------ 3 files changed, 35 insertions(+), 20 deletions(-) diff --git a/docs/reference/search/search-your-data/semantic-search-elser.asciidoc b/docs/reference/search/search-your-data/semantic-search-elser.asciidoc index 11aec59a00b30..424b1fd30dcfe 100644 --- a/docs/reference/search/search-your-data/semantic-search-elser.asciidoc +++ b/docs/reference/search/search-your-data/semantic-search-elser.asciidoc @@ -120,12 +120,12 @@ IMPORTANT: The `msmarco-passagetest2019-top1000` dataset was not utilized to tra It is only used in this tutorial as a sample dataset that is easily accessible for demonstration purposes. You can use a different data set to test the workflow and become familiar with it. -Download the file and upload it to your cluster using the -{kibana-ref}/connect-to-elasticsearch.html#upload-data-kibana[Data Visualizer] -in the {ml-app} UI. -Assign the name `id` to the first column and `content` to the second column. -The index name is `test-data`. -Once the upload is complete, you can see an index named `test-data` with 182469 documents. +Download the file and upload it to your cluster using the {kibana-ref}/connect-to-elasticsearch.html#upload-data-kibana[Data Visualizer] in the {ml-app} UI. +After your data is analyzed, click **Override settings**. +Under **Edit field names**, assign `id` to the first column and `content` to the second. +Click **Apply**, then **Import**. +Name the index `test-data`, and click **Import**. +Once the upload is complete, you will see an index named `test-data` with 182,469 documents. [discrete] [[reindexing-data-elser]] @@ -161,6 +161,17 @@ GET _tasks/ You can also open the Trained Models UI, select the Pipelines tab under ELSER to follow the progress. +Following this tutorial, you can also cancel the reindexing process if you don't want to wait until the reindexing process is fully complete which might take hours for large data sets. +You can test the feature even if you reindex only a subset of the data set - a few thousand data points for example - and generate embeddings for the subset. +The following API request will cancel the reindexing task: + +[source,console] +---- +POST _tasks//_cancel +---- +// TEST[skip:TBD] + + [discrete] [[text-expansion-query]] ==== Semantic search by using the `sparse_vector` query diff --git a/docs/reference/search/search-your-data/semantic-search-inference.asciidoc b/docs/reference/search/search-your-data/semantic-search-inference.asciidoc index a92a62cf46d67..63631d09e81a3 100644 --- a/docs/reference/search/search-your-data/semantic-search-inference.asciidoc +++ b/docs/reference/search/search-your-data/semantic-search-inference.asciidoc @@ -68,12 +68,12 @@ It consists of 200 queries, each accompanied by a list of relevant text passages All unique passages, along with their IDs, have been extracted from that data set and compiled into a https://github.com/elastic/stack-docs/blob/main/docs/en/stack/ml/nlp/data/msmarco-passagetest2019-unique.tsv[tsv file]. -Download the file and upload it to your cluster using the -{kibana-ref}/connect-to-elasticsearch.html#upload-data-kibana[Data Visualizer] -in the {ml-app} UI. -Assign the name `id` to the first column and `content` to the second column. -The index name is `test-data`. -Once the upload is complete, you can see an index named `test-data` with 182469 documents. +Download the file and upload it to your cluster using the {kibana-ref}/connect-to-elasticsearch.html#upload-data-kibana[Data Visualizer] in the {ml-app} UI. +After your data is analyzed, click **Override settings**. +Under **Edit field names**, assign `id` to the first column and `content` to the second. +Click **Apply**, then **Import**. +Name the index `test-data`, and click **Import**. +Once the upload is complete, you will see an index named `test-data` with 182,469 documents. [discrete] [[reindexing-data-infer]] @@ -92,7 +92,9 @@ GET _tasks/ ---- // TEST[skip:TBD] -You can also cancel the reindexing process if you don't want to wait until the reindexing process is fully complete which might take hours for large data sets: +Following this tutorial, you can also cancel the reindexing process if you don't want to wait until the reindexing process is fully complete which might take hours for large data sets. +You can test the feature even if you reindex only a subset of the data set - a few thousand data points for example - and generate embeddings for the subset. +The following API request will cancel the reindexing task: [source,console] ---- diff --git a/docs/reference/search/search-your-data/semantic-search-semantic-text.asciidoc b/docs/reference/search/search-your-data/semantic-search-semantic-text.asciidoc index e2cc2d8c62219..2bf1c2dfa46ae 100644 --- a/docs/reference/search/search-your-data/semantic-search-semantic-text.asciidoc +++ b/docs/reference/search/search-your-data/semantic-search-semantic-text.asciidoc @@ -96,11 +96,12 @@ a list of relevant text passages. All unique passages, along with their IDs, have been extracted from that data set and compiled into a https://github.com/elastic/stack-docs/blob/main/docs/en/stack/ml/nlp/data/msmarco-passagetest2019-unique.tsv[tsv file]. -Download the file and upload it to your cluster using the -{kibana-ref}/connect-to-elasticsearch.html#upload-data-kibana[Data Visualizer] -in the {ml-app} UI. Assign the name `id` to the first column and `content` to -the second column. The index name is `test-data`. Once the upload is complete, -you can see an index named `test-data` with 182469 documents. +Download the file and upload it to your cluster using the {kibana-ref}/connect-to-elasticsearch.html#upload-data-kibana[Data Visualizer] in the {ml-app} UI. +After your data is analyzed, click **Override settings**. +Under **Edit field names**, assign `id` to the first column and `content` to the second. +Click **Apply**, then **Import**. +Name the index `test-data`, and click **Import**. +Once the upload is complete, you will see an index named `test-data` with 182,469 documents. [discrete] @@ -137,8 +138,9 @@ GET _tasks/ ------------------------------------------------------------ // TEST[skip:TBD] -It is recommended to cancel the reindexing process if you don't want to wait -until it is fully complete which might take a long time for an inference endpoint with few assigned resources: +Following this tutorial, it is recommended to cancel the reindexing process if you don't want to wait until it is fully complete which might take a long time for an inference endpoint with few assigned resources. +You can test the feature even if you reindex only a subset of the data set - a few thousand data points for example - and generate embeddings for the subset. +The following API request will cancel the reindexing task: [source,console] ------------------------------------------------------------ From 6eace80a75e3c7da65025e1058e884f037d6ee01 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Istv=C3=A1n=20Zolt=C3=A1n=20Szab=C3=B3?= Date: Wed, 18 Sep 2024 10:44:22 +0200 Subject: [PATCH 2/3] Apply suggestions from code review Co-authored-by: Liam Thompson <32779855+leemthompo@users.noreply.github.com> --- .../search-your-data/semantic-search-elser.asciidoc | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/docs/reference/search/search-your-data/semantic-search-elser.asciidoc b/docs/reference/search/search-your-data/semantic-search-elser.asciidoc index 424b1fd30dcfe..7ec072089d11f 100644 --- a/docs/reference/search/search-your-data/semantic-search-elser.asciidoc +++ b/docs/reference/search/search-your-data/semantic-search-elser.asciidoc @@ -117,15 +117,15 @@ All unique passages, along with their IDs, have been extracted from that data se https://github.com/elastic/stack-docs/blob/main/docs/en/stack/ml/nlp/data/msmarco-passagetest2019-unique.tsv[tsv file]. IMPORTANT: The `msmarco-passagetest2019-top1000` dataset was not utilized to train the model. -It is only used in this tutorial as a sample dataset that is easily accessible for demonstration purposes. +We use this sample dataset in the tutorial because is easily accessible for demonstration purposes. You can use a different data set to test the workflow and become familiar with it. -Download the file and upload it to your cluster using the {kibana-ref}/connect-to-elasticsearch.html#upload-data-kibana[Data Visualizer] in the {ml-app} UI. +Download the file and upload it to your cluster using the {kibana-ref}/connect-to-elasticsearch.html#upload-data-kibana[File Uploader] in the UI. After your data is analyzed, click **Override settings**. Under **Edit field names**, assign `id` to the first column and `content` to the second. Click **Apply**, then **Import**. Name the index `test-data`, and click **Import**. -Once the upload is complete, you will see an index named `test-data` with 182,469 documents. +After the upload is complete, you will see an index named `test-data` with 182,469 documents. [discrete] [[reindexing-data-elser]] @@ -161,8 +161,9 @@ GET _tasks/ You can also open the Trained Models UI, select the Pipelines tab under ELSER to follow the progress. -Following this tutorial, you can also cancel the reindexing process if you don't want to wait until the reindexing process is fully complete which might take hours for large data sets. -You can test the feature even if you reindex only a subset of the data set - a few thousand data points for example - and generate embeddings for the subset. +Reindexing large datasets can take a long time. You can test this workflow using +only a subset of the dataset. Do this by cancelling the reindexing process, and +only generating embeddings for the subset that was reindexed. The following API request will cancel the reindexing task: [source,console] From 78c3408812230e3626e4d9d308005e68d0c07c8a Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Istv=C3=A1n=20Zolt=C3=A1n=20Szab=C3=B3?= Date: Wed, 18 Sep 2024 10:46:26 +0200 Subject: [PATCH 3/3] [DOCS] Addresses feedback. --- .../search/search-your-data/semantic-search-elser.asciidoc | 6 +++--- .../search-your-data/semantic-search-inference.asciidoc | 7 ++++--- .../semantic-search-semantic-text.asciidoc | 7 ++++--- 3 files changed, 11 insertions(+), 9 deletions(-) diff --git a/docs/reference/search/search-your-data/semantic-search-elser.asciidoc b/docs/reference/search/search-your-data/semantic-search-elser.asciidoc index 7ec072089d11f..5309b24fa37c9 100644 --- a/docs/reference/search/search-your-data/semantic-search-elser.asciidoc +++ b/docs/reference/search/search-your-data/semantic-search-elser.asciidoc @@ -161,9 +161,9 @@ GET _tasks/ You can also open the Trained Models UI, select the Pipelines tab under ELSER to follow the progress. -Reindexing large datasets can take a long time. You can test this workflow using -only a subset of the dataset. Do this by cancelling the reindexing process, and -only generating embeddings for the subset that was reindexed. +Reindexing large datasets can take a long time. +You can test this workflow using only a subset of the dataset. +Do this by cancelling the reindexing process, and only generating embeddings for the subset that was reindexed. The following API request will cancel the reindexing task: [source,console] diff --git a/docs/reference/search/search-your-data/semantic-search-inference.asciidoc b/docs/reference/search/search-your-data/semantic-search-inference.asciidoc index 63631d09e81a3..0abc44c809d08 100644 --- a/docs/reference/search/search-your-data/semantic-search-inference.asciidoc +++ b/docs/reference/search/search-your-data/semantic-search-inference.asciidoc @@ -73,7 +73,7 @@ After your data is analyzed, click **Override settings**. Under **Edit field names**, assign `id` to the first column and `content` to the second. Click **Apply**, then **Import**. Name the index `test-data`, and click **Import**. -Once the upload is complete, you will see an index named `test-data` with 182,469 documents. +After the upload is complete, you will see an index named `test-data` with 182,469 documents. [discrete] [[reindexing-data-infer]] @@ -92,8 +92,9 @@ GET _tasks/ ---- // TEST[skip:TBD] -Following this tutorial, you can also cancel the reindexing process if you don't want to wait until the reindexing process is fully complete which might take hours for large data sets. -You can test the feature even if you reindex only a subset of the data set - a few thousand data points for example - and generate embeddings for the subset. +Reindexing large datasets can take a long time. +You can test this workflow using only a subset of the dataset. +Do this by cancelling the reindexing process, and only generating embeddings for the subset that was reindexed. The following API request will cancel the reindexing task: [source,console] diff --git a/docs/reference/search/search-your-data/semantic-search-semantic-text.asciidoc b/docs/reference/search/search-your-data/semantic-search-semantic-text.asciidoc index 2bf1c2dfa46ae..709d17091164c 100644 --- a/docs/reference/search/search-your-data/semantic-search-semantic-text.asciidoc +++ b/docs/reference/search/search-your-data/semantic-search-semantic-text.asciidoc @@ -101,7 +101,7 @@ After your data is analyzed, click **Override settings**. Under **Edit field names**, assign `id` to the first column and `content` to the second. Click **Apply**, then **Import**. Name the index `test-data`, and click **Import**. -Once the upload is complete, you will see an index named `test-data` with 182,469 documents. +After the upload is complete, you will see an index named `test-data` with 182,469 documents. [discrete] @@ -138,8 +138,9 @@ GET _tasks/ ------------------------------------------------------------ // TEST[skip:TBD] -Following this tutorial, it is recommended to cancel the reindexing process if you don't want to wait until it is fully complete which might take a long time for an inference endpoint with few assigned resources. -You can test the feature even if you reindex only a subset of the data set - a few thousand data points for example - and generate embeddings for the subset. +Reindexing large datasets can take a long time. +You can test this workflow using only a subset of the dataset. +Do this by cancelling the reindexing process, and only generating embeddings for the subset that was reindexed. The following API request will cancel the reindexing task: [source,console]