From 217566da3e425d050a2a093907ceea51fff108cf Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Istv=C3=A1n=20Zolt=C3=A1n=20Szab=C3=B3?= Date: Wed, 5 Feb 2025 13:50:27 +0100 Subject: [PATCH 1/6] [E&A] Refined DFA advanced concepts. --- .../data-frame-analytics/ml-dfa-concepts.md | 10 ---------- 1 file changed, 10 deletions(-) diff --git a/explore-analyze/machine-learning/data-frame-analytics/ml-dfa-concepts.md b/explore-analyze/machine-learning/data-frame-analytics/ml-dfa-concepts.md index 386c19e116..8d4b7f023e 100644 --- a/explore-analyze/machine-learning/data-frame-analytics/ml-dfa-concepts.md +++ b/explore-analyze/machine-learning/data-frame-analytics/ml-dfa-concepts.md @@ -16,13 +16,3 @@ This section explains the more complex concepts of the Elastic {{ml}} {dfanalyti * [Loss functions for {{regression}} analyses](dfa-regression-lossfunction.md) * [Hyperparameter optimization](hyperparameters.md) * [Trained models](ml-trained-models.md) - - - - - - - - - - From ef0ee666dec651d9da224ac600b552f55f072c83 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Istv=C3=A1n=20Zolt=C3=A1n=20Szab=C3=B3?= Date: Wed, 5 Feb 2025 13:54:06 +0100 Subject: [PATCH 2/6] [E&A] Refines DFA phases and DFA at scale. --- .../data-frame-analytics/ml-dfa-phases.md | 19 ++++++------------- .../data-frame-analytics/ml-dfa-scale.md | 9 --------- 2 files changed, 6 insertions(+), 22 deletions(-) diff --git a/explore-analyze/machine-learning/data-frame-analytics/ml-dfa-phases.md b/explore-analyze/machine-learning/data-frame-analytics/ml-dfa-phases.md index 9d8b575869..c216b34a80 100644 --- a/explore-analyze/machine-learning/data-frame-analytics/ml-dfa-phases.md +++ b/explore-analyze/machine-learning/data-frame-analytics/ml-dfa-phases.md @@ -8,7 +8,6 @@ mapped_pages: # How data frame analytics jobs work [ml-dfa-phases] - A {{dfanalytics-job}} is essentially a persistent {{es}} task. During its life cycle, it goes through four or five main phases depending on the analysis type: * reindexing, @@ -19,20 +18,17 @@ A {{dfanalytics-job}} is essentially a persistent {{es}} task. During its life c Let’s take a look at the phases one-by-one. - -## Reindexing [ml-dfa-phases-reindex] +## Reindexing [ml-dfa-phases-reindex] During the reindexing phase the documents from the source index or indices are copied to the destination index. If you want to define settings or mappings, create the index before you start the job. Otherwise, the job creates it using default settings. Once the destination index is built, the {{dfanalytics-job}} task calls the {{es}} [Reindex API](https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-reindex.html) to launch the reindexing task. - -## Loading data [ml-dfa-phases-load] +## Loading data [ml-dfa-phases-load] After the reindexing is finished, the job fetches the needed data from the destination index. It converts the data into the format that the analysis process expects, then sends it to the analysis process. - -## Analyzing [ml-dfa-phases-analyze] +## Analyzing [ml-dfa-phases-analyze] In this phase, the job generates a {{ml}} model for analyzing the data. The specific phases of analysis vary depending on the type of {{dfanalytics-job}}. @@ -45,15 +41,12 @@ In this phase, the job generates a {{ml}} model for analyzing the data. The spec 3. `fine_tuning_parameters`: Identifies final values for undefined hyperparameters. See [hyperparameter optimization](hyperparameters.md). 4. `final_training`: Trains the {{ml}} model. - -## Writing results [ml-dfa-phases-write] +## Writing results [ml-dfa-phases-write] After the loaded data is analyzed, the analysis process sends back the results. Only the additional fields that the analysis calculated are written back, the ones that have been loaded in the loading data phase are not. The {{dfanalytics-job}} matches the results with the data rows in the destination index, merges them, and indexes them back to the destination index. - -## {{infer-cap}} [ml-dfa-phases-inference] +## {{infer-cap}} [ml-dfa-phases-inference] This phase exists only for {{regression}} and {{classification}} jobs. In this phase, the job validates the trained model against the test split of the data set. -Finally, after all phases are completed, the task is marked as completed and the {{dfanalytics-job}} stops. Your data is ready to be evaluated. - +Finally, after all phases are completed, the task is marked as completed and the {{dfanalytics-job}} stops. Your data is ready to be evaluated. \ No newline at end of file diff --git a/explore-analyze/machine-learning/data-frame-analytics/ml-dfa-scale.md b/explore-analyze/machine-learning/data-frame-analytics/ml-dfa-scale.md index fa3a2be29a..685a8876bf 100644 --- a/explore-analyze/machine-learning/data-frame-analytics/ml-dfa-scale.md +++ b/explore-analyze/machine-learning/data-frame-analytics/ml-dfa-scale.md @@ -22,14 +22,12 @@ It is important to note that there is a correlation between the training time, t The following recommendations are not sequential – the numbers just help to navigate between the list items; you can take action on one or more of them in any order. - ## 0. Start small and iterate rapidly [rapid-iteration] Training is an iterative process. Experiment with different settings and configuration options (including but not limited to hyperparameters and feature importance), then evaluate the results and decide whether they are good enough or need further experimentation. Every iteration takes time, so it is useful to start with a small set of data so you can iterate rapidly and then build up from here. - ## 1. Set a small training percent [small-training-percent] (This step only applies to {{regression}} and {{classification}} jobs.) @@ -38,7 +36,6 @@ The number of documents used for training a model has an effect on the training Consider starting with a small percentage of training data so you can complete iterations more quickly. Once you are happy with your configuration, increase the training percent. As a rule of thumb, if you have a data set with more than 100,000 data points, start with a training percent of 5 or 10. - ## 2. Disable {{feat-imp}} calculation [disable-feature-importance] (This step only applies to {{regression}} and {{classification}} jobs.) @@ -47,7 +44,6 @@ Consider starting with a small percentage of training data so you can complete i For a shorter runtime, consider disabling {{feat-imp}} for some or all iterations if you do not require it. - ## 3. Optimize the number of included fields [optimize-included-fields] You can speed up runtime by only analyzing relevant fields. @@ -58,8 +54,6 @@ By default, all the fields that are supported by the analysis type are included {{feat-imp-cap}} can help you determine the fields that contribute most to the prediction. However, as calculating {{feat-imp}} increases training time, this is a trade-off that can be evaluated during an iterative training process. :::: - - ## 4. Increase the maximum number of threads [increase-threads] You can set the maximum number of threads that are used during the analysis. The default value of `max_num_threads` is 1. Depending on the characteristics of the data, using more threads may decrease the training time at the cost of increased CPU usage. Note that trying to use more threads than the number of CPU cores has no advantage. @@ -72,15 +66,12 @@ To learn more about the individual phases, refer to [How {{dfanalytics-jobs}} wo If your {{ml}} nodes are running concurrent {{anomaly-detect}} or {{dfanalytics-jobs}}, then you may want to keep the maximum number of threads set to a low number – for example the default 1 – to prevent jobs competing for resources. :::: - - ## 5. Optimize the size of the source index [optimize-source-index] Even if the training percent is low, reindexing the source index – which is a mandatory step in the job creation process – may take a long time. During reindexing, the documents from the source index or indices are copied to the destination index, so you have a static copy of the analyzed data. If your data is large and you do not need to test and train on the whole source index or indices, then reduce the cost of reindexing by using a subset of your source data. This can be done by either defining a filter for the source index in the {{dfanalytics-job}} configuration, or by manually reindexing a subset of this data to use as an alternate source index. - ## 6. Configure hyperparameters [configure-hyperparameters] (This step only applies to {{regression}} and {{classification}} jobs.) From 865afbb23265f3dcff5d750e8a813fe7f29373c2 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Istv=C3=A1n=20Zolt=C3=A1n=20Szab=C3=B3?= Date: Wed, 5 Feb 2025 13:56:58 +0100 Subject: [PATCH 3/6] [E&A] Refines feature encoding and custom URLs. --- .../data-frame-analytics/ml-dfa-custom-urls.md | 3 +-- .../data-frame-analytics/ml-feature-encoding.md | 1 - 2 files changed, 1 insertion(+), 3 deletions(-) diff --git a/explore-analyze/machine-learning/data-frame-analytics/ml-dfa-custom-urls.md b/explore-analyze/machine-learning/data-frame-analytics/ml-dfa-custom-urls.md index 7c4fa27788..bf6d9ef83b 100644 --- a/explore-analyze/machine-learning/data-frame-analytics/ml-dfa-custom-urls.md +++ b/explore-analyze/machine-learning/data-frame-analytics/ml-dfa-custom-urls.md @@ -21,7 +21,6 @@ When you create or edit an {{dfanalytics-job}} in {{kib}}, it simplifies the cre For each custom URL, you must supply a label. You can also optionally supply a time range. When you link to **Discover** or a {{kib}} dashboard, you’ll have additional options for specifying the pertinent {{data-source}} or dashboard name and query entities. - ## String substitution in custom URLs [ml-dfa-url-strings] You can use dollar sign ($) delimited tokens in a custom URL. These tokens are substituted for the values of the corresponding fields in the result index. For example, a custom URL might resolve to `discover#/?_g=(time:(from:'$earliest$',mode:absolute,to:'$latest$'))&_a=(filters:!(),index:'4b899bcb-fb10-4094-ae70-207d43183ffc',query:(language:kuery,query:'Carrier:"$Carrier$"'))`. In this case, the pertinent value of the `Carrier` field is passed to the target page when you click the link. @@ -30,7 +29,6 @@ You can use dollar sign ($) delimited tokens in a custom URL. These tokens are s When you create your custom URL in {{kib}}, the **Query entities** option is shown only when there are appropriate fields in the index. :::: - The `$earliest$` and `$latest$` tokens pass the beginning and end of the time span of the data to the target page. The tokens are substituted with date-time strings in ISO-8601 format. For example, the following API updates a job to add a custom URL that uses `$earliest$` and `$latest$` tokens: ```console @@ -51,6 +49,7 @@ POST _ml/data_frame/analytics/flight-delay-regression/_update When you click this custom URL, it opens up the **Discover** page and displays source data for the period one hour before and after the date of the default global settings. ::::{tip} + * The custom URL links use pop-ups. You must configure your web browser so that it does not block pop-up windows or create an exception for your {{kib}} URL. * When creating a link to a {{kib}} dashboard, the URLs for dashboards can be very long. Be careful of typos, end of line characters, and URL encoding. Also ensure you use the appropriate index ID for the target {{kib}} {data-source}. * The dates substituted for `$earliest$` and `$latest$` tokens are in ISO-8601 format and the target system must understand this format. diff --git a/explore-analyze/machine-learning/data-frame-analytics/ml-feature-encoding.md b/explore-analyze/machine-learning/data-frame-analytics/ml-feature-encoding.md index b511a1743a..e6f0feb6a3 100644 --- a/explore-analyze/machine-learning/data-frame-analytics/ml-feature-encoding.md +++ b/explore-analyze/machine-learning/data-frame-analytics/ml-feature-encoding.md @@ -16,4 +16,3 @@ mapped_pages: When the model makes predictions on new data, the data needs to be processed in the same way it was trained. {{ml-cap}} model inference in the {{stack}} does this automatically, so the automatically applied encodings are used in each call for inference. Refer to {{infer}} for [{{classification}}](ml-dfa-classification.md#ml-inference-class) and [{{regression}}](ml-dfa-regression.md#ml-inference-reg). [{{feat-imp-cap}}](ml-feature-importance.md) is calculated for the original categorical fields, not the automatically encoded features. - From ff4fdc8148e6e3d7bd4f4eb0442213edb6bceba8 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Istv=C3=A1n=20Zolt=C3=A1n=20Szab=C3=B3?= Date: Wed, 5 Feb 2025 14:00:07 +0100 Subject: [PATCH 4/6] [E&A] Refines hyperparameter optimization and loss functions. --- .../data-frame-analytics/dfa-regression-lossfunction.md | 4 +--- .../data-frame-analytics/hyperparameters.md | 4 +--- .../data-frame-analytics/ml-feature-importance.md | 2 -- .../data-frame-analytics/ml-feature-processors.md | 9 --------- 4 files changed, 2 insertions(+), 17 deletions(-) diff --git a/explore-analyze/machine-learning/data-frame-analytics/dfa-regression-lossfunction.md b/explore-analyze/machine-learning/data-frame-analytics/dfa-regression-lossfunction.md index d508fd5480..3562cfa927 100644 --- a/explore-analyze/machine-learning/data-frame-analytics/dfa-regression-lossfunction.md +++ b/explore-analyze/machine-learning/data-frame-analytics/dfa-regression-lossfunction.md @@ -19,8 +19,6 @@ You can specify the loss function to be used during {{reganalysis}} when you cre Consult [the Jupyter notebook on regression loss functions](https://github.com/elastic/examples/tree/master/Machine%20Learning/Regression%20Loss%20Functions) to learn more. -::::{tip} +::::{tip} The default loss function parameter values work fine for most of the cases. It is highly recommended to use the default values, unless you fully understand the impact of the different loss function parameters. :::: - - diff --git a/explore-analyze/machine-learning/data-frame-analytics/hyperparameters.md b/explore-analyze/machine-learning/data-frame-analytics/hyperparameters.md index ce532e1c25..36f8fdda6b 100644 --- a/explore-analyze/machine-learning/data-frame-analytics/hyperparameters.md +++ b/explore-analyze/machine-learning/data-frame-analytics/hyperparameters.md @@ -13,8 +13,6 @@ You can view the hyperparameter values that were ultimately chosen by expanding Different hyperparameters may affect the model performance to a different degree. To estimate the importance of the optimized hyperparameters, analysis of variance decomposition is used. The resulting `absolute importance` shows how much the variation of a hyperparameter impacts the variation in the validation loss. Additionally, `relative importance` is also computed which gives the importance of the hyperparameter compared to the rest of the tuneable hyperparameters. The sum of all relative importances is 1. You can check these results in the response of the [get {{dfanalytics-job}} stats API](https://www.elastic.co/guide/en/elasticsearch/reference/current/get-dfanalytics-stats.html). -::::{tip} +::::{tip} Unless you fully understand the purpose of a hyperparameter, it is highly recommended that you leave it unset and allow hyperparameter optimization to occur. :::: - - diff --git a/explore-analyze/machine-learning/data-frame-analytics/ml-feature-importance.md b/explore-analyze/machine-learning/data-frame-analytics/ml-feature-importance.md index 922b9619da..f6463c4642 100644 --- a/explore-analyze/machine-learning/data-frame-analytics/ml-feature-importance.md +++ b/explore-analyze/machine-learning/data-frame-analytics/ml-feature-importance.md @@ -40,5 +40,3 @@ For {{classanalysis}}, the sum of the {{feat-imp}} values approximates the predi By default, {{feat-imp}} values are not calculated. To generate this information, when you create a {{dfanalytics-job}} you must specify the `num_top_feature_importance_values` property. For example, see [Performing {{reganalysis}} in the sample flight data set](ml-dfa-regression.md#performing-regression) and [Performing {{classanalysis}} in the sample flight data set](ml-dfa-classification.md#performing-classification). The {{feat-imp}} values are stored in the {{ml}} results field for each document in the destination index. The number of {{feat-imp}} values for each document might be less than the `num_top_feature_importance_values` property value. For example, it returns only features that had a positive or negative effect on the prediction. - - diff --git a/explore-analyze/machine-learning/data-frame-analytics/ml-feature-processors.md b/explore-analyze/machine-learning/data-frame-analytics/ml-feature-processors.md index 89bae43d25..2c4ea1f2a3 100644 --- a/explore-analyze/machine-learning/data-frame-analytics/ml-feature-processors.md +++ b/explore-analyze/machine-learning/data-frame-analytics/ml-feature-processors.md @@ -4,11 +4,8 @@ mapped_pages: - https://www.elastic.co/guide/en/machine-learning/current/ml-feature-processors.html --- - - # Feature processors [ml-feature-processors] - {{dfanalytics-cap}} automatically includes a [Feature encoding](ml-feature-encoding.md) phase, which transforms categorical features into numerical ones. If you want to have more control over the encoding methods that are used for specific fields, however, you can define feature processors. If there are any remaining categorical features after your processors run, they are addressed in the automatic feature encoding phase. The feature processors that you defined are the part of the analytics process, when data comes through the aggregation or pipeline, the processors run against the new data. The resulting features are ephemeral; they are not stored in the index. This provides a mechanism to create features that can be used at search and ingest time and don’t take up space in the index. @@ -22,9 +19,3 @@ Available feature processors: * [n-gram encoding](https://www.elastic.co/guide/en/machine-learning/current/ngram-encoding.html) * [One hot encoding](https://www.elastic.co/guide/en/machine-learning/current/one-hot-encoding.html) * [Target mean encoding](https://www.elastic.co/guide/en/machine-learning/current/target-mean-encoding.html) - - - - - - From 47997ed7f40366c4ee841f64be3319f0eb29bfab Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Istv=C3=A1n=20Zolt=C3=A1n=20Szab=C3=B3?= Date: Wed, 5 Feb 2025 14:04:53 +0100 Subject: [PATCH 5/6] [E&A] Refines trained models page. --- .../data-frame-analytics/ml-trained-models.md | 61 +++++++++---------- 1 file changed, 28 insertions(+), 33 deletions(-) diff --git a/explore-analyze/machine-learning/data-frame-analytics/ml-trained-models.md b/explore-analyze/machine-learning/data-frame-analytics/ml-trained-models.md index 0912c9a377..69639ecfc4 100644 --- a/explore-analyze/machine-learning/data-frame-analytics/ml-trained-models.md +++ b/explore-analyze/machine-learning/data-frame-analytics/ml-trained-models.md @@ -11,33 +11,32 @@ In {{kib}}, you can view and manage your trained models in **{{stack-manage-app} Alternatively, you can use APIs like [get trained models](https://www.elastic.co/guide/en/elasticsearch/reference/current/get-trained-models.html) and [delete trained models](https://www.elastic.co/guide/en/elasticsearch/reference/current/delete-trained-models.html). - ## Deploying trained models [deploy-dfa-trained-models] - ### Models trained by {{dfanalytics}} [_models_trained_by_dfanalytics] 1. To deploy {{dfanalytics}} model in a pipeline, navigate to **Machine Learning** > **Model Management** > **Trained models** in the main menu, or use the [global search field](../../overview/kibana-quickstart.md#_finding_your_apps_and_objects) in {{kib}}. + 2. Find the model you want to deploy in the list and click **Deploy model** in the **Actions** menu. - :::{image} ../../../images/machine-learning-ml-dfa-trained-models-ui.png - :alt: The trained models UI in {kib} - :class: screenshot - ::: +:::{image} ../../../images/machine-learning-ml-dfa-trained-models-ui.png +:alt: The trained models UI in {kib} +:class: screenshot +::: 3. Create an {{infer}} pipeline to be able to use the model against new data through the pipeline. Add a name and a description or use the default values. - :::{image} ../../../images/machine-learning-ml-dfa-inference-pipeline.png - :alt: Creating an inference pipeline - :class: screenshot - ::: +:::{image} ../../../images/machine-learning-ml-dfa-inference-pipeline.png +:alt: Creating an inference pipeline +:class: screenshot +::: 4. Configure the pipeline processors or use the default settings. - :::{image} ../../../images/machine-learning-ml-dfa-inference-processor.png - :alt: Configuring an inference processor - :class: screenshot - ::: +:::{image} ../../../images/machine-learning-ml-dfa-inference-processor.png +:alt: Configuring an inference processor +:class: screenshot +::: 5. Configure to handle ingest failures or use the default settings. 6. (Optional) Test your pipeline by running a simulation of the pipeline to confirm it produces the anticipated results. @@ -45,76 +44,72 @@ Alternatively, you can use APIs like [get trained models](https://www.elastic.co The model is deployed and ready to use through the {{infer}} pipeline. - ### Models trained by other methods [_models_trained_by_other_methods] You can also supply trained models that are not created by {{dfanalytics-job}} but adhere to the appropriate [JSON schema](https://github.com/elastic/ml-json-schemas). Likewise, you can use third-party models to perform natural language processing (NLP) tasks. If you want to use these trained models in the {{stack}}, you must store them in {{es}} documents by using the [create trained models API](https://www.elastic.co/guide/en/elasticsearch/reference/current/put-trained-models.html). For more information about NLP models, refer to [*Deploy trained models*](../nlp/ml-nlp-deploy-models.md). - ## Exporting and importing models [export-import] Models trained in Elasticsearch are portable and can be transferred between clusters. This is particularly useful when models are trained in isolation from the cluster where they are used for inference. The following instructions show how to use [`curl`](https://curl.se/) and [`jq`](https://stedolan.github.io/jq/) to export a model as JSON and import it to another cluster. 1. Given a model *name*, find the model *ID*. You can use `curl` to call the [get trained model API](https://www.elastic.co/guide/en/elasticsearch/reference/current/get-trained-models.html) to list all models with their IDs. - ```bash +```bash curl -s -u username:password \ -X GET "http://localhost:9200/_ml/trained_models" \ | jq . -C \ | more - ``` +``` If you want to show just the model IDs available, use `jq` to select a subset. - ```bash +```bash curl -s -u username:password \ -X GET "http://localhost:9200/_ml/trained_models" \ | jq -C -r '.trained_model_configs[].model_id' - ``` +``` - ```bash +```bash flights1-1607953694065 flights0-1607953585123 lang_ident_model_1 - ``` +``` In this example, you are exporting the model with ID `flights1-1607953694065`. 2. Using `curl` from the command line, again use the [get trained models API](https://www.elastic.co/guide/en/elasticsearch/reference/current/get-trained-models.html) to export the entire model definition and save it to a JSON file. - ```bash +```bash curl -u username:password \ -X GET "http://localhost:9200/_ml/trained_models/flights1-1607953694065?exclude_generated=true&include=definition&decompress_definition=false" \ | jq '.trained_model_configs[0] | del(.model_id)' \ > flights1.json - ``` +``` - A few observations: +A few observations: - * Exporting models requires using `curl` or a similar tool that can **stream** the model over HTTP into a file. If you use the {{kib}} Console, the browser might be unresponsive due to the size of exported models. - * Note the query parameters that are used during export. These parameters are necessary to export the model in a way that it can later be imported again and used for inference. - * You must unnest the JSON object by one level to extract just the model definition. You must also remove the existing model ID in order to not have ID collisions when you import again. You can do these steps using `jq` inline or alternatively it can be done to the resulting JSON file after downloading using `jq` or other tools. + * Exporting models requires using `curl` or a similar tool that can **stream** the model over HTTP into a file. If you use the {{kib}} Console, the browser might be unresponsive due to the size of exported models. + * Note the query parameters that are used during export. These parameters are necessary to export the model in a way that it can later be imported again and used for inference. + * You must unnest the JSON object by one level to extract just the model definition. You must also remove the existing model ID in order to not have ID collisions when you import again. You can do these steps using `jq` inline or alternatively it can be done to the resulting JSON file after downloading using `jq` or other tools. 3. Import the saved model using `curl` to upload the JSON file to the [created trained model API](https://www.elastic.co/guide/en/elasticsearch/reference/current/put-trained-models.html). When you specify the URL, you can also set the model ID to something new using the last path part of the URL. - ```bash +```bash curl -u username:password \ -H 'Content-Type: application/json' \ -X PUT "http://localhost:9200/_ml/trained_models/flights1-imported" \ --data-binary @flights1.json - ``` - +``` ::::{note} + * Models exported from the [get trained models API](https://www.elastic.co/guide/en/elasticsearch/reference/current/get-trained-models.html) are limited in size by the [http.max_content_length](https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-network.html) global configuration value in {{es}}. The default value is `100mb` and may need to be increased depending on the size of model being exported. * Connection timeouts can occur, for example, when model sizes are very large or your cluster is under load. If needed, you can increase [timeout configurations](https://ec.haxx.se/usingcurl/usingcurl-timeouts) for `curl` (for example, `curl --max-time 600`) or your client of choice. :::: - If you also want to copy the {{dfanalytics-job}} to the new cluster, you can export and import jobs in the **{{stack-manage-app}}** app in {{kib}}. Refer to [Exporting and importing {{ml}} jobs](../anomaly-detection/move-jobs.md). - ## Importing an external model to the {{stack}} [import-external-model-to-es] It is possible to import a model to your {{es}} cluster even if the model is not trained by Elastic {{dfanalytics}}. Eland supports [importing models](https://www.elastic.co/guide/en/elasticsearch/client/eland/current/machine-learning.html) directly through its APIs. Please refer to the latest [Eland documentation](https://eland.readthedocs.io/en/latest/index.md) for more information on supported model types and other details of using Eland to import models with. From 7e01267d1e2bdefcc4790e6e99cffe22abfd95ef Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Istv=C3=A1n=20Zolt=C3=A1n=20Szab=C3=B3?= Date: Wed, 5 Feb 2025 14:16:09 +0100 Subject: [PATCH 6/6] [E&A] Refines limitations and resources. --- .../ml-dfa-limitations.md | 64 ++++++------------- .../data-frame-analytics/ml-dfa-resources.md | 2 - .../ml-dfanalytics-apis.md | 1 - 3 files changed, 20 insertions(+), 47 deletions(-) diff --git a/explore-analyze/machine-learning/data-frame-analytics/ml-dfa-limitations.md b/explore-analyze/machine-learning/data-frame-analytics/ml-dfa-limitations.md index 2f0bdfb7cb..1cabc73050 100644 --- a/explore-analyze/machine-learning/data-frame-analytics/ml-dfa-limitations.md +++ b/explore-analyze/machine-learning/data-frame-analytics/ml-dfa-limitations.md @@ -4,77 +4,61 @@ mapped_pages: - https://www.elastic.co/guide/en/machine-learning/current/ml-dfa-limitations.html --- - - # Limitations [ml-dfa-limitations] - The following limitations and known problems apply to the 9.0.0-beta1 release of the Elastic {{dfanalytics}} feature. The limitations are grouped into the following categories: * [Platform limitations](#dfa-platform-limitations) are related to the platform that hosts the {{ml}} feature of the {{stack}}. * [Configuration limitations](#dfa-config-limitations) apply to the configuration process of the {{dfanalytics-jobs}}. * [Operational limitations](#dfa-operational-limitations) affect the behavior of the {{dfanalytics-jobs}} that are running. +## Platform limitations [dfa-platform-limitations] -## Platform limitations [dfa-platform-limitations] - - -### CPU scheduling improvements apply to Linux and MacOS only [dfa-scheduling-priority] +### CPU scheduling improvements apply to Linux and MacOS only [dfa-scheduling-priority] When there are many {{ml}} jobs running at the same time and there are insufficient CPU resources, the JVM performance must be prioritized so search and indexing latency remain acceptable. To that end, when CPU is constrained on Linux and MacOS environments, the CPU scheduling priority of native analysis processes is reduced to favor the {{es}} JVM. This improvement does not apply to Windows environments. +## Configuration limitations [dfa-config-limitations] -## Configuration limitations [dfa-config-limitations] - - -### {{ccs-cap}} is not supported [dfa-ccs-limitations] +### {{ccs-cap}} is not supported [dfa-ccs-limitations] {{ccs-cap}} is not supported for {{dfanalytics}}. - -### Nested fields are not supported [dfa-nested-fields-limitations] +### Nested fields are not supported [dfa-nested-fields-limitations] Nested fields are not supported for {{dfanalytics-jobs}}. These fields are ignored during the analysis. If a nested field is selected as the dependent variable for {{classification}} or {{reganalysis}}, an error occurs. - -### {{dfanalytics-jobs-cap}} cannot be updated [dfa-update-limitations] +### {{dfanalytics-jobs-cap}} cannot be updated [dfa-update-limitations] You cannot update {{dfanalytics}} configurations. Instead, delete the {{dfanalytics-job}} and create a new one. - -### {{dfanalytics-cap}} memory limitation [dfa-dataframe-size-limitations] +### {{dfanalytics-cap}} memory limitation [dfa-dataframe-size-limitations] {{dfanalytics-cap}} can only perform analyses that fit into the memory available for {{ml}}. Overspill to disk is not currently possible. For general {{ml}} settings, see [{{ml-cap}} settings in {{es}}](https://www.elastic.co/guide/en/elasticsearch/reference/current/ml-settings.html). When you create a {{dfanalytics-job}} and the inference step of the process fails due to the model is too large to fit into JVM, follow the steps in [this GitHub issue](https://github.com/elastic/elasticsearch/issues/76093) for a workaround. - -### {{dfanalytics-jobs-cap}} cannot use more than 232 documents for training [dfa-training-docs] +### {{dfanalytics-jobs-cap}} cannot use more than 232 documents for training [dfa-training-docs] A {{dfanalytics-job}} that would use more than 232 documents for training cannot be started. The limitation applies only for documents participating in training the model. If your source index contains more than 232 documents, set the `training_percent` to a value that represents less than 232 documents. - -### Trained models created in 7.8 are not backwards compatible [dfa-inference-bwc] +### Trained models created in 7.8 are not backwards compatible [dfa-inference-bwc] Trained models created in version 7.8.0 are not backwards compatible with older node versions. In a mixed cluster environment, all nodes must be at least 7.8.0 to use a model created on a 7.8.0 node. +## Operational limitations [dfa-operational-limitations] -## Operational limitations [dfa-operational-limitations] - - -### Deleting a {{dfanalytics-job}} does not delete the destination index [dfa-deletion-limitations] +### Deleting a {{dfanalytics-job}} does not delete the destination index [dfa-deletion-limitations] The [delete {{dfanalytics-job}} API](https://www.elastic.co/guide/en/elasticsearch/reference/current/delete-dfanalytics.html) does not delete the destination index that contains the annotated data of the {{dfanalytics}}. That index must be deleted separately. - -### {{dfanalytics-jobs-cap}} runtime may vary [dfa-time-limitations] +### {{dfanalytics-jobs-cap}} runtime may vary [dfa-time-limitations] The runtime of {{dfanalytics-jobs}} depends on numerous factors, such as the number of data points in the data set, the type of analytics, the number of fields that are included in the analysis, the supplied [hyperparameters](hyperparameters.md), the type of analyzed fields, and so on. For this reason, a general runtime value that applies to all or most of the situations does not exist. The runtime of a {{dfanalytics-job}} may take from a couple of minutes up to many hours in extreme cases. The runtime increases with an increasing number of analyzed fields in a nearly linear fashion. For data sets of more than 100,000 points, start with a low training percent. Run a few {{dfanalytics-jobs}} to see how the runtime scales with the increased number of data points and how the quality of results scales with an increased training percentage. - -### {{dfanalytics-jobs-cap}} may restart after an {{es}} upgrade [dfa-restart] +### {{dfanalytics-jobs-cap}} may restart after an {{es}} upgrade [dfa-restart] A {{dfanalytics-job}} may be restarted from the beginning in the following cases: @@ -84,38 +68,30 @@ A {{dfanalytics-job}} may be restarted from the beginning in the following cases If any of these conditions applies, the destination index of the {{dfanalytics-job}} is deleted and the job starts again from the beginning – regardless of the phase where the job was in. - -### Documents with values of multi-element arrays in analyzed fields are skipped [dfa-multi-arrays-limitations] +### Documents with values of multi-element arrays in analyzed fields are skipped [dfa-multi-arrays-limitations] If the value of an analyzed field (field that is subect of the {{dfanalytics}}) in a document is an array with more than one element, the document that contains this field is skipped during the analysis. - -### {{oldetection-cap}} field types [dfa-od-field-type-docs-limitations] +### {{oldetection-cap}} field types [dfa-od-field-type-docs-limitations] {{oldetection-cap}} requires numeric or boolean data to analyze. The algorithms don’t support missing values, therefore fields that have data types other than numeric or boolean are ignored. Documents where included fields contain missing values, null values, or an array are also ignored. Therefore a destination index may contain documents that don’t have an {{olscore}}. These documents are still reindexed from the source index to the destination index, but they are not included in the {{oldetection}} analysis and therefore no {{olscore}} is computed. - -### {{regression-cap}} field types [dfa-regression-field-type-docs-limitations] +### {{regression-cap}} field types [dfa-regression-field-type-docs-limitations] {{regression-cap}} supports fields that are numeric, boolean, text, keyword and ip. It is also tolerant of missing values. Fields that are supported are included in the analysis, other fields are ignored. Documents where included fields contain an array are also ignored. Documents in the destination index that don’t contain a results field are not included in the {{reganalysis}}. - -### {{classification-cap}} field types [dfa-classification-field-type-docs-limitations] +### {{classification-cap}} field types [dfa-classification-field-type-docs-limitations] {{classification-cap}} supports fields that have numeric, boolean, text, keyword, or ip data types. It is also tolerant of missing values. Fields that are supported are included in the analysis, other fields are ignored. Documents where included fields contain an array are also ignored. Documents in the destination index that don’t contain a results field are not included in the {{classanalysis}}. - -### Imbalanced class sizes affect {{classification}} performance [dfa-classification-imbalanced-classes] +### Imbalanced class sizes affect {{classification}} performance [dfa-classification-imbalanced-classes] If your training data is very imbalanced, {{classanalysis}} may not provide good predictions. Try to avoid highly imbalanced situations. We recommend having at least 50 examples of each class and a ratio of no more than 10 to 1 for the majority to minority class labels in the training data. If your training data set is very imbalanced, consider downsampling the majority class, upsampling the minority class, or gathering more data. - -### Deeply nested objects affect {{infer}} performance [dfa-inference-nested-limitation] +### Deeply nested objects affect {{infer}} performance [dfa-inference-nested-limitation] If the data that you run inference against contains documents that have a series of combinations of dot delimited and nested fields (for example: `{"a.b": "c", "a": {"b": "c"},...}`), the performance of the operation might be slightly slower. Consider using as simple mapping as possible for the best performance profile. - -### Analytics runtime performance may significantly slow down with {{feat-imp}} computation [dfa-feature-importance-limitation] +### Analytics runtime performance may significantly slow down with {{feat-imp}} computation [dfa-feature-importance-limitation] For complex models (such as those with many deep trees), the calculation of {{feat-imp}} takes significantly more time. If a reduction in runtime is important to you, try strategies such as disabling {{feat-imp}}, reducing the amount of training data (for example by decreasing the training percentage), setting [hyperparameter](hyperparameters.md) values, or only selecting fields that are relevant for analysis. - diff --git a/explore-analyze/machine-learning/data-frame-analytics/ml-dfa-resources.md b/explore-analyze/machine-learning/data-frame-analytics/ml-dfa-resources.md index 03323d9491..b11e79db6a 100644 --- a/explore-analyze/machine-learning/data-frame-analytics/ml-dfa-resources.md +++ b/explore-analyze/machine-learning/data-frame-analytics/ml-dfa-resources.md @@ -8,5 +8,3 @@ mapped_pages: This section contains further resources for using {{dfanalytics}}. * [Limitations](ml-dfa-limitations.md) - - diff --git a/explore-analyze/machine-learning/data-frame-analytics/ml-dfanalytics-apis.md b/explore-analyze/machine-learning/data-frame-analytics/ml-dfanalytics-apis.md index caacf22387..caf0eb40db 100644 --- a/explore-analyze/machine-learning/data-frame-analytics/ml-dfanalytics-apis.md +++ b/explore-analyze/machine-learning/data-frame-analytics/ml-dfanalytics-apis.md @@ -29,4 +29,3 @@ The evaluation API endpoint has the following base: * [Update {{dfanalytics-jobs}}](https://www.elastic.co/guide/en/elasticsearch/reference/current/update-dfanalytics.html) For information about the APIs related to trained models, refer to [*API quick reference*](../nlp/ml-nlp-apis.md). -