From 33b98870483213089bbd9ed480c94fffc830a3d1 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Istv=C3=A1n=20Zolt=C3=A1n=20Szab=C3=B3?= Date: Tue, 29 Jun 2021 10:59:02 +0200 Subject: [PATCH 1/6] [DOCS] Amends data frame analytics overview. --- docs/en/stack/ml/df-analytics/index.asciidoc | 6 +- .../ml/df-analytics/ml-dfa-overview.asciidoc | 138 ++++++++++++++++++ ...ses.asciidoc => ml-how-dfa-works.asciidoc} | 17 +-- .../ml-supervised-workflow.asciidoc | 135 ----------------- 4 files changed, 147 insertions(+), 149 deletions(-) rename docs/en/stack/ml/df-analytics/{ml-dfa-phases.asciidoc => ml-how-dfa-works.asciidoc} (88%) delete mode 100644 docs/en/stack/ml/df-analytics/ml-supervised-workflow.asciidoc diff --git a/docs/en/stack/ml/df-analytics/index.asciidoc b/docs/en/stack/ml/df-analytics/index.asciidoc index 081e6ff22..5097b5716 100644 --- a/docs/en/stack/ml/df-analytics/index.asciidoc +++ b/docs/en/stack/ml/df-analytics/index.asciidoc @@ -1,11 +1,11 @@ include::ml-dfanalytics.asciidoc[] include::ml-dfa-overview.asciidoc[leveloffset=+1] -include::ml-supervised-workflow.asciidoc[leveloffset=+2] -include::ml-dfa-phases.asciidoc[leveloffset=+2] -include::ml-dfa-scale.asciidoc[leveloffset=+2] + include::ml-dfa-concepts.asciidoc[leveloffset=+1] +include::ml-how-dfa-works.asciidoc[leveloffset=+2] +include::ml-dfa-scale.asciidoc[leveloffset=+2] include::dfa-outlier-detection.asciidoc[leveloffset=+2] include::dfa-regression.asciidoc[leveloffset=+2] include::dfa-classification.asciidoc[leveloffset=+2] diff --git a/docs/en/stack/ml/df-analytics/ml-dfa-overview.asciidoc b/docs/en/stack/ml/df-analytics/ml-dfa-overview.asciidoc index 630d6fd94..e3f4e3f2d 100644 --- a/docs/en/stack/ml/df-analytics/ml-dfa-overview.asciidoc +++ b/docs/en/stack/ml/df-analytics/ml-dfa-overview.asciidoc @@ -43,3 +43,141 @@ with supervised learning. | {regression} | supervised | {classification} | supervised |=== + +[discrete] +[[ml-supervised-workflow]] +== Introduction to supervised learning + + +Elastic supervised learning enables you to train a {ml} model based on training +examples that you provide. You can then use your model to make predictions on +new data. This page summarizes the end-to-end workflow for training, evaluating +and deploying a model. It gives a high-level overview of the steps required to +identify and implement a solution using supervised learning. + +The workflow for supervised learning consists of the following stages: + +image::images/ml-dfa-lifecycle-diagram.png["Supervised learning workflow"] + +These are iterative stages, meaning that after evaluating each step, you might +need to make adjustments before you move further. + +[discrete] +[[define-problem]] +=== Define the problem + +It’s important to take a moment and think about where {ml} can be most +impactful. Consider what type of data you have available and what value it +holds. The better you know the data, the quicker you will be able to create {ml} +models that generate useful insights. What kinds of patterns do you want to +discover in your data? What type of value do you want to predict: a category, or +a numerical value? The answers help you choose the type of analysis that fits +your use case. + +After you identify the problem, consider which of the {ml-features} are most +likely to help you solve it. Supervised learning requires a data set that +contains known values that the model can be trained on. Unsupervised learning – +like {anomaly-detect} or {oldetection} – does not have this requirement. + +{stack} provides the following types of supervised learning: + +* {regression}: predicts **continuous, numerical values** like the response time + of a web request. +* {classification}: predicts **discrete, categorical values** like whether a + https://www.elastic.co/blog/machine-learning-in-cybersecurity-training-supervised-models-to-detect-dga-activity[DNS request originates from a malicious or benign domain]. + + +[discrete] +[[prepare-transform-data]] +=== Prepare and transform data + +You have defined the problem and selected an appropriate type of analysis. The +next step is to produce a high-quality data set in {es} with a clear +relationship to your training objectives. If your data is not already in {es}, +this is the stage where you develop your data pipeline. If you want to learn +more about how to ingest data into {es}, refer to the +{ref}/ingest.html[Ingest node documentation]. + +{regression-cap} and {classification} are supervised {ml} techniques, therefore +you must supply a labelled data set for training. This is often called the +"ground truth". The training process uses this information to identify +relationships among the various characteristics of the data and the predicted +value. It also plays a critical role in model evaluation. + +An important requirement is a data set that is large enough to train a model. +For example, if you would like to train a {classification} model that decides +whether an email is a spam or not, you need a labelled data set that contains +enough data points from each possible category to train the model. What counts +as "enough" depends on various factors like the complexity of the problem or +the {ml} solution you have chosen. There is no exact number that fits every +use case; deciding how much data is acceptable is rather a heuristic process +that might involve iterative trials. + +Before you train the model, consider preprocessing the data. In practice, the +type of preprocessing depends on the nature of the data set. Preprocessing can +include, but is not limited to, mitigating redundancy, reducing biases, applying +standards and/or conventions, data normalization, and so on. + +{regression-cap} and {classification} require specifically structured source +data: a two dimensional tabular data structure. For this reason, you might need +to {ref}/transforms.html[{transform}] your data to create a {dataframe} which +can be used as the source for these types of {dfanalytics}. + +[discrete] +[[train-test-iterate]] +=== Train, test, iterate + +After your data is prepared and transformed into the right format, it is time to +train the model. Training is an iterative process — every iteration is followed +by an evaluation to see how the model performs. + +The first step is defining the features – the relevant fields in the data set – +that will be used for training the model. By default, all the fields with +supported types are included in {regression} and {classification} automatically. +However, you can optionally exclude irrelevant fields from the process. Doing so +makes a large data set more manageable, reducing the computing resources and +time required for training. + +Next you must define how to split your data into a training and a test set. The +test set won’t be used to train the model; it is used to evaluate how the model +performs. There is no optimal percentage that fits all use cases, it depends on +the amount of data and the time you have to train. For large data sets, you may +want to start with a low training percent to complete an end-to-end iteration in +a short time. + +During the training process, the training data is fed through the learning +algorithm. The model predicts the value and compares it to the ground truth then +the model is fine-tuned to make the predictions more accurate. + +Once the model is trained, you can evaluate how well it predicts previously +unseen data with the model generalization error. There are further +evaluation types for both {regression} and {classification} analysis which +provide metrics about training performance. When you are satisfied with the +results, you are ready to deploy the model. Otherwise, you may want to adjust +the training configuration or consider alternative ways to preprocess and +represent your data. + +[discrete] +[[deploy-model]] +=== Deploy model + +You have trained the model and are satisfied with the performance. The last step +is to deploy your trained model and start using it on new data. + +The Elastic {ml} feature called {infer} enables you to make predictions for new +data either by using it as a processor in an ingest pipeline, in a continuous +{transform} or as an aggregation at search time. When new data comes into your +ingest pipeline or you run a search on your data with an {infer} aggregation, +the model is used to infer against the data and make predictions on it. + +[discrete] +[[next-steps]] +=== Next steps + +* Read more about how to {ref}/transforms.html[transform you data] into an + entity-centric index. +* Consult the documentation to learn more about <> + and <>. +* Learn how to <> regression and +classification models. +* Find out how to deploy your model by using <> \ No newline at end of file diff --git a/docs/en/stack/ml/df-analytics/ml-dfa-phases.asciidoc b/docs/en/stack/ml/df-analytics/ml-how-dfa-works.asciidoc similarity index 88% rename from docs/en/stack/ml/df-analytics/ml-dfa-phases.asciidoc rename to docs/en/stack/ml/df-analytics/ml-how-dfa-works.asciidoc index 38546dbeb..1248f7089 100644 --- a/docs/en/stack/ml/df-analytics/ml-dfa-phases.asciidoc +++ b/docs/en/stack/ml/df-analytics/ml-how-dfa-works.asciidoc @@ -1,9 +1,6 @@ [role="xpack"] [[ml-dfa-phases]] = How a {dfanalytics-job} works -++++ -How it works -++++ A {dfanalytics-job} is essentially a persistent {es} task. During its life @@ -17,6 +14,7 @@ cycle, it goes through four or five main phases depending on the analysis type: Let's take a look at the phases one-by-one. +[discrete] [[ml-dfa-phases-reindex]] == Reindexing @@ -28,6 +26,7 @@ default settings. Once the destination index is built, the {dfanalytics-job} task calls the {es} {ref}/docs-reindex.html[Reindex API] to launch the reindexing task. +[discrete] [[ml-dfa-phases-load]] == Loading data @@ -35,6 +34,7 @@ After the reindexing is finished, the job fetches the needed data from the destination index. It converts the data into the format that the analysis process expects, then sends it to the analysis process. +[discrete] [[ml-dfa-phases-analyze]] == Analyzing @@ -54,6 +54,7 @@ in which they identify outliers in the data. hyperparameters. See <>. . `final_training`: Trains the {ml} model. +[discrete] [[ml-dfa-phases-write]] == Writing results @@ -63,6 +64,7 @@ ones that have been loaded in the loading data phase are not. The {dfanalytics-job} matches the results with the data rows in the destination index, merges them, and indexes them back to the destination index. +[discrete] [[ml-dfa-phases-inference]] == {infer-cap} @@ -72,11 +74,4 @@ set. Finally, after all phases are completed, the task is marked as completed and the -{dfanalytics-job} stops. Your data is ready to be evaluated. - - -Check the <> section if you'd like to know more about the -various {dfanalytics} types. - -Check the <> section if you are interested in the -evaluation of the {dfanalytics} results. +{dfanalytics-job} stops. Your data is ready to be evaluated. \ No newline at end of file diff --git a/docs/en/stack/ml/df-analytics/ml-supervised-workflow.asciidoc b/docs/en/stack/ml/df-analytics/ml-supervised-workflow.asciidoc deleted file mode 100644 index ff9d70ddb..000000000 --- a/docs/en/stack/ml/df-analytics/ml-supervised-workflow.asciidoc +++ /dev/null @@ -1,135 +0,0 @@ -[role="xpack"] -[[ml-supervised-workflow]] -= Introduction to supervised learning - - -Elastic supervised learning enables you to train a {ml} model based on training -examples that you provide. You can then use your model to make predictions on -new data. This page summarizes the end-to-end workflow for training, evaluating -and deploying a model. It gives a high-level overview of the steps required to -identify and implement a solution using supervised learning. - -The workflow for supervised learning consists of the following stages: - -image::images/ml-dfa-lifecycle-diagram.png["Supervised learning workflow"] - -These are iterative stages, meaning that after evaluating each step, you might -need to make adjustments before you move further. - - -[[define-problem]] -== Define the problem - -It’s important to take a moment and think about where {ml} can be most -impactful. Consider what type of data you have available and what value it -holds. The better you know the data, the quicker you will be able to create {ml} -models that generate useful insights. What kinds of patterns do you want to -discover in your data? What type of value do you want to predict: a category, or -a numerical value? The answers help you choose the type of analysis that fits -your use case. - -After you identify the problem, consider which of the {ml-features} are most -likely to help you solve it. Supervised learning requires a data set that -contains known values that the model can be trained on. Unsupervised learning – -like {anomaly-detect} or {oldetection} – does not have this requirement. - -{stack} provides the following types of supervised learning: - -* {regression}: predicts **continuous, numerical values** like the response time - of a web request. -* {classification}: predicts **discrete, categorical values** like whether a - https://www.elastic.co/blog/machine-learning-in-cybersecurity-training-supervised-models-to-detect-dga-activity[DNS request originates from a malicious or benign domain]. - - -[[prepare-transform-data]] -== Prepare and transform data - -You have defined the problem and selected an appropriate type of analysis. The -next step is to produce a high-quality data set in {es} with a clear -relationship to your training objectives. If your data is not already in {es}, -this is the stage where you develop your data pipeline. If you want to learn -more about how to ingest data into {es}, refer to the -{ref}/ingest.html[Ingest node documentation]. - -{regression-cap} and {classification} are supervised {ml} techniques, therefore -you must supply a labelled data set for training. This is often called the -"ground truth". The training process uses this information to identify -relationships among the various characteristics of the data and the predicted -value. It also plays a critical role in model evaluation. - -An important requirement is a data set that is large enough to train a model. -For example, if you would like to train a {classification} model that decides -whether an email is a spam or not, you need a labelled data set that contains -enough data points from each possible category to train the model. What counts -as "enough" depends on various factors like the complexity of the problem or -the {ml} solution you have chosen. There is no exact number that fits every -use case; deciding how much data is acceptable is rather a heuristic process -that might involve iterative trials. - -Before you train the model, consider preprocessing the data. In practice, the -type of preprocessing depends on the nature of the data set. Preprocessing can -include, but is not limited to, mitigating redundancy, reducing biases, applying -standards and/or conventions, data normalization, and so on. - -{regression-cap} and {classification} require specifically structured source -data: a two dimensional tabular data structure. For this reason, you might need -to {ref}/transforms.html[{transform}] your data to create a {dataframe} which -can be used as the source for these types of {dfanalytics}. - -[[train-test-iterate]] -== Train, test, iterate - -After your data is prepared and transformed into the right format, it is time to -train the model. Training is an iterative process — every iteration is followed -by an evaluation to see how the model performs. - -The first step is defining the features – the relevant fields in the data set – -that will be used for training the model. By default, all the fields with -supported types are included in {regression} and {classification} automatically. -However, you can optionally exclude irrelevant fields from the process. Doing so -makes a large data set more manageable, reducing the computing resources and -time required for training. - -Next you must define how to split your data into a training and a test set. The -test set won’t be used to train the model; it is used to evaluate how the model -performs. There is no optimal percentage that fits all use cases, it depends on -the amount of data and the time you have to train. For large data sets, you may -want to start with a low training percent to complete an end-to-end iteration in -a short time. - -During the training process, the training data is fed through the learning -algorithm. The model predicts the value and compares it to the ground truth then -the model is fine-tuned to make the predictions more accurate. - -Once the model is trained, you can evaluate how well it predicts previously -unseen data with the model generalization error. There are further -<> for both {regression} and -{classification} analysis which provide metrics about training performance. -When you are satisfied with the results, you are ready to deploy the model. -Otherwise, you may want to adjust the training configuration or consider -alternative ways to preprocess and represent your data. - - -[[deploy-model]] -== Deploy model - -You have trained the model and are satisfied with the performance. The last step -is to deploy your trained model and start using it on new data. - -The Elastic {ml} feature called {infer} enables you to make predictions for new -data either by using it as a processor in an ingest pipeline, in a continuous -{transform} or as an aggregation at search time. When new data comes into your -ingest pipeline or you run a search on your data with an {infer} aggregation, -the model is used to infer against the data and make predictions on it. - - -[[next-steps]] -== Next steps - -* Read more about how to {ref}/transforms.html[transform you data] into an - entity-centric index. -* Consult the documentation to learn more about <> - and <>. -* Learn how to <> regression and -classification models. -* Find out how to deploy your model by using <>. From 3b9c7e741d6c751471479268fcd5845456f3e4a3 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Istv=C3=A1n=20Zolt=C3=A1n=20Szab=C3=B3?= Date: Tue, 29 Jun 2021 11:23:36 +0200 Subject: [PATCH 2/6] [DOCS] Adds metadata to How DFA works page. --- docs/en/stack/ml/df-analytics/ml-how-dfa-works.asciidoc | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/docs/en/stack/ml/df-analytics/ml-how-dfa-works.asciidoc b/docs/en/stack/ml/df-analytics/ml-how-dfa-works.asciidoc index 1248f7089..5f2683c3e 100644 --- a/docs/en/stack/ml/df-analytics/ml-how-dfa-works.asciidoc +++ b/docs/en/stack/ml/df-analytics/ml-how-dfa-works.asciidoc @@ -1,6 +1,13 @@ [role="xpack"] [[ml-dfa-phases]] = How a {dfanalytics-job} works +[subs="attributes"] +++++ +How {dfanalytics-jobs} work +++++ +:keywords: {ml-init}, {stack}, {dfanalytics}, advanced, +:description: An explanation of how the {dfanalytics-jobs} work. Every job has \ + four or five main phases depending on its analysis type. A {dfanalytics-job} is essentially a persistent {es} task. During its life From 233d5d4f6c8f3b4819fd4648054fc151bd54e3a6 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Istv=C3=A1n=20Zolt=C3=A1n=20Szab=C3=B3?= Date: Tue, 29 Jun 2021 11:26:27 +0200 Subject: [PATCH 3/6] [DOCS] Renames Concepts to Advanced concepts. --- docs/en/stack/ml/df-analytics/ml-dfa-concepts.asciidoc | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/docs/en/stack/ml/df-analytics/ml-dfa-concepts.asciidoc b/docs/en/stack/ml/df-analytics/ml-dfa-concepts.asciidoc index 16145849d..f0a876159 100644 --- a/docs/en/stack/ml/df-analytics/ml-dfa-concepts.asciidoc +++ b/docs/en/stack/ml/df-analytics/ml-dfa-concepts.asciidoc @@ -1,10 +1,11 @@ [role="xpack"] [[ml-dfa-concepts]] -= Concepts += Advanced concepts This section explains the fundamental concepts of the Elastic {ml} {dfanalytics} feature and the corresponding {evaluatedf-api}. +* <> * <> * <> * <> From 5a0950556555c2b22208b5b8aa6bf263da8b0dd4 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Istv=C3=A1n=20Zolt=C3=A1n=20Szab=C3=B3?= Date: Tue, 29 Jun 2021 11:39:27 +0200 Subject: [PATCH 4/6] [DOCS] Adds DFA at scale link to Advanced concepts. --- docs/en/stack/ml/df-analytics/ml-dfa-concepts.asciidoc | 5 +++-- docs/en/stack/ml/df-analytics/ml-dfa-overview.asciidoc | 2 +- 2 files changed, 4 insertions(+), 3 deletions(-) diff --git a/docs/en/stack/ml/df-analytics/ml-dfa-concepts.asciidoc b/docs/en/stack/ml/df-analytics/ml-dfa-concepts.asciidoc index f0a876159..f6fce3422 100644 --- a/docs/en/stack/ml/df-analytics/ml-dfa-concepts.asciidoc +++ b/docs/en/stack/ml/df-analytics/ml-dfa-concepts.asciidoc @@ -2,10 +2,11 @@ [[ml-dfa-concepts]] = Advanced concepts -This section explains the fundamental concepts of the Elastic {ml} {dfanalytics} -feature and the corresponding {evaluatedf-api}. +This section explains the more complex concepts of the Elastic {ml} +{dfanalytics} feature. * <> +* <> * <> * <> * <> diff --git a/docs/en/stack/ml/df-analytics/ml-dfa-overview.asciidoc b/docs/en/stack/ml/df-analytics/ml-dfa-overview.asciidoc index e3f4e3f2d..247ea8625 100644 --- a/docs/en/stack/ml/df-analytics/ml-dfa-overview.asciidoc +++ b/docs/en/stack/ml/df-analytics/ml-dfa-overview.asciidoc @@ -180,4 +180,4 @@ the model is used to infer against the data and make predictions on it. and <>. * Learn how to <> regression and classification models. -* Find out how to deploy your model by using <> \ No newline at end of file +* Find out how to deploy your model by using <>. \ No newline at end of file From 7c3cd4dcc49ac7bd850a72026592f1c27a293c1b Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Istv=C3=A1n=20Zolt=C3=A1n=20Szab=C3=B3?= Date: Tue, 29 Jun 2021 15:45:05 +0200 Subject: [PATCH 5/6] [DOCS] Adds Resources section. --- docs/en/stack/ml/df-analytics/index.asciidoc | 3 ++- docs/en/stack/ml/df-analytics/ml-dfa-resources.asciidoc | 7 +++++++ 2 files changed, 9 insertions(+), 1 deletion(-) create mode 100644 docs/en/stack/ml/df-analytics/ml-dfa-resources.asciidoc diff --git a/docs/en/stack/ml/df-analytics/index.asciidoc b/docs/en/stack/ml/df-analytics/index.asciidoc index 5097b5716..11907e8a2 100644 --- a/docs/en/stack/ml/df-analytics/index.asciidoc +++ b/docs/en/stack/ml/df-analytics/index.asciidoc @@ -25,4 +25,5 @@ include::flightdata-regression.asciidoc[leveloffset=+2] include::flightdata-classification.asciidoc[leveloffset=+2] include::ml-lang-ident.asciidoc[leveloffset=+2] -include::ml-dfa-limitations.asciidoc[leveloffset=+1] \ No newline at end of file +include::ml-dfa-resources.asciidoc[leveloffset=+1] +include::ml-dfa-limitations.asciidoc[leveloffset=+2] \ No newline at end of file diff --git a/docs/en/stack/ml/df-analytics/ml-dfa-resources.asciidoc b/docs/en/stack/ml/df-analytics/ml-dfa-resources.asciidoc new file mode 100644 index 000000000..a7cbb199c --- /dev/null +++ b/docs/en/stack/ml/df-analytics/ml-dfa-resources.asciidoc @@ -0,0 +1,7 @@ +[role="xpack"] +[[ml-dfa-resources]] += Resources + +This section contains further resources for using {dfanalytics}. + +* <> \ No newline at end of file From 5d3f918dd93c9f7876acc189bc63174bdcb33bb6 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Istv=C3=A1n=20Zolt=C3=A1n=20Szab=C3=B3?= Date: Tue, 29 Jun 2021 16:08:40 +0200 Subject: [PATCH 6/6] [DOCS] Changes link on DFA main page. --- docs/en/stack/ml/df-analytics/ml-dfanalytics.asciidoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/en/stack/ml/df-analytics/ml-dfanalytics.asciidoc b/docs/en/stack/ml/df-analytics/ml-dfanalytics.asciidoc index 7e4165f5f..f4134eb59 100644 --- a/docs/en/stack/ml/df-analytics/ml-dfanalytics.asciidoc +++ b/docs/en/stack/ml/df-analytics/ml-dfanalytics.asciidoc @@ -20,6 +20,6 @@ and the security privileges that are required to use {dfanalytics}. * <> * <> * <> -* <> +* <> -- \ No newline at end of file