Skip to content

[DOCS] Amends data frame analytics overview and adds resources section #1726

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Jun 29, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 5 additions & 4 deletions docs/en/stack/ml/df-analytics/index.asciidoc
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
include::ml-dfanalytics.asciidoc[]

include::ml-dfa-overview.asciidoc[leveloffset=+1]
include::ml-supervised-workflow.asciidoc[leveloffset=+2]
include::ml-dfa-phases.asciidoc[leveloffset=+2]
include::ml-dfa-scale.asciidoc[leveloffset=+2]


include::ml-dfa-concepts.asciidoc[leveloffset=+1]
include::ml-how-dfa-works.asciidoc[leveloffset=+2]
include::ml-dfa-scale.asciidoc[leveloffset=+2]
include::dfa-outlier-detection.asciidoc[leveloffset=+2]
include::dfa-regression.asciidoc[leveloffset=+2]
include::dfa-classification.asciidoc[leveloffset=+2]
Expand All @@ -25,4 +25,5 @@ include::flightdata-regression.asciidoc[leveloffset=+2]
include::flightdata-classification.asciidoc[leveloffset=+2]
include::ml-lang-ident.asciidoc[leveloffset=+2]

include::ml-dfa-limitations.asciidoc[leveloffset=+1]
include::ml-dfa-resources.asciidoc[leveloffset=+1]
include::ml-dfa-limitations.asciidoc[leveloffset=+2]
8 changes: 5 additions & 3 deletions docs/en/stack/ml/df-analytics/ml-dfa-concepts.asciidoc
Original file line number Diff line number Diff line change
@@ -1,10 +1,12 @@
[role="xpack"]
[[ml-dfa-concepts]]
= Concepts
= Advanced concepts

This section explains the fundamental concepts of the Elastic {ml} {dfanalytics}
feature and the corresponding {evaluatedf-api}.
This section explains the more complex concepts of the Elastic {ml}
{dfanalytics} feature.

* <<ml-dfa-phases>>
* <<ml-dfa-scale>>
* <<dfa-outlier-detection>>
* <<dfa-regression>>
* <<dfa-classification>>
Expand Down
138 changes: 138 additions & 0 deletions docs/en/stack/ml/df-analytics/ml-dfa-overview.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -43,3 +43,141 @@ with supervised learning.
| {regression} | supervised
| {classification} | supervised
|===

[discrete]
[[ml-supervised-workflow]]
== Introduction to supervised learning


Elastic supervised learning enables you to train a {ml} model based on training
examples that you provide. You can then use your model to make predictions on
new data. This page summarizes the end-to-end workflow for training, evaluating
and deploying a model. It gives a high-level overview of the steps required to
identify and implement a solution using supervised learning.

The workflow for supervised learning consists of the following stages:

image::images/ml-dfa-lifecycle-diagram.png["Supervised learning workflow"]

These are iterative stages, meaning that after evaluating each step, you might
need to make adjustments before you move further.

[discrete]
[[define-problem]]
=== Define the problem

It’s important to take a moment and think about where {ml} can be most
impactful. Consider what type of data you have available and what value it
holds. The better you know the data, the quicker you will be able to create {ml}
models that generate useful insights. What kinds of patterns do you want to
discover in your data? What type of value do you want to predict: a category, or
a numerical value? The answers help you choose the type of analysis that fits
your use case.

After you identify the problem, consider which of the {ml-features} are most
likely to help you solve it. Supervised learning requires a data set that
contains known values that the model can be trained on. Unsupervised learning –
like {anomaly-detect} or {oldetection} – does not have this requirement.

{stack} provides the following types of supervised learning:

* {regression}: predicts **continuous, numerical values** like the response time
of a web request.
* {classification}: predicts **discrete, categorical values** like whether a
https://www.elastic.co/blog/machine-learning-in-cybersecurity-training-supervised-models-to-detect-dga-activity[DNS request originates from a malicious or benign domain].


[discrete]
[[prepare-transform-data]]
=== Prepare and transform data

You have defined the problem and selected an appropriate type of analysis. The
next step is to produce a high-quality data set in {es} with a clear
relationship to your training objectives. If your data is not already in {es},
this is the stage where you develop your data pipeline. If you want to learn
more about how to ingest data into {es}, refer to the
{ref}/ingest.html[Ingest node documentation].

{regression-cap} and {classification} are supervised {ml} techniques, therefore
you must supply a labelled data set for training. This is often called the
"ground truth". The training process uses this information to identify
relationships among the various characteristics of the data and the predicted
value. It also plays a critical role in model evaluation.

An important requirement is a data set that is large enough to train a model.
For example, if you would like to train a {classification} model that decides
whether an email is a spam or not, you need a labelled data set that contains
enough data points from each possible category to train the model. What counts
as "enough" depends on various factors like the complexity of the problem or
the {ml} solution you have chosen. There is no exact number that fits every
use case; deciding how much data is acceptable is rather a heuristic process
that might involve iterative trials.

Before you train the model, consider preprocessing the data. In practice, the
type of preprocessing depends on the nature of the data set. Preprocessing can
include, but is not limited to, mitigating redundancy, reducing biases, applying
standards and/or conventions, data normalization, and so on.

{regression-cap} and {classification} require specifically structured source
data: a two dimensional tabular data structure. For this reason, you might need
to {ref}/transforms.html[{transform}] your data to create a {dataframe} which
can be used as the source for these types of {dfanalytics}.

[discrete]
[[train-test-iterate]]
=== Train, test, iterate

After your data is prepared and transformed into the right format, it is time to
train the model. Training is an iterative process — every iteration is followed
by an evaluation to see how the model performs.

The first step is defining the features – the relevant fields in the data set –
that will be used for training the model. By default, all the fields with
supported types are included in {regression} and {classification} automatically.
However, you can optionally exclude irrelevant fields from the process. Doing so
makes a large data set more manageable, reducing the computing resources and
time required for training.

Next you must define how to split your data into a training and a test set. The
test set won’t be used to train the model; it is used to evaluate how the model
performs. There is no optimal percentage that fits all use cases, it depends on
the amount of data and the time you have to train. For large data sets, you may
want to start with a low training percent to complete an end-to-end iteration in
a short time.

During the training process, the training data is fed through the learning
algorithm. The model predicts the value and compares it to the ground truth then
the model is fine-tuned to make the predictions more accurate.

Once the model is trained, you can evaluate how well it predicts previously
unseen data with the model generalization error. There are further
evaluation types for both {regression} and {classification} analysis which
provide metrics about training performance. When you are satisfied with the
results, you are ready to deploy the model. Otherwise, you may want to adjust
the training configuration or consider alternative ways to preprocess and
represent your data.

[discrete]
[[deploy-model]]
=== Deploy model

You have trained the model and are satisfied with the performance. The last step
is to deploy your trained model and start using it on new data.

The Elastic {ml} feature called {infer} enables you to make predictions for new
data either by using it as a processor in an ingest pipeline, in a continuous
{transform} or as an aggregation at search time. When new data comes into your
ingest pipeline or you run a search on your data with an {infer} aggregation,
the model is used to infer against the data and make predictions on it.

[discrete]
[[next-steps]]
=== Next steps

* Read more about how to {ref}/transforms.html[transform you data] into an
entity-centric index.
* Consult the documentation to learn more about <<dfa-regression,regression>>
and <<dfa-classification,classification>>.
* Learn how to <<ml-dfanalytics-evaluate,evaluate>> regression and
classification models.
* Find out how to deploy your model by using <<ml-inference,inference>>.
7 changes: 7 additions & 0 deletions docs/en/stack/ml/df-analytics/ml-dfa-resources.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
[role="xpack"]
[[ml-dfa-resources]]
= Resources

This section contains further resources for using {dfanalytics}.

* <<ml-dfa-limitations>>
2 changes: 1 addition & 1 deletion docs/en/stack/ml/df-analytics/ml-dfanalytics.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,6 @@ and the security privileges that are required to use {dfanalytics}.
* <<ml-dfa-concepts>>
* <<ml-dfanalytics-apis>>
* <<dfanalytics-examples>>
* <<ml-dfa-limitations>>
* <<ml-dfa-resources>>

--
Original file line number Diff line number Diff line change
@@ -1,9 +1,13 @@
[role="xpack"]
[[ml-dfa-phases]]
= How a {dfanalytics-job} works
[subs="attributes"]
++++
<titleabbrev>How it works</titleabbrev>
<titleabbrev>How {dfanalytics-jobs} work</titleabbrev>
++++
:keywords: {ml-init}, {stack}, {dfanalytics}, advanced,
:description: An explanation of how the {dfanalytics-jobs} work. Every job has \
four or five main phases depending on its analysis type.


A {dfanalytics-job} is essentially a persistent {es} task. During its life
Expand All @@ -17,6 +21,7 @@ cycle, it goes through four or five main phases depending on the analysis type:

Let's take a look at the phases one-by-one.

[discrete]
[[ml-dfa-phases-reindex]]
== Reindexing

Expand All @@ -28,13 +33,15 @@ default settings.
Once the destination index is built, the {dfanalytics-job} task calls the {es}
{ref}/docs-reindex.html[Reindex API] to launch the reindexing task.

[discrete]
[[ml-dfa-phases-load]]
== Loading data

After the reindexing is finished, the job fetches the needed data from the
destination index. It converts the data into the format that the analysis
process expects, then sends it to the analysis process.

[discrete]
[[ml-dfa-phases-analyze]]
== Analyzing

Expand All @@ -54,6 +61,7 @@ in which they identify outliers in the data.
hyperparameters. See <<hyperparameters,hyperparameter optimization>>.
. `final_training`: Trains the {ml} model.

[discrete]
[[ml-dfa-phases-write]]
== Writing results

Expand All @@ -63,6 +71,7 @@ ones that have been loaded in the loading data phase are not. The
{dfanalytics-job} matches the results with the data rows in the destination
index, merges them, and indexes them back to the destination index.

[discrete]
[[ml-dfa-phases-inference]]
== {infer-cap}

Expand All @@ -72,11 +81,4 @@ set.


Finally, after all phases are completed, the task is marked as completed and the
{dfanalytics-job} stops. Your data is ready to be evaluated.


Check the <<ml-dfa-concepts>> section if you'd like to know more about the
various {dfanalytics} types.

Check the <<ml-dfanalytics-evaluate>> section if you are interested in the
evaluation of the {dfanalytics} results.
{dfanalytics-job} stops. Your data is ready to be evaluated.
Loading