Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Initial creation of APM module #18805

Merged
merged 7 commits into from May 16, 2018

Conversation

@jgowdyelastic
Copy link
Member

jgowdyelastic commented May 4, 2018

APM module is called apm_transaction and can be created via the endpoint call:

POST api/ml/modules/setup/apm_transaction
(optional payload arguments)
{
  "prefix":"test-",
  "indexPatternName":"apm-*",
  "groups": ["apm"],
  "startDatafeed": true
  "query": {
      "bool": {
      "must": [{
         .........
   }
}

This will create an ML job called test-apm_transaction which will use the supplied query to get data from apm-*

The ML job is based on the suggestion in this comment: #18477 (comment)
With the model memory limit reduced to 10mb

It also contains a query to allow ML to recognise the apm-* index and so create the job via our Recognizer page:

"query": {
    "bool": {
      "filter": [
        {
          "term": {
            "processor.name": "transaction"
          }
        },
        {
          "exists": {
            "field": "context.service.name"
          }
        },
        {
          "exists": {
            "field": "context.service.agent.name"
          }
        },
        {
          "exists": {
            "field": "context.service.agent.version"
          }
        }
      ]
    }
  }

Having the recognizer work with this APM module is not in the original requirement, but it's easy to implement (as long as the search is correct) and so I think worth adding.
The query was my best guess at what we should be looking for, so I'd like some input from the APM team please.
Basically the query needs to return some results in order for it to be a match, the query i've added just looks for some fields, if they exist it assumes it's an APM index that is applicable to this job.

cc @makwarth @blaklaybul @sophiec20

@elasticmachine

This comment has been minimized.

Copy link

elasticmachine commented May 4, 2018

Pinging @elastic/ml-ui

@elasticmachine

This comment has been minimized.

Copy link

elasticmachine commented May 4, 2018

@sqren

This comment has been minimized.

Copy link
Member

sqren commented May 7, 2018

Hi James, thanks for this. Looks great.

Module name
I think the module name apm_transaction is a bit misleading, since APM uses the term "transaction" differently.
Would it be possible to give it a very generic name, eg. avg_single_metric. This would make it useful for other teams also.

If the job contains logic that is specific to apm or any of the metrics are hardcoded, we should name it after that, eg. apm-avg-service-response-time.

I'm not 100% clear on the API. Is the query part of the request to api/ml/modules/setup/apm_transaction? Or is it hardcoded in the job?

This is how I think about the query: When creating a job we should be able to supply an ES query that filters the data, so the ML model only works on that subset. The filter will not only look at whether a field exists or not, but on actual values, eg:

{
  "query": {
    "bool": {
      "filter": [
        { "term": { "context.service.name": "my-service-name" } },
        { "term": { "transaction.name": "my-transaction-name" } }
      ]
    }
  }
}

API

This is how I suggest we structure the API

Create job
POST api/ml/modules/<module-name>/create

Payload:

  • indexPatternName (string) : Pattern for indices to query
  • query (object) : ES query to filter data
  • start (bool) : Indicate whether to start the job immediately after creating it
  • prefix (string) : Not really sure what prefix does

Get job status
GET api/ml/modules/<module-name>/status

I'm ready to jump on a zoom today if it can help clear up anything :)

@blaklaybul

This comment has been minimized.

Copy link

blaklaybul commented May 7, 2018

Can we call the job <service_name>_high_mean_response_time ?

@sqren

This comment has been minimized.

Copy link
Member

sqren commented May 8, 2018

@jgowdyelastic

"query": {
    "bool": {
      "filter": [
        {
          "term": {
            "processor.name": "transaction"
          }
        },
        {
          "term": {
            "processor.event": "transaction"
          }
        },
      ]
    }
  }
@elasticmachine

This comment has been minimized.

Copy link

elasticmachine commented May 8, 2018

@jgowdyelastic

This comment has been minimized.

Copy link
Member Author

jgowdyelastic commented May 8, 2018

Based on notes here and a conversation I had with @sqren, i've updated this PR with some changes:

  • Job id changed to <prefix>high_mean_response_time
  • Recogniser query shortened to just look for processor.name and processor.action

Still needed:

  • Ability to start the jobs in a module after they have been successfully created.
    As suggested, this can be achieved with a start flag being added to the setup endpoint.
  • An endpoint to retrieve the status of a job or multiple jobs.
    This can be added to our kibana enpoints.
    The originally it was suggested to use the module name for this status check. This would not be possible if the jobs were created with a prefix as there is no history kept of which jobs were created using the module setup endpoint.

I'm still unsure about the choice of name for the module.
@sophiec20 @blaklaybul can you think of something better than apm_transaction ?
My original reason for naming it such, was because it creates jobs for APM's transaction data.

@jgowdyelastic jgowdyelastic force-pushed the jgowdyelastic:creating-apm-module branch from a3293a8 to 5f7ccc9 May 8, 2018

@blaklaybul

This comment has been minimized.

Copy link

blaklaybul commented May 8, 2018

@jgowdyelastic For module name, is there a reason we can't call it apm ? Much like the nginx and apache2 modules, the module name is generic and the jobs names specify the details of the configuration.

@jgowdyelastic

This comment has been minimized.

Copy link
Member Author

jgowdyelastic commented May 8, 2018

the plan is to rename nginx and apache2 to something more specific, like nginx_web_access_logs
In case we want to create a new set of nginx jobs which focus on a different aspect of the data.

For APM we may want to create a job to look at the error data in the future. This could go in the existing apm module, in which case just apm would be ok.

@blaklaybul

This comment has been minimized.

Copy link

blaklaybul commented May 8, 2018

Gotcha.

@sqren What other meaning does transaction have in APM? Our thinking was to use processor event names: [transaction, span, error] as indicators in module names. If we are indeed planning to offer multiple recognizer modules for the same dataset then I'm happy using these as prefixes/suffixes.

However, do we intend to offer different recognizer modules for the same dataset? @stevedodson @sophiec20

@elasticmachine

This comment has been minimized.

Copy link

elasticmachine commented May 8, 2018

@sqren

This comment has been minimized.

Copy link
Member

sqren commented May 8, 2018

What other meaning does transaction have in APM?

Turns out it's the same (I misunderstood).

For APM we may want to create a job to look at the error data in the future. This could go in the existing apm module, in which case just apm would be ok.

I think having a generic module called apm is a good solution as long as it doesn't contain any transaction specific things. And then we can have specific jobs eg high_mean_response_time.

As I mentioned to @jgowdyelastic it sound like we don't need the module abstraction for APM. Being able to create/start jobs directly (without running setup module) would be ideal.

jgowdyelastic added some commits May 4, 2018

@jgowdyelastic jgowdyelastic requested review from walterra and peteharverson May 16, 2018

@walterra
Copy link
Contributor

walterra left a comment

LGTM

@peteharverson
Copy link
Contributor

peteharverson left a comment

LGTM

@elasticmachine

This comment has been minimized.

Copy link

elasticmachine commented May 16, 2018

@jgowdyelastic

This comment has been minimized.

Copy link
Member Author

jgowdyelastic commented May 16, 2018

jenkins test this

@elasticmachine

This comment has been minimized.

Copy link

elasticmachine commented May 16, 2018

@jgowdyelastic jgowdyelastic merged commit 329894f into elastic:master May 16, 2018

2 checks passed

CLA Commit author is a member of Elasticsearch
Details
kibana-ci Build finished.
Details

jgowdyelastic added a commit to jgowdyelastic/kibana that referenced this pull request May 16, 2018

[ML] Initial creation of APM module (elastic#18805)
* [ML] Initial creation of APM module

* updating query

* adding processor.name check to query

* removing processor.name exists check

* updating manifest with suggested changes

* renaming-job-id

* updating job description
@sqren

This comment has been minimized.

Copy link
Member

sqren commented May 16, 2018

Great work @jgowdyelastic !

mistic added a commit to mistic/kibana that referenced this pull request May 16, 2018

[ML] Initial creation of APM module (elastic#18805)
* [ML] Initial creation of APM module

* updating query

* adding processor.name check to query

* removing processor.name exists check

* updating manifest with suggested changes

* renaming-job-id

* updating job description

jgowdyelastic added a commit that referenced this pull request May 18, 2018

[ML] Initial creation of APM module (#18805) (#19133)
* [ML] Initial creation of APM module

* updating query

* adding processor.name check to query

* removing processor.name exists check

* updating manifest with suggested changes

* renaming-job-id

* updating job description
@makwarth

This comment has been minimized.

Copy link
Member

makwarth commented Jun 6, 2018

In summary for @sqren:

To create and start job for an APM service, call following endpoint with following payload.

POST api/ml/modules/setup/apm_transaction

{
  "prefix":"<apm_service_name>-", //the prefix allows us to create a module per service 
  "groups": ["apm", "<apm_service_name"], //groups are used to group jobs in the ML UI
  "indexPatternName":"apm-*", 
  "startDatafeed": true, //starts the job upon creation
  "query": { } //query to filter APM data by service name data
}

Notes:

  • This module (<prefix>apm_transaction) implicitly created the ML job called <prefix>high_mean_response_time.
  • The startDatafeed flag was merged in #19254
@elasticmachine

This comment has been minimized.

Copy link

elasticmachine commented Jun 6, 2018

@makwarth

This comment has been minimized.

Copy link
Member

makwarth commented Jun 11, 2018

@blaklaybul @jgowdyelastic

@sqren made good progress Friday and Monday. Here's a screenshot of the WIP implementation :)
screen shot 2018-06-11 at 13 28 43

Questions for ML team

  • What's result_type? Should we only show result_type: bucket?
  • What's the time range for the ML job? All data in the given index?
  • We thought results would be in .ml-results, but looks like they're in .ml-anomalies-shared - can you confirm?
  • Bucket resolution is 900 (15 min)
    • As far as I remember, the reason for not setting this to 1min was that Watcher could get noisy if alerting on anomaly score > 75, or similar. However, when we do the native Watcher integration, we could set the Watch to look at past 15min and use avg. of anomaly score to determine, if we should alert or not? 1m resolution would allow us to show ML bounds on all zoom levels.

APM UI list (for 6.4 MVP implementation)

  • Add transaction name (e.g. request) to prefix
  • Don't show n/a results
  • Flyout to start the ML job.
    • Copy should say that this feature is in Beta and that it can take for the ML job to produce and finalize results, so there'll be some delay before boudns are shown on the graph.
    • To delete the job, user will have to go find it in the ML UI

cc @formgeist

@jgowdyelastic

This comment has been minimized.

Copy link
Member Author

jgowdyelastic commented Jun 11, 2018

What's result_type? Should we only show result_type: bucket?

It depends on what are you information you are trying to plot. Is this for marking anomalies on the chart?

What's the time range for the ML job? All data in the given index?

Yes, it is all data in the given index unless you specify the optional start or end arguments when calling modules/setup endpoint.
Note, supplying an end argument will cause ML's datafeed to stop at that point, which isn't what you want I believe.
Omitting start and end will run the datafeed over the whole index and then keep it running for any new data added to the index.

We thought results would be in .ml-results, but looks like they're in .ml-anomalies-shared - can you confirm?

yes, by default the results will be written to .ml-anomalies-shared

As far as I remember, the reason for not setting this to 1min was that Watcher could get noisy if alerting on anomaly score

This isn't the only reason, 15min bucket span is more likely to give the best results.
1 min will be far to small to provide decent anomaly detection.

@sqren

This comment has been minimized.

Copy link
Member

sqren commented Jun 11, 2018

It depends on what are you information you are trying to plot. Is this for marking anomalies on the chart?

I'm interested in the lower- and upper boundary, and anomaly score. This is how I currently do it:

      aggs: {
        ml_avg_response_times: {
          date_histogram: {
            field: 'timestamp',
            interval: intervalString,
            min_doc_count: 0,
            extended_bounds: {
              min: start,
              max: end
            }
          },
          aggs: {
            anomaly_score: { avg: { field: 'anomaly_score' } },
            lower: { avg: { field: 'model_lower' } },
            upper: { avg: { field: 'model_upper' } }
          }
        }
      }

Currently I just query the entire index. Should I filter by result_type?

@jgowdyelastic

This comment has been minimized.

Copy link
Member Author

jgowdyelastic commented Jun 11, 2018

ah ok,
yeah, adding a filter would be a good idea.
the model values (model_lower and model_upper) come from model_plot result documents.
The anomaly_score values come from bucket result documents.

@sqren

This comment has been minimized.

Copy link
Member

sqren commented Jun 11, 2018

yeah, adding a filter would be a good idea.
the model values (model_lower and model_upper) come from model_plot result documents.
The anomaly_score values come from bucket result documents.

Can I still get the result in a single query then? Does it makes sense to filter by result_type: 'bucket' OR result_type: 'model_plot'?

@sqren

This comment has been minimized.

Copy link
Member

sqren commented Jun 11, 2018

This isn't the only reason, 15min bucket span is more likely to give the best results.
1 min will be far to small to provide decent anomaly detection.

This makes sense. In APM UI we display the ML boundaries on top of the response time data which has a much higher resolution. The way our visualizations work doesn't allow for series with different resolutions (different number of values of x-axis). So to avoid gaps in the ML boundary area I replace null values with the previous value. It is a super crude work-around (a bit better would be to use regression analysis of some kind) but works well enough. Example:

// before:
[null, null, 4, null, 5, null, null, 3, null, null, null, 4]

// after:
[null, null, 4, 4, 5, 5, 5, 3, 3, 3, 3, 4]

I was hoping elasticsearch could do the extrapolation for me, but seems I have to do it manually. Any better ideas?

@jgowdyelastic

This comment has been minimized.

Copy link
Member Author

jgowdyelastic commented Jun 11, 2018

Yes, you should be able to use an OR or perhaps a terms based filter

terms: {
  result_type: [
    'bucket',
    'model_plot'
  ]
}

Regarding filling in the null values, does the charting lib you're using not cater for null values like this?
We use D3 and it does this for us.
In fact we have to go out of our way to stop it doing it when we want to draw the gaps in the data.

@sqren

This comment has been minimized.

Copy link
Member

sqren commented Jun 11, 2018

Regarding filling in the null values, does the charting lib you're using not cater for null values like this?
We use D3 and it does this for us

We also use d3 but I still have this problem. By default null values are interpreted as 0, and looks like this:
screen shot 2018-06-11 at 23 28 53

Those "dips" are clearly no good, so I added d3's defined which skips null values. This create a non-continuous area-chart which doesn't look great either:
screen shot 2018-06-11 at 23 28 19

Finally I replaced the null gaps with last-known-value as mentioned before and it looks ok:
screen shot 2018-06-11 at 23 29 27

Apart from filtering null values out (which is not an option due to how I architected it unfortunately) I don't know any better solutions. Maybe you have a better solution that I could use?

@formgeist

This comment has been minimized.

Copy link
Contributor

formgeist commented Jun 12, 2018

@sqren I think we discussed in the design phase that we'd use last-know-value, when it was a dotted line graph, and I still think that's better than restricting the zoom level and skipping null values. In the skipping null values example, we're still showing last-know-value for each bucket, so it looks even more strange IMO.

@jgowdyelastic

This comment has been minimized.

Copy link
Member Author

jgowdyelastic commented Jun 12, 2018

Apart from filtering null values out (which is not an option due to how I architected it unfortunately)

Ah, this is how we do it. We keep the different types of data in separate arrays and draw them independently to avoid the problems caused by mismatching data points

@elasticmachine

This comment has been minimized.

Copy link

elasticmachine commented Jan 30, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.