Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Monitoring telemetry collection interval #34609

Merged
merged 21 commits into from Apr 27, 2019

Conversation

@Bamieh
Copy link
Member

commented Apr 5, 2019

Work on @chrisronline draft PR #33705


Relates to #32519

This is a straight copy and paste from this comment. This PR implements this idea.

In order for this PR to work, you'll need to update the .monitoring-kibana index template in ES.

PUT _template/.monitoring-kibana

Copy and paste the json from here -> https://gist.github.com/chrisronline/fdf4f549d4e96cdda06e7b20cf88ce2a

Okay I think I have a plan for this and want to run it by everyone. This solution will ensure that kibana monitoring collection will only poll for usage collector data at the same interval as the normal telemetry service.

  • Add a flag (isUsageCollector) to each usage collector, as an actual property similar to ignoreForInternalUploader so monitoring can know it's a usage collector (see why we have to do it this way)
  • Modify the kibana monitoring collection code to ONLY collect from usage collectors at the same rate as default telemetry (This will be kept in sync by exposing the interval through the xpack_main plugin). However, the monitoring code will pull from stats collectors at the current interval (default to 10s) to ensure the monitoring service and UIs continue to function properly. This will mean that some .monitoring-kibana-* documents will contain a usage field, and some will not.
  • Update the .monitoring-kibana index template to include mappings for the usage field (it currently does not) - Doing this will require a version dance as we want to ensure that folks don't run into a situation where Kibana is assuming this mapping exists but it doesn't yet in ES. This shouldn't happen as we explicitly recommend the monitoring cluster to be at least the same version as all monitored clusters
  • Update the query for collecting usage data from monitoring indices for telemetry to filter out documents that do not contain the usage field to ensure it only fetches documents that actually contain usage data (which will now occur at the same interval as default telemetry)

chrisronline and others added some commits Mar 21, 2019

@elasticmachine

This comment has been minimized.

Copy link
Collaborator

commented Apr 5, 2019

@elasticmachine

This comment has been minimized.

Copy link
Collaborator

commented Apr 5, 2019

@elasticmachine

This comment has been minimized.

Copy link
Collaborator

commented Apr 5, 2019

@Bamieh

This comment has been minimized.

Copy link
Member Author

commented Apr 6, 2019

retest

@elasticmachine

This comment has been minimized.

Copy link
Collaborator

commented Apr 6, 2019

@peterschretlen

This comment has been minimized.

Copy link
Contributor

commented Apr 8, 2019

@Bamieh I think the test failure here might be legitimate? When we call the telemetry stats API /api/telemetry/v1/clusters/_stats makes a query filtering on {"exists":{"field":"kibana_stats.usage.index"}}. For me this is always returning no results, so the telemetry data does not have usage data.

(I have updated the monitoring-kibana template)

If I remove that filter clause, I can see the result set has documents with kibana_stats.usage.index

@peterschretlen

This comment has been minimized.

Copy link
Contributor

commented Apr 8, 2019

This query returns results:

GET .monitoring-kibana-7-2019.04.08/_search?filter_path=hits.hits._source.timestamp,hits.hits._source.kibana_stats.usage.index
{
  "query": {
    "bool": {
      "filter": [
        {
          "bool": {
            "should": [
              {
                "term": {
                  "type": "kibana_stats"
                }
              }
            ]
          }
        },
        {
          "exists": {
            "field": "kibana_stats"
          }
        }
      ]
    }
  },
  "sort": [
    {
      "timestamp": "desc"
    }
  ]
}

it will return the following (can see the mix of usage data and no usage data)

{
  "hits" : {
    "hits" : [
      {
        "_source" : {
          "timestamp" : "2019-04-08T18:16:06.754Z"
        }
      },
      {
        "_source" : {
          "timestamp" : "2019-04-08T18:15:56.750Z"
        }
      },
      {
        "_source" : {
          "timestamp" : "2019-04-08T18:15:46.786Z",
          "kibana_stats" : {
            "usage" : {
              "index" : ".kibana"
            }
          }
        }
      }
   ]
 }
}

But updating the filter to kibana_stats.usage.index gives back empty result set:

GET .monitoring-kibana-7-2019.04.08/_search?filter_path=hits.hits._source.timestamp,hits.hits._source.kibana_stats.usage.index
{
  "query": {
    "bool": {
      "filter": [
        {
          "bool": {
            "should": [
              {
                "term": {
                  "type": "kibana_stats"
                }
              }
            ]
          }
        },
        {
          "exists": {
            "field": "kibana_stats.usage.index"
          }
        }
      ]
    }
  },
  "sort": [
    {
      "timestamp": "desc"
    }
  ]
}
@peterschretlen

This comment has been minimized.

Copy link
Contributor

commented Apr 8, 2019

Never mind ... after deleting the index and recreating it, the queries work fine. It must have been created before I updated the template.

@peterschretlen peterschretlen self-requested a review Apr 8, 2019

@peterschretlen
Copy link
Contributor

left a comment

LGTM - I tested both the collection and the telemetry locally, and working as expected. I think the integration test is failing for the same reason my local testing was failing for a while, indices weren't created with the new template. Once the ES changes show up in snapshots this should be good to go.

@peterschretlen

This comment has been minimized.

Copy link
Contributor

commented Apr 8, 2019

retest

@elasticmachine

This comment has been minimized.

Copy link
Collaborator

commented Apr 8, 2019

@peterschretlen

This comment has been minimized.

Copy link
Contributor

commented Apr 8, 2019

looks like the snapshot already has the new template. This is captured from the same ES snapshot running locally ( https://snapshots.elastic.co/8.0.0-55bab4fb/downloads/elasticsearch/elasticsearch-8.0.0-SNAPSHOT-darwin-x86_64.tar.gz )

$ curl -s --user elastic:changeme localhost:9220/_template/.monitoring-kibana | jq '.[".monitoring-kibana"].mappings.properties.kibana_stats.properties.usage'
{
  "properties": {
    "index": {
      "type": "keyword"
    }
  }
}

The failures are reproducible locally by running the api tests, so it is something other than the template.

@ycombinator

This comment has been minimized.

Copy link
Contributor

commented Apr 12, 2019

Hi, AFAICT the changes in this PR (and the original draft PR proposed by @chrisronline) will only affect internal collection, that is when Kibana is collecting usage metrics internally and bulk uploads them to the Elasticsearch POST _monitoring/bulk API endpoint.

However, the long term plan for Monitoring has been for Metricbeat to externally collect monitoring metrics from Kibana. Metricbeat does this by calling the GET api/stats?extended=true Kibana API and then sending the collected metrics to the Elasticsearch POST _bulk API endpoint.

If this PR is merged, the internal collection will work as we want it to. That is, usage will only be collected every telemetryCollectionInterval and bulk uploaded to the Elasticsearch POST _monitoring/bulk API endpoint. However, if the user sets up Metricbeat collection, I think usage will be collected every time Metricbeat calls the GET api/stats?extended=true Kibana API (default: every 10s), which then would defeat the goal of this PR.

@chrisronline Please correct me if I got any of the above wrong.

@chrisronline

This comment has been minimized.

Copy link
Contributor

commented Apr 12, 2019

@ycombinator That is correct. This is a general and reoccurring issue between MB and internal collection. We opened a ticket to discuss how to make this better: #34940

Bamieh added some commits Apr 23, 2019

@peterschretlen

This comment has been minimized.

Copy link
Contributor

commented Apr 23, 2019

@Bamieh I was able to fix the functional test failure with these changes:
5dc9aa6

Bamieh added some commits Apr 26, 2019

@pickypg
Copy link
Member

left a comment

LGTM, pending CI passing.

Update x-pack/plugins/xpack_main/server/lib/telemetry/monitoring/get_…
…high_level_stats.js

Co-Authored-By: Bamieh <ahmadbamieh@gmail.com>

@Bamieh Bamieh requested a review from peterschretlen Apr 26, 2019

@peterschretlen
Copy link
Contributor

left a comment

LGTM

@elasticmachine

This comment has been minimized.

Copy link
Collaborator

commented Apr 26, 2019

@elasticmachine

This comment has been minimized.

Copy link
Collaborator

commented Apr 26, 2019

Bamieh added some commits Apr 26, 2019

Merge branch 'monitoring-telemetry-collection-interval' of github.com…
…:Bamieh/kibana into monitoring-telemetry-collection-interval
@elasticmachine

This comment has been minimized.

Copy link
Collaborator

commented Apr 27, 2019

@Bamieh Bamieh merged commit 27de17a into elastic:master Apr 27, 2019

2 checks passed

CLA All commits in pull request signed
Details
kibana-ci Build finished.
Details

Bamieh added a commit that referenced this pull request Apr 27, 2019

Monitoring telemetry collection interval (#34609)
* Only collect usage data once a day from kibana monitoring

* isUsageCollector

* bulk uploader tests

* linting

* condition filter

* checkout yarn.lock from master

* update mappings for functional tests

* debugging for refactoring

* support legacy mappings

* self code review

* isUsageCollector

* fix mock

* live coding session with pickypg

* self code review

* Update x-pack/plugins/xpack_main/server/lib/telemetry/monitoring/get_high_level_stats.js

Co-Authored-By: Bamieh <ahmadbamieh@gmail.com>

* update mappings

Bamieh added a commit that referenced this pull request Apr 27, 2019

Monitoring telemetry collection interval (#34609)
* Only collect usage data once a day from kibana monitoring

* isUsageCollector

* bulk uploader tests

* linting

* condition filter

* checkout yarn.lock from master

* update mappings for functional tests

* debugging for refactoring

* support legacy mappings

* self code review

* isUsageCollector

* fix mock

* live coding session with pickypg

* self code review

* Update x-pack/plugins/xpack_main/server/lib/telemetry/monitoring/get_high_level_stats.js

Co-Authored-By: Bamieh <ahmadbamieh@gmail.com>

* update mappings

Bamieh added a commit that referenced this pull request Apr 27, 2019

Monitoring telemetry collection interval (#34609)
* Only collect usage data once a day from kibana monitoring

* isUsageCollector

* bulk uploader tests

* linting

* condition filter

* checkout yarn.lock from master

* update mappings for functional tests

* debugging for refactoring

* support legacy mappings

* self code review

* isUsageCollector

* fix mock

* live coding session with pickypg

* self code review

* Update x-pack/plugins/xpack_main/server/lib/telemetry/monitoring/get_high_level_stats.js

Co-Authored-By: Bamieh <ahmadbamieh@gmail.com>

* update mappings

peterschretlen added a commit that referenced this pull request Apr 27, 2019

Monitoring telemetry collection interval (#34609) (#35693)
* Only collect usage data once a day from kibana monitoring

peterschretlen added a commit that referenced this pull request Apr 27, 2019

[6.7] Monitoring telemetry collection interval (#34609) (#35695)
* Monitoring telemetry collection interval (#34609)

* Only collect usage data once a day from kibana monitoring

Bamieh added a commit that referenced this pull request Apr 27, 2019

Monitoring telemetry collection interval (#34609) (#35692)
* Only collect usage data once a day from kibana monitoring

* isUsageCollector

* bulk uploader tests

* linting

* condition filter

* checkout yarn.lock from master

* update mappings for functional tests

* debugging for refactoring

* support legacy mappings

* self code review

* isUsageCollector

* fix mock

* live coding session with pickypg

* self code review

* Update x-pack/plugins/xpack_main/server/lib/telemetry/monitoring/get_high_level_stats.js

Co-Authored-By: Bamieh <ahmadbamieh@gmail.com>

* update mappings
@ycombinator

This comment has been minimized.

Copy link
Contributor

commented Apr 28, 2019

I noticed that this PR has been labeled with v7.0.1. The 7.0.0 branch backport of this PR was merged a day ago, however the latest BC for 7.0.1 was made 2 days ago. Are we expecting a respin of 7.0.1 or should this PR be relabeled with v7.0.2 instead of v7.0.1?

@pickypg

This comment has been minimized.

Copy link
Member

commented Apr 28, 2019

Assuming we get a response respin (autocorrect!) of 6.7.2, we’ll get one for 7.0.1 as well.

lizozom added a commit to lizozom/kibana that referenced this pull request Apr 29, 2019

Monitoring telemetry collection interval (elastic#34609)
* Only collect usage data once a day from kibana monitoring

* isUsageCollector

* bulk uploader tests

* linting

* condition filter

* checkout yarn.lock from master

* update mappings for functional tests

* debugging for refactoring

* support legacy mappings

* self code review

* isUsageCollector

* fix mock

* live coding session with pickypg

* self code review

* Update x-pack/plugins/xpack_main/server/lib/telemetry/monitoring/get_high_level_stats.js

Co-Authored-By: Bamieh <ahmadbamieh@gmail.com>

* update mappings
@Bamieh

This comment has been minimized.

Copy link
Member Author

commented Apr 29, 2019

Notes for QA

  1. Monitoring was collecting kibana usage from telemetry usage stats every 10 seconds, now it collects it every 1 day and saves it into the monitoring index every 1 day.
  2. Monitoring still saves other stats every 10 seconds. Which means you should see a new monitoring document every 10seconds but without the kibana usage stats. you should only see a doc containing kibana usage stats once per day
  3. when telemetry wants to report data while monitoring is enabled, it hits the monitoring index and filters for the document that contains the usage stats, and uses that to report the usage data.

Some cases for when monitoring will put the kibana_stats.usage along with the rest of the monitoring data.

  1. server off + monitoring enabled: when it turns on, it puts kibana_stats.usage into a doc, next time after 24 hours
  2. server on + monitoring disabled: when monitoring is turned on, it puts kibana_stats.usage into the first doc, 24 hours
  3. server restart + monitoring enabled: puts a doc with kibana_stats.usage regardless of last insert, next time after 24 hours
  4. server on + monitoring enabled + 24 hours passed: monitoring will put a new doc with kibana_stats.usage and restart the clock for the next 24 hours.
  5. From the browser telemetry will report kibana_stats.usage from last doc in monitoring index that contains it (you can check by going to the advanced settings > example of what we collect instead of waiting 24 hours).
  6. The token for browser tracking last time reported to telemetry is different from the monitoring token for tracking last time monitoring reported usage data to monitoring index (time is not synced between the two as they not related)

@Bamieh Bamieh deleted the Bamieh:monitoring-telemetry-collection-interval branch Apr 29, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.