Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add EC2, GCE, or DigitalOcean metadata to events #2728

Merged

Conversation

andrewkroh
Copy link
Member

@andrewkroh andrewkroh commented Oct 7, 2016

This introduces a new processor called add_cloud_metadata that detects
the hosting provider, caches instance metadata, and enriches each event
with the data. There is one configuration option named timeout that
configures the maximum amount of time metadata fetching can run for. The
default value is 3s.

Sample config:

processors:
- add_cloud_metadata:
    #timeout: 3s

Sample data from the providers:

{
  "meta": {
    "cloud": {
      "availability_zone": "us-east-1c",
      "instance_id": "i-4e123456",
      "machine_type": "t2.medium",
      "provider": "ec2",
      "region": "us-east-1"
    }
  }
}
{
  "meta": {
    "cloud": {
      "instance_id": "1234567",
      "provider": "digitalocean",
      "region": "nyc2"
    }
  }
}
{
  "meta": {
    "cloud": {
      "availability_zone": "projects/1234567890/zones/us-east1-b",
      "instance_id": "1234556778987654321",
      "machine_type": "projects/1234567890/machineTypes/f1-micro",
      "project_id": "my-dev",
      "provider": "gce"
    }
  }
}

go func() { c <- fetchJSON("gce", gceHeaders, gceMetadataURL, gceSchema) }()

var results []result
timeout := time.After(5 * time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use a variable like ProviderTimeout instead of using the number 5, in order to make it somehow configurable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timeout is now configurable and defaults to 3s. In practice when running on EC2 the request is completed in 2ms.

Timeout: 2 * time.Second,
KeepAlive: 0, // We are only making a single request.
}).Dial,
ResponseHeaderTimeout: 2 * time.Second,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrewkroh How did you choose 2 here? Should be 2 < 5, right?

Copy link
Member Author

@andrewkroh andrewkroh Oct 8, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose 5 to allow the individual requests to either complete or timeout on their own. In the worst case, timeout could take ~4 seconds (2 seconds for connect timeout + 2 seconds for response header timeout). I left a 1 second buffer since it's executing three requests in parallel and they might not all be scheduled immediately.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the timeout implementation to make it configuration so this code is now different and relies on the http.Client.Timeout.

@monicasarbu
Copy link
Contributor

monicasarbu commented Oct 7, 2016

@andrewkroh This is a great PR 👍

I only have a few minor comments:
I would suggest to add an example in the .full.yml with cloud_metadata processor, and maybe mention experimental.
As the other actions are drop_* or include_*, starting with a verb, I suggest to change the name from cloud_metadata to something similar, maybe add_metadata with an option cloud: true or just add_cloud_metadata?

@andrewkroh andrewkroh force-pushed the feature/aws-metadata-processor branch from 0ca4c19 to cd00d89 Compare October 8, 2016 16:33
@andrewkroh
Copy link
Member Author

I renamed the processor to add_cloud_metadata. I added it to the config.full.yml.

Copy link
Member

@ruflin ruflin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great addition. Few thoughts:

  • I would suggest we either split this one processor up into 4 processors or make the type configurable. We could have a processor for each cloud type + 1 one which does the auto detection. Currently as far as I understand it checks every time all 3 options? We could also have a config option inside the processor which defines the type
  • Is there a specific reason for the 3s default timeout? If not I suggest we use the same as we use for other "metrics" in Metricbeat as default which is 10s.
  • Namespace: This processor reserves the namespace cloud. I'm kind of worried that we this could conflict with other things like a potential cloud module in metricbeat or some data in other beats. As we will face the same problem with other processors we could use a namespace where we add meta in general to prevent similar future conflicts. This namespace could be meta. The same namespace could be used in filebeat for data added by reader as there we face a similar issue: Add line number counter for multiline #2279 In addition we could allow for processors to define one which field data should be added
  • This definitively needs a CHANGELOG.md entry :-)
  • Do you plan to add docs for this in an other PR?

description: >
Name of the cloud provider. Possible values are ec2, gce, or digitalocean.

- name: cloud.instance_id
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use cloud.instance.id and cloud.instance.type here to follow our naming schema.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After further thought, I changed instance_type to machine_type (follow GCE's naming). So I don't think we need to make it instance.id now.


digitaloceanMetadataURL = "http://" + digitaloceanMetadataHost + digitaloceanMetadataURI
digitaloceanSchema = s.Schema{
"instance_id": c.StrFromNum("droplet_id"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instance.id


if instance, ok := m["instance"].(map[string]interface{}); ok {
s.Schema{
"instance_id": c.StrFromNum("id"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above for the naming.

@@ -4,6 +4,8 @@ import (
"testing"
"time"

"encoding/json"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove newline on top

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

# provider about the host machine. It works on EC2, GCE, and DigitalOcean.
#
#processors:
#-add_cloud_metadata:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space after -

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

@monicasarbu
Copy link
Contributor

I agree with @ruflin and have either a config option under add_cloud_metadata to choose the provider or do autodiscovery where the provider is detected automatically. Also, I agree that exporting all these information under cloud is a bit too generic, and I suggest meta.cloud.

@urso
Copy link

urso commented Oct 10, 2016

This PR introduces some kind of lookup processor as x-exec-lookup branch does. For this we introduced some namespacing on filters. e.g. x-exec-lookup registers to processor lookup.exec. For consistency reasons I'd propose to register to lookup.cloud_metadata.

@andrewkroh
Copy link
Member Author

@monicasarbu @ruflin thanks for reviewing.

Is there a specific reason for the 3s default timeout? If not I suggest we use the same as we use for other "metrics" in Metricbeat as default which is 10s.

This processor runs once at startup and auto-detects the cloud provider using three HTTP requests executed in parallel. If the Beat is running in the cloud then it can usually reach a disposition in ~2ms. If it's not running in the cloud then these requests will usually timeout because because there is no route to the metadata services which run on a special link-local IP address.

I don't think there is a need to increase the timeout. It should be able to reliably reach a disposition in that 3s window (probably even shorter would be fine). This would allow you to add the processor to all of your deployments whether they be on-prem or in the cloud without much of a penalty.

I would suggest we either split this one processor up into 4 processors or make the type configurable. We could have a processor for each cloud type + 1 one which does the auto detection. Currently as far as I understand it checks every time all 3 options? We could also have a config option inside the processor which defines the type

I really don't think this is necessary. Can we put this into master without these feature, let it get used a bit, then see if these is necessary?

Namespace: This processor reserves the namespace cloud. I'm kind of worried that we this could conflict with other things like a potential cloud module in metricbeat or some data in other beats.

This can definitely be a problem. After looking at the exec lookup feature I think I will default this to writing the data under fields, add a fields_under_root option, and provide a way to configure the key which defaults to cloud. I am also going to namespace the processor as lookup.cloud_metadata as suggested by @urso.

Do you plan to add docs for this in an other PR?

I'll write the docs in a second PR after the code and behavior is finalized.

@ruflin
Copy link
Member

ruflin commented Oct 11, 2016

@andrewkroh Thanks for the details.

This processor runs once at startup

That is the part I missed. I confused the timeout with how often it runs. I thought it updates the meta data every 3s. In this case 3s or lower makes total sense.

This also answers the second part about the config options. If it is only run once, there is very low overhead of running all 3.

I will default this to writing the data under fields

I don't think we should mix manually added data from the user and machine generated data. That is why I would prefer NOT to put it under fields but find an other namespace. Also I would not provide a fields_under_root option as this will only lead to problems with overwriting fields and will invalidate our predefined templates. I think I don't fully understand the advantage of having fields_under_root option. We could put it under lookup.cloud_metadata namespace. This makes a logical connection between the processor and the data itself. Having auto generation of templates from processors in mind (long term idea) this would make things easier ;-)

For me the only blocker to discuss for this PR is the namespace where the data will be written to.

@andrewkroh andrewkroh force-pushed the feature/aws-metadata-processor branch from cd00d89 to 97b7db1 Compare October 11, 2016 21:34
@andrewkroh
Copy link
Member Author

I pushed a change to rename the processor as lookup.cloud_metadata. We just need to discuss where the data shall go.

@andrewkroh andrewkroh force-pushed the feature/aws-metadata-processor branch 2 times, most recently from 259afb3 to b48acfa Compare October 12, 2016 21:28
@andrewkroh
Copy link
Member Author

This PR has been updated based on our discussions.

  • Processor name is add_cloud_metadata.
  • The data is added to events under meta.cloud. See PR description (top) for full examples.

@ruflin
Copy link
Member

ruflin commented Oct 13, 2016

LGTM. I think we should also add this change to the CHANGELOG

This introduces a new processor called `add_cloud_metadata` that detects
the hosting provider, caches instance metadata, and enriches each event
with the data. There is one configuration option named `timeout` that
configures the maximum amount of time metadata fetching can run for. The
default value is 3s.

Sample config:

```
processors:
- add_cloud_metadata:
    #timeout: 3s
```
@andrewkroh andrewkroh force-pushed the feature/aws-metadata-processor branch from b48acfa to 2918621 Compare October 13, 2016 13:03
@andrewkroh
Copy link
Member Author

Added this to the CHANGELOG.

@monicasarbu monicasarbu merged commit f24f925 into elastic:master Oct 13, 2016
@monicasarbu monicasarbu deleted the feature/aws-metadata-processor branch October 13, 2016 13:06
@urso
Copy link

urso commented Oct 17, 2016

I'm not sure I agree with the chosen namespaces here.

Why use add_cloud_metadata instead of lookup.cloud_metadata?

Why choose meta.cloud? I'd consider meta namespace to be quite common regarding filebeat+json or possibly some future metricset.

Considering more processors being added in future + having some more options to add custom fields in beats (e.g. filebeat prospector, packetbeat modules, processors) I'd opt for some general guidelines regarding namespacing here. Not saying one naming is better or worse then another, but mostly striving for consistency and some general agreement here.

Instead of using fields_under_root or some non-configurable top-level name, how about making the namespace for additional fields to be added configurable namespace? This option can be reused for fields settings as well as for lookup like processors.

@tsg
Copy link
Contributor

tsg commented Oct 17, 2016

We've discussed these things on Wednesday and the discussion was about to go long (like it usually does on things like this), so we decided to go with one of the options, knowing that we still have time to change it before this sees the light of day in 5.1.

So lets continue the discussion, although this PR is probably not the best place for it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants