Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use domain and registered_domain for the domain breakdown. #163

Closed
wants to merge 13 commits into from

Conversation

webmat
Copy link
Contributor

@webmat webmat commented Nov 1, 2018

This PR removes fields subdomain and hostname from destination and source.

View the readme on my branch

Closes #84

@webmat
Copy link
Contributor Author

webmat commented Nov 1, 2018

Sorry I've pinged everyone as a reviewer. But it's been so long before we reached a decision on this, I want to make sure everyone has a look before merging :-)

@webmat webmat requested a review from robgil November 1, 2018 15:24
schemas/url.yml Outdated
@@ -24,15 +24,25 @@
Note: The `:` is not part of the scheme.
example: https

- name: hostname
level: extended
- name: domain
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@webmat Sorry, that slipped through on my end. As mentioned before this is a very common standard to use hostname for url parsers. We can't do this for url.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you suggest we do here?

Copy link
Contributor Author

@webmat webmat Nov 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since a URL's host can either be a domain name or an IP, I agree that domain may not work here. Is that what you're referring to?

The correct wording to represent this would actually be host, not hostname, however (see http://pretty-rfc.herokuapp.com/RFC3986#host).

So if we go to url.host, I would be fine to word it like:

If 'host' is a domain name, it should be copied to source.domain or destination.domain, whichever is appropriate.

So the domain breakdown could happen there, and domains would be expected in two places instead of 3 (a plus, in my mind).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

host would contain the port, but it doesn't. hostname is correct here.

Just leave it hostname and we are good.

@webmat
Copy link
Contributor Author

webmat commented Nov 1, 2018

I undid the changes under url. for now. This PR is now only about source and destination.

I still think if we're parsing URLs hostname is not the correct terminology (it should be host, which can be an IP or a domain name), so there's still work to be done in url.

Copy link
Contributor

@MikePaquette MikePaquette left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM. Slightly off topic but re: #163 (comment) we can't actually call it url.host due to the conflict with the host.* object. Idea: Maybe we should remove url.hostname and just map it to destination.domain

fields.yml Outdated
description: >
Destination subdomain.
Destination domain, with the subdomain stripped out.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we could also include that this is the “effective top level domain plus one additional level” as secondary description to remove any ambiguity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Member

@ruflin ruflin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me working on converting the modules and with grok has brought some clarity here. I think the conflict we had discussing this that we looked at it from 2 different perspective:

  • Making ingest easy
  • Making consumption easy

ECS should help with both aspects.

  • ip: Fields we know are an ipv4 or ipv6. geoip enrichment can only be done on this field.
  • hostname: On ingest time we often don't know if a field is in IP or Hostname format. In grok this is this the IPORHOST pattern. We use hostname on the ingest side to simplify the ingestion.
  • domain: The recommendation here is to use the registered domain. Often this is a post processing task from the hostname field.

In summary, hostname is the field that is often used on ingest time as we don't know the exact content, it could even be a mix of ip and hostnames in an array.

Why hostname and not host?

hostname is used in the grok patterns and also for example on Linux or OS X it's also the hostname command. This makes it very intuitive.

Also as mentioned by @MikePaquette it conflicts with our host object.

@webmat
Copy link
Contributor Author

webmat commented Nov 2, 2018

Ok I will remove the mentions of hostname in this PR, and only keep the domain and registered_domain fields. Let's move the discussion to #166 for the hostname field.

Nic's point that in some of the web server logs, it's possible for the server to perform a reverse lookup on the source IP of each request, and log a hostname instead of an IP (although it's disabled by default). So I'm now convinced that we do need a field for this value that can either be a hostname or an IP. But I'm also still convinced that hostname is not the right name here, after reading RFCs 2396, 3986 and 7230. So let's hash this out in #166 ;-)

@webmat webmat force-pushed the domain_breakdown branch 2 times, most recently from d7ecc80 to 4456d91 Compare November 2, 2018 13:51
Copy link
Contributor Author

@webmat webmat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nicolas submitted a laser-focused PR earlier today, strictly about removing subdomain. His PR is now merged in master.

I've now rebased this PR on top of it.

Starting now, this PR is 100% about the clarification of the field domain (full thing) and the introduction of registered_domain (subdomain stripped out).

It's now ready for review again.

@@ -25,7 +25,7 @@ All notable changes to this project will be documented in this file based on the
* Update definition of `service.type` and `service.name`.
* Redefine purpose of `agent.name` field to be user defined field.
* Rename `url.href` to `url.original`.
* Remove `source.subdomain` and `destination.subdomain` fields.
* Remove `source.subdomain` and `destination.subdomain` fields. #165
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is from another PR, I'm just adding the PR link here.

@ruflin
Copy link
Member

ruflin commented Nov 5, 2018

@webmat I wonder if we go with source.name in #166 if we can use this also for what we use source.domain here and then source.registred_domain becomes source.domain.

@webmat
Copy link
Contributor Author

webmat commented Nov 5, 2018

@ruflin I'd be worried that .name is too generic to represent the domain name of a source / destination.

As I've said elsewhere, I would expect the registered_domain (with subdomain stripped out) value to not be populated as often by integrations, since it requires additional processing based on a TLD list & so on.

This would leave people seeing their domain names in most cases only in source.name and destination.name, which will feel strange, IMO.

I'm totally open to changing the name registered_name to something else, if you feel the name is clumsy. After the RFC review leading to #166, I no longer feel like it's a perfect name. "Registered name", afaik, could mean a short machine name that's in an internal DNS server. It doesn't actually mean the highest level of domain you can purchase (TLD+1 level).

Actually, based on the above paragraph, I'd be tempted to keep this PR for after beta1. This would mean the usage of the existing field domain is still ambiguous for the release. And we can take the time to find the perfect pair of names for this breakdown after beta1. I could go with that.

@webmat
Copy link
Contributor Author

webmat commented Dec 11, 2018

Closing this. This is based on a version of ECS that's too outdated.

We will start fresh on this later, in #267.

@webmat webmat closed this Dec 11, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants