Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fingerprint/composite field types #84282

Open
dgieselaar opened this issue Feb 23, 2022 · 11 comments
Open

Fingerprint/composite field types #84282

dgieselaar opened this issue Feb 23, 2022 · 11 comments
Labels
:Analytics/Aggregations Aggregations >enhancement :Search Foundations/Mapping Index mappings, including merging and defining field types Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch

Comments

@dgieselaar
Copy link
Member

In the APM app (and probably in Observability in general) we sometimes use the composite of multiple fields as "keys" for a certain timeseries. E.g., we might use a nested terms aggregation on service.name + service.environment. There are several downsides to this approach currently:

  • Nested terms aggregations require us to be able to have a good understanding of the relative cardinality of each field, e.g. we set a size of 5000 for service.name, and then 10 for service.environment. Otherwise some timeseries might not be included.
  • Composite aggregations are not sortable and often slower than a terms agg (nested, but especially single).
  • Multi-terms aggregations are even slower.

We also use the terms enum API to get a list of service names fast. However, we cannot use this for multiple fields.

One workaround would be to add an ingest processor that "fingerprints" values from multiple fields into a single keyword, and use that to aggregate over this field. However, this comes with the downside of us having to come up with a serialization/deserialization logic.

Ideally, ES can help us here by adding a field type for this purpose - I'm using fingerprint here because a composite field type is already a thing in ES, but the name is probably not the best. The mapping could look as follows:

{
  "properties": {
    "service": {
      "properties": {
        "name": {
          "type": "keyword"
        },
        "environment": {
          "type": "keyword"
        },
        "id": {
          "type": "fingerprint",
          "fields": [
            "service.name",
            "service.environment"
          ]
        }
      }
    }
  }
}

Suppose that we run a terms aggregation on service.id:

{
  "aggs": {
    "service.id": {
      "terms": {
        "field": "service.id"
      }
    }
  }
}

Elasticsearch would return the composite values as follows:

{
  "aggs": {
    "service.id": {
      "buckets": [
        {
          "key": "opbeans-java/production",
          "key_as_value": {
            "service.name": "opbeans-java",
            "service.environment": "production"
          }
        }
      ]
    }
  }
}

Or, when we call the terms enum API (which would have to be a breaking change, I guess?):

{
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "terms" : [
    { "service.name": "apm-server", "service.environment": "development" },
    { "service.name": "apm-server", "service.environment": "production" }
  ],
  "complete" : true
}
@ywelsch ywelsch added :Analytics/Aggregations Aggregations :Search/Search Search-related issues that do not fall into other categories labels Feb 23, 2022
@elasticmachine elasticmachine added Team:Search Meta label for search team Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) labels Feb 23, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (Team:Analytics)

@ywelsch
Copy link
Contributor

ywelsch commented Feb 23, 2022

I've tagged both search and analytics teams as this touches on areas covered by both. Each team can discuss and leave their thoughts here.

@imotov
Copy link
Contributor

imotov commented Feb 23, 2022

Do you anticipate a need for multiple fingerprints per document? If not this is what we are basically doing with _tsid in time series indices.

@dgieselaar
Copy link
Member Author

@imotov yeah, I think so. Eg for the service inventory we might only need service name + env, but then when drilling down into the service detail page, we'd like to add transaction type and maybe host name.

@nik9000
Copy link
Member

nik9000 commented Feb 23, 2022

I'd be curious to see a picture of the thing you are building with the results here. We sure can build fingerprint fields if its the right thing. But maybe the right thing is to make multi-field terms agg faster.

@dgieselaar
Copy link
Member Author

@nik9000 the thing that started this discussion was that we are experimenting with populating the service inventory (our landing page that has a list of all APM services) with the terms enum API to speed up perceived performance. However, one drawback there is that we'd like to filter on/group by environment, and the terms enum API will only return values for a single field. That is something that the multi terms agg cannot solve I think, though I am all in favor of a speed boost for the multi terms agg. We do some cases where we use a nested terms agg instead of multi terms because the former is a lot faster, and multi terms should be the more appropriate agg, in theory.

@dgieselaar
Copy link
Member Author

dgieselaar commented Mar 1, 2022

Another thing I'm wondering about is: suppose we have such a field, on three different fields, e.g. on service.name, service.environment and transaction.type - I'd like to run a terms agg on two of the fields, which would mean that ES would have to merge buckets in the reduce phase - is something like that a reasonable thing to do w/ a field type like this?

Maybe that's more of a TSDB thing though.

@nik9000
Copy link
Member

nik9000 commented Mar 1, 2022 via email

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@javanna javanna added :Search Foundations/Mapping Index mappings, including merging and defining field types and removed :Search/Search Search-related issues that do not fall into other categories labels Jul 17, 2024
@elasticsearchmachine elasticsearchmachine added Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch and removed Team:Search Meta label for search team labels Jul 17, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-foundations (Team:Search Foundations)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Aggregations Aggregations >enhancement :Search Foundations/Mapping Index mappings, including merging and defining field types Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch
Projects
None yet
Development

No branches or pull requests

8 participants