Add ID processor #14524

ycombinator · 2019-11-14T20:55:58Z

This PR introduces a new add_id processor that generates unique IDs for events to use.

The processor will take the following configuration options:

Name	Required?	Default	Description
`target_field`	No	`@metadata.id`	Field in which the generated ID should be stored.
`type`	No	`elasticsearch`	Type of ID to generate. Determines the ID generation algorithm.

Currently the only type of ID that can be generated using this processor is elasticsearch. IDs of this type are generated using the same algorithm that Elasticsearch uses for its auto-generated document IDs. These IDs are conceptually similar to Flake IDs in that the ID generation algorithm generates IDs that are roughly ordered as time progresses. However, there are some optimizations done with choosing the ordering of the bytes in the ID to give Elasticsearch a better chance of compressing the IDs.

Related: #14363.

libbeat/processors/elasticsearch_id/mac.go

libbeat/processors/elasticsearch_id/generator.go

libbeat/processors/elasticsearch_id/config.go

libbeat/processors/elasticsearch_id/errors.go

libbeat/processors/elasticsearch_id/elasticsearch_id.go

libbeat/processors/elasticsearch_id/config.go

libbeat/processors/elasticsearch_id/elasticsearch_id_test.go

libbeat/processors/elasticsearch_id/mac_test.go

libbeat/processors/elasticsearch_id/generator_test.go

libbeat/processors/uuid/generator/elasticsearch/generator.go

libbeat/processors/uuid/generator/generator.go

libbeat/processors/uuid/generator/elasticsearch/generator.go

libbeat/processors/uuid/generator/generator.go

ycombinator · 2019-11-15T19:24:54Z

@urso I'm torn about the name of this processor.

Originally I had named it elasticsearch_id because that's the most accurate description of the IDs it generates. But then I decided to generalize it a bit (by giving it an optional type setting) in case we want to teach this processor to generate different types of IDs later. So I renamed it to uuid because conceptually this processor is generating universally-unique IDs. But that could be misleading because actual UUIDs are expected to be 16 bytes long whereas the ES time-based ID generation algorithm generates 15-byte IDs.

Do you have any suggestions about this?

dedemorton

Suggested a couple of minor changes plus one that will fix the build.

Would you mind adding fingerprint to the link list? It'll save you having to rebase.

I need to rethink how we organize the lists because it means a different order if you organize by processor name vs topic title.

dedemorton · 2019-11-15T19:50:43Z

libbeat/processors/uuid/docs/uuid.asciidoc

+[[uuid]]
+=== Generate UUID for an event
+
+The `uuid` processor generates a random but roughly ordered UUID for an event.


Might be worth adding a sentence that explains what a UUID is (for novice users).

I'm going to hold off on this change until we've finalized the name of this processor. See #14524 (comment).

I think this is moot now, since we renamed the processor to Add ID (add_id) processor.

libbeat/processors/uuid/docs/uuid.asciidoc

libbeat/docs/processors-list.asciidoc

dedemorton · 2019-11-15T20:07:00Z

Re: your question about conditional coding: we only need to add conditions if the processor isn't available to all Beats. If it is, then no extra coding is required.

libbeat/processors/uuid/generator/generator.go

ycombinator · 2019-12-09T21:03:15Z

@urso I believe I've addressed all your feedback from the last round of review now. Please re-review when you get a chance.

In particular, I'd like your thoughts on #14524 (comment) since my change there differs from the implementation you had proposed but I believe solves the underlying problem nevertheless.

I do need to update/add tests in this PR but I will wait to do that based on your feedback, to avoid churn.

libbeat/processors/add_id/generator/es_generator.go

ycombinator · 2019-12-10T20:25:16Z

Chatted with @urso off PR about the monotonic timestamp issue. We came up with a simpler approach, which I've implemented in f2faf14, along with unit tests in 51f8eed.

@urso Could you give this PR another review when you get a chance, please? Thanks.

libbeat/processors/add_id/generator/es_generator.go

ycombinator · 2019-12-10T23:55:10Z

Travis CI is green. Jenkins CI failures are unrelated. Merging.

dedemorton

LGTm

* WIP: Flake ID processor * Fleshing out implementation of generator * Rename package * Unexport const * Use increment operator * Adding processor scaffolding * Fixing default field * Adding CHANGELOG entry * Fixing compile errors * WIP: unit tests * Fixing byte copy * Fixing up tests * Adding test TODOs * Adding non-default target field unit test * Adding one more test TODO * Adding TODO for post-benchmarking * Introduce type * Adding unit test for factory * Adding unit test for mac * Adding unit test for mac * Fleshing out remaining mac unit tests * Adding tests for ES ID generator * Remove TODO after experimenting with IIFE (perf was worse) * Moving doc * Adding UUID processor to list in docs * Apply suggestions from docs code review Co-Authored-By: DeDe Morton <dede.morton@elastic.co> * Adding godoc * Rename generator function type * Exporting and adding godoc * Adding godoc * Updating godoc * Adding Unwrap error methods * Moving ES ID generator into generators package + singleton construction * Addressing Hound feedback * Renaming processor to `add_id` * Updating processor name in CHANGELOG entry * More refactoring updates * Fixing more vet errors * Unexport config struct as it's only used within this package * Fixing doc anchor * Moving generator construction to processor constructor; simplifying factory * Fixing compile error * Validate ID generator type in config * Finer-grained locking to reduce mutex contention * Initialize package global variables that depend on randomness, later * Compute last timestamp while accounting for system time going backwards * Simpler and testable timestamp() function * Adding unit test for timestamp function * Re-implementing ES timestamp algorithm * Removing unused variable

daqqad · 2020-04-03T05:40:23Z

@ycombinator This may be a dumb question (and wrong place to ask it), but with time based UUID generator combined with high volume and multiple filebeat servers what are the chances for duplicate UUIDs?

I'm only asking because 20 character UUID add_id generates is considerably smaller than what fingerprint logstash filter plugin generates (36 chars) which add_id replaced in our environment.

houndci-bot reviewed Nov 14, 2019

View reviewed changes

urso mentioned this pull request Nov 14, 2019

Support for setting the document ID #14363

Closed

3 tasks

urso added [zube]: In Progress Team:Beats v7.6.0 labels Nov 14, 2019

urso self-assigned this Nov 14, 2019