Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

User-defined lookup enrichment #5221

Open
2 of 7 tasks
acchen97 opened this issue Apr 29, 2016 · 16 comments
Open
2 of 7 tasks

User-defined lookup enrichment #5221

acchen97 opened this issue Apr 29, 2016 · 16 comments

Comments

@acchen97
Copy link
Contributor

acchen97 commented Apr 29, 2016

Logstash should have more dynamic ways to lookup and enrich events, especially with external user-defined datasets. Currently, the main venue of lookup enrichment comes from the translate filter, which is primarily basic key/value lookup and only supports YAML. Here's some ideas:

Use cases

  • Simple key/value lookup enrichment
    • Lookup user name from a user ID
    • Tag/classify bad actors or blacklisted IP addresses
  • Multi-field lookup enrichment
    • “Join” external table data with event
    • Add multiple user fields (name, address, phone #, birthday, etc.) to an event
    • In more traditional BI, there are dimension tables in star schemas or CMDBs tables where relevant lookup data is sourced.
  • Use RDBMS, Elasticsearch, or others as doc store for lookup dataset

Filter plugin additions and enhancements for user-defined data lookup

Ignore below, retaining for precedence

Lookup source file formats (for file/http)

  • CSV
    • Multi-field lookup enrichment (Phase 1)
    • Popular format for tabular data, enables RDBMS/Excel table exports for CSV lookup
  • JSON and YAML
    • Simple key/value lookups (Phase 1)
    • Multi-field lookup enrichment (Future)

The lookup data should be cached:

  • O(1) lookups
  • Configurable max memory size allocated
  • Periodic reloading of cache - no need to bounce the pipeline to refresh lookup cache with changes

Multi-field lookup

CSV Format

  • Must contain a header line
  • Must contain at least two columns
  • Looks up on a single or compound key. The lookup key to use should be unique and must be defined at config time.
code,status_description,status_type,color
200,OK,Successful,Green
201,Created,Successful,Green
202,Accepted,Successful,Green
300,Multiple Choices,Redirection,Yellow

Example

#Config
filter {
  lookup_file {
    path => "~/conf/lookup.csv"
    format => "csv"
    cache_size => "1MB"
    refresh_interval => 10000
    event_fields => ["http_code", "color"]  # (required) 1+ event keys to match with.  event_fields.length() == lookup_fields.length()
    lookup_fields => ["code", "color"]  # (required for csv) 1+ lookup keys to match against
    target_fields => ["status_description", "status_type"]  # (optional) whitelist of 1+ looked up fields to add to event.  If not defined, adds all fields (not including lookup key fields e.g. "code" and "color") to event top level.
  }
}

#Event in
Event {
  http_code => "202"
  color => "Green"
}

#Event out
Event {
  http_code => "202"
  color => "Green"
  status_description => "Accepted"
  status_type => "Successful"
}

Simple key/value lookup

JSON Format

{
  "200": "Green",
  "201": "Green",
  "202": "Green",
  "300": "Yellow",
  "elastic": true,
  "version": 5.0
}

YAML Format

200: ‘Green’
201: ‘Green’
202: ‘Green’
300: ‘Yellow’
elastic: true
version: 5.0

Example

#Config
filter {
  lookup_file {
    path => "~/conf/lookup.json"
    format => ["json" | "yaml"]
    cache_size => "1MB"
    refresh_interval => 10000
    event_fields => "key"
    target_fields => "lookup_value"  # (optional) new field name of looked up value.  If not defined, new field name defaults to "lookup_value".
  }
}

#Event in
Event {
  key => "elastic"
  product => "logstash"
}

#Event out
Event {
  key => "elastic"
  product => "logstash"
  lookup_value => true
}

HTTP example

Very similar to file counterpart, except 'url' instead of 'path'.

filter {
  lookup_http {
    url => "localhost:9200/lookup1/"
    tls => false
    # other fields are the same...
  }
}

Ref: #5087, #3633, #3446, #4510

P.S. - open to suggestions on new plugin names...~~

@ph
Copy link
Contributor

ph commented Apr 29, 2016

Could we use ES as a backend store for the lookup? just reread carefully.

@ph
Copy link
Contributor

ph commented Apr 29, 2016

format => ["json" | "yaml"] # This could be auto-detected by the file name or when reading?

@acchen97
Copy link
Contributor Author

@ph you're right, it could be and we should consider it when implementing.

@purbon
Copy link
Contributor

purbon commented Jun 16, 2016

@acchen97 I love the idea of enhanced lookups for logstash pipeline, what about pushing priority on redis lookup, specially when the lookup is dynamic, having lookup with both ES and Redis might be very helpful to enhance events at runtime. Specially when there are two flows that have connections somehow.

@purbon
Copy link
Contributor

purbon commented Jun 16, 2016

@ph also could be detected by the parser, filename might be tricky but I agree usually a .yml extension indicate a yaml file :-P

@vnadgir-ef
Copy link

+1
Are there any timelines for this feature?

@suyograo suyograo added the v5.2.0 label Dec 1, 2016
@jordansissel
Copy link
Contributor

@vnadgir-ef file lookups are already supported by the translate filter (supports yaml, csv, and json format). There is no timeline. We have tentatively set this for Logstash 5.2.0 but do not have a release date (and Logstash 5.1.0 isn't out yet, either).

@alcanzar
Copy link

alcanzar commented Jan 16, 2017

I've created a plugin that does a lot of what's requested here. I call the plugin logstash-filter-augment. It allows joining multiple fields from a CSV/JSON/YAML file onto an event. I based it initially on the translate filter.

The gem is published to ruby-gems: https://rubygems.org/gems/logstash-filter-augment
And it's public on github: https://github.com/alcanzar/logstash-filter-augment

I'd appreciate any feedback/bug fixes/enhancement requests.

@jordansissel
Copy link
Contributor

@acchen97 can you update the description of this ticket (or close it and open a new one) to reflect some of the recent work in this area? I remember us having some discussions on slack/zoom about features we've already got implemented in the translate filter, for example.

@acchen97
Copy link
Contributor Author

acchen97 commented Feb 4, 2017

@jordansissel updated this based on our most recent discussions. Let me know if I missed anything.

@pmusa
Copy link

pmusa commented Feb 27, 2017

It would be nice to allow not only the elasticsearch _search endpoint, but also the _analyze endpoint as well.

@elvarb
Copy link

elvarb commented Mar 17, 2017

I use the Translate filter heavily and the Ruby filter also for the same reasons so this is a very welcome addition.

In one case I am using the Translate filter to lookup certain values and if nothing matches I have the ruby filter execute a Go program that queries a HTTP api, returns the result and appends the results to the translate dictionary. The issue is that if there are for example 100 incoming messages with the same value that does not exist in the dictionary the HTTP api will be hit 100 times, if there would be some way to trigger a reload of the dictionary if the file changes then that would be extremely valuable.

Just instead of having a reload the file every X seconds have it watch the file for modifications and if it is changed reload it. To prevent constant reloads if the dictionary changes fast then have a setting to to wait at least X seconds before reloading it again.

@acchen97
Copy link
Contributor Author

@jordansissel @suyograo just updated this based on our recent discussions with specific action items for translate, elasticsearch, and jdbc filters. One thing we should discuss is the design for better integrating the ES filter with ES percolations.

@suyograo suyograo added v5.4.0 and removed v5.3.0 labels Mar 24, 2017
@suyograo suyograo added v5.5.0 and removed v5.4.0 labels Apr 24, 2017
@pmusa
Copy link

pmusa commented Jul 12, 2017

Any news here or other issues to follow up the work?

@devinbfergy
Copy link

Is this still a planned feature?

@acchen97
Copy link
Contributor Author

acchen97 commented Mar 6, 2018

Database lookup enrichment is now generally available with the JDBC static and JDBC streaming filters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests