Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redact Ingest Processor #92951

Merged
merged 28 commits into from Feb 7, 2023
Merged

Redact Ingest Processor #92951

merged 28 commits into from Feb 7, 2023

Conversation

davidkyle
Copy link
Member

@davidkyle davidkyle commented Jan 16, 2023

The redact processor uses the Grok rules engine to redact matching Grok patterns. Existing patterns from the Grok pattern bank can be referenced directly and new patterns added inline in the pattern_definitions option.

One application of the redact processor is to obscure Personal Identifying Information by configuring the processor to detect known patterns such as email or IP addresses.

In an ingest pipeline the redact processor could be augmented by a Named Entity Recognition model to detect and remove names, places.

PUT _ingest/pipeline/redact
{
  "processors": [
    {
      "redact": {
        "field": "to_redact",
        "patterns": ["%{EMAILADDRESS:EMAIL}", "%{IP:IP_ADDRESS}", "%{CREDIT_CARD:CREDIT_CARD}"],
        "pattern_definitions": {
          "CREDIT_CARD": "\\d{4}[ -]\\d{4}[ -]\\d{4}[ -]\\d{4}"
        }
      }
    }
  ]
}

Given an input document with the field to_redact

{
  "to_redact": test@elastic.co sent an email from the IP 192.168.0.1."
}

The redact processor, as configured above, will emit

{
  "to_redact": <EMAIL> sent an email from the IP <IP_ADDRESS>."
}

The matched text is replaced by the Grok pattern name. The < and > tokens surrounding the replaced text are configurable via the prefix and suffix options.

@davidkyle davidkyle added cloud-deploy Publish cloud docker image for Cloud-First-Testing v8.7.0 labels Jan 16, 2023
@davidkyle davidkyle force-pushed the redact-processor branch 3 times, most recently from a3c1f09 to ecb7b0d Compare February 1, 2023 11:28
@davidkyle davidkyle marked this pull request as ready for review February 1, 2023 11:51
@davidkyle davidkyle added the :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP label Feb 1, 2023
@elasticsearchmachine elasticsearchmachine added the Team:Data Management Meta label for data/management team label Feb 1, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@davidkyle davidkyle added >feature and removed Team:Data Management Meta label for data/management team labels Feb 1, 2023
@elasticsearchmachine elasticsearchmachine added the Team:Data Management Meta label for data/management team label Feb 1, 2023
@elasticsearchmachine
Copy link
Collaborator

Hi @davidkyle, I've created a changelog YAML for you.

@joegallo joegallo self-assigned this Feb 2, 2023
@joegallo joegallo self-requested a review February 2, 2023 17:09
@davidkyle davidkyle changed the title [ML] Redact Ingest Processor Redact Ingest Processor Feb 6, 2023
@davidkyle davidkyle added the >docs General docs changes label Feb 6, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-docs (Team:Docs)

@elasticsearchmachine elasticsearchmachine added the Team:Docs Meta label for docs team label Feb 6, 2023
@@ -86,7 +86,8 @@ public Map<String, Processor.Factory> getProcessors(Processor.Parameters paramet
entry(NetworkDirectionProcessor.TYPE, new NetworkDirectionProcessor.Factory(parameters.scriptService)),
entry(CommunityIdProcessor.TYPE, new CommunityIdProcessor.Factory()),
entry(FingerprintProcessor.TYPE, new FingerprintProcessor.Factory()),
entry(RegisteredDomainProcessor.TYPE, new RegisteredDomainProcessor.Factory())
entry(RegisteredDomainProcessor.TYPE, new RegisteredDomainProcessor.Factory()),
entry(RedactProcessor.TYPE, new RedactProcessor.Factory(createGrokThreadWatchdog(parameters)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure whether we should be invoking createGrokThreadWatchdog twice, or merely creating a single value once and passing it to both the grok and redact processor factories. I'll figure out the answer and let you know.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! The MatchWatchdog is used by all the Grok processors on this node and all the Groks that those processors create, it makes sense to use the same watchdog for the Groks created by the redact processor.

I pushed 940c3c1


If one of the existing Grok https://github.com/elastic/elasticsearch/blob/{branch}/libs/grok/src/main/resources/patterns/ecs-v1[patterns]
does not fit your requirements extend the patterns with the `pattern_definitions` option.
New patterns can be defined with a regular expression or combine
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New patterns can be defined with a regular expression or combine Grok patterns from the base definitions to build complex patterns.

This sentence seems a bit stilted to me.

Copy link
Member Author

@davidkyle davidkyle Feb 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Updated with an example of defining a custom pattern

RegionTrackingMatchExtractor extractor = new RegionTrackingMatchExtractor();
for (var grok : groks) {
String className = grok.captureConfig().get(0).name();
extractor.setCurrentClass(className);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: setCurrentClass(className) -- is className the best name for what this is (the variable, the setter, the field of the extractor that the setter is controlling)? Perhaps this is better as patternName throughout? (I noticed a comment where you referenced it as "Grok pattern name".) I'm open to other ideas here, but I don't love className (there's a bit of a garden path problem with 'classes' and 'class names' on the JVM having a particular meaning).

Anyway, open to ideas, or counter arguments.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

patternName is better, I've used that consistently throughout now

@joegallo
Copy link
Contributor

joegallo commented Feb 7, 2023

Generally I think this looks great, please take my comments as relatively minor notes along the way towards me getting to a ✅.

Copy link
Contributor

@abdonpijpelink abdonpijpelink left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM from a docs perspective. I've left some minor suggestions.

docs/reference/ingest/processors/redact.asciidoc Outdated Show resolved Hide resolved
docs/reference/ingest/processors/redact.asciidoc Outdated Show resolved Hide resolved
docs/reference/ingest/processors/redact.asciidoc Outdated Show resolved Hide resolved
docs/reference/ingest/processors/redact.asciidoc Outdated Show resolved Hide resolved
docs/reference/ingest/processors/redact.asciidoc Outdated Show resolved Hide resolved
docs/reference/ingest/processors/redact.asciidoc Outdated Show resolved Hide resolved
docs/reference/ingest/processors/redact.asciidoc Outdated Show resolved Hide resolved
docs/reference/ingest/processors/redact.asciidoc Outdated Show resolved Hide resolved
docs/reference/ingest/processors/redact.asciidoc Outdated Show resolved Hide resolved
docs/reference/ingest/processors/redact.asciidoc Outdated Show resolved Hide resolved
davidkyle and others added 10 commits February 7, 2023 11:38
Co-authored-by: Abdon Pijpelink <abdon.pijpelink@elastic.co>
Co-authored-by: Abdon Pijpelink <abdon.pijpelink@elastic.co>
Co-authored-by: Abdon Pijpelink <abdon.pijpelink@elastic.co>
Co-authored-by: Abdon Pijpelink <abdon.pijpelink@elastic.co>
Co-authored-by: Abdon Pijpelink <abdon.pijpelink@elastic.co>
Co-authored-by: Abdon Pijpelink <abdon.pijpelink@elastic.co>
Co-authored-by: Abdon Pijpelink <abdon.pijpelink@elastic.co>
Co-authored-by: Abdon Pijpelink <abdon.pijpelink@elastic.co>
Co-authored-by: Abdon Pijpelink <abdon.pijpelink@elastic.co>
Co-authored-by: Abdon Pijpelink <abdon.pijpelink@elastic.co>
Copy link
Contributor

@joegallo joegallo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great feature! Thanks for writing it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cloud-deploy Publish cloud docker image for Cloud-First-Testing :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP >docs General docs changes >feature Team:Data Management Meta label for data/management team Team:Docs Meta label for docs team v8.7.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants