METRON-1569 Allow user to change field name conversion when indexing … #1022

nickwallen · 2018-05-22T16:47:11Z

The ElasticsearchWriter has a mechanism to transform the field names of a message before it is written to Elasticsearch. Right now this mechanism is hard-coded to replace all '.' dots with ':' colons.

This mechanism was needed for Elasticsearch 2.x which did not allow dots in field names. Now that Metron supports Elasticsearch 5.x this is no longer a problem. A user should be able to configure the field name transformation when writing to Elasticsearch, as needed.

While it might have been simpler to just remove the de-dotting mechanism, this would break backwards compatibility. Taking this approach provides users with an upgrade path.

Changes

This change allows the user to configure the field name converter as part of the index writer configuration.

Acceptable values include the following.

DEDOT: Replaces all '.' with ':' which is the default, backwards compatible behavior.
NOOP: No field name change.

If no "fieldNameConverter" is defined, it defaults to using DEDOT which maintains backwards compatibility.

A cache of FieldNameConverters is maintained since the index writer configuration can be changed at run-time and each sensor has its own index writer configuration.

An example configuration looks-like the following.

    {
      "hdfs" : {
        "enabled" : false
      },
      "elasticsearch" : {
        "index" : "bro",
        "batchSize" : 5,
        "enabled" : true,
        "fieldNameConverter": "NOOP"
      },
      "solr" : {
        "enabled" : false
      }
    }

Code Changes

Added the fieldNameConverter parameter to the Index writer configuration.
Moved the FieldNameConverter implementations to a dedicated package in metron-common.
Renamed ElasticsearchFieldNameConverter to DeDotFieldNameConverter.
Implemented the NoopFieldNameConverter which does not modify the field name.
Created FieldNameConverters class that allows a user to specify either DEDOT or NOOP to choose the appropriate implementation.
Implemented a CachedFieldNameConverterFactory that encapsulates all the logic for choosing and instantiating the appropriate FieldNameConverter.
Updated ElasticsearchWriter to use the CachedFieldNameConverterFactory.
Updated the README to document the new configuration parameter.

Manual Testing

Launch a development environment and login.

vagrant ssh
sudo su -
source /etc/default/metron

Validate the environment by ensuring alerts are visible in the Alerts UI and that the Ambari Service Check completes successfully. This ensures that the change is backwards compatible.
Login to the Storm UI and enable DEBUG logging for org.apache.metron.common and org.apache.metron.elasticsearch.

The Storm worker logs in /var/log/storm/worker-artifacts/random_access_indexing*/worker.log should contain the following log statements, if you have enabled DEBUG logging correctly. This shows that the default DEDOT converter is in-use.

2018-05-22 14:38:... [DEBUG] Renamed dotted field; original=source.type, new=source:type
2018-05-22 14:38:... [DEBUG] Renamed dotted field; original=adapter.geoadapter.end.ts, new=adapter:geoadapter:end:ts
2018-05-22 14:38:... [DEBUG] Renamed dotted field; original=threatintelsplitterbolt.splitter.end.ts, new=threatintelsplitterbolt:splitter:end:ts
2018-05-22 14:38:... [DEBUG] Renamed dotted field; original=adapter.threatinteladapter.begin.ts, new=adapter:threatinteladapter:begin:ts
2018-05-22 14:38:... [DEBUG] Renamed dotted field; original=enrichments.geo.ip_dst_addr.location_point, new=enrichments:geo:ip_dst_addr:location_point
2018-05-22 14:38:... [DEBUG] Renamed dotted field; original=adapter.threatinteladapter.end.ts, new=adapter:threatinteladapter:end:ts
2018-05-22 14:38:... [DEBUG] Renamed dotted field; original=enrichmentsplitterbolt.splitter.end.ts, new=enrichmentsplitterbolt:splitter:end:ts

Launch the REPL.
```
./bin/stellar -z $ZOOKEEPER
```

Change the field name converter to NOOP.

[Stellar]>>> conf := SHELL_EDIT()
{
  "hdfs" : {
    "enabled" : false
  },
  "elasticsearch" : {
    "index" : "bro",
    "batchSize" : 5,
    "enabled" : true,
    "fieldNameConverter": "NOOP"
  },
  "solr" : {
    "enabled" : false
  }
}

[Stellar]>>> CONFIG_PUT("INDEXING", conf, "bro")

It can take up to 5 minutes for the topology to pick-up this change. The old FieldNameConverter needs to expire from the cache first.
Go back to the Storm worker logs. When the change takes effect, we should see a log like the following indicating that the NoopFieldNameConverter was created.
```
2018-05-22 16:... [DEBUG] Created field name converter; sensorType=bro, configuredName=NOOP, class=NoopFieldNameConverter
```

In the same logs, we will start to see tuples fail to be indexed. Elasticsearch complains because the templates have been created to expect source:type, but that field no longer exists because the FieldNameConverter was changed.

2018-05-22 16:0...[ERROR] Failing 1 tuples
org.elasticsearch.index.mapper.MapperParsingException: Could not dynamically add mapping for field [source.type]. Existing mapping for [source] must be of type object but found [keyword].
	at org.elasticsearch.index.mapper.DocumentParser.getDynamicParentMapper(DocumentParser.java:876) ~[stormjar.jar:?]
	at org.elasticsearch.index.mapper.DocumentParser.parseValue(DocumentParser.java:596) ~[stormjar.jar:?]
	at org.elasticsearch.index.mapper.DocumentParser.innerParseObject(DocumentParser.java:396) ~[stormjar.jar:?]
	at org.elasticsearch.index.mapper.DocumentParser.parseObjectOrNested(DocumentParser.java:373) ~[stormjar.jar:?]
	at org.elasticsearch.index.mapper.DocumentParser.internalParseDocument(DocumentParser.java:93) ~[stormjar.jar:?]
	at org.elasticsearch.index.mapper.DocumentParser.parseDocument(DocumentParser.java:66) ~[stormjar.jar:?]

Pull Request Checklist

Is there a JIRA ticket associated with this PR? If not one needs to be created at Metron Jira.
Does your PR title start with METRON-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
Has your PR been rebased against the latest commit within the target branch (typically master)?
Have you included steps to reproduce the behavior or problem that is being changed or addressed?
Have you included steps or a guide to how the change may be verified and tested manually?
Have you ensured that the full suite of tests and checks have been executed in the root metron folder via:
Have you written or updated unit tests and or integration tests to verify your changes?
If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
Have you verified the basic functionality of the build by building and running locally with Vagrant full-dev environment or the equivalent?

…to Elasticsearch

mmiklavc

Thanks for the contribution @nickwallen. I made a first pass and have a couple questions about the caching implementation in context of the existing Zookeeper Curator configuration management.

mmiklavc · 2018-05-24T19:05:58Z

...ch/src/main/java/org/apache/metron/elasticsearch/writer/CachedFieldNameConverterFactory.java

+  private FieldNameConverter createInstance(String sensorType, WriterConfiguration config) {
+
+    // default to the 'DEDOT' field name converter to maintain backwards compatibility
+    FieldNameConverter result = new DeDotFieldNameConverter();


Might we want to call the enum when we create these objects?

FieldNameConverters.DEDOT.get();

Sure, that makes sense.

mmiklavc · 2018-05-24T19:11:46Z

...ch/src/main/java/org/apache/metron/elasticsearch/writer/CachedFieldNameConverterFactory.java

+ * after they are created.  Once they expire, the {@link WriterConfiguration} is used to
+ * reload the {@link FieldNameConverter}.
+ *
+ * <p>The user can change the {@link FieldNameConverter} in use at runtime. A change


I want to say that this is handled through the ConfigurationsUpdater classes already already. I believe the writer passes in an updated config on each call.

https://github.com/apache/metron/blob/master/metron-platform/metron-common/src/main/java/org/apache/metron/common/zookeeper/configurations/ConfigurationsUpdater.java

Maybe I am misunderstanding you, but the ConfigurationsUpdater ensures that our configuration classes stay in-sync with Zk. That's not what this class does.

This creates the appropriate FieldNameConverter instance for each source type that is then used by the ElasticsearchWriter to do the field name conversion. Are you proposing a different way to do this?

To help clarify, I could have implemented a different FieldNameConverterFactory; maybe one called SimpleFieldNameConverterFactory.

This 'simple' factory wouldn't even have to use a cache. Every time ElasticsearchWriter calls create(String sensorType, WriterConfiguration config), it could just instantiate a new instance of the right FieldNameConverter.

But of course, I didn't do that because this occurs for every field in every message that is written to Elasticsearch. The cache is there for performance.

I offer this only as an example to help clarify the purpose of the CachedFieldNameConverterFactory.

I think I'm just not fully grokking why we need additional caching here rather than FieldNameConverters.DEDOT.get();. It seems like the enum handles caching by virtue of the constructors in the enum itself? NOOP(new NoopFieldNameConverter())

So you're saying, hey why not just...

Take out the caching and just have something like a SimpleFieldNameConverterFactory. This would translate the (sourceType, WriterConfiguration) to the right FieldNameConverter.

The caching would be handled by FieldNameConverters in the sense that they are effectively singletons.

I think that would work if all FieldNameConverter implementations are stateless. And they all are stateless today, but I tried not to make that a constraint. I actually tried to make this work whether they are stateless or not.

But alas there is a bug in what I did because I need each call to FieldNameConverters.get() to return a new instance. (If we want to stick with this approach.)

I'd be open to removing the cache, if its not needed and we think FieldNameConverters would always remain stateless.

The prior commit 'b18c3bf' was just me fixing my original caching implementation, so at least I had a version of that with the bug fixed.

The latest commit 'cee3994' removes the cache and relies on the FactoryNameConverters class to hand out shared instances. This is what we have discussed.

Yes, agreed on the possible need for handling other writers. Is there a reason we wouldn't put that logic in the enum factory rather than creating another new class for it? It wouldn't even need to depend on the WriterConfiguration, actually.

EDIT - would need that create to be static

public enum FieldNameConverters implements FieldNameConverter { NOOP(new NoopFieldNameConverter()), DEDOT(new DeDotFieldNameConverter()); private FieldNameConverter converter; FieldNameConverters(FieldNameConverter converter) { this.converter = converter; } public FieldNameConverter create(String sensorType, Optional<String> converterName) { // default to the 'DEDOT' field name converter to maintain backwards compatibility return FieldNameConverters.valueOf(converterName.orElse("DEDOT")); } @Override public String convert(String originalField) { return converter.convert(originalField); } }

I'm not sure I follow about the testing - do you mean integration testing? We have an ElasticsearchWriterTest class that @justinleet wrote - https://github.com/apache/metron/blob/master/metron-platform/metron-elasticsearch/src/test/java/org/apache/metron/elasticsearch/writer/ElasticsearchWriterTest.java

I'm not sure I follow about the testing

If you look closely at those tests, all that class tests is the response using buildWriteReponse. It does not actually test the class using the write method. When you try to do that, it blows up because of dependency issues.

Ok, I see what you mean. Agreed, it would be good to have some tests around that write method at some point, but I think it's ok to keep that out of scope for this particular PR. We have a WriterBoltIntegrationTest, but it's only covering parsers right now. At a later time we could look at expanding it to cover multiple writers and topologies and provide hooks on those builders to enable one-time setup and teardown across a suite of tests.

I can add the create to the FieldNameConverters. That makes sense here. Its going to be slightly different than what you have above, but is what you were going for.

cestella · 2018-05-24T21:05:31Z

...ch/src/main/java/org/apache/metron/elasticsearch/writer/CachedFieldNameConverterFactory.java

+   */
+  private FieldNameConverter createInstance(String sensorType, WriterConfiguration config) {
+
+    // default to the 'DEDOT' field name converter to maintain backwards compatibility


This looks interesting, but one bit of functionality has changed as part of doing this PR. Currently, we specify the field converter at the writer implementation level, so:

ES uses dedot

HDFS uses noop

solr uses noop

By doing this, we're actually not maintaining backwards compatibility, we're changing the behavior for the HDFS writer and Solr to dedot. What I'd suggest doing is adding a method to the BulkMessageWriter interface like so:

default FieldNameConverter defaultFieldNameConverter() { return FieldNameConverters.NOOP.get(); }

and in ElasticsearchWriter specify DEDOT as the default. Also, here, you probably want to pass in the default field name converter if unspecified as a 3rd argument.

This would allow us to maintain backwards compatibility and enable users to override.

Remember that only the ElasticsearchWriter does field name conversion. HDFS and Solr do not do field name conversion and so are backwards compatible.

In the README, I think I specified that this field is only applicable for 'elasticsearch'.

In the future, we could certainly role this functionality out to the other Writers, but right now its not there. AFAIK

OH! yes, you're right. Sorry for jumping to conclusions, I assumed you generalized this to the bulk message writer level, rather than keeping it at the ES writer level. My bad, I retract the objection.

…her reusing the same instance each time.

nickwallen · 2018-05-25T15:32:44Z

FYI - I also confirmed that the configuration value can be changed using the Advanced mode in the Management UI.

cestella · 2018-05-25T17:51:46Z

...ch/src/main/java/org/apache/metron/elasticsearch/writer/SharedFieldNameConverterFactory.java

+ * <p>The instances are created and managed by the {@link FieldNameConverters} class.
+ */
+public class SharedFieldNameConverterFactory implements FieldNameConverterFactory {
+


This looks like it's moving in the right direction, but as @mmiklavc gave an example for, why would we not replace this class with a static create(String sensorType, WriterConfig config) method on the enum?

nickwallen · 2018-05-29T21:03:38Z

All review feedback should have been addressed. Let me know if there is anything else @mmiklavc @cestella

cestella · 2018-05-29T21:06:12Z

...platform/metron-common/src/main/java/org/apache/metron/common/field/FieldNameConverters.java

+ * <p>Allows the field name converter to be specified using a short-hand
+ * name, rather than the entire fully-qualified class name.
+ */
+public enum FieldNameConverters {


Minor nit, can we have this implement FieldNameConverter so the enum can be used directly without calling get()?

Ok. Check 'er out.

cestella · 2018-05-29T21:54:02Z

Ok, I'm comfortable with the change now. +1 by inspection, pending @mmiklavc

mmiklavc · 2018-05-30T14:51:21Z

+1 by inspection. This is great @nickwallen, thanks for the contribution.

METRON-1569 Allow user to change field name conversion when indexing …

c478de1

…to Elasticsearch

mmiklavc reviewed May 24, 2018

View reviewed changes

cestella reviewed May 24, 2018

View reviewed changes

nickwallen added 2 commits May 25, 2018 10:12

Addressed review comments when default converter is created

898c812

FieldNameConverters needs to return a new instance on each get(), rat…

b18c3bf

…her reusing the same instance each time.

Removed 'caching' factory and replaced with 'shared' factory

cee3994

cestella reviewed May 25, 2018

View reviewed changes

Moved create to FileNameConverters

d79702d

nickwallen force-pushed the METRON-1569 branch from c8ba5ec to d79702d Compare May 25, 2018 18:44

Added license and small fixups

00f02eb

cestella reviewed May 29, 2018

View reviewed changes

nickwallen added 2 commits May 29, 2018 17:48

FieldNameConverters is-a FieldNameConverter

b7363f3

Don't want an error level log if the user has not defined a converter

30315ee

asfgit closed this in 3a3bba4 May 30, 2018

nickwallen deleted the METRON-1569 branch September 17, 2018 19:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

METRON-1569 Allow user to change field name conversion when indexing … #1022

METRON-1569 Allow user to change field name conversion when indexing … #1022

nickwallen commented May 22, 2018 •

edited

mmiklavc left a comment

mmiklavc May 24, 2018

nickwallen May 24, 2018

mmiklavc May 24, 2018

nickwallen May 24, 2018 •

edited

nickwallen May 24, 2018 •

edited

mmiklavc May 24, 2018

nickwallen May 24, 2018 •

edited

nickwallen May 25, 2018

mmiklavc May 25, 2018 •

edited

nickwallen May 25, 2018

mmiklavc May 25, 2018

nickwallen May 25, 2018

cestella May 24, 2018

nickwallen May 24, 2018

cestella May 24, 2018

nickwallen commented May 25, 2018

cestella May 25, 2018

nickwallen commented May 29, 2018

cestella May 29, 2018

nickwallen May 29, 2018

cestella commented May 29, 2018

mmiklavc commented May 30, 2018

METRON-1569 Allow user to change field name conversion when indexing … #1022

METRON-1569 Allow user to change field name conversion when indexing … #1022

Conversation

nickwallen commented May 22, 2018 • edited

Changes

Code Changes

Manual Testing

Pull Request Checklist

mmiklavc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nickwallen May 24, 2018 • edited

Choose a reason for hiding this comment

nickwallen May 24, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nickwallen May 24, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mmiklavc May 25, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nickwallen commented May 25, 2018

Choose a reason for hiding this comment

nickwallen commented May 29, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cestella commented May 29, 2018

mmiklavc commented May 30, 2018

nickwallen commented May 22, 2018 •

edited

nickwallen May 24, 2018 •

edited

nickwallen May 24, 2018 •

edited

nickwallen May 24, 2018 •

edited

mmiklavc May 25, 2018 •

edited