Skip to content
This repository has been archived by the owner on May 12, 2021. It is now read-only.

METRON-1569 Allow user to change field name conversion when indexing … #1022

Closed
wants to merge 8 commits into from

Conversation

nickwallen
Copy link
Contributor

@nickwallen nickwallen commented May 22, 2018

The ElasticsearchWriter has a mechanism to transform the field names of a message before it is written to Elasticsearch. Right now this mechanism is hard-coded to replace all '.' dots with ':' colons.

This mechanism was needed for Elasticsearch 2.x which did not allow dots in field names. Now that Metron supports Elasticsearch 5.x this is no longer a problem. A user should be able to configure the field name transformation when writing to Elasticsearch, as needed.

While it might have been simpler to just remove the de-dotting mechanism, this would break backwards compatibility. Taking this approach provides users with an upgrade path.

Changes

This change allows the user to configure the field name converter as part of the index writer configuration.

Acceptable values include the following.

  • DEDOT: Replaces all '.' with ':' which is the default, backwards compatible behavior.
  • NOOP: No field name change.

If no "fieldNameConverter" is defined, it defaults to using DEDOT which maintains backwards compatibility.

A cache of FieldNameConverters is maintained since the index writer configuration can be changed at run-time and each sensor has its own index writer configuration.

An example configuration looks-like the following.

    {
      "hdfs" : {
        "enabled" : false
      },
      "elasticsearch" : {
        "index" : "bro",
        "batchSize" : 5,
        "enabled" : true,
        "fieldNameConverter": "NOOP"
      },
      "solr" : {
        "enabled" : false
      }
    } 

Code Changes

  • Added the fieldNameConverter parameter to the Index writer configuration.

  • Moved the FieldNameConverter implementations to a dedicated package in metron-common.

  • Renamed ElasticsearchFieldNameConverter to DeDotFieldNameConverter.

  • Implemented the NoopFieldNameConverter which does not modify the field name.

  • Created FieldNameConverters class that allows a user to specify either DEDOT or NOOP to choose the appropriate implementation.

  • Implemented a CachedFieldNameConverterFactory that encapsulates all the logic for choosing and instantiating the appropriate FieldNameConverter.

  • Updated ElasticsearchWriter to use the CachedFieldNameConverterFactory.

  • Updated the README to document the new configuration parameter.

Manual Testing

  1. Launch a development environment and login.

    vagrant ssh
    sudo su -
    source /etc/default/metron
    
  2. Validate the environment by ensuring alerts are visible in the Alerts UI and that the Ambari Service Check completes successfully. This ensures that the change is backwards compatible.

  3. Login to the Storm UI and enable DEBUG logging for org.apache.metron.common and org.apache.metron.elasticsearch.

  4. The Storm worker logs in /var/log/storm/worker-artifacts/random_access_indexing*/worker.log should contain the following log statements, if you have enabled DEBUG logging correctly. This shows that the default DEDOT converter is in-use.

    2018-05-22 14:38:... [DEBUG] Renamed dotted field; original=source.type, new=source:type
    2018-05-22 14:38:... [DEBUG] Renamed dotted field; original=adapter.geoadapter.end.ts, new=adapter:geoadapter:end:ts
    2018-05-22 14:38:... [DEBUG] Renamed dotted field; original=threatintelsplitterbolt.splitter.end.ts, new=threatintelsplitterbolt:splitter:end:ts
    2018-05-22 14:38:... [DEBUG] Renamed dotted field; original=adapter.threatinteladapter.begin.ts, new=adapter:threatinteladapter:begin:ts
    2018-05-22 14:38:... [DEBUG] Renamed dotted field; original=enrichments.geo.ip_dst_addr.location_point, new=enrichments:geo:ip_dst_addr:location_point
    2018-05-22 14:38:... [DEBUG] Renamed dotted field; original=adapter.threatinteladapter.end.ts, new=adapter:threatinteladapter:end:ts
    2018-05-22 14:38:... [DEBUG] Renamed dotted field; original=enrichmentsplitterbolt.splitter.end.ts, new=enrichmentsplitterbolt:splitter:end:ts
    
  5. Launch the REPL.

    ./bin/stellar -z $ZOOKEEPER
    
  6. Change the field name converter to NOOP.

    [Stellar]>>> conf := SHELL_EDIT()
    {
      "hdfs" : {
        "enabled" : false
      },
      "elasticsearch" : {
        "index" : "bro",
        "batchSize" : 5,
        "enabled" : true,
        "fieldNameConverter": "NOOP"
      },
      "solr" : {
        "enabled" : false
      }
    }
    
    [Stellar]>>> CONFIG_PUT("INDEXING", conf, "bro")
    
  7. It can take up to 5 minutes for the topology to pick-up this change. The old FieldNameConverter needs to expire from the cache first.

  8. Go back to the Storm worker logs. When the change takes effect, we should see a log like the following indicating that the NoopFieldNameConverter was created.

    2018-05-22 16:... [DEBUG] Created field name converter; sensorType=bro, configuredName=NOOP, class=NoopFieldNameConverter
    
  9. In the same logs, we will start to see tuples fail to be indexed. Elasticsearch complains because the templates have been created to expect source:type, but that field no longer exists because the FieldNameConverter was changed.

    2018-05-22 16:0...[ERROR] Failing 1 tuples
    org.elasticsearch.index.mapper.MapperParsingException: Could not dynamically add mapping for field [source.type]. Existing mapping for [source] must be of type object but found [keyword].
    	at org.elasticsearch.index.mapper.DocumentParser.getDynamicParentMapper(DocumentParser.java:876) ~[stormjar.jar:?]
    	at org.elasticsearch.index.mapper.DocumentParser.parseValue(DocumentParser.java:596) ~[stormjar.jar:?]
    	at org.elasticsearch.index.mapper.DocumentParser.innerParseObject(DocumentParser.java:396) ~[stormjar.jar:?]
    	at org.elasticsearch.index.mapper.DocumentParser.parseObjectOrNested(DocumentParser.java:373) ~[stormjar.jar:?]
    	at org.elasticsearch.index.mapper.DocumentParser.internalParseDocument(DocumentParser.java:93) ~[stormjar.jar:?]
    	at org.elasticsearch.index.mapper.DocumentParser.parseDocument(DocumentParser.java:66) ~[stormjar.jar:?]
    
    

Pull Request Checklist

  • Is there a JIRA ticket associated with this PR? If not one needs to be created at Metron Jira.
  • Does your PR title start with METRON-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
  • Has your PR been rebased against the latest commit within the target branch (typically master)?
  • Have you included steps to reproduce the behavior or problem that is being changed or addressed?
  • Have you included steps or a guide to how the change may be verified and tested manually?
  • Have you ensured that the full suite of tests and checks have been executed in the root metron folder via:
  • Have you written or updated unit tests and or integration tests to verify your changes?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • Have you verified the basic functionality of the build by building and running locally with Vagrant full-dev environment or the equivalent?

Copy link
Contributor

@mmiklavc mmiklavc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution @nickwallen. I made a first pass and have a couple questions about the caching implementation in context of the existing Zookeeper Curator configuration management.

private FieldNameConverter createInstance(String sensorType, WriterConfiguration config) {

// default to the 'DEDOT' field name converter to maintain backwards compatibility
FieldNameConverter result = new DeDotFieldNameConverter();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might we want to call the enum when we create these objects?

FieldNameConverters.DEDOT.get();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, that makes sense.

* after they are created. Once they expire, the {@link WriterConfiguration} is used to
* reload the {@link FieldNameConverter}.
*
* <p>The user can change the {@link FieldNameConverter} in use at runtime. A change
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to say that this is handled through the ConfigurationsUpdater classes already already. I believe the writer passes in an updated config on each call.

https://github.com/apache/metron/blob/master/metron-platform/metron-common/src/main/java/org/apache/metron/common/zookeeper/configurations/ConfigurationsUpdater.java

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I am misunderstanding you, but the ConfigurationsUpdater ensures that our configuration classes stay in-sync with Zk. That's not what this class does.

This creates the appropriate FieldNameConverter instance for each source type that is then used by the ElasticsearchWriter to do the field name conversion. Are you proposing a different way to do this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To help clarify, I could have implemented a different FieldNameConverterFactory; maybe one called SimpleFieldNameConverterFactory.

This 'simple' factory wouldn't even have to use a cache. Every time ElasticsearchWriter calls create(String sensorType, WriterConfiguration config), it could just instantiate a new instance of the right FieldNameConverter.

But of course, I didn't do that because this occurs for every field in every message that is written to Elasticsearch. The cache is there for performance.

I offer this only as an example to help clarify the purpose of the CachedFieldNameConverterFactory.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm just not fully grokking why we need additional caching here rather than FieldNameConverters.DEDOT.get();. It seems like the enum handles caching by virtue of the constructors in the enum itself? NOOP(new NoopFieldNameConverter())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you're saying, hey why not just...

  1. Take out the caching and just have something like a SimpleFieldNameConverterFactory. This would translate the (sourceType, WriterConfiguration) to the right FieldNameConverter.
  2. The caching would be handled by FieldNameConverters in the sense that they are effectively singletons.

I think that would work if all FieldNameConverter implementations are stateless. And they all are stateless today, but I tried not to make that a constraint. I actually tried to make this work whether they are stateless or not.

But alas there is a bug in what I did because I need each call to FieldNameConverters.get() to return a new instance. (If we want to stick with this approach.)

I'd be open to removing the cache, if its not needed and we think FieldNameConverters would always remain stateless.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The prior commit 'b18c3bf' was just me fixing my original caching implementation, so at least I had a version of that with the bug fixed.

The latest commit 'cee3994' removes the cache and relies on the FactoryNameConverters class to hand out shared instances. This is what we have discussed.

Copy link
Contributor

@mmiklavc mmiklavc May 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, agreed on the possible need for handling other writers. Is there a reason we wouldn't put that logic in the enum factory rather than creating another new class for it? It wouldn't even need to depend on the WriterConfiguration, actually.

EDIT - would need that create to be static

public enum FieldNameConverters implements FieldNameConverter {

  NOOP(new NoopFieldNameConverter()),
  DEDOT(new DeDotFieldNameConverter());

  private FieldNameConverter converter;

  FieldNameConverters(FieldNameConverter converter) {
    this.converter = converter;
  }

  public FieldNameConverter create(String sensorType, Optional<String> converterName) {
    // default to the 'DEDOT' field name converter to maintain backwards compatibility
   return FieldNameConverters.valueOf(converterName.orElse("DEDOT"));
  }

  @Override
  public String convert(String originalField) {
    return converter.convert(originalField);
  }
}

I'm not sure I follow about the testing - do you mean integration testing? We have an ElasticsearchWriterTest class that @justinleet wrote - https://github.com/apache/metron/blob/master/metron-platform/metron-elasticsearch/src/test/java/org/apache/metron/elasticsearch/writer/ElasticsearchWriterTest.java

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow about the testing

If you look closely at those tests, all that class tests is the response using buildWriteReponse. It does not actually test the class using the write method. When you try to do that, it blows up because of dependency issues.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I see what you mean. Agreed, it would be good to have some tests around that write method at some point, but I think it's ok to keep that out of scope for this particular PR. We have a WriterBoltIntegrationTest, but it's only covering parsers right now. At a later time we could look at expanding it to cover multiple writers and topologies and provide hooks on those builders to enable one-time setup and teardown across a suite of tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add the create to the FieldNameConverters. That makes sense here. Its going to be slightly different than what you have above, but is what you were going for.

*/
private FieldNameConverter createInstance(String sensorType, WriterConfiguration config) {

// default to the 'DEDOT' field name converter to maintain backwards compatibility
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks interesting, but one bit of functionality has changed as part of doing this PR. Currently, we specify the field converter at the writer implementation level, so:

  • ES uses dedot
  • HDFS uses noop
  • solr uses noop

By doing this, we're actually not maintaining backwards compatibility, we're changing the behavior for the HDFS writer and Solr to dedot. What I'd suggest doing is adding a method to the BulkMessageWriter interface like so:

default FieldNameConverter defaultFieldNameConverter() {
  return FieldNameConverters.NOOP.get();
}

and in ElasticsearchWriter specify DEDOT as the default. Also, here, you probably want to pass in the default field name converter if unspecified as a 3rd argument.

This would allow us to maintain backwards compatibility and enable users to override.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remember that only the ElasticsearchWriter does field name conversion. HDFS and Solr do not do field name conversion and so are backwards compatible.

In the README, I think I specified that this field is only applicable for 'elasticsearch'.

In the future, we could certainly role this functionality out to the other Writers, but right now its not there. AFAIK

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OH! yes, you're right. Sorry for jumping to conclusions, I assumed you generalized this to the bulk message writer level, rather than keeping it at the ES writer level. My bad, I retract the objection.

@nickwallen
Copy link
Contributor Author

FYI - I also confirmed that the configuration value can be changed using the Advanced mode in the Management UI.

screen shot 2018-05-25 at 11 31 17 am

* <p>The instances are created and managed by the {@link FieldNameConverters} class.
*/
public class SharedFieldNameConverterFactory implements FieldNameConverterFactory {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it's moving in the right direction, but as @mmiklavc gave an example for, why would we not replace this class with a static create(String sensorType, WriterConfig config) method on the enum?

@nickwallen
Copy link
Contributor Author

All review feedback should have been addressed. Let me know if there is anything else @mmiklavc @cestella

* <p>Allows the field name converter to be specified using a short-hand
* name, rather than the entire fully-qualified class name.
*/
public enum FieldNameConverters {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit, can we have this implement FieldNameConverter so the enum can be used directly without calling get()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Check 'er out.

@cestella
Copy link
Member

Ok, I'm comfortable with the change now. +1 by inspection, pending @mmiklavc

@mmiklavc
Copy link
Contributor

+1 by inspection. This is great @nickwallen, thanks for the contribution.

@asfgit asfgit closed this in 3a3bba4 May 30, 2018
@nickwallen nickwallen deleted the METRON-1569 branch September 17, 2018 19:32
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
3 participants