-
Notifications
You must be signed in to change notification settings - Fork 506
METRON-1772 Support alternative input formats in the Batch Profiler [Feature Branch] #1191
METRON-1772 Support alternative input formats in the Batch Profiler [Feature Branch] #1191
Conversation
…-profiler' into METRON-1772
|
Crank. Ci failure because...
|
|
After merging master, I just ran through the test steps again just to re-validate. |
|
|
+1, I spun it up and ran through the tests. Thanks for the contribution It's unrelated to getting this PR in (and probably not the right spot for this question), but should the default batch profiler config live in Ambari? Given that other profiles can be provided as desired, having the default managed by Ambari might be a reasonable thing to do. |
|
Thanks @justinleet.
That's the feedback I am looking for on the FB, so thanks. Just to level set, right now the Mpack installs the Batch Profiler and that is it. There is no configuration, starting, stopping or any other integration with the Mpack. I thought it wasn't worth it to add the configuration to Ambari. The Batch Profiler isn't a service that needs to remain running like the others, you wouldn't stop, start, or check the status of it. You have to run it from the command line. If I add just the configuration elements to Ambari, then a user has to go into Ambari to change the configuration, use the command line to define their profile, and then run it from the command line. Based on this, I feel adding it to Ambari makes it more difficult to use. Now eventually, I'd like to get a proper UI around the Profilers and I think at that point, we would be able to determine how it should integrate with the Mpack and that UI. |
|
That being said, I will own opening a proper discuss thread on the mailing list to see if there are other features (like Mpack integration) that I should take care of before merging this feature branch into master. |
…-profiler' into METRON-1772
|
Thanks! This has been merged into feature/METRON-1699-create-batch-profiler |
By default, the Batch Profiler supports the text/json that Metron lands in HDFS as the source of the archived telemetry. Of course, this is not the best option for archiving telemetry in many cases and users may choose to store it in alternative formats.
Alternatives like ORC should be supported when reading the input telemetry in the Batch Profiler. The user should be able to customize the profiler based on how they have chosen to archive their telemetry.
Updated README to describe how to read alternative input formats.
Added an additional command line option that allows the user to pass custom options to the
DataFrameReader. This may be needed by a user depending on how the telemetry is archived.quote,nullValue, etc needed by csv orallowSingleQuote,allowCommentsneeded by jsonAdded an integration test that validates that the Batch Profiler can read ORC data.
Added an integration test that validates that the Batch Profiler can read CSV data. I added CSV as a test so that I could validate the user providing custom options to the
DataFrameReader.This is a pull request against the
METRON-1699-create-batch-profilerfeature branch.This is dependent on the following PRs. By filtering on the last commit, this PR can be reviewed before the others are reviewed and merged.
Testing
Stand-up a development environment.
Validate the environment by ensuring alerts are visible within the Alerts UI and that the Metron Service Check in Ambari passes.
Allow some telemetry to be archived in HDFS.
Shutdown Metron topologies, Storm, Elasticsearch, Kibana, MapReduce2 to free up some resources on the VM.
Use Ambari to install Spark (version 2.3+). Actions > Add Service > Spark2
Make sure Spark can talk to HBase.
Follow the Getting Started section of the README to seed a basic profile using the text/json telemetry that is archived in HDFS.
Create the Profile.
Edit the Batch Profiler properties. to point it at the correct input path (changed localhost:9000 to localhost:8020).
Edit logging as you see fit. For example, set Spark logging to WARN and Profiler logging to DEBUG. This is described in the README.
Run the Batch Profiler.
Launch the Stellar REPL and retrieve the profile data. Save this result as it will be used for validation in subsequent steps.
Delete the profiler data.
Create a new directory in HDFS for the ORC data that we are about to generate.
You may need to also create this directory for Spark.
Launch the Spark shell
Use the Spark Shell to transform the text/json telemetry to ORC.
Edit
$METRON_HOME/config/batch-profiler.propertiesso that the Batch Profiler consumes that telemetry stored as ORC.Again, run the Batch Profiler again. It will now consume the ORC data.
Again, launch the Stellar REPL and retrieve the profile data. The data should match the previous profile data that was generated using the test/json telemetry.
Notice that the output is exactly the same no matter which input format we have used.
Pull Request Checklist