Skip to content
This repository was archived by the owner on Aug 20, 2025. It is now read-only.

Conversation

@nickwallen
Copy link
Contributor

@nickwallen nickwallen commented Sep 7, 2018

By default, the Batch Profiler supports the text/json that Metron lands in HDFS as the source of the archived telemetry. Of course, this is not the best option for archiving telemetry in many cases and users may choose to store it in alternative formats.

Alternatives like ORC should be supported when reading the input telemetry in the Batch Profiler. The user should be able to customize the profiler based on how they have chosen to archive their telemetry.

  • Updated README to describe how to read alternative input formats.

  • Added an additional command line option that allows the user to pass custom options to the DataFrameReader. This may be needed by a user depending on how the telemetry is archived.

    • For example, this allows the user to pass reader options like quote, nullValue, etc needed by csv or allowSingleQuote, allowComments needed by json
  • Added an integration test that validates that the Batch Profiler can read ORC data.

  • Added an integration test that validates that the Batch Profiler can read CSV data. I added CSV as a test so that I could validate the user providing custom options to the DataFrameReader.

This is a pull request against the METRON-1699-create-batch-profiler feature branch.

This is dependent on the following PRs. By filtering on the last commit, this PR can be reviewed before the others are reviewed and merged.

Testing

  1. Stand-up a development environment.

    cd metron-deployment/development/centos6
    vagrant up
    vagrant ssh
    sudo su -
    
  2. Validate the environment by ensuring alerts are visible within the Alerts UI and that the Metron Service Check in Ambari passes.

  3. Allow some telemetry to be archived in HDFS.

    [root@node1 ~]# hdfs dfs -cat /apps/metron/indexing/indexed/*/* | wc -l
    6916
    
  4. Shutdown Metron topologies, Storm, Elasticsearch, Kibana, MapReduce2 to free up some resources on the VM.

  5. Use Ambari to install Spark (version 2.3+). Actions > Add Service > Spark2

  6. Make sure Spark can talk to HBase.

    SPARK_HOME=/usr/hdp/current/spark2-client
    cp  /usr/hdp/current/hbase-client/conf/hbase-site.xml $SPARK_HOME/conf/
    
  7. Follow the Getting Started section of the README to seed a basic profile using the text/json telemetry that is archived in HDFS.

    1. Create the Profile.

      [root@node1 ~]# source /etc/default/metron
      [root@node1 ~]# cat $METRON_HOME/config/zookeeper/profiler.json
      {
        "profiles": [
          {
            "profile": "hello-world",
            "foreach": "'global'",
            "init":    { "count": "0" },
            "update":  { "count": "count + 1" },
            "result":  "count"
          }
        ],
        "timestampField": "timestamp"
      }
      
    2. Edit the Batch Profiler properties. to point it at the correct input path (changed localhost:9000 to localhost:8020).

      [root@node1 ~]# cat /usr/metron/0.5.1/config/batch-profiler.properties
      
      spark.app.name=Batch Profiler
      spark.master=local
      spark.sql.shuffle.partitions=8
      
      profiler.batch.input.path=hdfs://localhost:8020/apps/metron/indexing/indexed/*/*
      profiler.batch.input.format=text
      
      profiler.period.duration=15
      profiler.period.duration.units=MINUTES
      
    3. Edit logging as you see fit. For example, set Spark logging to WARN and Profiler logging to DEBUG. This is described in the README.

    4. Run the Batch Profiler.

      $METRON_HOME/bin/start_batch_profiler.sh
      
  8. Launch the Stellar REPL and retrieve the profile data. Save this result as it will be used for validation in subsequent steps.

    [root@node1 ~]# $METRON_HOME/bin/stellar -z $ZOOKEEPER
    ...
    Stellar, Go!
    Functions are loading lazily in the background and will be unavailable until loaded fully.
    ...
    [Stellar]>>> window := PROFILE_FIXED(2, "HOURS")
    [ProfilePeriod{period=1707332, durationMillis=900000}, ProfilePeriod{period=1707333, durationMillis=900000}, ProfilePeriod{period=1707334, durationMillis=900000}, ProfilePeriod{period=1707335, durationMillis=900000}, ProfilePeriod{period=1707336, durationMillis=900000}, ProfilePeriod{period=1707337, durationMillis=900000}, ProfilePeriod{period=1707338, durationMillis=900000}, ProfilePeriod{period=1707339, durationMillis=900000}, ProfilePeriod{period=1707340, durationMillis=900000}]
    
    [Stellar]>>> PROFILE_GET("hello-world","global", window)
    [1020, 5066, 830]
    
  9. Delete the profiler data.

    echo "truncate 'profiler'" |  hbase shell
    
  10. Create a new directory in HDFS for the ORC data that we are about to generate.

    export HADOOP_USER_NAME=hdfs
    hdfs dfs -mkdir /apps/metron/indexing/orc
    hdfs dfs -chown metron:hadoop /apps/metron/indexing/orc
    
  11. You may need to also create this directory for Spark.

     export HADOOP_USER_NAME=hdfs
     hdfs dfs -mkdir /spark2-history
    
  12. Launch the Spark shell

    export SPARK_MAJOR_VERSION=2
    export HADOOP_USER_NAME=hdfs
    spark-shell
    
  13. Use the Spark Shell to transform the text/json telemetry to ORC.

    scala> val jsonPath = "hdfs://localhost:8020/apps/metron/indexing/indexed/*/*"
    jsonPath: String = hdfs://localhost:8020/apps/metron/indexing/indexed/*/*
    
    scala> val orcPath = "hdfs://localhost:8020/apps/metron/orc/"
    orcPath: String = hdfs://localhost:8020/apps/metron/orc/
    
    scala> val msgs = spark.read.format("text").load(jsonPath).as[String]
    msgs: org.apache.spark.sql.Dataset[String] = [value: string]
    
    scala> msgs.count
    res0: Long = 6916
    
    scala> msgs.write.mode("overwrite").format("org.apache.spark.sql.execution.datasources.orc").save(orcPath)
    
    scala> spark.read.format("org.apache.spark.sql.execution.datasources.orc").load(orcPath).as[String].count
    res3: Long = 6916
    
  14. Edit $METRON_HOME/config/batch-profiler.properties so that the Batch Profiler consumes that telemetry stored as ORC.

    [root@node1 ~]# cat /usr/metron/0.5.1/config/batch-profiler.properties
    
    spark.app.name=Batch Profiler
    spark.master=local
    spark.sql.shuffle.partitions=8
    
    profiler.batch.input.path=hdfs://localhost:8020/apps/metron/orc/
    profiler.batch.input.format=org.apache.spark.sql.execution.datasources.orc
    
    profiler.period.duration=15
    profiler.period.duration.units=MINUTES
    
  15. Again, run the Batch Profiler again. It will now consume the ORC data.

    $METRON_HOME/bin/start_batch_profiler.sh
    
  16. Again, launch the Stellar REPL and retrieve the profile data. The data should match the previous profile data that was generated using the test/json telemetry.

    [root@node1 ~]# $METRON_HOME/bin/stellar -z $ZOOKEEPER
    ...
    Stellar, Go!
    Functions are loading lazily in the background and will be unavailable until loaded fully.
    ...
    [Stellar]>>> window := PROFILE_FIXED(2, "HOURS")
    [ProfilePeriod{period=1707332, durationMillis=900000}, ProfilePeriod{period=1707333, durationMillis=900000}, ProfilePeriod{period=1707334, durationMillis=900000}, ProfilePeriod{period=1707335, durationMillis=900000}, ProfilePeriod{period=1707336, durationMillis=900000}, ProfilePeriod{period=1707337, durationMillis=900000}, ProfilePeriod{period=1707338, durationMillis=900000}, ProfilePeriod{period=1707339, durationMillis=900000}, ProfilePeriod{period=1707340, durationMillis=900000}]
    
    [Stellar]>>> PROFILE_GET("hello-world","global", window)
    [1020, 5066, 830]
    
  17. Notice that the output is exactly the same no matter which input format we have used.

Pull Request Checklist

  • Have you included steps to reproduce the behavior or problem that is being changed or addressed?
  • Have you included steps or a guide to how the change may be verified and tested manually?
  • Have you ensured that the full suite of tests and checks have been executed in the root metron folder via:
  • Have you written or updated unit tests and or integration tests to verify your changes?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • Have you verified the basic functionality of the build by building and running locally with Vagrant full-dev environment or the equivalent?

@nickwallen
Copy link
Contributor Author

Crank. Ci failure because...

The job exceeded the maximum time limit for jobs, and has been terminated.

@nickwallen nickwallen closed this Sep 18, 2018
@nickwallen nickwallen reopened this Sep 18, 2018
@nickwallen
Copy link
Contributor Author

After merging master, I just ran through the test steps again just to re-validate.

@nickwallen
Copy link
Contributor Author

Failed tests: 
  ZKConfigurationsCacheIntegrationTest.validateUpdate:230->lambda$validateUpdate$9:230 expected:<{hdfs={index=yaf, batchSize=1, enabled=true}, elasticsearch={index=yaf, batchSize=25, batchTimeout=7, enabled=false}, solr={index=yaf, batchSize=5, enabled=false}}> but was:<{}>

@nickwallen nickwallen closed this Sep 18, 2018
@nickwallen nickwallen reopened this Sep 18, 2018
@justinleet
Copy link
Contributor

+1, I spun it up and ran through the tests. Thanks for the contribution

It's unrelated to getting this PR in (and probably not the right spot for this question), but should the default batch profiler config live in Ambari? Given that other profiles can be provided as desired, having the default managed by Ambari might be a reasonable thing to do.

@nickwallen
Copy link
Contributor Author

Thanks @justinleet.

@justinleet: ... should the default batch profiler config live in Ambari?

That's the feedback I am looking for on the FB, so thanks.

Just to level set, right now the Mpack installs the Batch Profiler and that is it. There is no configuration, starting, stopping or any other integration with the Mpack.

I thought it wasn't worth it to add the configuration to Ambari. The Batch Profiler isn't a service that needs to remain running like the others, you wouldn't stop, start, or check the status of it.

You have to run it from the command line. If I add just the configuration elements to Ambari, then a user has to go into Ambari to change the configuration, use the command line to define their profile, and then run it from the command line. Based on this, I feel adding it to Ambari makes it more difficult to use.

Now eventually, I'd like to get a proper UI around the Profilers and I think at that point, we would be able to determine how it should integrate with the Mpack and that UI.

@nickwallen
Copy link
Contributor Author

That being said, I will own opening a proper discuss thread on the mailing list to see if there are other features (like Mpack integration) that I should take care of before merging this feature branch into master.

asfgit pushed a commit that referenced this pull request Sep 19, 2018
@nickwallen
Copy link
Contributor Author

Thanks! This has been merged into feature/METRON-1699-create-batch-profiler

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants