Flow tuning [SPOT-164] #46

NathanSegerlind · 2017-06-01T22:56:21Z

This PR is to change the netflow Suspicious Connects analysis word formation step in the following ways:

time of day is binned on the hour
byte count is binned by the ceiling of the logarithm base 2 of the byte count
packet count is binned by the ceiling of the logarithm base 2 of the packet count
the protocol used by the flow is added to the word

The motivation for this is to remove the computationally expensive sorting steps used to calculate quantiles for binning numerical values in the old code, to incorporate the often meaningful protocol data, and to use a binning scheme more appropriate to the nature of the data being collected.

Testing on synthetic data has consistently show modest increases in the AUC with these changes.

experimental branch for the tuning of netflow: * shall we incorporate protocol information? * shall we bin time by hour ? * shall we bin bytes by exponential buckets (eg log of bytes) * shall we bin packet coutns by exponential buckets (eg log of packet counts) ?

some changes

Changed flow word creation logic. * protocol is now part of the flow word * time is binned by the hour * byte count is binned by its logarithm * packet count is binned by its logarithm On our synthetic BP dataset, this led to substantial improvement in the model effectiveness. It also removes two full dataest sorting passes from the model construction.

removed minute and second from the model columns since they are no longer being used

NathanSegerlind · 2017-06-02T00:03:43Z

spot-ml/src/main/scala/org/apache/spot/netflow/model/FlowSuspiciousConnectsModel.scala

                          srcIP: String,
                          dstIP: String,
                          srcPort: Int,
                          dstPort: Int,
-                          ipkt: Long,
+                          protocol: String,
                          ibyt: Long,


It was inconsistent throught whether ibyt was the field before ipkt or if ipkt was the field before ibyt. This was causing some frustrating errors.... we need a good way to catch slip ups like these; the only way I know how is to have very good unit test coverage that includes test cases that hit each feature in isolation.

okay, what was inconsistent was the function signatures with the cutoffs always listed bytes first, but everywhere else packets came first and that threw me off when writing the code.... point is, good unit test coverage and a more sane design to keep these things straight could prevent wasted time

rabarona

Overall it's a good change.
There're only a few things to clarify before approving.

Please add JIRA-# to the description.

rabarona · 2017-06-02T15:31:33Z

spot-ml/src/main/scala/org/apache/spot/netflow/FlowWordCreator.scala

-      val timeOfDay: Double = hour.toDouble + minute.toDouble / 60 + second.toDouble / 3600
+      val lnOf2 = scala.math.log(2) // natural log of 2
+      val ibytBin: Long =
+        scala.math.ceil(scala.math.log(ibyt) / lnOf2).toLong // 0 values should never ever happen


We need to have a conversation with network specialists about this rule, seems like there can be 0s depending on how users collect their data. We need to decide if we are going to ignore those records.
@vgonzale78.

I used floor instead of ceiling for the exponential/logarithmic binning. I think we need to choose one or the other to be consistent. However, I guess at the end it doesn't matter because either everything is shifted up or down.

I've settled on ceil( log2( 1 + x))

rabarona · 2017-06-02T15:31:49Z

spot-ml/src/main/scala/org/apache/spot/netflow/FlowWordCreator.scala

-      val timeBin = Quantiles.bin(timeOfDay, timeCuts)
-      val ibytBin = Quantiles.bin(ibyt, ibytCuts)
-      val ipktBin = Quantiles.bin(ipkt, ipktCuts)
+      val ipktBin: Long = scala.math.ceil(scala.math.log(ipkt) / lnOf2).toLong // 0 values should never ever happen


We need to have a conversation with network specialists about this rule, seems like there can be 0s depending on how users collect their data. We need to decide if we are going to ignore those records.
@vgonzale78.

Same about org.apache.spot.utilities.MathUtils, we can add a method for Long.

will use log2(1+x) to be safe at zero values

rabarona · 2017-06-02T15:36:10Z

spot-ml/src/main/scala/org/apache/spot/netflow/FlowWordCreator.scala

-      val timeOfDay: Double = hour.toDouble + minute.toDouble / 60 + second.toDouble / 3600
+      val lnOf2 = scala.math.log(2) // natural log of 2
+      val ibytBin: Long =
+        scala.math.ceil(scala.math.log(ibyt) / lnOf2).toLong // 0 values should never ever happen


@lujangus created an object called MathUtils in the org.apache.spot.utilities package with a function logBaseXInt, I think it would be a good idea to create a similar function for Long and leave it there in case we need a similar code in the future.

Yes, MathUtils recreates this step. Except that it is Int and automatically performs the floor function.

it's pretty crazy to have it be a long, since that would correspond to byte/packet counts that are of size 2 raised to (about) the 4 billionth power...

rabarona · 2017-06-02T15:45:15Z

spot-ml/src/main/scala/org/apache/spot/netflow/FlowSuspiciousConnectsAnalysis.scala

+
+    logger.info("Fitting probabilistic model to data")
+    val model =
+      FlowSuspiciousConnectsModel.trainModel(sparkContext, sqlContext, logger, config, flows.select(InSchema: _*))


Why change the select to this line when flows can be already both, filtered and with columns selected? Is there any other use for flows that needs all the columns?

it was an artifact of a test that I did; I wil remove it

rabarona · 2017-06-02T15:49:47Z

spot-ml/src/main/scala/org/apache/spot/netflow/model/FlowSuspiciousConnectsModel.scala

-                    inputRecords: DataFrame,
-                    topicCount: Int): FlowSuspiciousConnectsModel = {
+
+  def cleanData(flows: DataFrame): DataFrame = {


This is a duplicate of FlowSuspiciousConnectsAnalysis.cleanFlowRecords except that is not checking the minutes and seconds. What's the reason to add this function if it's only used in unit tests and the data is already clean from the object calling trainingModel function?

we can move the change to SPOT-128 if you like, but I think that the cleaning code should be packaged with the model because it is enforcing preconditions necessary to build the model

rabarona · 2017-06-02T15:55:19Z

spot-ml/src/scripts/TestingFlow.scala

@@ -0,0 +1,65 @@
+


This script seems to have many hard coded values, in order to make this available to everyone we should leave values unassigned or receive parameters and document what exactly this test is doing (functional/integration test). Else, I'd leave it for internal use only.

Also, the name can be something more descriptive.

this should not have been checked in, thanks for the catch

rabarona · 2017-06-02T16:56:45Z

spot-ml/src/main/scala/org/apache/spot/netflow/FlowSuspiciousConnectsAnalysis.scala

@@ -61,39 +68,18 @@ object FlowSuspiciousConnectsAnalysis {
    val invalidFlowRecords = filterAndSelectInvalidFlowRecords(inputFlowRecords)
    dataValidation.showAndSaveInvalidRecords(invalidFlowRecords, config.hdfsScoredConnect, logger)

+    outputFlowRecords


This change looks very similar to the one being requested in SPOT-128 but feels a little incomplete.
The code base for this function is originally returning Unit. This change seems to serve only to TestFlow script, do we want to make this change and then complete the desired functionality in SPOT-128 or remove this change and let SPOT-128 take care?

it's an artifact of something I was doing during testing ; I will remove it and leave these changes to SPOT-128

…uning

incorporated code review feedback

rabarona

Looks good to me. Just need a small clarification but overall great change.

rabarona · 2017-06-05T17:44:21Z

spot-ml/src/main/scala/org/apache/spot/netflow/FlowSuspiciousConnectsAnalysis.scala

@@ -108,8 +93,6 @@ object FlowSuspiciousConnectsAnalysis {

    inputFlowRecords
      .filter(cleanFlowRecordsFilter)


What was the reason to move the "select" step out of this function?
I just want to keep consistency between pipelines; changing the way it works here should change the way we do it for DNS and Proxy, the same question applies for function name cleanFlowRecords (name changing) and the removal of detectFlowAnomalies

as we discussed on the phone, as we move towards more of a library of routines that people can compose for their own experiments, it is natural that people will want to execute the suspicious connects analysis on a data frame in a such way that it acts like an "add columns" operation and does not drop a bunch of columns (which might contain useful side information, like class labels) just because they are not consumed by the suspicious connects model

rabarona · 2017-06-05T17:51:18Z

spot-ml/src/main/scala/org/apache/spot/netflow/model/FlowSuspiciousConnectsModel.scala

+    // simplify netflow log entries into "words"
+
+    val dataWithWords = totalRecords.withColumn(SourceWord, FlowWordCreator.srcWordUDF(ModelColumns: _*))
+      .withColumn(DestinationWord, FlowWordCreator.dstWordUDF(ModelColumns: _*))


I really like how this section is looking much cleaner (also in Proxy).

rabarona · 2017-06-05T18:05:00Z

spot-ml/src/test/scala/org/apache/spot/netflow/FlowSuspiciousConnectsAnalysisTest.scala

-                      ibyt: Int,
-                      opkt: Int,
-                      obyt: Int)
+


Nice, looks cleaner.

lujangus

Good work.

rabarona · 2017-06-06T16:49:35Z

spot-ml/src/main/scala/org/apache/spot/netflow/FlowSuspiciousConnectsAnalysis.scala

-    * @param logger
-    * @return
-    */
-  def detectFlowAnomalies(data: DataFrame,


rabarona · 2017-06-06T16:49:47Z

spot-ml/src/main/scala/org/apache/spot/netflow/FlowSuspiciousConnectsAnalysis.scala


  /**
    *
    * @param inputFlowRecords raw flow records
    * @return
    */
-  def filterAndSelectCleanFlowRecords(inputFlowRecords: DataFrame): DataFrame = {


rabarona · 2017-06-06T16:49:56Z

spot-ml/src/main/scala/org/apache/spot/netflow/FlowSuspiciousConnectsAnalysis.scala

@@ -108,8 +93,6 @@ object FlowSuspiciousConnectsAnalysis {

    inputFlowRecords
      .filter(cleanFlowRecordsFilter)
-      .select(InSchema: _*)


lujacab · 2017-06-06T17:08:58Z

Working Fine, performance improvements runtime.
And I see new line comment, SuspiciousConnectsAnalysis: Fitting probabilistic model to data.

raypanduro · 2017-06-06T17:10:46Z

+1

reverted some changes to defer them for a larger overhaul

rabarona

LGTM

lujacab · 2017-06-06T18:13:54Z

+1

NathanSegerlind added 5 commits April 27, 2017 14:00

flow_tuning

8c6a530

experimental branch for the tuning of netflow: * shall we incorporate protocol information? * shall we bin time by hour ? * shall we bin bytes by exponential buckets (eg log of bytes) * shall we bin packet coutns by exponential buckets (eg log of packet counts) ?

flow_tuning

f5f5785

some changes

flow_cleaning

3ba71e9

flow_tuning

e093531

removed minute and second from the model columns since they are no longer being used

NathanSegerlind commented Jun 2, 2017

View reviewed changes

rabarona reviewed Jun 2, 2017

View reviewed changes

NathanSegerlind added 2 commits June 2, 2017 12:32

Merge branch 'master' of github.com:apache/incubator-spot into flow_t…

f18bd55

…uning

flow_tuning

4004ede

incorporated code review feedback

NathanSegerlind changed the title ~~Flow tuning~~ Flow tuning [SPOT-164] Jun 5, 2017

rabarona approved these changes Jun 5, 2017

View reviewed changes

lujangus approved these changes Jun 6, 2017

View reviewed changes

rabarona reviewed Jun 6, 2017

View reviewed changes

flow_tuning

c0f80c5

reverted some changes to defer them for a larger overhaul

rabarona approved these changes Jun 6, 2017

View reviewed changes

asfgit merged commit c0f80c5 into apache:master Jun 6, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flow tuning [SPOT-164] #46

Flow tuning [SPOT-164] #46

NathanSegerlind commented Jun 1, 2017

NathanSegerlind Jun 2, 2017

NathanSegerlind Jun 2, 2017

rabarona left a comment •

edited

rabarona Jun 2, 2017

lujangus Jun 2, 2017

NathanSegerlind Jun 2, 2017

rabarona Jun 2, 2017

rabarona Jun 2, 2017

NathanSegerlind Jun 2, 2017

rabarona Jun 2, 2017

lujangus Jun 2, 2017

NathanSegerlind Jun 2, 2017

rabarona Jun 2, 2017

NathanSegerlind Jun 2, 2017

rabarona Jun 2, 2017

NathanSegerlind Jun 2, 2017

rabarona Jun 2, 2017

NathanSegerlind Jun 2, 2017

rabarona Jun 2, 2017 •

edited

NathanSegerlind Jun 2, 2017

rabarona left a comment

rabarona Jun 5, 2017

NathanSegerlind Jun 6, 2017

rabarona Jun 5, 2017

rabarona Jun 5, 2017

lujangus left a comment

rabarona Jun 6, 2017

rabarona Jun 6, 2017

rabarona Jun 6, 2017

lujacab commented Jun 6, 2017

raypanduro commented Jun 6, 2017

rabarona left a comment

lujacab commented Jun 6, 2017 •

edited

		@@ -108,8 +93,6 @@ object FlowSuspiciousConnectsAnalysis {

		inputFlowRecords
		.filter(cleanFlowRecordsFilter)

Flow tuning [SPOT-164] #46

Flow tuning [SPOT-164] #46

Conversation

NathanSegerlind commented Jun 1, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rabarona left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rabarona Jun 2, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rabarona left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lujangus left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lujacab commented Jun 6, 2017

raypanduro commented Jun 6, 2017

rabarona left a comment

Choose a reason for hiding this comment

lujacab commented Jun 6, 2017 • edited

rabarona left a comment •

edited

rabarona Jun 2, 2017 •

edited

lujacab commented Jun 6, 2017 •

edited