SPOT-183 Schema validation for input data #72

rabarona · 2017-06-29T21:49:20Z

This PR implements changes requested in SPOT-183 and aims to overpass issues reported on SPOT-174 and SPOT-149. It validates input dataset (reading dataframe schema) and checks if it contains the schema required for model training.

Main changes

Added schema validation based on columns required for model training.
Updated each pipeline (DNS, Flow, Proxy) so the schema is validated before performing any other activity.
Updated main activity to show error about bad schema if any.
The main application now will print what fields are not matching the expected schema.
Added unit test for schema validation.

NathanSegerlind · 2017-06-29T22:19:07Z

spot-ml/src/main/scala/org/apache/spot/dns/DNSSuspiciousConnectsAnalysis.scala

-    logger.info("Fitting probabilistic model to data")
-    val model =
-      DNSSuspiciousConnectsModel.trainModel(sparkSession, logger, config, dnsRecords)
+    if (schemaValidationResults.length > InputSchema.ResponseDefaultSize) {


this is a somewhat cryptic test condition... what is being checked here?

So, the validator returns a Seq of messages (String). If everything went well, there will be 1 message (initializing message) but if there are schema errors there are going to be more than one.
If is > InputSchema.ResponseDefaultSize (meaning 1) then it indicates there are some errors, if it's one then everything is good.

could we just return a binary Pass/Fail value?

Yeah, it can be, just want to have the list of columns not working. I can return a tuple or case class with Pass/Fail and a Seq of messages. What do you think?

i like the idea of a pair or case class

NathanSegerlind · 2017-06-29T22:21:01Z

spot-ml/src/main/scala/org/apache/spot/proxy/ProxySuspiciousConnectsAnalysis.scala


-    logger.info("Fitting probabilistic model to data")
-    val model = ProxySuspiciousConnectsModel.trainModel(sparkSession, logger, config, proxyRecords)
+    if (schemaValidationResults.length > InputSchema.ResponseDefaultSize) {


cryptic condition

NathanSegerlind · 2017-06-29T22:21:14Z

spot-ml/src/main/scala/org/apache/spot/netflow/FlowSuspiciousConnectsAnalysis.scala

-    logger.info("Fitting probabilistic model to data")
-    val model =
-      FlowSuspiciousConnectsModel.trainModel(sparkSession, logger, config, flowRecords)
+    if (schemaValidationResults.length > InputSchema.ResponseDefaultSize) {


cryptic condition

NathanSegerlind · 2017-06-29T22:23:38Z

spot-ml/src/main/scala/org/apache/spot/utilities/data/validation/InputSchema.scala

+    * @param expectedSchema schema expected by model training and scoring methods
+    * @return
+    */
+  def validate(inSchema: StructType, expectedSchema: StructType): Seq[String] = {


to be explicit... this test should pass if there are extra columns in the dataframe, not just the ones used by the model schema ?

this should be commented on and tested on

Yeah, the validation is going to check that the columns required for model training then it's good to go, no matter if there are more columns.
Will do that.

NathanSegerlind · 2017-06-30T20:07:32Z

LGTM

NathanSegerlind · 2017-06-30T20:08:07Z

see you in 9 and 1/2 weeks

lujacab · 2017-07-21T16:39:13Z

+1

brandon-edwards · 2017-07-26T17:50:10Z

spot-ml/src/main/scala/org/apache/spot/SuspiciousConnects.scala

@@ -103,7 +103,11 @@ object SuspiciousConnects {
            InvalidDataHandler.showAndSaveInvalidRecords(invalidRecords, config.hdfsScoredConnect, logger)
          }

-          case None => logger.error("Unsupported (or misspelled) analysis: " + analysis)
+          case None => logger.error(s"Something went wrong while trying to run Suspicious Connects Analysis")
+            logger.error(s"The value of parameter analysis (provided: $analysis) is any of the valid analysis types? " +


Suggested change?: "Is the value of the analysis parameter (provided: $analysis) any of the valid analysis types?"

#BadEnglish
Thanks!

brandon-edwards · 2017-07-26T17:51:59Z

I gave one of my picky edit suggestions in a file toward the begging of the list of changed files. All else looks good to me: +1

Added schema validation based on columns required for model training. Updated each pipeline (DNS, Flow, Proxy) so the schema is validated before performing any other activity. Added unit test for schema validation. Updated main activity to show error about bad schema if any. Application now will print what fields are not matching the expected schema.

@NathanSegerlind

Made changes after code review from @NathanSegerlind - ValidateSchema will return case class with flag isValid and Seq[String] for a list of invalid columns. Changed flow, dns and proxy pipelines to handle validateSchema response InputSchemaValidationResponse - Updated unit tests

analysis type is passed.

anilreddydonthireddy · 2018-06-20T11:48:33Z

After merging the changes as part of PR. I am still getting the issue while running ML for proxy data.

18/06/20 07:45:16 INFO SuspiciousConnectsAnalysis: Running Spark LDA with params alpha = 1.02 beta = 1.001 Max iterations = 20 Optimizer = em
Exception in thread "main" java.util.NoSuchElementException: next on empty iterator
at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
at scala.collection.Iterator$$anon$2.next(Iterator.scala:37)
at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:63)
at scala.collection.IterableLike$class.head(IterableLike.scala:107)
at scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$head(ArrayOps.scala:186)
at scala.collection.IndexedSeqOptimized$class.head(IndexedSeqOptimized.scala:126)
at scala.collection.mutable.ArrayOps$ofRef.head(ArrayOps.scala:186)
at org.apache.spark.mllib.clustering.EMLDAOptimizer.initialize(LDAOptimizer.scala:166)
at org.apache.spark.mllib.clustering.EMLDAOptimizer.initialize(LDAOptimizer.scala:80)
at org.apache.spark.mllib.clustering.LDA.run(LDA.scala:331)
at org.apache.spot.lda.SpotLDAWrapper$.runLDA(SpotLDAWrapper.scala:132)
at org.apache.spot.proxy.ProxySuspiciousConnectsModel$.trainModel(ProxySuspiciousConnectsModel.scala:155)
at org.apache.spot.proxy.ProxySuspiciousConnectsAnalysis$.run(ProxySuspiciousConnectsAnalysis.scala:106)
at org.apache.spot.SuspiciousConnects$.main(SuspiciousConnects.scala:112)
at org.apache.spot.SuspiciousConnects.main(SuspiciousConnects.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

NathanSegerlind reviewed Jun 29, 2017

View reviewed changes

NathanSegerlind approved these changes Jun 30, 2017

View reviewed changes

brandon-edwards reviewed Jul 26, 2017

View reviewed changes

Ricardo Barona added 3 commits July 27, 2017 12:51

SPOT-183: Updated SuspiciousConnects.scala error message when incorrect

781ccf7

analysis type is passed.

rabarona force-pushed the SPOT-183-Schema_validation_input_data branch from c30854a to 781ccf7 Compare July 27, 2017 18:00

rabarona changed the title ~~SPOT-166 Schema validation for input data~~ SPOT-183 Schema validation for input data Jul 27, 2017

asfgit merged commit 781ccf7 into apache:master Jul 27, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SPOT-183 Schema validation for input data #72

SPOT-183 Schema validation for input data #72

rabarona commented Jun 29, 2017 •

edited

Loading

NathanSegerlind Jun 29, 2017

rabarona Jun 29, 2017

NathanSegerlind Jun 29, 2017

rabarona Jun 29, 2017

NathanSegerlind Jun 30, 2017

NathanSegerlind Jun 29, 2017

NathanSegerlind Jun 29, 2017

NathanSegerlind Jun 29, 2017

rabarona Jun 29, 2017

NathanSegerlind commented Jun 30, 2017

NathanSegerlind commented Jun 30, 2017

lujacab commented Jul 21, 2017

brandon-edwards Jul 26, 2017

rabarona Jul 27, 2017

brandon-edwards commented Jul 26, 2017

anilreddydonthireddy commented Jun 20, 2018

SPOT-183 Schema validation for input data #72

SPOT-183 Schema validation for input data #72

Conversation

rabarona commented Jun 29, 2017 • edited Loading

Main changes

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NathanSegerlind commented Jun 30, 2017

NathanSegerlind commented Jun 30, 2017

lujacab commented Jul 21, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brandon-edwards commented Jul 26, 2017

anilreddydonthireddy commented Jun 20, 2018

rabarona commented Jun 29, 2017 •

edited

Loading