-
Notifications
You must be signed in to change notification settings - Fork 224
SPOT-183 Schema validation for input data #72
SPOT-183 Schema validation for input data #72
Conversation
logger.info("Fitting probabilistic model to data") | ||
val model = | ||
DNSSuspiciousConnectsModel.trainModel(sparkSession, logger, config, dnsRecords) | ||
if (schemaValidationResults.length > InputSchema.ResponseDefaultSize) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a somewhat cryptic test condition... what is being checked here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, the validator returns a Seq of messages (String). If everything went well, there will be 1 message (initializing message) but if there are schema errors there are going to be more than one.
If is > InputSchema.ResponseDefaultSize (meaning 1) then it indicates there are some errors, if it's one then everything is good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we just return a binary Pass/Fail value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it can be, just want to have the list of columns not working. I can return a tuple or case class with Pass/Fail and a Seq of messages. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i like the idea of a pair or case class
|
||
logger.info("Fitting probabilistic model to data") | ||
val model = ProxySuspiciousConnectsModel.trainModel(sparkSession, logger, config, proxyRecords) | ||
if (schemaValidationResults.length > InputSchema.ResponseDefaultSize) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cryptic condition
logger.info("Fitting probabilistic model to data") | ||
val model = | ||
FlowSuspiciousConnectsModel.trainModel(sparkSession, logger, config, flowRecords) | ||
if (schemaValidationResults.length > InputSchema.ResponseDefaultSize) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cryptic condition
* @param expectedSchema schema expected by model training and scoring methods | ||
* @return | ||
*/ | ||
def validate(inSchema: StructType, expectedSchema: StructType): Seq[String] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to be explicit... this test should pass if there are extra columns in the dataframe, not just the ones used by the model schema ?
this should be commented on and tested on
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the validation is going to check that the columns required for model training then it's good to go, no matter if there are more columns.
Will do that.
LGTM |
see you in 9 and 1/2 weeks |
+1 |
@@ -103,7 +103,11 @@ object SuspiciousConnects { | |||
InvalidDataHandler.showAndSaveInvalidRecords(invalidRecords, config.hdfsScoredConnect, logger) | |||
} | |||
|
|||
case None => logger.error("Unsupported (or misspelled) analysis: " + analysis) | |||
case None => logger.error(s"Something went wrong while trying to run Suspicious Connects Analysis") | |||
logger.error(s"The value of parameter analysis (provided: $analysis) is any of the valid analysis types? " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggested change?: "Is the value of the analysis parameter (provided: $analysis) any of the valid analysis types?"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#BadEnglish
Thanks!
I gave one of my picky edit suggestions in a file toward the begging of the list of changed files. All else looks good to me: +1 |
Added schema validation based on columns required for model training. Updated each pipeline (DNS, Flow, Proxy) so the schema is validated before performing any other activity. Added unit test for schema validation. Updated main activity to show error about bad schema if any. Application now will print what fields are not matching the expected schema.
Made changes after code review from @NathanSegerlind - ValidateSchema will return case class with flag isValid and Seq[String] for a list of invalid columns. Changed flow, dns and proxy pipelines to handle validateSchema response InputSchemaValidationResponse - Updated unit tests
analysis type is passed.
c30854a
to
781ccf7
Compare
After merging the changes as part of PR. I am still getting the issue while running ML for proxy data. 18/06/20 07:45:16 INFO SuspiciousConnectsAnalysis: Running Spark LDA with params alpha = 1.02 beta = 1.001 Max iterations = 20 Optimizer = em |
This PR implements changes requested in SPOT-183 and aims to overpass issues reported on SPOT-174 and SPOT-149. It validates input dataset (reading dataframe schema) and checks if it contains the schema required for model training.
Main changes
Added schema validation based on columns required for model training.
Updated each pipeline (DNS, Flow, Proxy) so the schema is validated before performing any other activity.
Updated main activity to show error about bad schema if any.
The main application now will print what fields are not matching the expected schema.
Added unit test for schema validation.