Spot-128 Make every pipeline to return results to main program #47
Spot-128 Make every pipeline to return results to main program #47
Conversation
…oke a single save to HDSF method. This change mainly includes the elimination of any call to save/write to file system in any place other than the main application. Also includes some code cleanup and refactoring. Modified the run method in each pipeline so it returns two data frames, one with the final results (scored) and another with the invalid records. Removed any remaining call to save so that the only part of code writing to the file system is the main application. Modified SuspiciousConnects main application to write the results from whatever pipeline is being analyzed. Modified filter and select methods in each pipeline to just return a set of clean or invalid records; removed the select part so users can do further analysis if they want to see all the columns. Removed the concept of corrupt records leaving only invalid records. Updated filtering/validations so only valid records are processed.
…oke a single save to HDSF method. This change mainly includes the elimination of any call to save/write to file system in any place other than the main application. Also includes some code cleanup and refactoring. - Code refactoring to unit tests - Code refactoring/consistency on *SuspiciousConnectsModel.scala changing trainNewModel to trainModel - Fixed a couple of typos in DNSSuspiciousConnectsAnalysisTest.scala - Fixed merge issues in FlosSuspiciousConnectsModelTest.scala - Rebased this branch with latest version on incubator-spot/master
|
||
val orderedDNSRecords = filteredDNSRecords.orderBy(Score) | ||
val filteredScored = filterScoredRecords(scoredDNSRecords, config.threshold).orderBy(Score) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are two calls to orderBy(Score) here ... pretty sure that it gets optimized out, but still...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed that but then it got reverted when I did rebase with master. Will change again, thanks.
val proxyRecords = filterRecords(inputProxyRecords) | ||
.select(InSchema: _*) | ||
.na.fill(DefaultUserAgent, Seq(UserAgent)) | ||
.na.fill(DefaultResponseContentType, Seq(ResponseContentType)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this imputation behavior documented somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so, it has been there for a while but no documentation. I'm going to add a JIRA issue for that documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks pretty good, a couple small changes requested
…oke a single save to HDSF method. This change mainly includes the elimination of any call to save/write to file system in any place other than the main application. Also includes some code cleanup and refactoring. - Code fix, removed double call to orderBy.
+1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good.
+1 |
1 similar comment
+1 |
This PR implements changes requested on JIRA issue SPOT-128. Includes the elimination of any call to save/write to file system in any place other than the main application.
Also, includes some code cleanup and refactoring.
Main changes
Modified the run method in each pipeline so that it returns two data frames, one with the final results (scored) and another with the invalid records.
Removed any remaining call to save so that the only part of code writing to the file system is the main application.
Modified SuspiciousConnects main application to write the results from whatever pipeline is being analyzed.
Other code refactoring and cleanup:
Modified filter and select methods in each pipeline (*SuspiciousConnectsAnalysis.scala) to just return a set of clean or invalid records removing the select call in case users want to do further analysis and do not drop columns.
Removed the concept of corrupt records leaving only invalid records. Updated filtering/validations so only valid records are processed.
Code refactoring/consistency on *SuspiciousConnectsModel.scala changing trainNewModel to trainModel
Fixed a couple of typos in DNSSuspiciousConnectsAnalysisTest.scala