Spot-128 Make every pipeline to return results to main program #47

rabarona · 2017-06-07T19:07:04Z

This PR implements changes requested on JIRA issue SPOT-128. Includes the elimination of any call to save/write to file system in any place other than the main application.
Also, includes some code cleanup and refactoring.

Main changes

Modified the run method in each pipeline so that it returns two data frames, one with the final results (scored) and another with the invalid records.
Removed any remaining call to save so that the only part of code writing to the file system is the main application.
Modified SuspiciousConnects main application to write the results from whatever pipeline is being analyzed.

Other code refactoring and cleanup:

Modified filter and select methods in each pipeline (*SuspiciousConnectsAnalysis.scala) to just return a set of clean or invalid records removing the select call in case users want to do further analysis and do not drop columns.
Removed the concept of corrupt records leaving only invalid records. Updated filtering/validations so only valid records are processed.
Code refactoring/consistency on *SuspiciousConnectsModel.scala changing trainNewModel to trainModel
Fixed a couple of typos in DNSSuspiciousConnectsAnalysisTest.scala

…oke a single save to HDSF method. This change mainly includes the elimination of any call to save/write to file system in any place other than the main application. Also includes some code cleanup and refactoring. Modified the run method in each pipeline so it returns two data frames, one with the final results (scored) and another with the invalid records. Removed any remaining call to save so that the only part of code writing to the file system is the main application. Modified SuspiciousConnects main application to write the results from whatever pipeline is being analyzed. Modified filter and select methods in each pipeline to just return a set of clean or invalid records; removed the select part so users can do further analysis if they want to see all the columns. Removed the concept of corrupt records leaving only invalid records. Updated filtering/validations so only valid records are processed.

…oke a single save to HDSF method. This change mainly includes the elimination of any call to save/write to file system in any place other than the main application. Also includes some code cleanup and refactoring. - Code refactoring to unit tests - Code refactoring/consistency on *SuspiciousConnectsModel.scala changing trainNewModel to trainModel - Fixed a couple of typos in DNSSuspiciousConnectsAnalysisTest.scala - Fixed merge issues in FlosSuspiciousConnectsModelTest.scala - Rebased this branch with latest version on incubator-spot/master

NathanSegerlind · 2017-06-07T21:27:15Z

spot-ml/src/main/scala/org/apache/spot/dns/DNSSuspiciousConnectsAnalysis.scala


-    val orderedDNSRecords = filteredDNSRecords.orderBy(Score)
+    val filteredScored = filterScoredRecords(scoredDNSRecords, config.threshold).orderBy(Score)


there are two calls to orderBy(Score) here ... pretty sure that it gets optimized out, but still...

I removed that but then it got reverted when I did rebase with master. Will change again, thanks.

NathanSegerlind · 2017-06-07T21:34:58Z

spot-ml/src/main/scala/org/apache/spot/proxy/ProxySuspiciousConnectsAnalysis.scala

+    val proxyRecords = filterRecords(inputProxyRecords)
+      .select(InSchema: _*)
+      .na.fill(DefaultUserAgent, Seq(UserAgent))
+      .na.fill(DefaultResponseContentType, Seq(ResponseContentType))


is this imputation behavior documented somewhere?

I don't think so, it has been there for a while but no documentation. I'm going to add a JIRA issue for that documentation.

NathanSegerlind

looks pretty good, a couple small changes requested

…oke a single save to HDSF method. This change mainly includes the elimination of any call to save/write to file system in any place other than the main application. Also includes some code cleanup and refactoring. - Code fix, removed double call to orderBy.

NathanSegerlind · 2017-06-07T21:58:29Z

+1

lujangus

Good.

lujacab · 2017-06-09T21:25:52Z

+1

raypanduro · 2017-06-09T21:27:26Z

+1

Ricardo Barona added 2 commits June 7, 2017 12:46

NathanSegerlind reviewed Jun 7, 2017

View reviewed changes

NathanSegerlind suggested changes Jun 7, 2017

View reviewed changes

lujangus approved these changes Jun 7, 2017

View reviewed changes

asfgit merged commit 856f04f into apache:master Jun 9, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spot-128 Make every pipeline to return results to main program #47

Spot-128 Make every pipeline to return results to main program #47

rabarona commented Jun 7, 2017 •

edited

NathanSegerlind Jun 7, 2017

rabarona Jun 7, 2017

NathanSegerlind Jun 7, 2017

rabarona Jun 7, 2017

NathanSegerlind left a comment

NathanSegerlind commented Jun 7, 2017

lujangus left a comment

lujacab commented Jun 9, 2017

raypanduro commented Jun 9, 2017


		val orderedDNSRecords = filteredDNSRecords.orderBy(Score)
		val filteredScored = filterScoredRecords(scoredDNSRecords, config.threshold).orderBy(Score)

Spot-128 Make every pipeline to return results to main program #47

Spot-128 Make every pipeline to return results to main program #47

Conversation

rabarona commented Jun 7, 2017 • edited

Main changes

Other code refactoring and cleanup:

NathanSegerlind Jun 7, 2017

Choose a reason for hiding this comment

rabarona Jun 7, 2017

Choose a reason for hiding this comment

NathanSegerlind Jun 7, 2017

Choose a reason for hiding this comment

rabarona Jun 7, 2017

Choose a reason for hiding this comment

NathanSegerlind left a comment

Choose a reason for hiding this comment

NathanSegerlind commented Jun 7, 2017

lujangus left a comment

Choose a reason for hiding this comment

lujacab commented Jun 9, 2017

raypanduro commented Jun 9, 2017

rabarona commented Jun 7, 2017 •

edited