Skip to content
This repository has been archived by the owner on Apr 21, 2023. It is now read-only.

Spot-128 Make every pipeline to return results to main program #47

Merged

Conversation

rabarona
Copy link

@rabarona rabarona commented Jun 7, 2017

This PR implements changes requested on JIRA issue SPOT-128. Includes the elimination of any call to save/write to file system in any place other than the main application.
Also, includes some code cleanup and refactoring.

Main changes

  • Modified the run method in each pipeline so that it returns two data frames, one with the final results (scored) and another with the invalid records.

  • Removed any remaining call to save so that the only part of code writing to the file system is the main application.

  • Modified SuspiciousConnects main application to write the results from whatever pipeline is being analyzed.

Other code refactoring and cleanup:

  • Modified filter and select methods in each pipeline (*SuspiciousConnectsAnalysis.scala) to just return a set of clean or invalid records removing the select call in case users want to do further analysis and do not drop columns.

  • Removed the concept of corrupt records leaving only invalid records. Updated filtering/validations so only valid records are processed.

  • Code refactoring/consistency on *SuspiciousConnectsModel.scala changing trainNewModel to trainModel

  • Fixed a couple of typos in DNSSuspiciousConnectsAnalysisTest.scala

Ricardo Barona added 2 commits June 7, 2017 12:46
…oke a single save to HDSF method.

This change mainly includes the elimination of any call to save/write to file system in any place other than the main application.
Also includes some code cleanup and refactoring.

Modified the run method in each pipeline so it returns two data frames, one with the final results (scored) and another with the invalid records. Removed any remaining call to save so that the only part of code writing to the file system is the main application.
Modified SuspiciousConnects main application to write the results from whatever pipeline is being analyzed.
Modified filter and select methods in each pipeline to just return a set of clean or invalid records; removed the select part so users can do further analysis if they want to see all the columns.
Removed the concept of corrupt records leaving only invalid records. Updated filtering/validations so only valid records are processed.
…oke a single save to HDSF method.

This change mainly includes the elimination of any call to save/write to file system in any place other than the main application.
Also includes some code cleanup and refactoring.

- Code refactoring to unit tests
- Code refactoring/consistency on *SuspiciousConnectsModel.scala changing trainNewModel to trainModel
- Fixed a couple of typos in DNSSuspiciousConnectsAnalysisTest.scala
- Fixed merge issues in FlosSuspiciousConnectsModelTest.scala
- Rebased this branch with latest version on incubator-spot/master

val orderedDNSRecords = filteredDNSRecords.orderBy(Score)
val filteredScored = filterScoredRecords(scoredDNSRecords, config.threshold).orderBy(Score)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are two calls to orderBy(Score) here ... pretty sure that it gets optimized out, but still...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed that but then it got reverted when I did rebase with master. Will change again, thanks.

val proxyRecords = filterRecords(inputProxyRecords)
.select(InSchema: _*)
.na.fill(DefaultUserAgent, Seq(UserAgent))
.na.fill(DefaultResponseContentType, Seq(ResponseContentType))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this imputation behavior documented somewhere?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, it has been there for a while but no documentation. I'm going to add a JIRA issue for that documentation.

Copy link
Contributor

@NathanSegerlind NathanSegerlind left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks pretty good, a couple small changes requested

…oke a single save to HDSF method.

This change mainly includes the elimination of any call to save/write to file system in any place other than the main application.
Also includes some code cleanup and refactoring.

- Code fix, removed double call to orderBy.
@NathanSegerlind
Copy link
Contributor

+1

Copy link
Contributor

@lujangus lujangus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good.

@lujacab
Copy link

lujacab commented Jun 9, 2017

+1

1 similar comment
@raypanduro
Copy link
Contributor

+1

@asfgit asfgit merged commit 856f04f into apache:master Jun 9, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
6 participants