Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-16697][ML][MLLib] improve LDA submitMiniBatch method to avoid redundant RDD computation #14335

Closed
wants to merge 2 commits into from

Conversation

WeichenXu123
Copy link
Contributor

What changes were proposed in this pull request?

In LDAOptimizer.submitMiniBatch, do persist on stats: RDD[(BDM[Double], List[BDV[Double]])]
and also move the place of unpersisting expElogbetaBc broadcast variable,
to avoid the expElogbetaBc broadcast variable to be unpersisted too early,
and update previous expElogbetaBc.unpersist() into expElogbetaBc.destroy(false)

How was this patch tested?

Existing test.

@SparkQA
Copy link

SparkQA commented Jul 24, 2016

Test build #62771 has finished for PR 14335 at commit e5ed33b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -472,12 +473,13 @@ final class OnlineLDAOptimizer extends LDAOptimizer {
gammaPart = gammad :: gammaPart
}
Iterator((stat, gammaPart))
}
}.persist(StorageLevel.MEMORY_AND_DISK)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little risky to assume this can be persisted, but I think it's good idea unless it can be optimized into one pass

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srowen The type of the RDD to be persisted here is fixed to RDD[(BDM[Double], List[BDV[Double]])] so what's the risk of cannot be persisted ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I mean, it's not always clear that it's OK to use the memory/disk, but, I think it's justified here. Yes it will serialize.

@WeichenXu123
Copy link
Contributor Author

@srowen
stats.unpersist(false) ==> stats.unpersist() updated.
is there anything else need to update ?

@SparkQA
Copy link

SparkQA commented Jul 26, 2016

Test build #62853 has finished for PR 14335 at commit 3d8c60d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Jul 26, 2016

Merged to master

@asfgit asfgit closed this in 4c96955 Jul 26, 2016
@WeichenXu123 WeichenXu123 deleted the improve_LDA branch July 31, 2016 09:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants