Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(LDA): scalability issue in updateCounter. #30

Open
hucheng opened this issue Aug 26, 2015 · 2 comments
Open

(LDA): scalability issue in updateCounter. #30

hucheng opened this issue Aug 26, 2015 · 2 comments

Comments

@hucheng
Copy link
Contributor

hucheng commented Aug 26, 2015

Current updateCounter is implemented via aggregateMessages. It will create an dense arrays with length of vertices in each partition, and each element is a sparseVector. When the number of vertices in one partition is huge (consider 1B vertices), it cannot be hold in memory.
This can be solved by:

  1. better graph partition approach such that the number of vertices in a partition is limited even when the graph (input data) is super large.
  2. use aggregateByKey. There are several advantages against aggregateMessages:
    (1). For each edge, aggregateMessages will new a sparseVector (edge attribute is an array), and new a sparseVector that is the result of sparseVector+sparseVector.
    (2). Why not reduceByKey is because it seqOp definition (U, V) => U does not needs V to be the same type with U, thus unnecessary to new sparseVector from edge attribute array. Besides, to avoid memory allocation, both of these functions are allowed to modify and return their first argument instead of creating a new U.
    (3). The side-effect of aggregateByKey is that it needs a sort-phase if sort-based shuffle is used.
@bhoppi
Copy link
Contributor

bhoppi commented Sep 11, 2015

The 2nd issue is solved now. We don't use aggregateMessages nor aggregateByKey. We do it by ourselves. The implementation is like aggregateMessages so no sorting needed, and it is done in-place so not much memory needed and doesn't need create many new sparseVectors.

@hucheng
Copy link
Contributor Author

hucheng commented Sep 12, 2015

Great. The same, please share the pr here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants