Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Approach handling 4 billion nodes on modern hardware #3

Open
xeoncross opened this issue Oct 24, 2016 · 4 comments
Open

Approach handling 4 billion nodes on modern hardware #3

xeoncross opened this issue Oct 24, 2016 · 4 comments

Comments

@xeoncross
Copy link

xeoncross commented Oct 24, 2016

The uint can handle values past 4 billion so that means this system can handle graphs with up to 4294967295 nodes in them. Currently, a graph with only 1,000,000 nodes ( with 100,000,000 links) takes 2.6GB of memory to calculate. Approaching 1 billion nodes (with 10 or 100 links each) would take a massive server to calculate.

Is there any way we could split the processing up by sharding the graph into smaller parts rather than processing the whole thing in memory? Perhaps we could even combine the graph.edges and graph.nodes weight values since they contain duplicate weights (or swap them for pointers to a shared float64 value).

@alixaxel
Copy link
Owner

alixaxel commented Nov 2, 2016

It's definitely possible but that would involve preprocessing the graph to analyze it's connectivity and identify disconnected sub-graphs that could be calculated independently. Also, when you follow this approach I think you cannot normalize the values of the edges so that their sum amounts to 1 - not 100% sure.

That would also mean you would have to known after how many iterations to stop beforehand.

My current needs involve processing a little over 10 million nodes and for that 32GB of RAM is enough.

I suggest you look into https://github.com/GraphChi/graphchi-cpp and the respective papers as it's an amazing piece of software and the parallel sliding window is a very nice idea, another alternative could be something like InifitiMem but from my benchmarks it's nowhere near as efficient.

If you want to implement graph sharding to this project, PRs are more than welcome though. :)

@xeoncross
Copy link
Author

GraphChi looks like a great project. Thanks for those links. Unfortunately I haven't spent any time on C++ in a long time and don't remember much. I guess I just need to rent a EC2 r3.2xlarge for a couple hours to fit a larger dataset in memory.

@alixaxel
Copy link
Owner

alixaxel commented Nov 7, 2016

@xeoncross Throwing money at the problem is usually the easiest route. :)

But there's also a GraphChi version in Java (see https://github.com/GraphChi/graphchi-java/blob/master/src/main/java/edu/cmu/graphchi/apps/WeightedPagerank.java) if you feel more comfortable with that and want to tackle the problem. ;)

@ckingdev
Copy link

I know this is pretty old, but I thought I'd comment in case it helps anyone:

There are a couple of ways you could improve the ability to scale, but at the size of the problem you're considering, you're running into fundamental memory requirements. I see you're storing the nodes in a map, and the edges in a two-level map. That's going to result in a good deal of overhead (space and time) from the hash map, but the bare minimum you could get away with, using 32 bit weights, for a 1 billion x 1 billion matrix with 100 nonzero elements per node is 1.2 TB. That's if you were to use a CSR matrix with 32 bit values (the two indices would have to be 64 bit because of the number of edges). With hash maps, you should probably expect a roughly constant multiple of that for memory usage.

If it's important to speed this up, though, switching to a sparse matrix and using BLAS for the sparse matrix - dense vector dot product would do wonders. It would immediately reduce memory usage and increase performance- and you'd get parallelism for free, if it would actually benefit your problem. Sparse matrix multiplication is very complex and the communication required often outweighs the benefit.

Lastly, if you need to work on matrices larger than main memory, this can be done as a block matrix multiplication. You could keep only the block of the matrix that you're working on in memory, load the next, combine results, etc.

I/O-efficient techniques for computing pagerank gives an overview of a few methods of doing that. Full text pdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants