This repo contains the core of the FlashX project, which provides big data analytics tools that perform data analytics in the form of graphs and matrices. As such, FlashX covers a large range of data analysis tasks. All tools in FlashX utilize solid-state drives (SSDs) to scale data analysis to large datasets in a single machine, while achieving lightning speed (SSD-based solutions run almost as fast as in-memory solutions). The main components in FlashX are FlashGraph and FlashMatrix.
FlashGraph is a general-purpose graph analysis framework that exposes vertex-centric programming interface for users to express varieties of graph algorithms. FlashGraph scales graph computation to large graphs by keeping the edges of a graph on SSDs and computation state in memory. With smart I/O scheduling, FlashGraph is able to achieve performance comparable to state-of-art in-memory graph analysis frameworks and significantly outperforms state-of-art distributed graph analysis frameworks while being able to scale to graphs with billions of vertices and hundreds of billions of edges. Please see the performance result.
FlashMatrix is a matrix computation engine that provides a small set of generalized matrix operations on sparse matrices and dense matrices to express varieties of data mining and machine learning algorithms. For certain graph algorithms such as PageRank, which can be formulated as sparse matrix multiplication, FlashMatrix is able to significantly outperform FlashGraph.
FlashX exposes C++, R and Python programming interface. The R and Python programming interface is highly compatible with the R base package and NumPy. As such, users can execute R and Python machine learning code on FlashX with little or no modification. Our goal is to eventually make the R and Python interface fully compatible with the ones in native R and NumPy.
- FlashR provides many matrix operations in the R base package.
- FlashGraphR exposes many graph algorithms in FlashGraph to R.
- FlashR-learn is a machine learning library implemented completely with FlashR.
- FlashPy provides many array operations in NumPy.
Da Zheng, Disa Mhembere, Joshua T. Vogelstein, Carey E. Priebe, and Randal Burns, “FlashMatrix: Parallel, scalable data analysis with generalized matrix operations using commodity ssds,” arXiv preprint arXiv:1604.06414, 2016 [pdf]
Da Zheng, Disa Mhembere, Vince Lyzinski, Joshua Vogelstein, Carey E. Priebe, and Randal Burns, “Semi-external memory sparse matrix multiplication on billion-node graphs”, Transactions on Parallel and Distributed Systems, 2016. [pdf]
Heng Wang, Da Zheng, Randal Burns, Carey Priebe, Active Community Detection in Massive Graphs, SDM-Networks 2015 [pdf]
Da Zheng, Randal Burns, Alexander S. Szalay, Toward Millions of File System IOPS on Low-Cost, Commodity Hardware, in Proceeding SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, [pdf][bib]
Mailing list: email@example.com