- Author: Ben Du
- Date: 2021-04-21 11:10:17
- Title: DataFrame Implementations in Python
- Slug: scaling-pandas
- Category: Computer Science
- Tags: Computer Science, programming, Python, DataFrame, pandas, PySpark, Vaex, Modin, Dask, RAPIDS, cudf, cylon

 ** Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement! **  

## pandas DataFrame

## [dask.DataFrame](https://github.com/dask/dask)

Dask is a low-level scheduler and a high-level partial Pandas replacement, 
geared toward running code on compute clusters.
Dask provides `dask.dataframe`,
a higher-level, Pandas-like library that can help you deal with out-of-core datasets.


## [vaex](https://github.com/vaexio/vaex)

Vaex is a high performance Python library for lazy Out-of-Core DataFrames (similar to Pandas), 
to visualize and explore big tabular datasets. 
It calculates statistics such as mean, sum, count, standard deviation etc, 
on an N-dimensional grid for more than a billion (10^9) samples/rows per second. 
Visualization is done using histograms, density plots and 3d volume rendering, 
allowing interactive exploration of big data. 
Vaex uses memory mapping, zero memory copy policy and lazy computations for best performance (no memory wasted).



## [cylon](https://github.com/cylondata/cylon)

Cylon is a fast, scalable distributed memory data parallel library for processing structured data. 
Cylon implements a set of relational operators to process data. 
While ”Core Cylon” is implemented using system level C/C++, 
multiple language interfaces (Python and Java ) are provided to seamlessly integrate with existing applications, 
enabling both data and AI/ML engineers to invoke data processing operators in a familiar programming language. 
By default it works with MPI for distributing the applications.
Internally Cylon uses Apache Arrow to represent the data in a column format.

## [cudf](https://github.com/rapidsai/cudf)

cudf (developed by RAPIDS) is built based on the Apache Arrow columnar memory format, 
cuDF is a GPU DataFrame library for loading, joining, aggregating, filtering, and otherwise manipulating data.
cuDF provides a pandas-like API that will be familiar to data engineers & data scientists, 
so they can use it to easily accelerate their workflows without going into the details of CUDA programming.

## modin

Modin, with Ray as a backend. By installing these, you might see significant benefit by changing just a single line (`import pandas as pd` to `import modin.pandas as pd`). Unlike the other tools, Modin aims to reach full compatibility with Pandas.

Modin: a drop-in replacement for Pandas, powered by either Dask or Ray.



## PySpark DataFrame

## TODO

1. compare PySpark DataFrame vs Vaex on a single machine ...

## References

- [Scaling Pandas: Comparing Dask, Ray, Modin, Vaex, and RAPIDS](https://www.datarevenue.com/en-blog/pandas-vs-dask-vs-vaex-vs-modin-vs-rapids-vs-ray#:~:text=Vaex,-Dask%20(Dataframe)%20is&text=Ultimately%2C%20Dask%20is%20more%20focused,on%20data%20processing%20and%20wrangling.)

- [RIP Pandas 2.0: Time For DASK After VAEX !!!](https://towardsdatascience.com/dask-vs-vaex-for-big-data-38cb66728747)

- [High performance Computing in Python](http://www.legendu.net/misc/blog/high-performance-computing-in-python)