https://towardsdatascience.com/speed-up-your-pandas-workflow-with-modin-9a61acff0076

Soner Yıldırım
Mar 10
Speed Up Your Pandas Workflow with Modin
Make use of the power of distributed computation

install dependencies

In [None]:
!pip install -r requirements.txt

Import what we need

In [None]:
import numpy as np
import pandas as pd
import modin.pandas as pdm
import ray
ray.init(ignore_reinit_error=True)

We will first create a sample DataFrame with 1 million rows and save it in csv format. Then, we will perform the common operations with both Pandas and Modin.

In [None]:
df = pd.DataFrame(np.random.randint(1,100,size=(5*10**6,50)))
df = df.add_prefix("column_")
df["group"] = ["A","B","C","D"]*250000
df.shape

The DataFrame contains 10 million rows and 50 columns with integer values between 1 and 100. I have also added a categorical column to be able to test the groupby function.

We can now save this DataFrame as a csv file.

In [None]:
df.to_csv("large_dataset.csv",index=False)

The size of the csv file is 1.47 GB.

We now have our “large” dataset. It is time to do operations and time them. I have a MacBook Pro 2000 with an M1 chip. The times you measure on your machine might differ slightly but you will see an improvement with Modin compared to Pandas.

The first operation we will do is read the csv file.

In [None]:
%%time
df_pandas = pd.read_csv("large_dataset.csv")

It took 26 seconds to read the file with Pandas.

In [None]:
%%time
df_modin = pdm.read_csv("large_dataset.csv")

With Modin, we are able to read the same file in 6.66 seconds which means a 74% improvement.

Although we can apply the same operations using the same syntax, the types of “df_pandas” and “df_modin” are different.

In [None]:
type(df_pandas)

In [None]:
type(df_modin)

We select the rows whose group value is A or B. The operation took 2.14 seconds with Pandas.

In [None]:
%%time
df_filtered = df_pandas[df_pandas.group.isin(["A","B"])]

In [None]:
%%time
df_filtered = df_modin[df_modin.group.isin(["A","B"])]

We do not see any improvement. In fact, Pandas is slightly faster than Modin in this operation.

**Note:** Modin used Ray or Dask engines. How to choose one of them is explained in the documentation.

Another common task in data processing is combining multiple DataFrames. Let’s do an example by combining the filtered DataFrame and the original one with the concat function.

In [None]:
%%time
df_combined = pdm.concat([df_pandas, df_filtered])

In [None]:
%%time
df_combined = pdm.concat([df_modin, df_filtered])

In [None]:
%%time
df_pandas.groupby("group")["column_1"].mean()

In [None]:
%%time
df_modin.groupby("group")["column_1"].mean()

We have done some examples to compare the performance of Pandas and Modin on a 1.47 GB csv file. Modin outperforms Pandas in reading the file and combining DataFrames. On the other hand, Pandas performed better than Modin in filtering and aggregations with groupby.
I think the performance difference between Modin and Pandas will get more noticeable as the data size increases. Another important point that will reveal the speed of Modin compared to Pandas is the number of clusters. I have done the examples on a single machine (i.e. my laptop). It might be a more accurate comparison to do these examples with a larger dataset on multiple clusters.