# Converting a Dask DataFrame to a Pandas DataFrame

This notebook shows how to convert a Dask DataFrame to a Pandas DataFrame, discusses when this operation is appropriate, and demonstrates when this'll error out.

This notebook runs some small examples locally and then bigger dataset examples on the cloud.

In [1]:
import dask.dataframe as dd
import pandas as pd

In [8]:
import warnings

warnings.filterwarnings("ignore")

## Convert on localhost

In [27]:
df = pd.DataFrame(
    {"nums": [1, 2, 3, 4, 5, 6], "letters": ["a", "b", "c", "d", "e", "f"]}
)
ddf = dd.from_pandas(df, npartitions=2)

In [28]:
type(ddf)

dask.dataframe.core.DataFrame

In [29]:
pandas_df = ddf.compute()

In [36]:
print(pandas_df)

   nums letters
0     1       a
1     2       b
2     3       c
3     4       d
4     5       e
5     6       f


In [31]:
type(pandas_df)

pandas.core.frame.DataFrame

In [33]:
pdf2 = ddf.repartition(1).partitions[0]

In [35]:
type(pdf2)

dask.dataframe.core.DataFrame

## Convert on cloud

In [None]:
import coiled
import dask

In [7]:
cluster = coiled.Cluster(name="demo-cluster", n_workers=5)



Your account is using the ECS backend.

After September 16th, accounts using the ECS backend will be migrated to the default AWS VM backend. For more information, refer to the backend documentation and the FAQ:

https://docs.coiled.io/user_guide/backends
https://docs.coiled.io/user_guide/faq.html#backends
Found software environment build


In [10]:
client = dask.distributed.Client(cluster)

In [13]:
ddf = dd.read_parquet(
    "s3://coiled-datasets/timeseries/20-years/parquet",
    storage_options={"anon": True, "use_ssl": True},
    engine="pyarrow",
)

In [14]:
len(ddf)

662256000

In [17]:
ddf.memory_usage(deep=True).sum().compute()

62481295692

In [37]:
from dask.utils import format_bytes

In [38]:
format_bytes(62481295692)

'58.19 GiB'

## Other syntax for conceptual mastery

In [18]:
ddf.partitions[0]

Unnamed: 0_level_0,id,name,x,y
npartitions=1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000-01-01,int64,object,float64,float64
2000-01-08,...,...,...,...


In [19]:
type(ddf.partitions[0])

dask.dataframe.core.DataFrame

In [20]:
ddf.memory_usage_per_partition(deep=True).compute()

0       57061027
1       57060857
2       57059768
3       57059342
4       57060737
          ...   
1090    57059834
1091    57061111
1092    57061001
1093    57058404
1094    57061989
Length: 1095, dtype: int64

## Why conversion is more possible after big filtering operations

In [21]:
filtered_ddf = ddf.loc[ddf["id"] > 1150]

In [22]:
len(filtered_ddf)

1103

In [23]:
filtered_ddf.memory_usage(deep=True).sum().compute()

104151

In [39]:
format_bytes(104151)

'101.71 kiB'

In [24]:
pdf = filtered_ddf.compute()

In [25]:
print(pdf)

                       id    name         x         y
timestamp                                            
2000-01-09 01:52:30  1152   Edith  0.273674  0.997075
2000-01-29 17:22:59  1175  Oliver -0.909065  0.017086
2000-01-29 20:34:37  1158     Bob -0.910895  0.652333
2000-02-06 00:13:44  1152   Sarah  0.080475  0.855420
2000-02-08 05:23:59  1153   Kevin  0.258087 -0.144844
...                   ...     ...       ...       ...
2020-10-25 18:15:35  1175  George  0.060843  0.229963
2020-11-01 14:25:07  1153   Alice -0.537439 -0.544084
2020-11-20 02:30:17  1158  Oliver  0.733396  0.227974
2020-11-30 03:01:06  1155   Kevin -0.963094 -0.638443
2020-11-30 07:11:46  1163  Yvonne -0.671973 -0.700749

[1103 rows x 4 columns]


## Convert entire Dask DataFrame to Pandas DataFrame

In [26]:
pandas_df = ddf.compute()



CommClosedError: in <TLS (closed) ConnectionPool.gather local=tls://192.168.1.226:53124 remote=tls://ec2-3-80-139-218.compute-1.amazonaws.com:8786>: Stream is closed

distributed.client - ERROR - Failed to reconnect to scheduler after 30.00 seconds, closing client
_GatheringFuture exception was never retrieved
future: <_GatheringFuture finished exception=CancelledError()>
asyncio.exceptions.CancelledError
