# Write Multiple Parquet Files to a Single CSV using Python

This post shows you how to read multiple Parquet files into your Python session and write them to a single CSV.

We will:
- create multiple dummy Parquet files
- read these Parquet files into a Dask Dataframe with a single `read_parquet` command
- write the Dask DataFrame to a single CSV file

Using Dask to convert your Parquet files to CSV means:
1. You don't have to worry about file sizes or running out of memory, and
2. You can read and write easily from cloud data storage like Amazon S3. 

Let's get to it.

## Create Multiple Parquet Files

We'll start by creating some dummy dataframes write these to 3 separate Parquet files using a simple `for` loop. Each dataframe will contain 3 rows and 4 columns, populated with random integers between 0 and 100.

In [20]:
import pandas as pd
import numpy as np

# use the recommended method for generating random integers with NumPy
rng = np.random.default_rng()

# generate 3 dummy dataframes with similar filenames
for i in range(3):
    df = pd.DataFrame(rng.integers(0, 100, size=(3, 4)), columns=list('ABCD'))
    df.to_parquet(f"dummy_df_{i}.parquet")

If you're working in an IPython session on Mac or Linux you can run `! ls` to confirm that the files have been created.

In [23]:
! ls

dummy_df_0.parquet        dummy_df_2.parquet        parquet-to-csv-dask.ipynb
dummy_df_1.parquet        dummy_df_all.csv


## Load Multiple Parquet Files with Dask DataFrame

We now have all the ingredients to run our experiment. 

Let's see how we can load multiple Parquet files into a DataFrame and write them to single CSV file using the Dask DataFrame API. Dask is a library for distributed computing that scales familiar Python APIs like pandas, NumPy and scikit-learn to arbitrarily large datasets. Read more about the basics of Dask here.

We'll start by importing `dask.dataframe`.

In [6]:
import dask.dataframe as dd

We'll then use the `read_parquet` method to read all of our Parquet files at once. 

We can do this because Dask accepts an asterisk `*` as a glob / wildcard character that will match related filenames.

In [25]:
ddf = dd.read_parquet('dummy_df_*.parquet')

Let's have a look at the contents of our DataFrame by calling `ddf.compute()`.

In [26]:
ddf.compute()

Unnamed: 0,A,B,C,D
0,0,41,85,11
1,92,53,30,72
2,6,61,79,29
0,13,74,5,25
1,19,30,5,89
2,34,13,26,70
0,80,12,84,40
1,23,57,7,44
2,44,2,18,64


Our Dask DataFrame now contains all the data from our 3 separate Parquet files.

## Write Parquet Files to CSV

We can now write our multiple Parquet files out to a single CSV file using the `to_csv` method. Make sure to set `single_file` to `True` and `index` to `False`.

In [29]:
ddf.to_csv("dummy_df_all.csv", 
           single_file=True, 
           index=False
)

['/Users/rpelgrim/Documents/git/coiled-resources/parquet-csv/dummy_df_all.csv']

Let's verify that this actually worked by reading the csv file into a pandas DataFrame.

In [30]:
df_csv = pd.read_csv("dummy_df_all.csv")

In [31]:
df_csv

Unnamed: 0,A,B,C,D
0,0,41,85,11
1,92,53,30,72
2,6,61,79,29
3,13,74,5,25
4,19,30,5,89
5,34,13,26,70
6,80,12,84,40
7,23,57,7,44
8,44,2,18,64


## About the Dask API

Dask follows the pandas API as much as possible, but there are two important differences to note in the example above:
1. using the `.compute()` method to inspect the DataFrame, and
2. setting the index 

### 1. Using .compute()

Dask uses 'lazy evaluation' to optimize performance. This means that results are not computed until you explicitly tell Dask to do so. This allows Dask to find the quickest way to get you your result *when you actually need it*. 

Simply calling `ddf` will get you some basisc information *about* the DataFrame but not the actual contents. To view the content of the DataFrame, tell Dask to go and run the computation by calling `ddf.compute()`.

In [12]:
# ddf without .compute()
ddf

Unnamed: 0_level_0,A,B,C,D
npartitions=3,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
,int64,int64,int64,int64
,...,...,...,...
,...,...,...,...
,...,...,...,...


In [32]:
# ddf with .compute()
ddf.compute()

Unnamed: 0,A,B,C,D
0,0,41,85,11
1,92,53,30,72
2,6,61,79,29
0,13,74,5,25
1,19,30,5,89
2,34,13,26,70
0,80,12,84,40
1,23,57,7,44
2,44,2,18,64


### 2. Dask DataFrame Index

You might have noticed that the index for the Dask DataFrame runs from 0 to 3 and then repeats. This is because a Dask DataFrame is divided into partitions (3 in this case). Each partition is a pandas DataFrame that has its own index starting from 0. This may seem odd but helps Dask speed up your indexing operations when working with very large datasets.

For most purposes - including ours - this is not a problem at all, just something to be aware of. Read this blog if you want to learn more about setting indexes in Dask. https://coiled.io/blog/dask-set-index-dataframe/

## Why Use Dask to Write Parquet Files to CSV

The benefits of using Dask to write your Parquet files to CSV are that you:
1. Don't have to worry about file sizes or running out of memory
2. Can easily read and write data to/from cloud-based data storage

To demonstrate, let's load 2750 Parquet files (104 GB) into a Dask DataFrame and write them to an S3 bucket as CSV. We'll use a Coiled cluster to access additional hardware resources in the cloud.

In [1]:
import coiled

In [2]:
coiled.__version__

'0.0.60'

In [None]:
cluster = coiled.Cluster(
    name="parquet-to-csv",
    software="coiled-examples/numpy-zarr",
    n_workers=25,
    scheduler_options={'idle_timeout':'1 hour'},
    backend_options={'spot': True}
)

In [11]:
from distributed import Client

In [17]:
client = Client(cluster)
client

0,1
Connection method: Cluster object,Cluster type: coiled.Cluster
Dashboard: http://3.234.205.253:8787,

0,1
Dashboard: http://3.234.205.253:8787,Workers: 20
Total threads: 40,Total memory: 153.34 GiB

0,1
Comm: tls://10.4.6.65:8786,Workers: 20
Dashboard: http://10.4.6.65:8787/status,Total threads: 40
Started: 1 minute ago,Total memory: 153.34 GiB

0,1
Comm: tls://10.4.3.243:40737,Total threads: 2
Dashboard: http://10.4.3.243:43009/status,Memory: 7.67 GiB
Nanny: tls://10.4.3.243:39729,
Local directory: /dask-worker-space/worker-nnxn5xkv,Local directory: /dask-worker-space/worker-nnxn5xkv

0,1
Comm: tls://10.4.12.24:43825,Total threads: 2
Dashboard: http://10.4.12.24:34143/status,Memory: 7.67 GiB
Nanny: tls://10.4.12.24:36161,
Local directory: /dask-worker-space/worker-me5rozyz,Local directory: /dask-worker-space/worker-me5rozyz

0,1
Comm: tls://10.4.12.200:43295,Total threads: 2
Dashboard: http://10.4.12.200:40745/status,Memory: 7.67 GiB
Nanny: tls://10.4.12.200:38289,
Local directory: /dask-worker-space/worker-8a09cy2i,Local directory: /dask-worker-space/worker-8a09cy2i

0,1
Comm: tls://10.4.7.122:45091,Total threads: 2
Dashboard: http://10.4.7.122:34109/status,Memory: 7.67 GiB
Nanny: tls://10.4.7.122:44729,
Local directory: /dask-worker-space/worker-fld6jb9n,Local directory: /dask-worker-space/worker-fld6jb9n

0,1
Comm: tls://10.4.5.243:35983,Total threads: 2
Dashboard: http://10.4.5.243:40477/status,Memory: 7.67 GiB
Nanny: tls://10.4.5.243:35389,
Local directory: /dask-worker-space/worker-2vc8ueav,Local directory: /dask-worker-space/worker-2vc8ueav

0,1
Comm: tls://10.4.2.110:36139,Total threads: 2
Dashboard: http://10.4.2.110:44913/status,Memory: 7.67 GiB
Nanny: tls://10.4.2.110:41075,
Local directory: /dask-worker-space/worker-p82attr5,Local directory: /dask-worker-space/worker-p82attr5

0,1
Comm: tls://10.4.15.11:37965,Total threads: 2
Dashboard: http://10.4.15.11:32917/status,Memory: 7.67 GiB
Nanny: tls://10.4.15.11:45851,
Local directory: /dask-worker-space/worker-kikzv1v8,Local directory: /dask-worker-space/worker-kikzv1v8

0,1
Comm: tls://10.4.3.152:43079,Total threads: 2
Dashboard: http://10.4.3.152:36355/status,Memory: 7.67 GiB
Nanny: tls://10.4.3.152:44015,
Local directory: /dask-worker-space/worker-b2cruq7y,Local directory: /dask-worker-space/worker-b2cruq7y

0,1
Comm: tls://10.4.10.37:43759,Total threads: 2
Dashboard: http://10.4.10.37:41787/status,Memory: 7.67 GiB
Nanny: tls://10.4.10.37:38533,
Local directory: /dask-worker-space/worker-fccmoblm,Local directory: /dask-worker-space/worker-fccmoblm

0,1
Comm: tls://10.4.6.226:37063,Total threads: 2
Dashboard: http://10.4.6.226:46261/status,Memory: 7.67 GiB
Nanny: tls://10.4.6.226:44595,
Local directory: /dask-worker-space/worker-mo3l7axs,Local directory: /dask-worker-space/worker-mo3l7axs

0,1
Comm: tls://10.4.14.210:41659,Total threads: 2
Dashboard: http://10.4.14.210:46173/status,Memory: 7.67 GiB
Nanny: tls://10.4.14.210:37315,
Local directory: /dask-worker-space/worker-mya7gnn0,Local directory: /dask-worker-space/worker-mya7gnn0

0,1
Comm: tls://10.4.8.67:39093,Total threads: 2
Dashboard: http://10.4.8.67:35339/status,Memory: 7.67 GiB
Nanny: tls://10.4.8.67:46569,
Local directory: /dask-worker-space/worker-990ssr80,Local directory: /dask-worker-space/worker-990ssr80

0,1
Comm: tls://10.4.7.80:42303,Total threads: 2
Dashboard: http://10.4.7.80:37925/status,Memory: 7.67 GiB
Nanny: tls://10.4.7.80:41627,
Local directory: /dask-worker-space/worker-uropr1cs,Local directory: /dask-worker-space/worker-uropr1cs

0,1
Comm: tls://10.4.8.19:41755,Total threads: 2
Dashboard: http://10.4.8.19:46853/status,Memory: 7.67 GiB
Nanny: tls://10.4.8.19:44507,
Local directory: /dask-worker-space/worker-mjr_fbsk,Local directory: /dask-worker-space/worker-mjr_fbsk

0,1
Comm: tls://10.4.8.123:40409,Total threads: 2
Dashboard: http://10.4.8.123:35075/status,Memory: 7.67 GiB
Nanny: tls://10.4.8.123:39801,
Local directory: /dask-worker-space/worker-j0zfnzmm,Local directory: /dask-worker-space/worker-j0zfnzmm

0,1
Comm: tls://10.4.15.49:39095,Total threads: 2
Dashboard: http://10.4.15.49:46445/status,Memory: 7.67 GiB
Nanny: tls://10.4.15.49:39669,
Local directory: /dask-worker-space/worker-oedclo3m,Local directory: /dask-worker-space/worker-oedclo3m

0,1
Comm: tls://10.4.1.159:37589,Total threads: 2
Dashboard: http://10.4.1.159:36781/status,Memory: 7.67 GiB
Nanny: tls://10.4.1.159:36877,
Local directory: /dask-worker-space/worker-5tb9fwwy,Local directory: /dask-worker-space/worker-5tb9fwwy

0,1
Comm: tls://10.4.3.74:40709,Total threads: 2
Dashboard: http://10.4.3.74:37821/status,Memory: 7.67 GiB
Nanny: tls://10.4.3.74:42263,
Local directory: /dask-worker-space/worker-dt5j43n_,Local directory: /dask-worker-space/worker-dt5j43n_

0,1
Comm: tls://10.4.4.251:38251,Total threads: 2
Dashboard: http://10.4.4.251:39789/status,Memory: 7.67 GiB
Nanny: tls://10.4.4.251:43965,
Local directory: /dask-worker-space/worker-y2f8okq2,Local directory: /dask-worker-space/worker-y2f8okq2

0,1
Comm: tls://10.4.9.42:34819,Total threads: 2
Dashboard: http://10.4.9.42:43873/status,Memory: 7.67 GiB
Nanny: tls://10.4.9.42:44139,
Local directory: /dask-worker-space/worker-dv_k3p8k,Local directory: /dask-worker-space/worker-dv_k3p8k


In [13]:
ddf = dd.read_parquet(
    "s3://coiled-datasets/synthetic-data/synth-reg-104GB.parquet/",
    storage_options={'anon':'True', 'use_ssl':'True'}
)

In [14]:
ddf

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,target
npartitions=2750,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1
0,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64
100000,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
274900000,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
274999999,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...


In [15]:
ddf.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,41,42,43,44,45,46,47,48,49,target
0,-1.083516,0.173372,-0.973546,-1.465443,1.973955,-0.922526,1.058072,0.302878,1.160762,-0.690999,...,0.478698,-1.286906,0.037474,-0.448159,-0.652509,-1.205982,0.166634,2.526275,-0.890744,223.602485
1,2.077819,-0.507675,1.188347,-0.958974,0.666332,0.699718,0.416365,-0.006916,-0.561665,-0.535323,...,-0.406144,-0.122424,1.623143,0.438106,-1.510411,-0.909098,-0.416044,0.16966,-1.343285,-63.876627
2,-1.545396,-1.001309,-0.185548,-0.507883,1.223005,0.405486,-0.838138,-0.521867,1.16429,0.566665,...,1.341402,-0.206474,-1.203585,0.7965,-2.083753,0.670345,1.243194,-0.513658,-1.388109,182.856379
3,-0.548436,-0.754629,1.62849,0.954295,0.190117,-0.359459,1.901831,-0.137075,-0.005027,0.918249,...,1.214883,-0.115838,0.287735,-0.115192,-0.49933,0.349165,-1.618127,1.421938,-0.43924,-211.527657
4,-0.981102,0.993449,-0.173022,0.503123,0.823864,0.083351,0.242027,0.661806,0.463781,-0.799858,...,-0.98889,-0.541225,-0.298992,0.306095,0.351885,2.269911,0.465673,0.909917,0.513545,-165.464021


We'll use the same call to `ddf.to_csv()` we used earlier, just changing the file path to point to our S3 bucket. We'll also set `single_file` to `False` this time to make the best use of Dask's parallel writing capabilities. Technically you can write the Parquet files to a single CSV file, but it will take much longer and the resulting CSV file will be quiet unwieldy and you'll be losing many of the performance benefits that come with parallel read/write.

In [19]:
%%time
ddf.to_csv("s3://coiled-datasets/synthetic-data/synth-reg-104GB.csv", 
           single_file=False, 
           index=False
)

CPU times: user 13 s, sys: 1.21 s, total: 14.2 s
Wall time: 21min 54s


['coiled-datasets/synthetic-data/synth-reg-104GB.csv/0000.part',
 'coiled-datasets/synthetic-data/synth-reg-104GB.csv/0001.part',
 'coiled-datasets/synthetic-data/synth-reg-104GB.csv/0002.part',
 'coiled-datasets/synthetic-data/synth-reg-104GB.csv/0003.part',
 'coiled-datasets/synthetic-data/synth-reg-104GB.csv/0004.part',
 'coiled-datasets/synthetic-data/synth-reg-104GB.csv/0005.part',
 'coiled-datasets/synthetic-data/synth-reg-104GB.csv/0006.part',
 'coiled-datasets/synthetic-data/synth-reg-104GB.csv/0007.part',
 'coiled-datasets/synthetic-data/synth-reg-104GB.csv/0008.part',
 'coiled-datasets/synthetic-data/synth-reg-104GB.csv/0009.part',
 'coiled-datasets/synthetic-data/synth-reg-104GB.csv/0010.part',
 'coiled-datasets/synthetic-data/synth-reg-104GB.csv/0011.part',
 'coiled-datasets/synthetic-data/synth-reg-104GB.csv/0012.part',
 'coiled-datasets/synthetic-data/synth-reg-104GB.csv/0013.part',
 'coiled-datasets/synthetic-data/synth-reg-104GB.csv/0014.part',
 'coiled-datasets/synthet

Note that writing this DataFrame to Parquet is much faster. Generally speaking, we recommend working with Parquet files when using Dask unless you have very strong reasons not to do so. Read this blog to learn more about the benefits of the Parquet file format.

In [21]:
%%time
ddf.to_parquet("s3://coiled-datasets/synthetic-data/synth-reg-104GB_2.parquet")

CPU times: user 2.27 s, sys: 243 ms, total: 2.51 s
Wall time: 4min 35s


[None]

distributed.client - ERROR - Failed to reconnect to scheduler after 30.00 seconds, closing client
distributed.client - ERROR - Failed to reconnect to scheduler after 30.00 seconds, closing client
distributed.client - ERROR - Failed to reconnect to scheduler after 30.00 seconds, closing client
_GatheringFuture exception was never retrieved
future: <_GatheringFuture finished exception=CancelledError()>
asyncio.exceptions.CancelledError
_GatheringFuture exception was never retrieved
future: <_GatheringFuture finished exception=CancelledError()>
asyncio.exceptions.CancelledError
_GatheringFuture exception was never retrieved
future: <_GatheringFuture finished exception=CancelledError()>
asyncio.exceptions.CancelledError
Traceback (most recent call last):
  File "/Users/rpelgrim/mambaforge/envs/coiled-base/lib/python3.9/site-packages/distributed/comm/tcp.py", line 398, in connect
    stream = await self.client.connect(
  File "/Users/rpelgrim/mambaforge/envs/coiled-base/lib/python3.9/site-p

## Writing Parquet Files to CSV with Dask Summary

- You can use Dask to convert multiple Parquet files into a single CSV file
- With Dask you don’t have to worry about file sizes or memory errors
- Dask supports reading and writing from cloud-based data storage
- Dask enables you to work faster by reading and writing in parallel
