ADD gz decompression in parallel like pigz #624

vchemla · 2024-04-03T08:30:05Z

Hi,

In our case, we would like to read a big CSV file compressed in .gz format.

We would like to use the read_csv function like this:

ctx.read_csv('myfile.csv.gz',file_extension=".csv.gz", delimiter=';', has_header=True, schema_infer_max_records=0, file_compression_type='gzip')

However, this decompression is not parallel like pigz (54 seconds) compared to 800 seconds when using the read_csv function.

If you could take a look...

The text was updated successfully, but these errors were encountered:

vchemla added the enhancement New feature or request label Apr 3, 2024

Provide feedback