Investigate if we should skip zipping of parquet dependency table #397

hagenw · 2024-05-07T09:33:49Z

In #372 we introduced storing the dependency table as a PARQUET file, instead of a CSV file.
When the file is uploaded to the server, still a ZIP file is created first. As PARQUET comes already with compression, we should check:

Is the file size still reduced by using ZIP?
How much code is affected, if we would skip using ZIP before uploading?
If we use a similar compression algorithm directly in PARQUET, do we loose speed compared to the current approach?

frankenjoe · 2024-05-07T09:40:40Z

Related to #181

hagenw · 2024-05-07T09:41:44Z

To answer the first question, we create a PARQUET file and a corresponding ZIP file, and compare their sizes.
NOTE: the following example requires the dev branch of audb at the moment.

import audb
import audeer
import os

deps = audb.dependencies("musan", version="1.0.0")
parquet_file = "deps.parquet"
zip_file = "deps.zip"
deps.save(parquet_file)
audeer.create_archive(".", parquet_file, zip_file)
parquet_size = os.stat(parquet_file).st_size
zip_size = os.stat(zip_file).st_size

print(f"Parquet file size: {parquet_size >> 10:.0f} kB")
print(f"Zip file size: {zip_size >> 10:.0f} kB")

returns

Parquet file size: 175 kB
Zip file size: 130 kB

I repeated it with librispeech 3.1.0 from our internal repository to have an example of a bigger dataset:

Parquet file size: 21848 kB
Zip file size: 16163 kB

hagenw · 2024-05-07T09:55:57Z

Regarding the second question, we would need to change the following code block in audb/core/publish.py:

audb/audb/core/publish.py

Lines 753 to 759 in fa14acc

    
           archive_file = backend.join("/", db.name, define.DB + ".zip") 
        
           backend.put_archive( 
        
               db_root, 
        
               archive_file, 
        
               version, 
        
               files=define.DEPENDENCIES_FILE, 
        
           )

There we could simply use put_file() instead of put_archive() to not zip the file.

Slightly more complicated will be the case of loading the dependency table, as we might have a ZIP file or a PARQUET file on the server, which is not ideal. The affected code block is in audb/core/api.py in the definition of audb.dependencies():

audb/audb/core/api.py

Lines 275 to 282 in fa14acc

    
           with tempfile.TemporaryDirectory() as tmp_root: 
        
               archive = backend.join("/", name, define.DB + ".zip") 
        
               backend.get_archive( 
        
                   archive, 
        
                   tmp_root, 
        
                   version, 
        
                   verbose=verbose, 
        
               )

There we could first try to load the PARQUET file (or check if it exists), and otherwise load the ZIP file.
An alternative approach would be to still use ZIP, but don't compress the file as proposed in #181 (comment)

Then there are also two parts inside audb/core/api.py inside remove_media():

audb/audb/core/api.py

Lines 492 to 500 in fa14acc

    
           with tempfile.TemporaryDirectory() as db_root: 
        
               # download dependencies 
        
               archive = backend.join("/", name, define.DB + ".zip") 
        
               deps_path = backend.get_archive( 
        
                   archive, 
        
                   db_root, 
        
                   version, 
        
                   verbose=verbose, 
        
               )[0]

audb/audb/core/api.py

Lines 550 to 560 in fa14acc

    
           # upload dependencies 
        
           if upload: 
        
               deps.save(deps_path) 
        
               remote_archive = backend.join("/", name, define.DB + ".zip") 
        
               backend.put_archive( 
        
                   db_root, 
        
                   remote_archive, 
        
                   version, 
        
                   files=define.DEPENDENCIES_FILE, 
        
                   verbose=verbose, 
        
               )

hagenw · 2024-05-07T12:06:20Z

To answer the third question, I created the benchmark script shown below, that tests different ways to store and load the dependency table on a dataset containing 292,381 files. When running the script, it returns:

parquet snappy
Writing time: 0.2501 s
Reading time: 0.1112 s
File size: 21848 kB

parquet snappy + zip no compression
Writing time: 0.2985 s
Reading time: 0.1290 s
File size: 21848 kB

parquet snappy + zip
Writing time: 1.1113 s
Reading time: 0.2630 s
File size: 16163 kB

parquet gzip
Writing time: 1.5897 s
Reading time: 0.1205 s
File size: 13524 kB

The zipped CSV file, currently used to store the dependency table of the same dataset has a size of 14390 kB.

Benchmark script

import os
import time
import zipfile

import pyarrow.parquet

import audb
import audeer


parquet_file = "deps.parquet"
zip_file = "deps.zip"


def clear():
    for file in [parquet_file, zip_file]:
        if os.path.exists(file):
            os.remove(file)


deps = audb.dependencies("librispeech", version="3.1.0")

print("parquet snappy")
clear()
t0 = time.time()
table = deps._dataframe_to_table(deps._df, file_column=True)
pyarrow.parquet.write_table(table, parquet_file, compression="snappy")
t = time.time() - t0
print(f"Writing time: {t:.4f} s")
t0 = time.time()
table = pyarrow.parquet.read_table(parquet_file)
df = deps._table_to_dataframe(table)
t = time.time() - t0
print(f"Reading time: {t:.4f} s")
size = os.stat(parquet_file).st_size
print(f"File size: {size >> 10:.0f} kB")
print()

print("parquet snappy + zip no compression")
clear()
t0 = time.time()
table = deps._dataframe_to_table(deps._df, file_column=True)
pyarrow.parquet.write_table(table, parquet_file, compression="snappy")
with zipfile.ZipFile(zip_file, "w", zipfile.ZIP_STORED) as zf:
    full_file = audeer.path(".", parquet_file)
    zf.write(full_file, arcname=parquet_file)
t = time.time() - t0
print(f"Writing time: {t:.4f} s")
t0 = time.time()
audeer.extract_archive(zip_file, ".")
table = pyarrow.parquet.read_table(parquet_file)
df = deps._table_to_dataframe(table)
t = time.time() - t0
print(f"Reading time: {t:.4f} s")
size = os.stat(zip_file).st_size
print(f"File size: {size >> 10:.0f} kB")
print()

print("parquet snappy + zip")
clear()
t0 = time.time()
table = deps._dataframe_to_table(deps._df, file_column=True)
pyarrow.parquet.write_table(table, parquet_file, compression="snappy")
with zipfile.ZipFile(zip_file, "w", zipfile.ZIP_DEFLATED) as zf:
    full_file = audeer.path(".", parquet_file)
    zf.write(full_file, arcname=parquet_file)
t = time.time() - t0
print(f"Writing time: {t:.4f} s")
os.remove(parquet_file)
t0 = time.time()
audeer.extract_archive(zip_file, ".")
table = pyarrow.parquet.read_table(parquet_file)
df = deps._table_to_dataframe(table)
t = time.time() - t0
print(f"Reading time: {t:.4f} s")
size = os.stat(zip_file).st_size
print(f"File size: {size >> 10:.0f} kB")
print()

print("parquet gzip")
clear()
t0 = time.time()
table = deps._dataframe_to_table(deps._df, file_column=True)
pyarrow.parquet.write_table(table, parquet_file, compression="GZIP")
t = time.time() - t0
print(f"Writing time: {t:.4f} s")
t0 = time.time()
table = pyarrow.parquet.read_table(parquet_file)
df = deps._table_to_dataframe(table)
t = time.time() - t0
print(f"Reading time: {t:.4f} s")
size = os.stat(parquet_file).st_size
print(f"File size: {size >> 10:.0f} kB")

"zip no compression" is referring to the solution proposed in #181, to still be able to upload the files as ZIP files to the server. In #181 we discuss media files, for which it is important to store them in a ZIP file, as we also have to preserve the underlying folder structure. This is not the case for the dependency table, and also the file extension will always be the same for the dependency table.

Our current approach is "parquet snappy + zip". If we switch to any of the other approaches, reading time would be halved.
We can choose between using GZIP directly when creating the PARQUET file. This increases writing time, but reduces the files size. Or we could switch to using snappy compression, which decreases writing time, but will result in a larger file.
@ChristianGeng any preference?

ChristianGeng · 2024-05-07T15:03:35Z

Or we could switch to using snappy compression, which decreases writing time, but will result in a larger file. @ChristianGeng any preference?

In general I think that disk storage is normally quite cheap, so I would find it a good move to be able to read data faster. So I would be open to depart from "parquet snappy + zip" and optimize for reading time by going into snappy direction.

The SOV post here suggests too that excessive zipping is for cold data. I think we have something in between - luke warm data - but CPU is normally more expensive. Apart from that the SOV post discusses "splittability". Concerning determinism in order to be able to md5sum a file I have not been able to answer.

hagenw · 2024-05-08T08:45:06Z

I agree, that compressing the PARQUET file with SNAPPY and storing it directly on the backend seems to be the best solution.
I created #398, that implements this proposal.

hagenw · 2024-05-28T10:43:19Z

We decided to no longer zip the dependency table, and store it instead directly on the sever as implemented in #398.

hagenw self-assigned this May 7, 2024

hagenw mentioned this issue May 8, 2024

Store dependency table as parquet on backend #398

Merged

hagenw closed this as completed May 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate if we should skip zipping of parquet dependency table #397

Investigate if we should skip zipping of parquet dependency table #397

hagenw commented May 7, 2024 •

edited

Loading

frankenjoe commented May 7, 2024

hagenw commented May 7, 2024 •

edited

Loading

hagenw commented May 7, 2024 •

edited

Loading

hagenw commented May 7, 2024 •

edited

Loading

ChristianGeng commented May 7, 2024

hagenw commented May 8, 2024

hagenw commented May 28, 2024

Investigate if we should skip zipping of parquet dependency table #397

Investigate if we should skip zipping of parquet dependency table #397

Comments

hagenw commented May 7, 2024 • edited Loading

frankenjoe commented May 7, 2024

hagenw commented May 7, 2024 • edited Loading

hagenw commented May 7, 2024 • edited Loading

hagenw commented May 7, 2024 • edited Loading

ChristianGeng commented May 7, 2024

hagenw commented May 8, 2024

hagenw commented May 28, 2024

hagenw commented May 7, 2024 •

edited

Loading

hagenw commented May 7, 2024 •

edited

Loading

hagenw commented May 7, 2024 •

edited

Loading

hagenw commented May 7, 2024 •

edited

Loading