Store Dependencies as parquet file #372

hagenw · 2024-02-13T14:28:31Z

Closes #300

Summary:

Speed up loading and saving of the dependency table
Store dependency table as parquet files on the server and in cache

This pull request uses pyarrow dtypes in the columns fo the dataframe representing the dependency table. This, in combination with pyarrow.Table as an intermediate representation for CSV and parquet files, results in faster reading and writing of CSV, pickle and parquet files:

task	before	pull request
reading csv	1.158 s	0.113 s
reading pickle	0.255 s	0.092 s
reading parquet	-	0.085 s

writing csv	2.026 s	0.277 s
writing pickle	0.649 s	0.294 s
writing parquet	-	0.273 s

Results are for a dependency table holding 1,000,000 entries, see Loading and writing to a file benchmark for full results.

In addition, this pull request stores dependency tables as parquet files as reading/writing to them is not slower than for pickle files. Parquet files are also smaller, which means faster transfers, and we can later add support for loading only parts of the file for very huge datasets. Unfortunately, it is still faster to load from pickle in cache, so we continue storing the depednency table as pickle file in cache.
For parquet support we add read abilitites to audb.Dependencies.load(), write abilities to audb.Dependencies.save(), and extra code that looks for legacy csv files coming from the server (on the server it is always stored in a ZIP file, so the filename on the server does not change). As loading a dependency table from a parquet file and writing it again at another place might change its MD5 sum, I also adjusted the code to compare dependency tables by directly comparing their dataframes.

This pull request adds pyarrow as a dependency of audb, which has the downside of adding a package that is 70 MB in size, and might not be easy to install on all systems. As pandas is also planing to add pyarrow as a dependency in 3.0, you can find a long discusson about pro and cons at pandas-dev/pandas#54466 and pandas-dev/pandas#57073. One outcome will most likely be that there will be a smaller package than pyarrow that can be used as a replacement for pyarrow. At the moment I would argue that the speed improvements we gain and the addition of parquet as a file format provide a big advantage that anyway justify adding pyarrow as a dependency of audb.

Real world benchmark

I also tested the speedup for loading real world datasets from the cache.
The parquet column corresponds to the case that we use parquet instead of pickle for storing files in cache.
The results are given as the average execution time averaged over 10 runs with standard deviation.

dataset	before	pull request	parquet
mozillacommonvoice	0.465±0.027 s	0.257±0.033 s	0.613±0.025 s
voxceleb2	0.234±0.008 s	0.127±0.008 s	0.280±0.011 s
librispeech	0.071±0.007 s	0.034±0.005 s	0.086±0.007 s
imda-nsc-read-speech-balanced	0.494±0.018 s	0.283±0.036 s	0.552±0.016 s
emodb	0.008±0.018 s	0.006±0.011 s	0.010±0.012 s

As can be seen, the current implementation of using pickle files in cache is the fastest solution,
whereas using parquet files in cache is slower than loading was before.

I updated the "Database depednencies" section in the documentation as we mention the dependency file there.

And the docstrings of audb.Dependencies.load() and audb.Dependencies.save().

NOTE: the speed improvement for loading and saving CSV files with the help of pyarrow can also easily be added to audformat, without the need of using pyarrow dtypes in the pandas dataframes representing the tables of a dataset.

/cc @frankenjoe

codecov · 2024-02-13T15:58:45Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 100.0%. Comparing base (44de33f) to head (30cdd3a).

Additional details and impacted files

Files	Coverage Δ
audb/core/api.py	`100.0% <100.0%> (ø)`
audb/core/define.py	`100.0% <100.0%> (ø)`
audb/core/dependencies.py	`100.0% <100.0%> (ø)`
audb/core/publish.py	`100.0% <100.0%> (ø)`

hagenw · 2024-02-14T14:18:07Z

@ChristianGeng I marked the pull request as draft as I realized in the real world tests that loading a dependency table from cache is actually slightly slower when using the parquet file also in cache (even though the benchmarks suggested that it should be on par).

I'm changing the code at the moment to store the dependency table again as pickle in cache.

hagenw · 2024-02-14T14:30:53Z

I adjusted the code, so that we now again store the dependency table as pickle file in cache.

It's ready for review.

hagenw · 2024-02-14T14:44:10Z

This pull request combines two things: general speed up and storing dependencies as parquet files on the server.

If we are afraid about backward compatibility of published datasets, we might also consider to add the general speed up, and also read/write support for parquet files in this pull request, but still store the dependency tables as CSV files on the server and only switch to parquet files in a few month. This would then make sure that also users with older versions of audb might then still be able to load datasets published with the newest version of audb that would then store dependency tables as parquet files.

But I would not be to afraid about breaking old code. If users included the version argument when using audb.load() it will always work, even if newer versions of the same dataset have been published in the mean time that store the dependency table as parquet files. And when writing new code to access new datasets, it should be fine to tell them that they need at least audb >=1.7.

hagenw · 2024-02-15T08:10:33Z

BTW, I also tried to store the dependency table directly as pyarrow.Table instead of pandas.DataFrame as this is by far the fastest solution for reading csv and parquet files. But the downside is that row based lookups are much slower that way (compare #356). And we need good performance for column based and row based lookups. I tried to improve row based performance by adding an index based on dict, which indeed provides similar row based performance we get with pandas.DataFrame. But then reading the file and building the index is slightly slower than the method proposed in this pull request.

hagenw · 2024-02-16T08:31:37Z

To make it even more confusing than it is already there seems to be not only the two pyarrow string dtypes currently discussed in the pandas documentation (pandas.StringDtype("pyarrow") and pandas.ArrowDtype(pa.string())), but also a third one](pandas-dev/pandas#54466 (comment)):

"string[pyarrow_numpy]" aka pandas.StringDtype("pyarrow_numpy"): Introduced in pandas 2.1. Uses np.nan as its missing value to be more backward compatible with existing default NumPy dtypes and is the proposed default string type in pandas 3.0

So maybe we also need to check here how the performance would using this new dtype.

audb/core/publish.py

audb/core/api.py

audb/core/dependencies.py

tests/test_dependencies.py

ChristianGeng

The review consists of several mostly minor comments.
The main questions that I have raised are whether it would be possible to consistently use parquet and fully do away with pickle.

ChristianGeng

First round of comments.

hagenw · 2024-02-23T11:41:26Z

Thanks for the comments, as I have a few more important TODOs, I will come back to this when I have time.

* Use pyarrow for save/load/dtypes in Dependencies * Fix dtype mapping * Fix expected str representation output * Add pyarrow as dependency * Add parquet format to save()/load() * Add tests for parquet files * Fix docstring of Dependencies.save() * Publish dependency table as parquet file * Fix cache handling for docs/publish.rst * Compare dependency tables instead of MD5 sums * Store always as parquet in cache * Fix skipping of old audb caches * Add LEGACY to old depedendency cache file name * Use pickle in cache * Remove debug print statement * Mention correct dependency file in docs * Add docstring to test * Fix comment for errors test * Simplify dependency file loading code * Only convert dtype if needed during loading * Add test for backward compatibility * Remove unneeded line

hagenw mentioned this pull request Feb 14, 2024

Speedup dependencies dtypes #363

Closed

hagenw force-pushed the dependencies-pyarrow-dtypes branch from f195e4e to ea9018d Compare February 14, 2024 13:34

hagenw marked this pull request as ready for review February 14, 2024 13:48

hagenw requested a review from ChristianGeng February 14, 2024 13:48

hagenw changed the title ~~Use pyarrow for save/load/dtypes in Dependencies~~ Store Dependencies as parquet file Feb 14, 2024

hagenw marked this pull request as draft February 14, 2024 13:58

hagenw marked this pull request as ready for review February 14, 2024 14:30