Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store Dependencies as parquet file #372

Merged
merged 22 commits into from
Apr 12, 2024
Merged

Store Dependencies as parquet file #372

merged 22 commits into from
Apr 12, 2024

Conversation

hagenw
Copy link
Member

@hagenw hagenw commented Feb 13, 2024

Closes #300

Summary:

  • Speed up loading and saving of the dependency table
  • Store dependency table as parquet files on the server and in cache

This pull request uses pyarrow dtypes in the columns fo the dataframe representing the dependency table. This, in combination with pyarrow.Table as an intermediate representation for CSV and parquet files, results in faster reading and writing of CSV, pickle and parquet files:

task before pull request
reading csv 1.158 s 0.113 s
reading pickle 0.255 s 0.092 s
reading parquet - 0.085 s
writing csv 2.026 s 0.277 s
writing pickle 0.649 s 0.294 s
writing parquet - 0.273 s

Results are for a dependency table holding 1,000,000 entries, see Loading and writing to a file benchmark for full results.

In addition, this pull request stores dependency tables as parquet files as reading/writing to them is not slower than for pickle files. Parquet files are also smaller, which means faster transfers, and we can later add support for loading only parts of the file for very huge datasets. Unfortunately, it is still faster to load from pickle in cache, so we continue storing the depednency table as pickle file in cache.
For parquet support we add read abilitites to audb.Dependencies.load(), write abilities to audb.Dependencies.save(), and extra code that looks for legacy csv files coming from the server (on the server it is always stored in a ZIP file, so the filename on the server does not change). As loading a dependency table from a parquet file and writing it again at another place might change its MD5 sum, I also adjusted the code to compare dependency tables by directly comparing their dataframes.

This pull request adds pyarrow as a dependency of audb, which has the downside of adding a package that is 70 MB in size, and might not be easy to install on all systems. As pandas is also planing to add pyarrow as a dependency in 3.0, you can find a long discusson about pro and cons at pandas-dev/pandas#54466 and pandas-dev/pandas#57073. One outcome will most likely be that there will be a smaller package than pyarrow that can be used as a replacement for pyarrow. At the moment I would argue that the speed improvements we gain and the addition of parquet as a file format provide a big advantage that anyway justify adding pyarrow as a dependency of audb.


Real world benchmark

I also tested the speedup for loading real world datasets from the cache.
The parquet column corresponds to the case that we use parquet instead of pickle for storing files in cache.
The results are given as the average execution time averaged over 10 runs with standard deviation.

dataset before pull request parquet
mozillacommonvoice 0.465±0.027 s 0.257±0.033 s 0.613±0.025 s
voxceleb2 0.234±0.008 s 0.127±0.008 s 0.280±0.011 s
librispeech 0.071±0.007 s 0.034±0.005 s 0.086±0.007 s
imda-nsc-read-speech-balanced 0.494±0.018 s 0.283±0.036 s 0.552±0.016 s
emodb 0.008±0.018 s 0.006±0.011 s 0.010±0.012 s

As can be seen, the current implementation of using pickle files in cache is the fastest solution,
whereas using parquet files in cache is slower than loading was before.


I updated the "Database depednencies" section in the documentation as we mention the dependency file there.

image

And the docstrings of audb.Dependencies.load() and audb.Dependencies.save().

image

image


NOTE: the speed improvement for loading and saving CSV files with the help of pyarrow can also easily be added to audformat, without the need of using pyarrow dtypes in the pandas dataframes representing the tables of a dataset.

/cc @frankenjoe

Copy link

codecov bot commented Feb 13, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 100.0%. Comparing base (44de33f) to head (30cdd3a).

Additional details and impacted files
Files Coverage Δ
audb/core/api.py 100.0% <100.0%> (ø)
audb/core/define.py 100.0% <100.0%> (ø)
audb/core/dependencies.py 100.0% <100.0%> (ø)
audb/core/publish.py 100.0% <100.0%> (ø)

@hagenw hagenw marked this pull request as ready for review February 14, 2024 13:48
@hagenw hagenw changed the title Use pyarrow for save/load/dtypes in Dependencies Store Dependencies as parquet file Feb 14, 2024
@hagenw hagenw marked this pull request as draft February 14, 2024 13:58
@hagenw
Copy link
Member Author

hagenw commented Feb 14, 2024

@ChristianGeng I marked the pull request as draft as I realized in the real world tests that loading a dependency table from cache is actually slightly slower when using the parquet file also in cache (even though the benchmarks suggested that it should be on par).

I'm changing the code at the moment to store the dependency table again as pickle in cache.

@hagenw hagenw marked this pull request as ready for review February 14, 2024 14:30
@hagenw
Copy link
Member Author

hagenw commented Feb 14, 2024

I adjusted the code, so that we now again store the dependency table as pickle file in cache.

It's ready for review.

@hagenw
Copy link
Member Author

hagenw commented Feb 14, 2024

This pull request combines two things: general speed up and storing dependencies as parquet files on the server.

If we are afraid about backward compatibility of published datasets, we might also consider to add the general speed up, and also read/write support for parquet files in this pull request, but still store the dependency tables as CSV files on the server and only switch to parquet files in a few month. This would then make sure that also users with older versions of audb might then still be able to load datasets published with the newest version of audb that would then store dependency tables as parquet files.

But I would not be to afraid about breaking old code. If users included the version argument when using audb.load() it will always work, even if newer versions of the same dataset have been published in the mean time that store the dependency table as parquet files. And when writing new code to access new datasets, it should be fine to tell them that they need at least audb >=1.7.

@hagenw
Copy link
Member Author

hagenw commented Feb 15, 2024

BTW, I also tried to store the dependency table directly as pyarrow.Table instead of pandas.DataFrame as this is by far the fastest solution for reading csv and parquet files. But the downside is that row based lookups are much slower that way (compare #356). And we need good performance for column based and row based lookups. I tried to improve row based performance by adding an index based on dict, which indeed provides similar row based performance we get with pandas.DataFrame. But then reading the file and building the index is slightly slower than the method proposed in this pull request.

@hagenw
Copy link
Member Author

hagenw commented Feb 16, 2024

To make it even more confusing than it is already there seems to be not only the two pyarrow string dtypes currently discussed in the pandas documentation (pandas.StringDtype("pyarrow") and pandas.ArrowDtype(pa.string())), but also a third one](pandas-dev/pandas#54466 (comment)):

  • "string[pyarrow_numpy]" aka pandas.StringDtype("pyarrow_numpy"): Introduced in pandas 2.1. Uses np.nan as its missing value to be more backward compatible with existing default NumPy dtypes and is the proposed default string type in pandas 3.0

So maybe we also need to check here how the performance would using this new dtype.

Copy link
Member

@ChristianGeng ChristianGeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The review consists of several mostly minor comments.
The main questions that I have raised are whether it would be possible to consistently use parquet and fully do away with pickle.

Copy link
Member

@ChristianGeng ChristianGeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First round of comments.

@hagenw
Copy link
Member Author

hagenw commented Feb 23, 2024

Thanks for the comments, as I have a few more important TODOs, I will come back to this when I have time.

@hagenw hagenw force-pushed the dependencies-pyarrow-dtypes branch from a132186 to 9ee1f24 Compare March 19, 2024 13:44
@hagenw hagenw merged commit cc1f4d9 into dev Apr 12, 2024
9 checks passed
@hagenw hagenw deleted the dependencies-pyarrow-dtypes branch April 12, 2024 10:06
hagenw added a commit that referenced this pull request May 3, 2024
* Use pyarrow for save/load/dtypes in Dependencies

* Fix dtype mapping

* Fix expected str representation output

* Add pyarrow as dependency

* Add parquet format to save()/load()

* Add tests for parquet files

* Fix docstring of Dependencies.save()

* Publish dependency table as parquet file

* Fix cache handling for docs/publish.rst

* Compare dependency tables instead of MD5 sums

* Store always as parquet in cache

* Fix skipping of old audb caches

* Add LEGACY to old depedendency cache file name

* Use pickle in cache

* Remove debug print statement

* Mention correct dependency file in docs

* Add docstring to test

* Fix comment for errors test

* Simplify dependency file loading code

* Only convert dtype if needed during loading

* Add test for backward compatibility

* Remove unneeded line
hagenw added a commit that referenced this pull request May 3, 2024
* Use pyarrow for save/load/dtypes in Dependencies

* Fix dtype mapping

* Fix expected str representation output

* Add pyarrow as dependency

* Add parquet format to save()/load()

* Add tests for parquet files

* Fix docstring of Dependencies.save()

* Publish dependency table as parquet file

* Fix cache handling for docs/publish.rst

* Compare dependency tables instead of MD5 sums

* Store always as parquet in cache

* Fix skipping of old audb caches

* Add LEGACY to old depedendency cache file name

* Use pickle in cache

* Remove debug print statement

* Mention correct dependency file in docs

* Add docstring to test

* Fix comment for errors test

* Simplify dependency file loading code

* Only convert dtype if needed during loading

* Add test for backward compatibility

* Remove unneeded line
hagenw added a commit that referenced this pull request May 3, 2024
* Use pyarrow for save/load/dtypes in Dependencies

* Fix dtype mapping

* Fix expected str representation output

* Add pyarrow as dependency

* Add parquet format to save()/load()

* Add tests for parquet files

* Fix docstring of Dependencies.save()

* Publish dependency table as parquet file

* Fix cache handling for docs/publish.rst

* Compare dependency tables instead of MD5 sums

* Store always as parquet in cache

* Fix skipping of old audb caches

* Add LEGACY to old depedendency cache file name

* Use pickle in cache

* Remove debug print statement

* Mention correct dependency file in docs

* Add docstring to test

* Fix comment for errors test

* Simplify dependency file loading code

* Only convert dtype if needed during loading

* Add test for backward compatibility

* Remove unneeded line
hagenw added a commit that referenced this pull request May 8, 2024
* Use pyarrow for save/load/dtypes in Dependencies

* Fix dtype mapping

* Fix expected str representation output

* Add pyarrow as dependency

* Add parquet format to save()/load()

* Add tests for parquet files

* Fix docstring of Dependencies.save()

* Publish dependency table as parquet file

* Fix cache handling for docs/publish.rst

* Compare dependency tables instead of MD5 sums

* Store always as parquet in cache

* Fix skipping of old audb caches

* Add LEGACY to old depedendency cache file name

* Use pickle in cache

* Remove debug print statement

* Mention correct dependency file in docs

* Add docstring to test

* Fix comment for errors test

* Simplify dependency file loading code

* Only convert dtype if needed during loading

* Add test for backward compatibility

* Remove unneeded line
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants