-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Store Dependencies as parquet file #372
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files
|
f195e4e
to
ea9018d
Compare
@ChristianGeng I marked the pull request as draft as I realized in the real world tests that loading a dependency table from cache is actually slightly slower when using the parquet file also in cache (even though the benchmarks suggested that it should be on par). I'm changing the code at the moment to store the dependency table again as pickle in cache. |
I adjusted the code, so that we now again store the dependency table as pickle file in cache. It's ready for review. |
This pull request combines two things: general speed up and storing dependencies as parquet files on the server. If we are afraid about backward compatibility of published datasets, we might also consider to add the general speed up, and also read/write support for parquet files in this pull request, but still store the dependency tables as CSV files on the server and only switch to parquet files in a few month. This would then make sure that also users with older versions of But I would not be to afraid about breaking old code. If users included the |
BTW, I also tried to store the dependency table directly as |
To make it even more confusing than it is already there seems to be not only the two
So maybe we also need to check here how the performance would using this new |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The review consists of several mostly minor comments.
The main questions that I have raised are whether it would be possible to consistently use parquet and fully do away with pickle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First round of comments.
Thanks for the comments, as I have a few more important TODOs, I will come back to this when I have time. |
a132186
to
9ee1f24
Compare
* Use pyarrow for save/load/dtypes in Dependencies * Fix dtype mapping * Fix expected str representation output * Add pyarrow as dependency * Add parquet format to save()/load() * Add tests for parquet files * Fix docstring of Dependencies.save() * Publish dependency table as parquet file * Fix cache handling for docs/publish.rst * Compare dependency tables instead of MD5 sums * Store always as parquet in cache * Fix skipping of old audb caches * Add LEGACY to old depedendency cache file name * Use pickle in cache * Remove debug print statement * Mention correct dependency file in docs * Add docstring to test * Fix comment for errors test * Simplify dependency file loading code * Only convert dtype if needed during loading * Add test for backward compatibility * Remove unneeded line
* Use pyarrow for save/load/dtypes in Dependencies * Fix dtype mapping * Fix expected str representation output * Add pyarrow as dependency * Add parquet format to save()/load() * Add tests for parquet files * Fix docstring of Dependencies.save() * Publish dependency table as parquet file * Fix cache handling for docs/publish.rst * Compare dependency tables instead of MD5 sums * Store always as parquet in cache * Fix skipping of old audb caches * Add LEGACY to old depedendency cache file name * Use pickle in cache * Remove debug print statement * Mention correct dependency file in docs * Add docstring to test * Fix comment for errors test * Simplify dependency file loading code * Only convert dtype if needed during loading * Add test for backward compatibility * Remove unneeded line
* Use pyarrow for save/load/dtypes in Dependencies * Fix dtype mapping * Fix expected str representation output * Add pyarrow as dependency * Add parquet format to save()/load() * Add tests for parquet files * Fix docstring of Dependencies.save() * Publish dependency table as parquet file * Fix cache handling for docs/publish.rst * Compare dependency tables instead of MD5 sums * Store always as parquet in cache * Fix skipping of old audb caches * Add LEGACY to old depedendency cache file name * Use pickle in cache * Remove debug print statement * Mention correct dependency file in docs * Add docstring to test * Fix comment for errors test * Simplify dependency file loading code * Only convert dtype if needed during loading * Add test for backward compatibility * Remove unneeded line
* Use pyarrow for save/load/dtypes in Dependencies * Fix dtype mapping * Fix expected str representation output * Add pyarrow as dependency * Add parquet format to save()/load() * Add tests for parquet files * Fix docstring of Dependencies.save() * Publish dependency table as parquet file * Fix cache handling for docs/publish.rst * Compare dependency tables instead of MD5 sums * Store always as parquet in cache * Fix skipping of old audb caches * Add LEGACY to old depedendency cache file name * Use pickle in cache * Remove debug print statement * Mention correct dependency file in docs * Add docstring to test * Fix comment for errors test * Simplify dependency file loading code * Only convert dtype if needed during loading * Add test for backward compatibility * Remove unneeded line
Closes #300
Summary:
parquet
files on the server and in cacheThis pull request uses
pyarrow
dtypes in the columns fo the dataframe representing the dependency table. This, in combination withpyarrow.Table
as an intermediate representation for CSV and parquet files, results in faster reading and writing of CSV, pickle and parquet files:Results are for a dependency table holding 1,000,000 entries, see Loading and writing to a file benchmark for full results.
In addition, this pull request stores dependency tables as parquet files as reading/writing to them is not slower than for pickle files. Parquet files are also smaller, which means faster transfers, and we can later add support for loading only parts of the file for very huge datasets. Unfortunately, it is still faster to load from pickle in cache, so we continue storing the depednency table as pickle file in cache.
For parquet support we add read abilitites to
audb.Dependencies.load()
, write abilities toaudb.Dependencies.save()
, and extra code that looks for legacy csv files coming from the server (on the server it is always stored in a ZIP file, so the filename on the server does not change). As loading a dependency table from a parquet file and writing it again at another place might change its MD5 sum, I also adjusted the code to compare dependency tables by directly comparing their dataframes.This pull request adds
pyarrow
as a dependency ofaudb
, which has the downside of adding a package that is 70 MB in size, and might not be easy to install on all systems. Aspandas
is also planing to addpyarrow
as a dependency in 3.0, you can find a long discusson about pro and cons at pandas-dev/pandas#54466 and pandas-dev/pandas#57073. One outcome will most likely be that there will be a smaller package thanpyarrow
that can be used as a replacement forpyarrow
. At the moment I would argue that the speed improvements we gain and the addition ofparquet
as a file format provide a big advantage that anyway justify addingpyarrow
as a dependency ofaudb
.Real world benchmark
I also tested the speedup for loading real world datasets from the cache.
The parquet column corresponds to the case that we use
parquet
instead ofpickle
for storing files in cache.The results are given as the average execution time averaged over 10 runs with standard deviation.
As can be seen, the current implementation of using pickle files in cache is the fastest solution,
whereas using parquet files in cache is slower than loading was before.
I updated the "Database depednencies" section in the documentation as we mention the dependency file there.
And the docstrings of
audb.Dependencies.load()
andaudb.Dependencies.save()
.NOTE: the speed improvement for loading and saving CSV files with the help of
pyarrow
can also easily be added toaudformat
, without the need of usingpyarrow
dtypes in thepandas
dataframes representing the tables of a dataset./cc @frankenjoe