Store dependency table in different format #300

hagenw · 2023-04-28T14:13:06Z

For tables we support CSV to provide them in a human readable format, but this is not necessary for the dependency table. In addition, the dependency table is frequently accessed to gather basic information about a database.

I think it would make sense to switch to another format when storing it for new databases. It should be fast to read, and maybe support reading only parts like columns or rows of it to make sure it will always fit in memory.

hagenw · 2024-01-22T15:49:55Z

As a side note, pyarrow will become a dependency of pandas anyway: https://github.com/pandas-dev/pandas/blob/main/web/pandas/pdeps/0010-required-pyarrow-dependency.md
So, it should be fine if we starting integrating pyarrow based approaches here as well, e.g., storing dependencies as parquet files.

hagenw · 2024-02-14T10:30:44Z

We have now benchmark results for comparing using csv, pickle, parquet files to store the dependency table available at https://github.com/audeering/audb/tree/a8bb3367a37fae79601e189ccac76a1a12105bae/benchmarks#audbdependencies-loadingwriting-to-file.

We first focus on the results for reading as this will be performed more often than writing.

The fastest results can be achieved when using pyarrow dtypes for the internal dataframe representation of the dependency table, together with pyarrow.Table for reading from csv and parquet files
Reading speed is nearly identical from csv, pickle, and parquet files, and takes around 0.1 s for a dependency table with 1,000,000 entries
The parquet file needs only around 20% of space on disk (20 MB for 1,000,000 entries)

When looking at writing performance we get:

Writing speed is nearly identical for csv, pickle, parquet, and takes around 0.25 s for a dependency table with 1,000,000 entries

Having those results in mind it seems to be reasonable to switch storing the dependency table directly as parquet files, both on the server and in cache.

hagenw · 2024-06-21T11:40:18Z

Solved by #372.

This was referenced Jan 22, 2024

DOC: set pyarrow as dependency #345

Merged

Use different name to store dependencies of a database? #321

Closed

This was referenced Feb 7, 2024

Benchmark Dependency.load() and save() #364

Merged

Store Dependencies as parquet file #372

Merged

hagenw closed this as completed Jun 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Store dependency table in different format #300

Store dependency table in different format #300

hagenw commented Apr 28, 2023

hagenw commented Jan 22, 2024

hagenw commented Feb 14, 2024

hagenw commented Jun 21, 2024

Store dependency table in different format #300

Store dependency table in different format #300

Comments

hagenw commented Apr 28, 2023

hagenw commented Jan 22, 2024

hagenw commented Feb 14, 2024

hagenw commented Jun 21, 2024