Implementation discussion- parallelization #2

bethac07 · 2021-04-02T18:34:08Z

The following is off-topic (not usage-related) and is getting into implementation, so feel free to ignore for now, or bump to another thread.

It currently takes ~3-5 hours to ingest a file. Our current choice of SQLite does not allow us to do parallel writes, so there's no way to parallelize this.

But we do now, because can store it as a Parquet Dataset which can have multiple files.

So for a 384-well dataset, we can save the output as a Parquet dataset with, say, 24 files (one for each column of the384-well plate). This will also make parallel reads faster, so e.g. aggregation can be faster.

Originally posted by @shntnu in #1 (comment)

bethac07 · 2021-04-02T18:35:09Z

Per @0x00b1 's comment #1 (comment)

24 is the number of columns in a "standard" plate (384, which is 24 columns x 16 rows).

shntnu · 2021-04-02T18:51:15Z

It will be fantastic if we can make use of the way Parquet structures data so it will be trivial to read an entire batch of data as a single Parquet dataset, even if the individual plates were created separately

The ParquetDataset class accepts either a directory name or a list or file paths, and can discover and infer some common partition structures, such as those produced by Hive:

https://arrow.apache.org/docs/python/parquet.html#reading-from-partitioned-datasets

0x00b1 · 2021-04-08T21:37:06Z

Hi, @bethac07 and @shntnu, working on sharding now. What column has the well column index? And would you prefer column × row (i.e. n = 384)?

shntnu · 2021-04-08T21:51:14Z

What column has the well column index?

It is typically called Metadata_Well

And would you prefer column × row (i.e. n = 384)?

I think per column is sufficient sharding, but can you tell us what are the tradeoffs we should consider between too much and too little sharding?

Each well will typically have 2000 to 4000 rows (=number of cells) x 2000 to 5000 columns.

0x00b1 · 2021-04-08T23:24:09Z

Great. This is merged.

shntnu mentioned this issue Apr 2, 2021

Usage decisions #1

Open

0x00b1 closed this as completed Apr 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementation discussion- parallelization #2

Implementation discussion- parallelization #2

bethac07 commented Apr 2, 2021

bethac07 commented Apr 2, 2021

shntnu commented Apr 2, 2021 •

edited

Loading

0x00b1 commented Apr 8, 2021 •

edited

Loading

shntnu commented Apr 8, 2021

0x00b1 commented Apr 8, 2021

Implementation discussion- parallelization #2

Implementation discussion- parallelization #2

Comments

bethac07 commented Apr 2, 2021

bethac07 commented Apr 2, 2021

shntnu commented Apr 2, 2021 • edited Loading

0x00b1 commented Apr 8, 2021 • edited Loading

shntnu commented Apr 8, 2021

0x00b1 commented Apr 8, 2021

shntnu commented Apr 2, 2021 •

edited

Loading

0x00b1 commented Apr 8, 2021 •

edited

Loading