Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation discussion- parallelization #2

Closed
bethac07 opened this issue Apr 2, 2021 · 5 comments
Closed

Implementation discussion- parallelization #2

bethac07 opened this issue Apr 2, 2021 · 5 comments

Comments

@bethac07
Copy link
Member

bethac07 commented Apr 2, 2021

The following is off-topic (not usage-related) and is getting into implementation, so feel free to ignore for now, or bump to another thread.

It currently takes ~3-5 hours to ingest a file. Our current choice of SQLite does not allow us to do parallel writes, so there's no way to parallelize this.

But we do now, because can store it as a Parquet Dataset which can have multiple files.

So for a 384-well dataset, we can save the output as a Parquet dataset with, say, 24 files (one for each column of the384-well plate). This will also make parallel reads faster, so e.g. aggregation can be faster.

Originally posted by @shntnu in #1 (comment)

@bethac07
Copy link
Member Author

bethac07 commented Apr 2, 2021

Per @0x00b1 's comment #1 (comment)

24 is the number of columns in a "standard" plate (384, which is 24 columns x 16 rows).

@shntnu
Copy link
Member

shntnu commented Apr 2, 2021

It will be fantastic if we can make use of the way Parquet structures data so it will be trivial to read an entire batch of data as a single Parquet dataset, even if the individual plates were created separately

The ParquetDataset class accepts either a directory name or a list or file paths, and can discover and infer some common partition structures, such as those produced by Hive:

https://arrow.apache.org/docs/python/parquet.html#reading-from-partitioned-datasets

@shntnu shntnu mentioned this issue Apr 2, 2021
@0x00b1
Copy link
Contributor

0x00b1 commented Apr 8, 2021

Hi, @bethac07 and @shntnu, working on sharding now. What column has the well column index? And would you prefer column × row (i.e. n = 384)?

@shntnu
Copy link
Member

shntnu commented Apr 8, 2021

What column has the well column index?

It is typically called Metadata_Well

And would you prefer column × row (i.e. n = 384)?

I think per column is sufficient sharding, but can you tell us what are the tradeoffs we should consider between too much and too little sharding?

Each well will typically have 2000 to 4000 rows (=number of cells) x 2000 to 5000 columns.

@0x00b1
Copy link
Contributor

0x00b1 commented Apr 8, 2021

Great. This is merged.

@0x00b1 0x00b1 closed this as completed Apr 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants