Usage decisions #1

gwaybio · 2021-04-02T16:59:37Z

How shall users interact with the codebase? Let's track our thoughts and decide here

0x00b1 · 2021-04-02T17:03:02Z

My current thinking for the command line interface:

Usage: cytominer-transport [OPTIONS] SOURCE OUTPUT

Options:
  --compartment FILENAME
  --experiment FILENAME
  --images FILENAME

compartment can be used multiple times in the same command, e.g.

$ cytominer-transport . experiment.parquet \
    --compartment foo.csv                  \
    --compartment bar.csv                  \
    --compartment baz.csv                  \
    --experiment experiment.csv            \
    --images images.csv

If SOURCE includes the following files:

./source/foo.csv
./source/bar.csv
./source/baz.csv
./source/experiment.csv
./source/image.csv

the command can be shortened to:

$ cytominer-transport ./source experiment.parquet

shntnu · 2021-04-02T17:11:27Z

I like it!

tagging @bethac07

0x00b1 · 2021-04-02T17:12:26Z

@gwaygenomics @shntnu @bethac07 Would you like a Python API (e.g. to_pandas and to_numpy functions in addition to a to_parquet function)?

gwaybio · 2021-04-02T17:13:01Z

looks good to me.

In cytominer-database, we had an --engine option, in which one could specify either parquet or sqlite. Will cytominer-transport infer this engine from the file extension? I think scrapping sqlite entirely will be good for this package, but it might be worth thinking about splitting an engine option for the inevitability that better data structures become available

Also, if we have noncanonical compartments, would your compartment-naked option be:

$ cytominer-transport ./source experiment.parquet --compartment whatever.csv

0x00b1 · 2021-04-02T17:14:42Z

Also, if we have noncanonical compartments, would your compartment-naked option be:

Yes. The user can specify any filename they'd like for compartment (experiment and image too).

gwaybio · 2021-04-02T17:14:57Z

Would you like a Python API (e.g. to_pandas and to_numpy functions in addition to a to_parquet function)?

parquet is 100% compatible with pandas, so no need IMO

shntnu · 2021-04-02T17:15:22Z

I'm very much in favor of scrapping SQLite

parquet is 100% compatible with pandas, so no need IMO

I agree

0x00b1 · 2021-04-02T17:16:38Z

@gwaygenomics I was thinking something like:

to_pandas(experiment: Path, image: Path, compartments: List[Path]) -> pandas.DataFrame

0x00b1 · 2021-04-02T17:17:36Z

@gwaygenomics and @shntnu Would you still like a public to_parquet function?

shntnu · 2021-04-02T17:19:54Z

Usage: cytominer-transport [OPTIONS] SOURCE OUTPUT

Oh – and this is a big one because listdir will not be easy:

SOURCE and OUTPUT can be S3 URLs.

gwaybio · 2021-04-02T17:21:21Z

Would you still like a public to_parquet function?

Yeah, I think so. I think our lab will mostly use this package via command line (assay devs final step before handoff), but I can imagine a scenario in which someone would want to run an image-based profiling pipeline end-to-end in python

shntnu · 2021-04-02T17:22:18Z

Would you still like a public to_parquet function?

That's wise to have a public API. One use case I can think of: future profiling recipes can use it directly.

shntnu · 2021-04-02T17:22:41Z

I can imagine a scenario in which someone would want to run an image-based profiling pipeline end-to-end in python

ditto

shntnu · 2021-04-02T17:27:58Z

SOURCE and OUTPUT can be S3 URLs.

Enabling this will likely result in a 5x performance improvement off the bat because our current approach is to mount the bucket using using s3fs and then access :D

0x00b1 · 2021-04-02T18:26:17Z

@shntnu Good to know. Where does your n (i.e., 24) come from? It can be arbitrary, but ideally, it would correspond to some structural detail of the experiment.

Would you mind putting this in a separate issue for tracking purposes?

bethac07 · 2021-04-02T18:44:33Z

So is the idea that you would call this on a folder of data, or on one subfolder at a time? AKA if our structure is the below, is SOURCE plate or A01 - Site 1 or either?

-- plate
   |
   -- A01 - Site 1
     | 
     -- Image.csv
     -- Experiment.csv
     -- Foo.csv
   -- A01 - Site 2
     | 
     -- Image.csv
     -- Experiment.csv
     -- Foo.csv

bethac07 · 2021-04-02T18:46:19Z

(I'm fine with either behavior, as long as we're all on the same page as to what it is; in cytominer-database, we'd be calling it on plate, but my reading of the suggested usage would be that going forward we'd be calling it on each site, but those dumping back into a common experiment.parquet)

shntnu · 2021-04-02T18:51:42Z

So is the idea that you would call this on a folder of data, or on one subfolder at a time?

Not sure if the q is for @0x00b1 but I think we should call it on a folder of data; SOURCE is plate.

I have a related implementation comment here #2 (comment)

This comment has been minimized.

Sign in to view

bethac07 mentioned this issue Apr 2, 2021

Implementation discussion- parallelization #2

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Usage decisions #1

Usage decisions #1

gwaybio commented Apr 2, 2021

0x00b1 commented Apr 2, 2021 •

edited

Loading

shntnu commented Apr 2, 2021

0x00b1 commented Apr 2, 2021 •

edited

Loading

gwaybio commented Apr 2, 2021

0x00b1 commented Apr 2, 2021

gwaybio commented Apr 2, 2021

shntnu commented Apr 2, 2021

0x00b1 commented Apr 2, 2021

0x00b1 commented Apr 2, 2021

shntnu commented Apr 2, 2021 •

edited

Loading

gwaybio commented Apr 2, 2021

shntnu commented Apr 2, 2021 •

edited

Loading

shntnu commented Apr 2, 2021

shntnu commented Apr 2, 2021 •

edited

Loading

This comment has been minimized.

0x00b1 commented Apr 2, 2021 •

edited

Loading

bethac07 commented Apr 2, 2021

bethac07 commented Apr 2, 2021 •

edited

Loading

shntnu commented Apr 2, 2021

Usage decisions #1

Usage decisions #1

Comments

gwaybio commented Apr 2, 2021

0x00b1 commented Apr 2, 2021 • edited Loading

shntnu commented Apr 2, 2021

0x00b1 commented Apr 2, 2021 • edited Loading

gwaybio commented Apr 2, 2021

0x00b1 commented Apr 2, 2021

gwaybio commented Apr 2, 2021

shntnu commented Apr 2, 2021

0x00b1 commented Apr 2, 2021

0x00b1 commented Apr 2, 2021

shntnu commented Apr 2, 2021 • edited Loading

gwaybio commented Apr 2, 2021

shntnu commented Apr 2, 2021 • edited Loading

shntnu commented Apr 2, 2021

shntnu commented Apr 2, 2021 • edited Loading

This comment has been minimized.

0x00b1 commented Apr 2, 2021 • edited Loading

bethac07 commented Apr 2, 2021

bethac07 commented Apr 2, 2021 • edited Loading

shntnu commented Apr 2, 2021

0x00b1 commented Apr 2, 2021 •

edited

Loading

0x00b1 commented Apr 2, 2021 •

edited

Loading

shntnu commented Apr 2, 2021 •

edited

Loading

shntnu commented Apr 2, 2021 •

edited

Loading

shntnu commented Apr 2, 2021 •

edited

Loading

0x00b1 commented Apr 2, 2021 •

edited

Loading

bethac07 commented Apr 2, 2021 •

edited

Loading