Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Usage decisions #1

Open
gwaybio opened this issue Apr 2, 2021 · 19 comments
Open

Usage decisions #1

gwaybio opened this issue Apr 2, 2021 · 19 comments

Comments

@gwaybio
Copy link
Member

gwaybio commented Apr 2, 2021

How shall users interact with the codebase? Let's track our thoughts and decide here

@0x00b1
Copy link
Contributor

0x00b1 commented Apr 2, 2021

My current thinking for the command line interface:

Usage: cytominer-transport [OPTIONS] SOURCE OUTPUT

Options:
  --compartment FILENAME
  --experiment FILENAME
  --images FILENAME

compartment can be used multiple times in the same command, e.g.

$ cytominer-transport . experiment.parquet \
    --compartment foo.csv                  \
    --compartment bar.csv                  \
    --compartment baz.csv                  \
    --experiment experiment.csv            \
    --images images.csv

If SOURCE includes the following files:

./source/foo.csv
./source/bar.csv
./source/baz.csv
./source/experiment.csv
./source/image.csv

the command can be shortened to:

$ cytominer-transport ./source experiment.parquet

@shntnu
Copy link
Member

shntnu commented Apr 2, 2021

I like it!

tagging @bethac07

@0x00b1
Copy link
Contributor

0x00b1 commented Apr 2, 2021

@gwaygenomics @shntnu @bethac07 Would you like a Python API (e.g. to_pandas and to_numpy functions in addition to a to_parquet function)?

@gwaybio
Copy link
Member Author

gwaybio commented Apr 2, 2021

looks good to me.

In cytominer-database, we had an --engine option, in which one could specify either parquet or sqlite. Will cytominer-transport infer this engine from the file extension? I think scrapping sqlite entirely will be good for this package, but it might be worth thinking about splitting an engine option for the inevitability that better data structures become available

Also, if we have noncanonical compartments, would your compartment-naked option be:

$ cytominer-transport ./source experiment.parquet --compartment whatever.csv

@0x00b1
Copy link
Contributor

0x00b1 commented Apr 2, 2021

Also, if we have noncanonical compartments, would your compartment-naked option be:

Yes. The user can specify any filename they'd like for compartment (experiment and image too).

@gwaybio
Copy link
Member Author

gwaybio commented Apr 2, 2021

Would you like a Python API (e.g. to_pandas and to_numpy functions in addition to a to_parquet function)?

parquet is 100% compatible with pandas, so no need IMO

@shntnu
Copy link
Member

shntnu commented Apr 2, 2021

I'm very much in favor of scrapping SQLite

parquet is 100% compatible with pandas, so no need IMO

I agree

@0x00b1
Copy link
Contributor

0x00b1 commented Apr 2, 2021

@gwaygenomics I was thinking something like:

to_pandas(experiment: Path, image: Path, compartments: List[Path]) -> pandas.DataFrame

@0x00b1
Copy link
Contributor

0x00b1 commented Apr 2, 2021

@gwaygenomics and @shntnu Would you still like a public to_parquet function?

@shntnu
Copy link
Member

shntnu commented Apr 2, 2021

Usage: cytominer-transport [OPTIONS] SOURCE OUTPUT

Oh – and this is a big one because listdir will not be easy:

SOURCE and OUTPUT can be S3 URLs.

@gwaybio
Copy link
Member Author

gwaybio commented Apr 2, 2021

Would you still like a public to_parquet function?

Yeah, I think so. I think our lab will mostly use this package via command line (assay devs final step before handoff), but I can imagine a scenario in which someone would want to run an image-based profiling pipeline end-to-end in python

@shntnu
Copy link
Member

shntnu commented Apr 2, 2021

Would you still like a public to_parquet function?

That's wise to have a public API. One use case I can think of: future profiling recipes can use it directly.

@shntnu
Copy link
Member

shntnu commented Apr 2, 2021

I can imagine a scenario in which someone would want to run an image-based profiling pipeline end-to-end in python

ditto

@shntnu
Copy link
Member

shntnu commented Apr 2, 2021

SOURCE and OUTPUT can be S3 URLs.

Enabling this will likely result in a 5x performance improvement off the bat because our current approach is to mount the bucket using using s3fs and then access :D

@shntnu

This comment has been minimized.

@0x00b1
Copy link
Contributor

0x00b1 commented Apr 2, 2021

@shntnu Good to know. Where does your n (i.e., 24) come from? It can be arbitrary, but ideally, it would correspond to some structural detail of the experiment.

Would you mind putting this in a separate issue for tracking purposes?

@bethac07
Copy link
Member

bethac07 commented Apr 2, 2021

So is the idea that you would call this on a folder of data, or on one subfolder at a time? AKA if our structure is the below, is SOURCE plate or A01 - Site 1 or either?

-- plate
   |
   -- A01 - Site 1
     | 
     -- Image.csv
     -- Experiment.csv
     -- Foo.csv
   -- A01 - Site 2
     | 
     -- Image.csv
     -- Experiment.csv
     -- Foo.csv

@bethac07
Copy link
Member

bethac07 commented Apr 2, 2021

(I'm fine with either behavior, as long as we're all on the same page as to what it is; in cytominer-database, we'd be calling it on plate, but my reading of the suggested usage would be that going forward we'd be calling it on each site, but those dumping back into a common experiment.parquet)

@shntnu
Copy link
Member

shntnu commented Apr 2, 2021

So is the idea that you would call this on a folder of data, or on one subfolder at a time?

Not sure if the q is for @0x00b1 but I think we should call it on a folder of data; SOURCE is plate.

I have a related implementation comment here #2 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants