Define folder structures and implement data versioning #54

shntnu · 2020-05-01T17:52:09Z

We want to address two issues here

define a new folder structure for profiling experiments
identify which of the components will be version controlled.

I will update this comment periodically as the strategy evolves. I realize this is not ideal because it upsets the chronology of discussions.

This is our current folder structure specified in the Profiling Handbook. This differs slightly from the folder structure specified in the Cell Painting Gallery. For this level of nesting (under workspace) the only discrepancy is metadata/platemaps (see #70); consensus and collated are currently missing in the Gallery, but that is not a discrepancy per se.

This is the proposed folder structure in the Profiling Handbook:

├── profiles
│   └── 2016_04_01_a549_48hr_batch1
│       └── SQ00015167
│           ├── SQ00015167_augmented.csv
│           ├── SQ00015167_normalized.csv
│           ├── SQ00015167_normalized_feature_select.csv
│           └── SQ00015167_spherized.csv
├── collated (*)
│   └── 2016_04_01_a549_48hr_batch1
│       ├── 2016_04_01_a549_48hr_batch1_augmented.parquet
│       ├── 2016_04_01_a549_48hr_batch1_normalized.parquet
│       ├── 2016_04_01_a549_48hr_batch1_normalized_feature_select.parquet
│       └── 2016_04_01_a549_48hr_batch1_spherized.parquet
├── consensus (*)
│   └── 2016_04_01_a549_48hr_batch1
│       ├── 2016_04_01_a549_48hr_batch1_augmented.parquet
│       ├── 2016_04_01_a549_48hr_batch1_normalized.parquet
│       └── 2016_04_01_a549_48hr_batch1_spherized.parquet
├── backend
│   └── 2016_04_01_a549_48hr_batch1
│       └── SQ00015167
│           ├── SQ00015167.csv
│           └── SQ00015167.sqlite 
├── load_data_csv
│   └── 2016_04_01_a549_48hr_batch1
│       └── SQ00015167
│           ├── load_data.csv
│           └── load_data_with_illum.csv
├── log 
├── metadata
│   └── 2016_04_01_a549_48hr_batch1
│       ├── barcode_platemap.csv
│       └── platemap
│           └── C-7161-01-LM6-006.txt
└── pipelines

* collated and consensus files are saved as parquet to allow fast loading.

We will version these folders by placing them inside the project repo

folder	generator
profiles	pycytominer
collated	pycytominer
consensus	pycytominer
load_data_csv	pe2loaddata
log	GNU parallel (when running various commands)
metadata	manual
pipelines	manual

We will not version these folders:

folder	generator	reason
backend	cytominer-database
analysis	CellProfiler, Distributed-CellProfiler	redundant with SQLite backend
images	Microscope	Never changes, and too big!

The text was updated successfully, but these errors were encountered:

shntnu · 2020-05-10T23:57:54Z

I propose we split backend into ~~backend~~ single_cell and profiles.

single_cell will have only Level 2b i.e. the SQLite / Parquet file
profiles will have Level 3 upwards

├── single_cell
│   └── 2016_04_01_a549_48hr_batch1
│       └── SQ00015167
│           └── SQ00015167.sqlite 
├── profiles
│   └── 2016_04_01_a549_48hr_batch1
│       └── SQ00015167
│           ├── SQ00015167.csv
│           ├── SQ00015167_augmented.csv
│           ├── SQ00015167_normalized.csv
│           └── SQ00015167_normalized_variable_selected.csv

This is for two reasons

File size: SQLite is ~300 times larger that the rest of the files combined. Keeping this large file in a separate folder structure will make maintaining data easier.
Frequency of access: SQLite is not touched as often as the the downstream data, at least not so far.

SQLite files should likely not be versioned given the file size. Instead we should store a hyperlink to their location on S3 or some other permanent storage (like Figshare)

shntnu · 2020-05-12T20:59:53Z

I propose we split backend into backend and profiles.

backend will have only Level 2b i.e. the SQLite / Parquet file – this was in fact my original intention (and thus the name :D)

profiles will have Level 3 upwards

@gwaygenomics did you see this? Does that work? (they are at the same level)

gwaybio · 2020-05-12T21:14:11Z

just saw it now - yes it can work. Any thought to renaming backend? If all that lives there is going to be SQLite/Parquet then isn't single_cell_profiles (or just single_cell) better?

shntnu · 2020-05-12T21:40:30Z

single_cell sounds good to me.

shntnu · 2021-03-18T23:27:38Z

I added this

├── collated
│   └── 2016_04_01_a549_48hr_batch1
│       └── 2016_04_01_a549_48hr_batch1_augmented.parquet
│       └── 2016_04_01_a549_48hr_batch1_normalized.parquet
│       └── 2016_04_01_a549_48hr_batch1_normalized_feature_select.parquet

and dropped all cytotools as a data generator; only pycytominer going forward.

shntnu · 2022-07-13T18:13:24Z

I dropped batchfiles and audit because we no longer produce these

├── batchfiles
│   └── 2016_04_01_a549_48hr_batch1
│       └── SQ00015167
│           ├── analysis
│           │   ├── Batch_data.h5
│           │   ├── dcp_config.json
│           │   ├── cp_docker_commands.txt
│           │   └── cpgroups.csv
│           └── illum
│               ├── Batch_data.h5
│               ├── dcp_config.json
│               ├── cp_docker_commands.txt
│               └── cpgroups.csv

├── audit 
│    └── 2016_04_01_a549_48hr_batch1
│       ├── C-7161-01-LM6-006_audit.csv
│       └── C-7161-01-LM6-006_audit_detailed.csv

I renamed single_cell to backend because that became the de facto standard via JUMP (although I wish had gone with single_cell; I lost track of this discussion), and moved SQ00015167.csv to backend (from profiles)

shntnu mentioned this issue May 1, 2020

Define folder structure broadinstitute/lincs-cell-painting#23

Closed

This comment has been minimized.

Sign in to view

gwaybio mentioned this issue May 12, 2020

Adding Dual License broadinstitute/lincs-cell-painting#32

Merged

gwaybio mentioned this issue May 13, 2020

Reorganize Configuration Processing broadinstitute/pooled-cell-painting-profiling-recipe#6

Merged

shntnu changed the title ~~Implement data versioning~~ Implement data versioning ande define folder structures May 15, 2020

shntnu changed the title ~~Implement data versioning ande define folder structures~~ Implement data versioning and define folder structures May 18, 2020

shntnu changed the title ~~Implement data versioning and define folder structures~~ Define folder structures and implement data versioning May 19, 2020

shntnu pinned this issue May 19, 2020

gwaybio mentioned this issue May 29, 2020

Adding normalize step to recipe broadinstitute/pooled-cell-painting-profiling-recipe#16

Merged

shntnu mentioned this issue Jul 15, 2021

Make decisions up to consensus clearer broadinstitute/lincs-cell-painting#73

Closed

shntnu mentioned this issue Apr 5, 2022

Document the folder structure broadinstitute/cellpainting-gallery#1

Closed

shntnu mentioned this issue May 19, 2023

2022_09_DD_DeepProfiler (cpg0019) broadinstitute/cellpainting-gallery#20

Closed

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Define folder structures and implement data versioning #54

Define folder structures and implement data versioning #54

shntnu commented May 1, 2020 •

edited

Loading

shntnu commented May 10, 2020 •

edited

Loading

This comment has been minimized.

shntnu commented May 12, 2020

gwaybio commented May 12, 2020

shntnu commented May 12, 2020

shntnu commented Mar 18, 2021

shntnu commented Jul 13, 2022 •

edited

Loading

Define folder structures and implement data versioning #54

Define folder structures and implement data versioning #54

Comments

shntnu commented May 1, 2020 • edited Loading

shntnu commented May 10, 2020 • edited Loading

This comment has been minimized.

shntnu commented May 12, 2020

gwaybio commented May 12, 2020

shntnu commented May 12, 2020

shntnu commented Mar 18, 2021

shntnu commented Jul 13, 2022 • edited Loading

shntnu commented May 1, 2020 •

edited

Loading

shntnu commented May 10, 2020 •

edited

Loading

shntnu commented Jul 13, 2022 •

edited

Loading