Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define folder structures and implement data versioning #54

Open
shntnu opened this issue May 1, 2020 · 7 comments
Open

Define folder structures and implement data versioning #54

shntnu opened this issue May 1, 2020 · 7 comments

Comments

@shntnu
Copy link
Member

shntnu commented May 1, 2020

We want to address two issues here

  1. define a new folder structure for profiling experiments
  2. identify which of the components will be version controlled.

I will update this comment periodically as the strategy evolves. I realize this is not ideal because it upsets the chronology of discussions.

This is our current folder structure specified in the Profiling Handbook. This differs slightly from the folder structure specified in the Cell Painting Gallery. For this level of nesting (under workspace) the only discrepancy is metadata/platemaps (see #70); consensus and collated are currently missing in the Gallery, but that is not a discrepancy per se.

This is the proposed folder structure in the Profiling Handbook:

├── profiles
│   └── 2016_04_01_a549_48hr_batch1
│       └── SQ00015167
│           ├── SQ00015167_augmented.csv
│           ├── SQ00015167_normalized.csv
│           ├── SQ00015167_normalized_feature_select.csv
│           └── SQ00015167_spherized.csv
├── collated (*)
│   └── 2016_04_01_a549_48hr_batch1
│       ├── 2016_04_01_a549_48hr_batch1_augmented.parquet
│       ├── 2016_04_01_a549_48hr_batch1_normalized.parquet
│       ├── 2016_04_01_a549_48hr_batch1_normalized_feature_select.parquet
│       └── 2016_04_01_a549_48hr_batch1_spherized.parquet
├── consensus (*)
│   └── 2016_04_01_a549_48hr_batch1
│       ├── 2016_04_01_a549_48hr_batch1_augmented.parquet
│       ├── 2016_04_01_a549_48hr_batch1_normalized.parquet
│       └── 2016_04_01_a549_48hr_batch1_spherized.parquet
├── backend
│   └── 2016_04_01_a549_48hr_batch1
│       └── SQ00015167
│           ├── SQ00015167.csv
│           └── SQ00015167.sqlite 
├── load_data_csv
│   └── 2016_04_01_a549_48hr_batch1
│       └── SQ00015167
│           ├── load_data.csv
│           └── load_data_with_illum.csv
├── log 
├── metadata
│   └── 2016_04_01_a549_48hr_batch1
│       ├── barcode_platemap.csv
│       └── platemap
│           └── C-7161-01-LM6-006.txt
└── pipelines

* collated and consensus files are saved as parquet to allow fast loading.

We will version these folders by placing them inside the project repo

folder generator
profiles pycytominer
collated pycytominer
consensus pycytominer
load_data_csv pe2loaddata
log GNU parallel (when running various commands)
metadata manual
pipelines manual

We will not version these folders:

folder generator reason
backend cytominer-database
analysis CellProfiler, Distributed-CellProfiler redundant with SQLite backend
images Microscope Never changes, and too big!
@shntnu
Copy link
Member Author

shntnu commented May 10, 2020

I propose we split backend into backend single_cell and profiles.

  • single_cell will have only Level 2b i.e. the SQLite / Parquet file
  • profiles will have Level 3 upwards
├── single_cell
│   └── 2016_04_01_a549_48hr_batch1
│       └── SQ00015167
│           └── SQ00015167.sqlite 
├── profiles
│   └── 2016_04_01_a549_48hr_batch1
│       └── SQ00015167
│           ├── SQ00015167.csv
│           ├── SQ00015167_augmented.csv
│           ├── SQ00015167_normalized.csv
│           └── SQ00015167_normalized_variable_selected.csv

This is for two reasons

  • File size: SQLite is ~300 times larger that the rest of the files combined. Keeping this large file in a separate folder structure will make maintaining data easier.
  • Frequency of access: SQLite is not touched as often as the the downstream data, at least not so far.

SQLite files should likely not be versioned given the file size. Instead we should store a hyperlink to their location on S3 or some other permanent storage (like Figshare)

@shntnu

This comment has been minimized.

@shntnu
Copy link
Member Author

shntnu commented May 12, 2020

I propose we split backend into backend and profiles.

  • backend will have only Level 2b i.e. the SQLite / Parquet file – this was in fact my original intention (and thus the name :D)
  • profiles will have Level 3 upwards

@gwaygenomics did you see this? Does that work? (they are at the same level)

@gwaybio
Copy link
Member

gwaybio commented May 12, 2020

just saw it now - yes it can work. Any thought to renaming backend? If all that lives there is going to be SQLite/Parquet then isn't single_cell_profiles (or just single_cell) better?

@shntnu
Copy link
Member Author

shntnu commented May 12, 2020

single_cell sounds good to me.

@shntnu shntnu changed the title Implement data versioning Implement data versioning ande define folder structures May 15, 2020
@shntnu shntnu changed the title Implement data versioning ande define folder structures Implement data versioning and define folder structures May 18, 2020
@shntnu shntnu changed the title Implement data versioning and define folder structures Define folder structures and implement data versioning May 19, 2020
@shntnu shntnu pinned this issue May 19, 2020
@shntnu
Copy link
Member Author

shntnu commented Mar 18, 2021

I added this

├── collated
│   └── 2016_04_01_a549_48hr_batch1
│       └── 2016_04_01_a549_48hr_batch1_augmented.parquet
│       └── 2016_04_01_a549_48hr_batch1_normalized.parquet
│       └── 2016_04_01_a549_48hr_batch1_normalized_feature_select.parquet

and dropped all cytotools as a data generator; only pycytominer going forward.

@shntnu
Copy link
Member Author

shntnu commented Jul 13, 2022

I dropped batchfiles and audit because we no longer produce these

├── batchfiles
│   └── 2016_04_01_a549_48hr_batch1
│       └── SQ00015167
│           ├── analysis
│           │   ├── Batch_data.h5
│           │   ├── dcp_config.json
│           │   ├── cp_docker_commands.txt
│           │   └── cpgroups.csv
│           └── illum
│               ├── Batch_data.h5
│               ├── dcp_config.json
│               ├── cp_docker_commands.txt
│               └── cpgroups.csv
├── audit 
│    └── 2016_04_01_a549_48hr_batch1
│       ├── C-7161-01-LM6-006_audit.csv
│       └── C-7161-01-LM6-006_audit_detailed.csv

I renamed single_cell to backend because that became the de facto standard via JUMP (although I wish had gone with single_cell; I lost track of this discussion), and moved SQ00015167.csv to backend (from profiles)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants