Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handbook discrepancy: metadata/<batch> vs metadata/<batch>/platemaps #70

Open
bethac07 opened this issue Sep 9, 2022 · 8 comments
Open

Comments

@bethac07
Copy link
Member

bethac07 commented Sep 9, 2022

Having the metadata broken out into platemaps and external is still a relatively new thing, and the image analysis team to date has typically only ever dealt with the platemap level metadata. Thus, on AWS, metadata is stored as s3://${BUCKET}/projects/${PROJECT_NAME}/workspace/metadata/${BATCH_ID}, as evidenced in this sync command.

BUT, for the recipe, we eventually want it in s3://${BUCKET}/projects/${PROJECT_NAME}/workspace/metadata/platemaps/${BATCH_ID}, as depicted in this tree structure. We currently handle that with just adding that extra platemaps in during the sync command above.

BUT, that's not what the handbook implies - it implies that it should be uploaded in the tree structure. So we basically have one of a couple of choices - 1) Explain everything I just wrote here in the handbook and show both tree structures there or 2) From now on, change how the analysts structure things and update the sync command, because presumably in places like the gallery in the future (but I don't know that anyone has checked for current bits of metadata over there), we want this new structure.

I'm guessing the preference here is 2, but we should make a decision and update the handbook accordingly.

@shntnu
Copy link
Member

shntnu commented Sep 9, 2022

BUT, that's not what the handbook implies - it implies that it should be uploaded in the tree structure. So we basically have one of a couple of choices - 1) Explain everything I just wrote here in the handbook and show both tree structures there or 2) From now on, change how the analysts structure things and update the sync command, because presumably in places like the gallery in the future (but I don't know that anyone has checked for current bits of metadata over there), we want this new structure.

I didn't quite follow (I bet it is me, not you; your note seems super clear so I'm probably just a bit of out the loop wrt the handbook)

That said, perhaps this would help:

Our cellpainting-gallery tree structure is consistent with the profiling-recipe
https://github.com/broadinstitute/cellpainting-gallery/blob/849abb85939dfb9172a24269cd0c612cb6221730/folder_structure.md#metadata-folder-structure

Our instructions for data upload say this
https://github.com/broadinstitute/cellpainting-gallery/blob/849abb85939dfb9172a24269cd0c612cb6221730/upload.md#upload-metadata-folder

(i.e. sync metadata as a whole)

So to keep things consistent, I think it is better to change the sync command

aws s3 sync s3://${BUCKET}/projects/${PROJECT_NAME}/workspace/metadata/${BATCH_ID} metadata/platemaps/${BATCH_ID}

to

aws s3 sync s3://${BUCKET}/projects/${PROJECT_NAME}/workspace/metadata/ metadata/

(i.e. just download the whole thing)

and then modify the handbook as needed so that the platemaps are expected to be found at metadata/platemaps

@bethac07
Copy link
Member Author

bethac07 commented Sep 9, 2022

Is it fair to say that your vote then is for the image analysts to stop storing the metadata as workspace/metadata/${BATCH_ID} on AWS, but instead do workspace/metadata/platemaps/${BATCH_ID} ?

Our cellpainting-gallery tree structure is consistent with the profiling-recipe
https://github.com/broadinstitute/cellpainting-gallery/blob/849abb85939dfb9172a24269cd0c612cb6221730/folder_structure.md#metadata-folder-structure

Right, that's what you want the structure to be on the gallery, but I'm saying that unless someone was careful in doing the uploads (and notably did NOT do the sync you suggest there), they may not be, because historically, we've always on AWS stored metadata in workspace/metadata/${BATCH_ID} rather than workspace/metadata/platemaps/${BATCH_ID}.

Evidence - here is the suggested tree structure at commit 07e7691 (June 2021)

└── metadata
    └── 2016_04_01_a549_48hr_batch1
        ├── barcode_platemap.csv
        └── platemap
            └── C-7161-01-LM6-006.txt

@bethac07
Copy link
Member Author

bethac07 commented Sep 9, 2022

Tracing historically - as of 2835588 (June 19 2021), we were still on the old structure. Sometimes in the giant batch of commits on June 22-23, by the end of it (8ffd6c9), there are no metadata instructions at all. They get added back in as part of #63 August 2021, but with platemaps now in the path.

@bethac07
Copy link
Member Author

bethac07 commented Sep 9, 2022

@shntnu and I chatted and here are the decisions we want to make going forward

  • New projects should use the new /platemaps containing structure. Old projects can continue to use the old structure, because it's likely better to be parallel (so that if we do want to move them to the new structure, we can do it en masse for the whole project rather than batch-by-batch).
  • Notes should be added to the handbook in the metadata section and at the metadata sync command [todo: me] and to the gallery sync instructions [todo: someone?] noting that either structure might exist; specifically for the gallery, it should instruct users to either fix it at the source to include platemaps or add it in as part of the sync command
  • At some point, somebody [todo: ?] should check the projects that already exist in the gallery to see whether they are old structure or new, and fix them if old. While we're still at the state where there are only 18, this is probably only a ~30 minute TODO. We think JUMP-Production is likely fine due to the validation script but is TBD for others.

Did I miss anything?

@rsenft1
Copy link
Member

rsenft1 commented Sep 9, 2022

Will the profiling recipe be modified to look in metadata/platemaps/batch/platemaps? (I'm assuming this is the only place where that metadata is used).

@bethac07
Copy link
Member Author

bethac07 commented Sep 9, 2022

That is already where the recipe looks!

@rsenft1
Copy link
Member

rsenft1 commented Sep 9, 2022

🤯 Oh wait, there's not actually anywhere where it 'looks' for platemaps on s3 right? We just download them to the backends machine and then from there is what's used in generating profiles?

@bethac07
Copy link
Member Author

bethac07 commented Sep 9, 2022

Yup

@shntnu shntnu changed the title Handbook discrepancy Handbook discrepancy: metadata/${BATCH_ID} vs metadata/${BATCH_ID}/platemaps Oct 16, 2022
@shntnu shntnu changed the title Handbook discrepancy: metadata/${BATCH_ID} vs metadata/${BATCH_ID}/platemaps Handbook discrepancy: metadata/<batch> vs metadata/<batch>/platemaps Oct 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants