-
Notifications
You must be signed in to change notification settings - Fork 5
Use newer (2015) data #23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@gwaybio This is ready for review! |
gwaybio
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! well done.
A good strategy for next time is to separate code changes from data changes - this will reduce pain in sifting code changes through massive amounts of data changes. Next time!
d33bs
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! I left a few comments for your consideration. As @gwaybio mentioned, I'd recommend splitting data from code updates in the future.
Some additional thoughts below:
- The volume of raw data within this repo might be better handled using data-specific tools. Github, git, or other constraints may cause barriers to progress with large datasets. I'd recommend looking into tools like DVC, which are designed to help you version your data and store it in [sometimes] less constrained ways.
- The inclusion of
shapelyand use ofpandasmight benefit from usinggeopandas, which is specialized for handling similar data elements together (only if it makes sense).
|
Thank you @gwaybio and @d33bs for the review! @d33bs regarding your additional thoughts:
|
After this PR,
mitocheck_datawill use a newer trainingset from Mitocheck (2015 version) to create a more updated labeled training dataset forphenotypic_profiling_model.The main change in this PR is the association of
IDR_stream-derived features with Mitocheck labels.In the last version of
mitocheck_datawe associate the center coords derived withIDR_streamto the bounding boxes given by Mitocheck to determine which features belong to which cells.In this PR, we associate the center coords given by Mitocheck to the single-cell outlines derived with
IDR_stream.The final result is the same: a pandas dataframe of single-cell info including cell plate, well, frame, gene perturbation, center coordinates, features, and phenotypic label (as assigned by Mitocheck).
Note: Because the entire Mitocheck trainingset is being changed, this PR has too many files being changed to load every file diff. The two files with notable changes that do not get loaded as a diff are training_data_utils.py and locate_utils.py. These two files might need to be reviewed outside of the default GitHub review UI.