Skip to content

Conversation

@roshankern
Copy link
Member

@roshankern roshankern commented Oct 31, 2022

After this PR, mitocheck_data will use a newer trainingset from Mitocheck (2015 version) to create a more updated labeled training dataset for phenotypic_profiling_model.

The main change in this PR is the association of IDR_stream-derived features with Mitocheck labels.
In the last version of mitocheck_data we associate the center coords derived with IDR_stream to the bounding boxes given by Mitocheck to determine which features belong to which cells.
In this PR, we associate the center coords given by Mitocheck to the single-cell outlines derived with IDR_stream.

The final result is the same: a pandas dataframe of single-cell info including cell plate, well, frame, gene perturbation, center coordinates, features, and phenotypic label (as assigned by Mitocheck).

Note: Because the entire Mitocheck trainingset is being changed, this PR has too many files being changed to load every file diff. The two files with notable changes that do not get loaded as a diff are training_data_utils.py and locate_utils.py. These two files might need to be reviewed outside of the default GitHub review UI.

@roshankern
Copy link
Member Author

@gwaybio This is ready for review!

@roshankern roshankern requested a review from gwaybio November 1, 2022 15:23
Copy link
Member

@gwaybio gwaybio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! well done.

A good strategy for next time is to separate code changes from data changes - this will reduce pain in sifting code changes through massive amounts of data changes. Next time!

Copy link
Member

@d33bs d33bs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! I left a few comments for your consideration. As @gwaybio mentioned, I'd recommend splitting data from code updates in the future.

Some additional thoughts below:

  • The volume of raw data within this repo might be better handled using data-specific tools. Github, git, or other constraints may cause barriers to progress with large datasets. I'd recommend looking into tools like DVC, which are designed to help you version your data and store it in [sometimes] less constrained ways.
  • The inclusion of shapely and use of pandas might benefit from using geopandas, which is specialized for handling similar data elements together (only if it makes sense).

@roshankern
Copy link
Member Author

Thank you @gwaybio and @d33bs for the review!

@d33bs regarding your additional thoughts:

  • I think DVC is definitely worth using for future repo changes but not sure I can justify making that change in this PR, especially with phenotypic_profiling_model referencing the data directly from this repo (with the hash used for version control.
  • I have never heard of geopandas, thank you for bringing that to my attention! I think it would be worth making this change in IDR_stream before this repo though as that is where the cell object outlines are first derived and saved. Definitely something to add to my list of IDR_stream development plans!

@roshankern roshankern merged commit 3ebd0ca into WayScience:main Nov 2, 2022
@roshankern roshankern deleted the use-2015-data branch November 2, 2022 21:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants