-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add JUMP Mitocheck feature analysis comparison with KS tests #42
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great PR! I left a few comments to address but LGTM!
|
||
full_summary_boxplot <- ( | ||
ggplot(ks_test_df, aes(x=feature, y=ks_stat)) | ||
+ geom_boxplot(aes(color = feature_group), outlier.size = 0.1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does outlier.size
do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This resource does a good job explaining: https://www.geeksforgeeks.org/change-size-of-outlier-labels-on-boxplot-in-r/
ggplot lets you customize a lot of things in your plots. In a boxplot (aka box and whisker plot) the box means something, as do the lines coming from the box (hinges). Usually the hinges of the box means the IQR (interquartile range) of the data, which is the 25th and 75th percentile of the range. The outliers are points that fall beyond this: https://ggplot2.tidyverse.org/reference/geom_boxplot.html
We can control attributes of the outliers using this API! (e.g., outlier.size = ...
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments I have:
- For panel C, what can you confirm that other is the rest of the AreaShape features or based on the legend title, is it the other feature groups? I think this plot would be more convincing if there was more than just Zernike highlighted, but maybe not.
- Can you remind me what the dotted blue lines are in panels C and D. Are you able to add it into the legend or is it self explanatory where it isn't necessary for most audiences?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In relation to my comment in 2, I can see in the legend that you addressed this.
But now that I take a second look, is there a reason why you don't see any red in the raw data in panel C?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this comment!
I think that based on our recent discussions together with @MattsonCam - we should shift focus of this plot slightly. I agree that it would be more convincing if we show more than just AreaShape in C, and it would help to highlight zernike's in D when we focus. Essentially, the updates are to decrease focus as sharply.
The dotted blue lines are definitely not standard, but a reader will likely intuit their importance. The legend addressess this fully.
But now that I take a second look, is there a reason why you don't see any red in the raw data in panel C?
All these features have very low variance. I can add this to the legend, thanks!
# Define function for loading data | ||
load_process_data <- function(file, normalized_or_raw) { | ||
ks_test_df <- readr::read_tsv( | ||
results_file, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably just my inexperience here with R functions, but I am confused why you have function(file, normalized_or_raw)
and then in this line you have results_file
. I would assume that if you are loading in a file in the function that the variable name file
would be used here and not a new name called results_file
.
I just don't see the variable file
used in this function so I am a bit confused, but please let me know if I just missed it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a major typo! Thanks for catching this! Kudos!
I've fixed it in the next commit. (note, it doesn't actually have any impact in this present script, but could be devastating if used elsewhere.
Thanks again for the review @jenna-tomkinson ! I will go ahead and merge now. |
Related to the analysis in WayScience/JUMP-single-cell#13
Legend:
Comparing JUMP and Mitocheck nuclei feature spaces. (A) Kolmogorov-Smirnov (KS) test results comparing JUMP and Mitocheck per common CellProfiler feature colored by specific CellProfiler feature group. The boxplot whiskers represent the interquartile range of 1,000 permutations of randomly subsampled JUMP single-cells from a single plate (JUMP Pilot plate BR00116991) compared to Mitocheck. Mitocheck and JUMP sample size is the same (n = 2,916). We show both raw and z-score normalized comparisons. (B) The same KS test results focused on AreaShape measurements, which showed the lowest differences in feature distributions across datasets. (C) Comparing variance of JUMP and Mitocheck for all CellProfiler features. The dotted lines are the function y=x (anything below is a feature with higher variance in Mitocheck). Note that low variance features group together near zero and obscure colors. (D) The same variance plot as panel C except focused only on the AreaShape features.