Classification for analysis #222

Alexandru-Emil · 2018-07-12T12:04:32Z

I have one suggestion. It would be useful for the analysis to be able to classify in control and treatment according to the name given to each measurement, for example, to write that if the measurement name contains "co" then it should be recognized as control for the analysis. The same for repetition (example: if the measurement name contains 95, then this is repeat number 3). It would save a lot of time when working with many measurements; now you have to define everything one by one.

paulmueller · 2018-07-12T19:47:57Z

Thanks for getting involved.
Currently, only measurement names that contain "control" are identified as such.

Just for clarification, how is the number "95" in a measurement name related to repetition number "3"?
To be absolutely clear, it would be helpful if you could give me example measurement names and the corresponding mapping.

I guess it would also make sense to discuss the mapping here, if anyone else would like to get involved @phidahl @MartaUrb @chrherold.

Alexandru-Emil · 2018-07-13T06:56:40Z

For the control measurements I had names such as "co" or "ctrl", and in this case just the first measurement is identified as control and the others are none, and everything is identified as repeat 1.
But, depending on the analysis, one might want to identify as control measurements with a different name.
And the number 95 was a random example of labeling, for example of a cell donor.

Let's say I have 3 donors:

donor 15 (15 here could be an identifier for the donor in another database)
donor 52
donor 60
And from each of the three donors I have untreated cells (labeled "co"), cells treated with x (labeled "x"), and cells treated with x+y (labeled "x+y").
First I would like to assign "15" to repeat 1, "52" to repeat 2 and "60" to repeat 3. Then I would like to identify everything that is labeled "co" as control and "x" as treatment. After I finish this analysis I would maybe be interested to identify "x" as control and "x+y" as treatment, or "co" as control and "x+y" as treatment.

Hope I was more clear now and thank you for your interest!

chrherold · 2018-07-23T09:11:22Z

@Alexandru-Emil: Thank you for participating in improving ShapeOut!

We are currently in the process of planning and designing bigger changes for the user interface and the more feedback we get the better.

A (semi)-automatic identification of control/treatment and repeats for the linear mixed models analysis seems like a good idea also for the "general" user.

I think it should not be too strictly derived from the name though.
Reasons:

different users will always use different patterns for names
different experiments of the same user may require different naming patterns
experiments may be a treatment in one sense and a control to a different treatment in another sense

So my suggestion would be as follows - and I am more than happy to discuss this in terms of practicability to implement (@paulmueller ) and to use (@Alexandru-Emil ).

Lets assume a dialogue box that opens when pushing a button (named e.g. "auto assign").

automatic identification will be carried out on all data sets the were selected and analyzed
in the dialogue, you will be asked to fill a phrase that identifies a control (which is than searched for in the titles of the plots/sample names)
you will have an option to select all other plots as treatment OR fill a phrase that identifies treatment
similarly you will have the option to find reservoir of control and treatment in addition

Which leaves the identification of the repeat. Here, no fixed phrase will be helpful so one would need to tell the dialogue at which position to look for similarities to build a repeat group.
Say a box where you can tell to look at the first x characters of the title to see (in case you choose 8 characters):
"donor 01"
and compare it to
"donor 99"
of another title.
Likely it would be good to give another option to look at the last x characters instead.

After applying this search the drop down menus should all be set and manual re-assignment is still possible. And another automatic assignment would be possible to completely reset and redo the assignments.

BTW: @paulmueller, is there a reason why "repetition" ends at 9?

paulmueller · 2018-07-23T09:37:23Z

BTW: @paulmueller, is there a reason why "repetition" ends at 9?

No, I think this was just the initial design.

chrherold · 2018-07-23T09:56:30Z

OK, I guess we should lift that limit with the automated detection, if you think this detection can be implemented without big trouble.
I see the advantage in the manual use because one would not really want a terribly long drop down.

paulmueller · 2018-07-23T09:58:57Z

The repeatment selection could also be done with an integer spin box control.

chrherold · 2018-07-23T10:04:13Z

Spin box sounds good!

maikherbig · 2018-07-30T21:07:20Z

I just talked to a collegue who faced the issue of having only 9 repetition numbers and now I'm happy to see that this issue is alreday discussed :) Thanks again to @Alexandru-Emil for participating!
One more thing: The LinMixMod algorithm used in ShapeOut is explained in a publication, which could be cited if someone wants to use it for an own publication. Wouldnt it make snse to write at the bottom of the result (.txt popup) something like:
If you use this statistical test in a scientific publication, a citation of the following paper would be appreciated:
https://doi.org/10.1063/1.5027197
?

paulmueller · 2018-07-30T21:24:42Z

@maikherbig I am planning an extensive online documentation on readthedocs.org that will include a section "how to cite". In general, I think asking for a citation in the data output by a program is not a good idea.

…222)

… repetition for mixed model analysis (#222)

paulmueller · 2018-08-02T22:09:30Z

I implemented an automated classification (with similarity analysis to determine the repeat number).

Please test the installer of the development version:
https://ci.appveyor.com/project/paulmueller/shapeout/build/1.0.928/job/ntashbbdq1iqnh3t/artifacts

[EDIT]
direct link: https://ci.appveyor.com/api/buildjobs/ntashbbdq1iqnh3t/artifacts/.appveyor%2FOutput%2FShapeOut_0.8.6.dev25_win_64bit_setup.exe

chrherold · 2018-08-06T12:08:03Z

Thanks @paulmueller for implementing the automated classification. I have played a little bit with the new feature and I like the concept a lot.

I have the following feedback, that should be discussed by more frequent users of the feature than I am.

I am not quite sure how the identification of the repetition works at the moment. I guess it looks for two titles that are similar outside of what was entered in control and treatment?
If it is that, I think it makes a lot of sense. But I have some questions to the more frequent users (e.g. @maikherbig):
1a) Is it possible/of any use for the analysis that two treatment or control measurements have the same repetition number? If yes, this should be possible, if no, it should not be possible. Leading to
1b) Is it possible / useful for the analysis to have mixed situations like : repetition 1 has control and treatment; repetition 2 only has control; repetition 3 only has treatment (...and similar). What would it mean to have this mixed situation. Or in other words, how should numbering be handled if control and treatment measurements are not (strictly, or at all) pairwise correlated? Do you avoid numbering of repeats and give all the measurements the same number? Do you give a different number to every experiment? Currently it seems to sort all measurements that do not exist as pairs to the ID 1, even if there is also a pair for one. This seems strange in any case. The suggestion on how to handle such cases, however, depends on what the consequences are for the analysis.
Currently, it seems mandatory to fill some phrase into the control field. I think it should also be possible to fill just something into the treatment field (and leave control empty) and make all the rest control. This would make handling situations where samples are named "...celltypeA" and "...celltpeA + drug" more intuitive to handle. (I guess, one could just switch the meanings of control and treatment and make "+ drug" the control group to get the same result just of inverse effect. But this will not be the intuitive thing to do and might cause confusion.)

paulmueller · 2018-08-06T12:33:13Z

To identify repetitions, a similarity analysis is done, i.e. all titles are compared to all titles and they are matched in descending order. E.g. "sample 01 ctl" matches "sample 01" better than "sample 02". I am using Pythons built-in difflib for that.
Yes, this is currently mandatory. Please let me know whether this should be changed.

maikherbig · 2018-08-06T23:37:00Z

@chrherold
1a) If you set the state and repetition number of two experiments to the same value (f.e. two times Control No. 2) means that the data of these two experiments will be pooled. I think it is a nice feature and it should be allowed to do that.
1b) In mixed situations, lme4 approximates the missing data by maximum likelihood estimation. This means in your example that you have to skip repetition number 2 for Treatment and repetition number 3 for Control and lme4 would then estimate these data-points.
The LMM based test should preferably be used as a paired test and the experimental design should be chosen accordingly, but in principle it is possible to do an unpaired test by giving each experiment a different repetition number. Having a different repetition number for each experiment (e.g. Control goes from 1 to 3 and Treatment from 4 to 6) means lots of data-points have to be estimated, and the model will likely not converge. If the LMM does not converge, it is documented in the output .txt like that:
„convergence code: 0
unable to evaluate scaled gradient
Model failed to converge: degenerate Hessian with 1 negative eigenvalues“
To answer your questions one by one:
Do you avoid numbering of repeats and give all the measurements the same number?: No. For LMM we actually need repeats. Otherwise it is not possible to get any information about the Random error. Hence there would be an error message from lmer: “grouping factors must have > 1 sampled level”
Do you give a different number to every experiment?: This would equal an unpaired test, which is possible but often the LMM does not converge. Therefore, I suggest to think about an experimental design beforehand which permits pairing.

phidahl · 2018-08-07T07:24:29Z

For my test data it did work so far. I think this will be very useful and time-saving.

…

Am 07.08.2018 um 01:37 schrieb Maik Herbig ***@***.***>: @chrherold <https://github.com/chrherold> 1a) If you set the state and repetition number of two experiments to the same value (f.e. two times Control No. 2) means that the data of these two experiments will be pooled. I think it is a nice feature and it should be allowed to do that. 1b) In mixed situations, lme4 approximates the missing data by maximum likelihood estimation. This means in your example that you have to skip repetition number 2 for Treatment and repetition number 3 for Control and lme4 would then estimate these data-points. The LMM based test should preferably be used as a paired test and the experimental design should be chosen accordingly, but in principle it is possible to do an unpaired test by giving each experiment a different repetition number. Having a different repetition number for each experiment (e.g. Control goes from 1 to 3 and Treatment from 4 to 6) means lots of data-points have to be estimated, and the model will likely not converge. If the LMM does not converge, it is documented in the output .txt like that: „convergence code: 0 unable to evaluate scaled gradient Model failed to converge: degenerate Hessian with 1 negative eigenvalues“ To answer your questions one by one: Do you avoid numbering of repeats and give all the measurements the same number?: No. For LMM we actually need repeats. Otherwise it is not possible to get any information about the Random error. Hence there would be an error message from lmer: “grouping factors must have > 1 sampled level” Do you give a different number to every experiment?: This would equal an unpaired test, which is possible but often the LMM does not converge. Therefore, I suggest to think about an experimental design beforehand which permits pairing. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#222 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AMPYzFJirQr39jdkpz5ngpCH6hTuKO7Yks5uONMdgaJpZM4VMuPZ>.

chrherold · 2018-08-07T13:48:32Z

@paulmueller and @maikherbig : thank you for the detailed explanations.

I guess the following changes to the system might be useful:

As a minimum input information, make it mandatory to have an entry either to the control field OR to the treatment field. (Instead of an always mandatory entry to the control)
Check if it is possible to not sort all "single" non-paired events to ID 1. Either make an ID 0 (if possible to handle in analysis) for all of those events or increase IDs even if non-paired events are found. Given @maikherbig's explanations, I would favor ID 0.

In general it seems to work nicely for paired data - which is the purpose. So especially 2) is just cosmetics. I think it is ok to demand manual corrections if the experimental design is not ideal.

paulmueller · 2018-08-07T20:39:08Z

This is now implemented. Please test again:
https://ci.appveyor.com/project/paulmueller/shapeout/build/1.0.931/job/pjwyqpoeh7gx0e1g/artifacts

paulmueller added the enhancement label Jul 12, 2018

paulmueller added the feedback required label Jul 12, 2018

paulmueller added a commit that referenced this issue Jul 31, 2018

fix: Allow arbitrary number of repetitions for mixed model analysis (#…

69adf97

…222)

paulmueller added a commit that referenced this issue Aug 1, 2018

feat: implement backend for automated identification of treatment and…

b96c58d

… repetition for mixed model analysis (#222)

paulmueller added a commit that referenced this issue Aug 2, 2018

Add user interface for automated classification (#222)

090de8c

paulmueller added a commit that referenced this issue Aug 7, 2018

lmm: allow to select teatment OR control identifier (#222)

2b9a7f6

paulmueller added a commit that referenced this issue Aug 16, 2018

take into account chip region when classifying measurements (#222)

683bcf0

paulmueller added a commit that referenced this issue Aug 17, 2018

analysis: reservoir data detection required both ids to be set (#222)

40092ef

paulmueller closed this as completed in 025e74b Aug 18, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Classification for analysis #222

Classification for analysis #222

Alexandru-Emil commented Jul 12, 2018

paulmueller commented Jul 12, 2018

Alexandru-Emil commented Jul 13, 2018

chrherold commented Jul 23, 2018 •

edited

paulmueller commented Jul 23, 2018 •

edited

chrherold commented Jul 23, 2018

paulmueller commented Jul 23, 2018

chrherold commented Jul 23, 2018

maikherbig commented Jul 30, 2018 •

edited

paulmueller commented Jul 30, 2018

paulmueller commented Aug 2, 2018 •

edited

chrherold commented Aug 6, 2018

paulmueller commented Aug 6, 2018

maikherbig commented Aug 6, 2018

phidahl commented Aug 7, 2018 via email

chrherold commented Aug 7, 2018

paulmueller commented Aug 7, 2018

Classification for analysis #222

Classification for analysis #222

Comments

Alexandru-Emil commented Jul 12, 2018

paulmueller commented Jul 12, 2018

Alexandru-Emil commented Jul 13, 2018

chrherold commented Jul 23, 2018 • edited

paulmueller commented Jul 23, 2018 • edited

chrherold commented Jul 23, 2018

paulmueller commented Jul 23, 2018

chrherold commented Jul 23, 2018

maikherbig commented Jul 30, 2018 • edited

paulmueller commented Jul 30, 2018

paulmueller commented Aug 2, 2018 • edited

chrherold commented Aug 6, 2018

paulmueller commented Aug 6, 2018

maikherbig commented Aug 6, 2018

phidahl commented Aug 7, 2018 via email

chrherold commented Aug 7, 2018

paulmueller commented Aug 7, 2018

chrherold commented Jul 23, 2018 •

edited

paulmueller commented Jul 23, 2018 •

edited

maikherbig commented Jul 30, 2018 •

edited

paulmueller commented Aug 2, 2018 •

edited