Skip to content
This repository has been archived by the owner on Nov 5, 2020. It is now read-only.

Classification for analysis #222

Closed
Alexandru-Emil opened this issue Jul 12, 2018 · 16 comments
Closed

Classification for analysis #222

Alexandru-Emil opened this issue Jul 12, 2018 · 16 comments

Comments

@Alexandru-Emil
Copy link

I have one suggestion. It would be useful for the analysis to be able to classify in control and treatment according to the name given to each measurement, for example, to write that if the measurement name contains "co" then it should be recognized as control for the analysis. The same for repetition (example: if the measurement name contains 95, then this is repeat number 3). It would save a lot of time when working with many measurements; now you have to define everything one by one.

@paulmueller
Copy link
Member

Thanks for getting involved.
Currently, only measurement names that contain "control" are identified as such.

Just for clarification, how is the number "95" in a measurement name related to repetition number "3"?
To be absolutely clear, it would be helpful if you could give me example measurement names and the corresponding mapping.

I guess it would also make sense to discuss the mapping here, if anyone else would like to get involved @phidahl @MartaUrb @chrherold.

@Alexandru-Emil
Copy link
Author

For the control measurements I had names such as "co" or "ctrl", and in this case just the first measurement is identified as control and the others are none, and everything is identified as repeat 1.
But, depending on the analysis, one might want to identify as control measurements with a different name.
And the number 95 was a random example of labeling, for example of a cell donor.

Let's say I have 3 donors:

  • donor 15 (15 here could be an identifier for the donor in another database)
  • donor 52
  • donor 60
    And from each of the three donors I have untreated cells (labeled "co"), cells treated with x (labeled "x"), and cells treated with x+y (labeled "x+y").
    First I would like to assign "15" to repeat 1, "52" to repeat 2 and "60" to repeat 3. Then I would like to identify everything that is labeled "co" as control and "x" as treatment. After I finish this analysis I would maybe be interested to identify "x" as control and "x+y" as treatment, or "co" as control and "x+y" as treatment.

Hope I was more clear now and thank you for your interest!

@chrherold
Copy link
Contributor

chrherold commented Jul 23, 2018

@Alexandru-Emil: Thank you for participating in improving ShapeOut!

We are currently in the process of planning and designing bigger changes for the user interface and the more feedback we get the better.

A (semi)-automatic identification of control/treatment and repeats for the linear mixed models analysis seems like a good idea also for the "general" user.

I think it should not be too strictly derived from the name though.
Reasons:

  • different users will always use different patterns for names

  • different experiments of the same user may require different naming patterns

  • experiments may be a treatment in one sense and a control to a different treatment in another sense

So my suggestion would be as follows - and I am more than happy to discuss this in terms of practicability to implement (@paulmueller ) and to use (@Alexandru-Emil ).

Lets assume a dialogue box that opens when pushing a button (named e.g. "auto assign").

  • automatic identification will be carried out on all data sets the were selected and analyzed

  • in the dialogue, you will be asked to fill a phrase that identifies a control (which is than searched for in the titles of the plots/sample names)

  • you will have an option to select all other plots as treatment OR fill a phrase that identifies treatment

  • similarly you will have the option to find reservoir of control and treatment in addition

Which leaves the identification of the repeat. Here, no fixed phrase will be helpful so one would need to tell the dialogue at which position to look for similarities to build a repeat group.
Say a box where you can tell to look at the first x characters of the title to see (in case you choose 8 characters):
"donor 01"
and compare it to
"donor 99"
of another title.
Likely it would be good to give another option to look at the last x characters instead.

After applying this search the drop down menus should all be set and manual re-assignment is still possible. And another automatic assignment would be possible to completely reset and redo the assignments.

BTW: @paulmueller, is there a reason why "repetition" ends at 9?

@paulmueller
Copy link
Member

paulmueller commented Jul 23, 2018

BTW: @paulmueller, is there a reason why "repetition" ends at 9?

No, I think this was just the initial design.

@chrherold
Copy link
Contributor

OK, I guess we should lift that limit with the automated detection, if you think this detection can be implemented without big trouble.
I see the advantage in the manual use because one would not really want a terribly long drop down.

@paulmueller
Copy link
Member

The repeatment selection could also be done with an integer spin box control.

@chrherold
Copy link
Contributor

Spin box sounds good!

@maikherbig
Copy link
Member

maikherbig commented Jul 30, 2018

I just talked to a collegue who faced the issue of having only 9 repetition numbers and now I'm happy to see that this issue is alreday discussed :) Thanks again to @Alexandru-Emil for participating!
One more thing: The LinMixMod algorithm used in ShapeOut is explained in a publication, which could be cited if someone wants to use it for an own publication. Wouldnt it make snse to write at the bottom of the result (.txt popup) something like:
If you use this statistical test in a scientific publication, a citation of the following paper would be appreciated:
https://doi.org/10.1063/1.5027197
?

@paulmueller
Copy link
Member

@maikherbig I am planning an extensive online documentation on readthedocs.org that will include a section "how to cite". In general, I think asking for a citation in the data output by a program is not a good idea.

@paulmueller
Copy link
Member

paulmueller commented Aug 2, 2018

I implemented an automated classification (with similarity analysis to determine the repeat number).

Please test the installer of the development version:
https://ci.appveyor.com/project/paulmueller/shapeout/build/1.0.928/job/ntashbbdq1iqnh3t/artifacts

[EDIT]
direct link: https://ci.appveyor.com/api/buildjobs/ntashbbdq1iqnh3t/artifacts/.appveyor%2FOutput%2FShapeOut_0.8.6.dev25_win_64bit_setup.exe

@chrherold
Copy link
Contributor

Thanks @paulmueller for implementing the automated classification. I have played a little bit with the new feature and I like the concept a lot.

I have the following feedback, that should be discussed by more frequent users of the feature than I am.

  1. I am not quite sure how the identification of the repetition works at the moment. I guess it looks for two titles that are similar outside of what was entered in control and treatment?
    If it is that, I think it makes a lot of sense. But I have some questions to the more frequent users (e.g. @maikherbig):
    1a) Is it possible/of any use for the analysis that two treatment or control measurements have the same repetition number? If yes, this should be possible, if no, it should not be possible. Leading to
    1b) Is it possible / useful for the analysis to have mixed situations like : repetition 1 has control and treatment; repetition 2 only has control; repetition 3 only has treatment (...and similar). What would it mean to have this mixed situation. Or in other words, how should numbering be handled if control and treatment measurements are not (strictly, or at all) pairwise correlated? Do you avoid numbering of repeats and give all the measurements the same number? Do you give a different number to every experiment? Currently it seems to sort all measurements that do not exist as pairs to the ID 1, even if there is also a pair for one. This seems strange in any case. The suggestion on how to handle such cases, however, depends on what the consequences are for the analysis.

  2. Currently, it seems mandatory to fill some phrase into the control field. I think it should also be possible to fill just something into the treatment field (and leave control empty) and make all the rest control. This would make handling situations where samples are named "...celltypeA" and "...celltpeA + drug" more intuitive to handle. (I guess, one could just switch the meanings of control and treatment and make "+ drug" the control group to get the same result just of inverse effect. But this will not be the intuitive thing to do and might cause confusion.)

@paulmueller
Copy link
Member

  1. To identify repetitions, a similarity analysis is done, i.e. all titles are compared to all titles and they are matched in descending order. E.g. "sample 01 ctl" matches "sample 01" better than "sample 02". I am using Pythons built-in difflib for that.

  2. Yes, this is currently mandatory. Please let me know whether this should be changed.

@maikherbig
Copy link
Member

@chrherold
1a) If you set the state and repetition number of two experiments to the same value (f.e. two times Control No. 2) means that the data of these two experiments will be pooled. I think it is a nice feature and it should be allowed to do that.
1b) In mixed situations, lme4 approximates the missing data by maximum likelihood estimation. This means in your example that you have to skip repetition number 2 for Treatment and repetition number 3 for Control and lme4 would then estimate these data-points.
The LMM based test should preferably be used as a paired test and the experimental design should be chosen accordingly, but in principle it is possible to do an unpaired test by giving each experiment a different repetition number. Having a different repetition number for each experiment (e.g. Control goes from 1 to 3 and Treatment from 4 to 6) means lots of data-points have to be estimated, and the model will likely not converge. If the LMM does not converge, it is documented in the output .txt like that:
„convergence code: 0
unable to evaluate scaled gradient
Model failed to converge: degenerate Hessian with 1 negative eigenvalues“
To answer your questions one by one:
Do you avoid numbering of repeats and give all the measurements the same number?: No. For LMM we actually need repeats. Otherwise it is not possible to get any information about the Random error. Hence there would be an error message from lmer: “grouping factors must have > 1 sampled level”
Do you give a different number to every experiment?: This would equal an unpaired test, which is possible but often the LMM does not converge. Therefore, I suggest to think about an experimental design beforehand which permits pairing.

@phidahl
Copy link
Contributor

phidahl commented Aug 7, 2018 via email

@chrherold
Copy link
Contributor

@paulmueller and @maikherbig : thank you for the detailed explanations.

I guess the following changes to the system might be useful:

  1. As a minimum input information, make it mandatory to have an entry either to the control field OR to the treatment field. (Instead of an always mandatory entry to the control)

  2. Check if it is possible to not sort all "single" non-paired events to ID 1. Either make an ID 0 (if possible to handle in analysis) for all of those events or increase IDs even if non-paired events are found. Given @maikherbig's explanations, I would favor ID 0.

In general it seems to work nicely for paired data - which is the purpose. So especially 2) is just cosmetics. I think it is ok to demand manual corrections if the experimental design is not ideal.

@paulmueller
Copy link
Member

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants