Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CRAN Task View Proposal: CompositionalData #58

Closed
statlink opened this issue Sep 27, 2023 · 40 comments
Closed

CRAN Task View Proposal: CompositionalData #58

statlink opened this issue Sep 27, 2023 · 40 comments

Comments

@statlink
Copy link

Hello,

I would like to propose a new CTV named CompositionalData. The CTV is about packages dedicated to compositional data analysis.
The relevant github link is
https://github.com/statlink/CompositionalData

Michail Tsagris

@dutangc
Copy link

dutangc commented Sep 27, 2023

Dear Michail,
Your proposal is excellent for me and Compositional data analysis deserves a CRAN task view.
Christophe

@pkR-pkR
Copy link

pkR-pkR commented Sep 27, 2023

The list prepared by Michail includes all packages I know on compositional data analysis and even more. It is really a comprehensive list. I support it.

@zeileis
Copy link
Contributor

zeileis commented Sep 27, 2023

Michail, I agree with the others, very nice proposal! I will read a few things in more detail but a couple of quick comments:

  • The scope of the task view is very clear and I think it would be a nice addition to the list of available views.
  • Connections to other task views could be brought out more clearly by adding cross-references with view(...) to other task views, where appropriate.
  • Have you reached out to potential co-maintainers? As the proposal guidelines explain: Task views should have teams of additional 1-5 co-maintainers to share the workload and reflect different perspectives. Ideally the co-maintainers should be a diverse group in terms of gender, origin, scientific field, etc.
  • The DirichletReg package would be another useful package, I think.
  • I'm tagging @matthias-da here because he might have further thoughts/additions.
  • Currently, there are a couple of links at the end. The first two (gR, Bioc) seem irrelevant. The others might be relevant - if so, they should ideally be worked into the main text (as explained above) using the view(...) tag (and not hard links to the CRAN master site).
  • Hyperlinks to books/journals publications should ideally be made using the DOI URL (https://doi.org/10...) to reduce the risk of getting outdated links in the future.

@statlink
Copy link
Author

statlink commented Sep 28, 2023

Achim thanks for your nice comments.

  • I will make the cross-relations to other CTVs as you suggested, using the view(...).
  • I would like to add as a co-maintainer Patrice, because his package "RWsearch" helped me a lot in creating this CTV, plus he knows the field of compositional data.
  • The DirichletReg did not appear in my search with the "RWsearch" because I was using the keyword "compositional". Of course I will add this package.
  • I will remove those two links you mentioned. I included them only because I had seen them in other CTVs ().
  • I am using the journal links or the arixv links of the papers. Do you still want me to change them?

@zeileis
Copy link
Contributor

zeileis commented Sep 28, 2023

Patrice is definitely a good addition, welcome on board. Additionally, it would be good to increase diversity a bit and maybe find two more co-maintainer, ideally a female person and/or someone from a different region/field/application area etc.

For the links: DOIs will be more persistent and always resolve to the journal links (which may change over time). arXiv also added DOIs recently.

@statlink
Copy link
Author

Ok, in that case I will add Christophe as well. He is an expert in the CTVs and from a different field, but not a female.
I will change the links with the DOIs everywhere, later on today. What shall I do about the books?

@matthias-da
Copy link

matthias-da commented Sep 28, 2023 via email

@zeileis
Copy link
Contributor

zeileis commented Sep 28, 2023

Sure, Matthias, that's perfectly fine. The other CRAN Task View Editors haven't reacted, yet, either.

So we're still in the review process stage and are still collecting feedback.

@zeileis
Copy link
Contributor

zeileis commented Sep 28, 2023

Re: Michail.

Christophe is, of course, a great collaborator...but he is already the principal maintainer of two task views. So if possible I would ask you to reach out to other persons in order to distribute the workload better. Moreover, it would also be good to team up with people that really bring in a different perspective who might be aware of packages/activities/etc that you don't know, yet. So I would encourage you to think about potential co-maintainers and then reach out to them.

@pkR-pkR
Copy link

pkR-pkR commented Sep 28, 2023 via email

@zeileis
Copy link
Contributor

zeileis commented Sep 28, 2023

Patrice, thanks for the input! Two quick comments:

  • I wasn't necessarily thinking about Matthias becoming a co-maintainer as he is also already a principal maintainer for another task view. But being on two task views is also fine. Up to Matthias, really. My intention was mainly to get his feedback.
  • Rather than waiting for almost a year and making a call then, I would recommend that you think of the community and identify one or two persons that could add to the diversity (gender, ethnicity, scientific background, geographical location, ...) and who are reasonably active/familiar with the R package landscape. And then just try to reach out to them directly.

@statlink
Copy link
Author

Achim hello.

  • Up to now I have included Patrice as the co-maintainer. Why do we need someone who is active with the R package landscape? A CTV requires 2 or 3 maintainers?

  • I changed almost all links to the doi style.

@zeileis
Copy link
Contributor

zeileis commented Sep 28, 2023

Thanks for the DOIs!

Regarding the maintenance: Maintainers should pick up new and interesting packages, tutorials, etc. that are relevant for the task view. So it's good to have people with different backgrounds (scientific field, methodology vs. applications, geographical region, etc.) who follow what is going on in the R world from their perspective. Then they will notice different relevant innovations.

@statlink
Copy link
Author

In that case can I ask someone from the bioinformatics field to join us?
Further, how can we add bioconductor packages, if I find any?

@zeileis
Copy link
Contributor

zeileis commented Sep 28, 2023

Bioinformatics sounds good to me. But maybe wait for Matthias in case he has further suggestions.

Adding Bioconductor packages can be done via bioc(...). See: https://github.com/cran-task-views/ctv/blob/main/Documentation.md#main-text

@statlink
Copy link
Author

Thanks Achim,
I will add more packages from there if I find any.

@tuxette
Copy link
Contributor

tuxette commented Sep 28, 2023

Sorry to come in the conversation that late! Thanks for the proposal, which is indeed very useful @statlink I agree that the "Bioiformatics/ecology related packages" (be careful: there is a typo) could maybe be improved. The use of methods explicitly using the compositional nature of the data is the standard in metagenomics and this could be a subsection of this part (I can help you find some packages of interest in addition to the ones that you already cited). For other types of omics, such as sequencing data in general, it is less standard but sometimes useful for some tasks (for instance, the bioconductor package coseq is one of them). The maintainer of this package (Andrea) might be someone to contact to get some help (I think that biostatisticians are more likely to have interesting clues than bioinformaticians in this area but I might be wrong).

@tuxette
Copy link
Contributor

tuxette commented Sep 28, 2023

And, in addition, I am far from being an expert but some omics data obtained from spectrometer (proteomics, metabolomics) are also often compositional (you cite some packages related to this in your current proposal) and to my opinion, this should be a different subsection (because, the reason for the compositional nature of the data is very different from metagenomics and other kind of sequencing data).

@statlink
Copy link
Author

Tuxette hi. I am not an expert either, and I included all of them in one section because this is a different field to mine.

@tuxette
Copy link
Contributor

tuxette commented Sep 28, 2023

I can help you sort this section (if you agree of course). I'll try to do that next week if that works for you?

@statlink
Copy link
Author

statlink commented Sep 28, 2023

We need Achim to agree with this also. Because in that case you would have to be a co-maintainer.

@zeileis
Copy link
Contributor

zeileis commented Sep 28, 2023

Nathalie @tuxette is a CRAN Task View Editor - like myself. And we help to improve task view proposals while they are under review, so that we can eventually approve them. (See also the proposal guidelines.)

@statlink
Copy link
Author

Achim I am happy if she joins us alongside Patrice.

@matthias-da
Copy link

Dear all

Thanks for this initiative. Really great to see so many (new) packages in this field and you did a great job finding and listing them.
However, I am afraid that such a CTV needs (a lot) further discussions.

In short:
I think the current version of the task view needs re-writing almost from scratch. And I strongly recommend asking people from the inner core of the field of compositional data analysis to participate. So in my point, a ctv needs a mixture of enthusiastic guys and guys from the inner circle of the CoDa community.

In long:

  1. Theoretical aspects:
    I do not agree with the first sentence: "Compositional data are positive multivariate data where the sum of the values of each vector sums to the same constant," since this is a very old and outdated view from the 90ies on the topic and it is not needed. Think of household expenditures or chemical concentrations in ppm of soil samples. Each composition can thus have a different constant and closing to an arbitrary constant like 1 or 100 is not needed not, since anyt multiple representation of a composition is from the same equivalence class. I thus recommend to write instead:
    "Compositional data are positive multivariate data where the sum of the values of each vector sums to a whole." or similar.

I also disagree with the second sentence: "The most popular approach is to use the logarithm transformation applied to ratios of the variables, initially suggested by Aitchison (1982). However this approach has drawbacks and for this many alternative transformations have been developed throughout the years.", since you meant most probably the additive log-ratio and centered log-ratio, and with many you mean most probably only the isometric log-ratio transformation (plus some "exotic" ones" since most of the other power transformation does not fulfill the principles of compositional data analysis. I would thus recommend: "The most popular approach is to apply a log-ratio analysis, initially suggested by Aitchison (1982)".

From these, you may see that I propose that at least one guy from the inner circle from CoDa should be included in the task view. This could be, e.g. Karel Hron or - in case you need a women: Kamila Facevicova. They could be the CoDa police ;-)

  1. it is not good practice to list your package - which I even did not know by now - at first place in the list of packages. I think those packages that are used most, which are well-known and used by the community, and which include the most variety of methods could be listed first.

  2. I think the general purpose ones are: compositions, robCompositions, Compositional, easyCoDa.
    (zComposition is special, robCompositions include always non-robust and robust alternatives and a lot of non-robust methods, eg. for the analysis of compositional tables).

  3. The robust ones are (listed by order of methods provided): robComposition, complmrob, robregcc, rrcov3way,

  4. Some topics are missing, e.g. the problem of rounded zeros and structural zeros is one of the most important problems for practitioners in the field. E.g. all data from chemometrics, biomics and chemistry comes with rounded zeros. Package zCompositions and package robComposition can be listed on this subject matter.

  5. I would recommend structuring "Other packages" and Bioinformatics/ecology related packages into more specific fields, e.g. Biomics, Chemometrics, Ecology/Biology. And also give some ideas on how to deal with high-dimensional data.

  6. It would be good to include somebody from those guys from the main packages, because they play a central role in the community and pushed the topic in the last years. This could be Raimon Tolosana-Delgado from the compositions package, it could be myself from the robCompositions package, it could be Javier Palarea-Albaladejo from the zCompositions package. All these guys are well-connected with the compositional community.

  7. A more complicated question is whether one should categorize methods in packages if they fulfill the three principles of a compositional analysis or not. For example, the Dirichlet distribution and Dirgichelt regression is helpful in many situations but it is not a subcompositional coherent method that can trouble you, since dependencies between parts are not modelled in a compositional sense and results can be contradictory to results obtained from a subcomposition taking not all parts into the analysis. I have no answer to this question about distinguishing compositional methods fulfilling the main principles of CoDa and those which relax/violate some of the principles (sometimes for good reasons), but want to point out that you may think about this matter.

  8. The description of packages is often a 1:1 copy from the package description, and thus much too long. As an example:
    r pkg("ArArRedux"): Processes noble gas mass spectrometer data to determine the isotopic composition of argon (comprised of Ar36, Ar37, Ar38, Ar39 and Ar40) released from neutron-irradiated potassium-bearing minerals. Then uses these compositions to calculate precise and accurate geochronological ages for multiple samples as well as the covariances between them. Error propagation is done in matrix form, which jointly treats all samples and all isotopes simultaneously at every step of the data reduction process. Includes methods for regression of the time-resolved mass spectrometer signals to t=0 ('time zero') for both single- and multi-collector instruments, blank correction, mass fractionation correction, detector intercalibration, decay corrections, interference corrections, interpolation of the irradiation parameter between neutron fluence monitors, and (weighted mean) age calculation. All operations are performed on the logs of the ratios between the different argon isotopes so as to properly treat them as 'compositional data'. --> 1:1 from CRAN https://cran.r-project.org/web/packages/ArArRedux/index.html
    It is - at least in my point of view - not the aim of the CTV to copy and paste package descriptions, but to shorten them and bring only the main message.

  9. Instead of ternary diagrams (which are of limited use) I would recommend either deleting this section or making a section "Visualisation" (and trying hard to think which other visualizations should be listed), where ternary diagrams are only one of the visualizations.

I am sorry to be such critical because despite being critical, I really look forward to such a task view, but my impression is that the current version needs a lot of discussion and re-writings and also needs people from the inner circle (e.g. some of those I mentioned in my points (1) and (7)).

@tuxette
Copy link
Contributor

tuxette commented Sep 29, 2023

Achim I am happy if she joins us alongside Patrice.

I am afraid that would be too much for me but I can help in organizing things with the "bioinformatics" part. However, maybe first, I think that Matthias's comments above have to be accounted for. I agree with most of them (but I am not an expert of of CoDa), especially with comment 6 (which is in line with my previous comment) and also with the fact that the description of packages is too long.

@zeileis
Copy link
Contributor

zeileis commented Sep 29, 2023

Matthias @matthias-da, thank you for the thorough feedback, this is very much appreciated...and exactly what I would have hoped for. I agree with Nathalie @tuxette that this feedback should be incorporated first.

Michail @statlink, Matthias' feedback reflects why we push for a diverse team of co-maintainers. What feels completely obvious and natural for some readers might feel awkward for others. So rather than pushing for one side or the other, we try to make the task view accessible for all sorts of different readers from different backgrounds. Hence, establishing a mixed team is a good idea.

@pkR-pkR
Copy link

pkR-pkR commented Sep 29, 2023 via email

@dutangc
Copy link

dutangc commented Sep 29, 2023

Thanks @matthias-da .

Regarding your points:

  1. the view would benefit from any contribution, in particular women, but I don’t think that a contributor of the view has to endorse the role of policeman.
  2. do you suggest to create subsections for rounded-zeros or true zeros?
  3. for people who do not know compositional analysis like me, we cannot expected readers of the view to know what are the three principles except if they are stated at the beginning. Is that what you want in the introduction?

regarding your point 3., 5.,..., do you have a proposal for the structure/outline of the view?

@matthias-da
Copy link

ad 1. I wrote the police with a ;-) The theory is not that simple in compositional data analysis and I outlined pitfalls in the intro. Thus - in my point of view - the ctv would benefit from someone who has outstanding theoretical knowledge (and is using R in daily business and is from the "inner circle" of the compositional data community). In my point of view, somebody from the "Viennese/Czech group (Peter Filzmoser, Karel Hron, Matthias Templ) or/and from the "German group" (best suited from this group is Raimon Tolosano-Delgado) or/and from the "Girona group" (best suited from this group is Javier Palarea-Albaladejo) should be part of it, at least this would be natural looking at their achievements in the field. This doesn't mean that you are not experts, it's just to have somebody on board with the traditional (log-ratio analysis) view.

ad 2. Personally, I would create one section "Rounded zeros, structural zeros, count zeros and missing values" and make paragraphs for all these issues. One might also give the name "Prepocessing of compositional data" as an alternative.

ad 3. As already written in my point (8), unfortunately, I have no answer to this question, but it should be discussed. Whenever the principles are introduced in the beginning, one probably should give a mark on methods that do not fulfill the three key principles of CoDa (scale invariance, sub-compositional coherence (including subcompositional dominance and ratio preservingness), and permutation invariance). I see this as an open question of how to deal with this. I tend to not discuss this matter in the CTV, because it would involve a deep dive into all methods listed.

ad. regarding the outline: I am not sure about the structure. I see several possibilities. One lists packages according to the type of methods (such as regression methods, compositional tables, robust methods, visualization, high-dimensional data, ...), and the other one lists packages (also) based on applicational fields (such as omics science and bioinformatics, chemometrics, ecology, ...).
Personally, I think the main categories should be built based on the kind of methods and there could be some extra sections with very specialized fields (or even subsubmit them in the previous sections and have 8) High-dimensional data as the last section).

Maybe something like this?

  1. General purpose packages
  2. Robust methods
  3. Rounded zeros, structural zeros, count zeros, and missing values
  4. Regression modelling
  5. Functional data analysis and probability density functions
  6. Contingency tables and compositional tables
  7. Visualization (?)
  8. Special applications in Omics science and bioinformatics (?) including high-dimensional data (?)
  9. Special applications in ecology (?)

However, there are other methods like cluster analysis, discriminant analysis and classification methods, principal component analysis, and correlation analysis. Why they would be less important than "regression analysis", for example? So should one extend the above list with another (at least) 4 sections on these methods? And why not also have a section on log-ratio (and other) transformations in the beginning? One problem is also maybe that package compositions and robCompositions, for example, could be listed in almost all sections. I think this all needs further discussion, and I am afraid that it might need time to find a good solution.

Another idea is to have a similar structure on sections like the sections in the books of CoDa:

@pkR-pkR
Copy link

pkR-pkR commented Oct 2, 2023 via email

@zeileis
Copy link
Contributor

zeileis commented Oct 2, 2023

Patrice @pkR-pkR thanks for this! There's no rush.

@tuxette
Copy link
Contributor

tuxette commented Oct 2, 2023

@pkR-pkR : No rush indeed. Since Matthias has suggested deep modifications, tell me when you have a first version and I'll make my suggestion on that basis (next week at best probably).

@tuxette
Copy link
Contributor

tuxette commented Mar 14, 2024

@pkR-pkR : There is no activity in this discussion since last October. There is no rush but I'm checking if you still plan to submit this proposal?

@matthias-da
Copy link

Alternative: I can imagine to completely re-write from scratch this ctv together with Raimon Tolosana-Delgado and Javier Palarea-Albaladejo. Both are experts in compositional data analysis and R and well-known in the community. What do you think?

@zeileis
Copy link
Contributor

zeileis commented Mar 14, 2024

From the viewpoint of the CRAN Task View Editors it would be best if the different approaches to this topic could be resolved unanimously - with contributors from both sides! So maybe - now that some time has passed since the original proposal - you can coordinate a revision that you do jointly and that encompasses ideas from both sides?

That would be much preferred over a decision between two different teams of co-maintainers with different ideas.

@matthias-da
Copy link

Agree. I offered my participation as well as I listed the other suggestions of potential co-authors in October and it is still surely a good way to do so.

Best
Matthias

@zeileis
Copy link
Contributor

zeileis commented Mar 14, 2024

Thanks, Matthias, very much appreciated!

@zeileis
Copy link
Contributor

zeileis commented Sep 22, 2024

Michail @statlink and Patrice @pkR-pkR, we haven't had any update from you in almost a year. Hence, it's time to close it.

Matthias @matthias-da, if you still want to propose something on the same topic, feel free to create a new issue. If you do so, I would ask you to consider including some of the ideas of Michail, Patrice, and Christophe.

@zeileis zeileis closed this as completed Sep 22, 2024
@matthias-da
Copy link

Achim @zeileis, I would then proceed with some co-authors and specialists in the field mentioned earlier. However, we surely need 1-2 months from now to have a version to share.

@zeileis
Copy link
Contributor

zeileis commented Oct 22, 2024

Matthias, ok, good, thanks for the follow-up. Just open a new issue for the proposal when you are ready to do so. In the issue please also briefly discuss how you incorporated the ideas from this first proposal. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants