Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discuss: TIGER implementations are confusing #103

Closed
roblanf opened this issue May 18, 2016 · 9 comments
Closed

Discuss: TIGER implementations are confusing #103

roblanf opened this issue May 18, 2016 · 9 comments
Milestone

Comments

@roblanf
Copy link
Collaborator

roblanf commented May 18, 2016

Hey @brettc @pbfrandsen @wrightaprilm,

Here's something to figure out. Right now we have THREE implementations of TIGER, and this is confusing. They are:

  1. Paul's TIGER code in C
  2. Brett's TIGGER code in C
  3. Paul's pure python TIGER code

Right now these can all be called from the commandline with options like --kmeans tiger or --kmeans fast_tiger. It's kinda confusing. My question is, what is the best thing for users here? We don't officially support TIGER for DNA or protein any more (entropy is just as good, and ~infinitely quicker), but TIGER might still be useful for morphology datasets, which are much smaller.

Here are two options, let me know which you prefer, or if there are others you think are better.

Option 1

Delete all C code, retain Paul's pure python implementation, and only have TIGER work with morphology (these datasets are small, and pure python is OK here). If users call --kmeans tiger from the commandline with DNA or protein, they get back an error.

Pros: v. simple for users, and for us to maintain.
Cons: users don't get to do TIGER rates on DNA/Protein any more (do they care?)

Option 2

Retain one of the C implementations (in which case Brett/Paul need to be willing to maintain it if there are bugs, which of course there aren't, and also answer inevitable installation/compilation questions on the forum), and use that for DNA/Protein, but pure python for morphology (so that we avoid problems in the most common use case).

My vote is for Option 1. Because it's the simplest, and doesn't require users to ever compile C code, which is I think a step too far for most users, and will just put people off. We can keep the code in place for running fastTIGER and TIGGER, but just remove the options to do so, and mentions of them in the docs.

Thoughts?

R

@brettc
Copy link
Owner

brettc commented May 18, 2016

I vote for option 1 as well.

The python / cython code for tigger is available in a separate repository here: https://github.com/brettc/tigger in any case. This would be a good starting place for future Tiger/python projects.

A comment though: I think the reason to remove it is that the entropy stuff works better. I don't think a good reason to remove it is because: "users find compiling C hard". We already distribute compiled C code (raxml and phyml). If we didn't have the entropy solution, we'd simply have to distribute some more compiled c code (which, of course, would make our lives more difficult).

I should add that compiling tigger is, in fact, very easy. You just need to right dependencies installed. The hard bit is making sure you have all the right dependencies installed (ie. the reason Anaconda is so good).

Cheers,
Brett

On 19 May 2016, 9:55 AM +1200, roblanfnotifications@github.com, wrote:

Hey@brettc(https://github.com/brettc)@pbfrandsen(https://github.com/pbfrandsen)@wrightaprilm(https://github.com/wrightaprilm),

Here's something to figure out. Right now we have THREE implementations of TIGER, and this is confusing. They are:

Paul's TIGER code in C
Brett's TIGGER code in C
Paul's pure python TIGER code

Right now these can all be called from the commandline with options like--kmeans tiger`or--kmeans fast_tiger``. It's kinda confusing. My question is, what is the best thing for users here? We don't officially support TIGER for DNA or protein any more (entropy is just as good, and ~infinitely quicker), but TIGER might still be useful for morphology datasets, which are much smaller.

Here are two options, let me know which you prefer, or if there are others you think are better.

Option 1

Delete all C code, retain Paul's pure python implementation, and only have TIGER work with morphology (these datasets are small, and pure python is OK here). If users call --kmeans tiger from the commandline with DNA or protein, they get back an error.

Pros: v. simple for users, and for us to maintain.
Cons: users don't get to do TIGER rates on DNA/Protein any more (do they care?)

Option 2

Retain one of the C implementations (in which case Brett/Paul need to be willing to maintain it if there are bugs, which of course there aren't, and also answer inevitable installation/compilation questions on the forum), and use that for DNA/Protein, but pure python for morphology (so that we avoid problems in the most common use case).

My vote is forOption 1. Because it's the simplest, and doesn't require users to ever compile C code, which is I think a step too far for most users, and will just put people off. We can keep the code in place for running fastTIGER and TIGGER, but just remove the options to do so, and mentions of them in the docs.

Thoughts?

R


You are receiving this because you were mentioned.
Reply to this email directly orview it on GitHub(#103)

@roblanf
Copy link
Collaborator Author

roblanf commented May 18, 2016

Fair enough.

To continue the argument, I would say that the compiled C code we
distribute right now is code that is regularly compiled by 1000's of people
on lots of systems, and has dedicated support forums not run by us. My
concern is that our C code has not (yet) been widely used, and we'd be
doing all the support. Which is something I'd bet none of us are super
excited about.

R

On 19 May 2016 at 08:35, Brett Calcott notifications@github.com wrote:

I vote for option 1 as well.

The python / cython code for tigger is available in a separate repository
here: https://github.com/brettc/tigger in any case. This would be a good
starting place for future Tiger/python projects.

A comment though: I think the reason to remove it is that the entropy
stuff works better. I don't think a good reason to remove it is because:
"users find compiling C hard". We already distribute compiled C code (raxml
and phyml). If we didn't have the entropy solution, we'd simply have to
distribute some more compiled c code (which, of course, would make our
lives more difficult).

I should add that compiling tigger is, in fact, very easy. You just need
to right dependencies installed. The hard bit is making sure you have all
the right dependencies installed (ie. the reason Anaconda is so good).

Cheers,
Brett

On 19 May 2016, 9:55 AM +1200, roblanfnotifications@github.com, wrote:

Hey@brettc(
https://github.com/brettc)@pbfrandsen(https://github.com/pbfrandsen)@wrightaprilm(https://github.com/wrightaprilm
),

Here's something to figure out. Right now we have THREE implementations
of TIGER, and this is confusing. They are:

Paul's TIGER code in C
Brett's TIGGER code in C
Paul's pure python TIGER code

Right now these can all be called from the commandline with options
like--kmeans tiger`or--kmeans fast_tiger``. It's kinda confusing. My
question is, what is the best thing for users here? We don't officially
support TIGER for DNA or protein any more (entropy is just as good, and
~infinitely quicker), but TIGER might still be useful for morphology
datasets, which are much smaller.

Here are two options, let me know which you prefer, or if there are
others you think are better.

Option 1

Delete all C code, retain Paul's pure python implementation, and only
have TIGER work with morphology (these datasets are small, and pure python
is OK here). If users call --kmeans tiger from the commandline with DNA or
protein, they get back an error.

Pros: v. simple for users, and for us to maintain.
Cons: users don't get to do TIGER rates on DNA/Protein any more (do they
care?)

Option 2

Retain one of the C implementations (in which case Brett/Paul need to be
willing to maintain it if there are bugs, which of course there aren't, and
also answer inevitable installation/compilation questions on the forum),
and use that for DNA/Protein, but pure python for morphology (so that we
avoid problems in the most common use case).

My vote is forOption 1. Because it's the simplest, and doesn't require
users to ever compile C code, which is I think a step too far for most
users, and will just put people off. We can keep the code in place for
running fastTIGER and TIGGER, but just remove the options to do so, and
mentions of them in the docs.

Thoughts?

R


You are receiving this because you were mentioned.
Reply to this email directly orview it on GitHub(
#103)


You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
#103 (comment)

Rob Lanfear
School of Biological Sciences,
Macquarie University,
Sydney

phone: +61 (0)2 9850 8204

www.robertlanfear.com

@brettc
Copy link
Owner

brettc commented May 18, 2016

Yep!

On 19/05/2016, at 11:22 AM, roblanf notifications@github.com wrote:

Fair enough.

To continue the argument, I would say that the compiled C code we
distribute right now is code that is regularly compiled by 1000's of people
on lots of systems, and has dedicated support forums not run by us. My
concern is that our C code has not (yet) been widely used, and we'd be
doing all the support. Which is something I'd bet none of us are super
excited about.

R

On 19 May 2016 at 08:35, Brett Calcott notifications@github.com wrote:

I vote for option 1 as well.

The python / cython code for tigger is available in a separate repository
here: https://github.com/brettc/tigger in any case. This would be a good
starting place for future Tiger/python projects.

A comment though: I think the reason to remove it is that the entropy
stuff works better. I don't think a good reason to remove it is because:
"users find compiling C hard". We already distribute compiled C code (raxml
and phyml). If we didn't have the entropy solution, we'd simply have to
distribute some more compiled c code (which, of course, would make our
lives more difficult).

I should add that compiling tigger is, in fact, very easy. You just need
to right dependencies installed. The hard bit is making sure you have all
the right dependencies installed (ie. the reason Anaconda is so good).

Cheers,
Brett

On 19 May 2016, 9:55 AM +1200, roblanfnotifications@github.com, wrote:

Hey@brettc(
https://github.com/brettc)@pbfrandsen(https://github.com/pbfrandsen)@wrightaprilm(https://github.com/wrightaprilm
),

Here's something to figure out. Right now we have THREE implementations
of TIGER, and this is confusing. They are:

Paul's TIGER code in C
Brett's TIGGER code in C
Paul's pure python TIGER code

Right now these can all be called from the commandline with options
like--kmeans tiger`or--kmeans fast_tiger``. It's kinda confusing. My
question is, what is the best thing for users here? We don't officially
support TIGER for DNA or protein any more (entropy is just as good, and
~infinitely quicker), but TIGER might still be useful for morphology
datasets, which are much smaller.

Here are two options, let me know which you prefer, or if there are
others you think are better.

Option 1

Delete all C code, retain Paul's pure python implementation, and only
have TIGER work with morphology (these datasets are small, and pure python
is OK here). If users call --kmeans tiger from the commandline with DNA or
protein, they get back an error.

Pros: v. simple for users, and for us to maintain.
Cons: users don't get to do TIGER rates on DNA/Protein any more (do they
care?)

Option 2

Retain one of the C implementations (in which case Brett/Paul need to be
willing to maintain it if there are bugs, which of course there aren't, and
also answer inevitable installation/compilation questions on the forum),
and use that for DNA/Protein, but pure python for morphology (so that we
avoid problems in the most common use case).

My vote is forOption 1. Because it's the simplest, and doesn't require
users to ever compile C code, which is I think a step too far for most
users, and will just put people off. We can keep the code in place for
running fastTIGER and TIGGER, but just remove the options to do so, and
mentions of them in the docs.

Thoughts?

R


You are receiving this because you were mentioned.
Reply to this email directly orview it on GitHub(
#103)


You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
#103 (comment)

Rob Lanfear
School of Biological Sciences,
Macquarie University,
Sydney

phone: +61 (0)2 9850 8204

www.robertlanfear.com

You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub #103 (comment)

@wrightaprilm
Copy link
Contributor

Option 1. Whatever streamlines maintenance and support.

@pbfrandsen
Copy link
Collaborator

I think we definitely need to simplify things with the commands (maybe just
--tiger when using morphology?). I would vote to leave the code in for
people that really want to use it, but not offer any support.

Paul

On Thu, May 19, 2016 at 11:46 AM April Wright notifications@github.com
wrote:

Option 1. Whatever streamlines maintenance and support.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#103 (comment)

@cmayer
Copy link

cmayer commented May 20, 2016

I think Paul made a good suggestion.
Some people might want to use the code, including Paul.

Best
Christoph

Am 20.05.2016 um 14:53 schrieb Paul Frandsen notifications@github.com:

I think we definitely need to simplify things with the commands (maybe just
--tiger when using morphology?). I would vote to leave the code in for
people that really want to use it, but not offer any support.

Paul

On Thu, May 19, 2016 at 11:46 AM April Wright notifications@github.com
wrote:

Option 1. Whatever streamlines maintenance and support.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#103 (comment)


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub


Dr. Christoph Mayer
Email: c.mayer.zfmk@uni-bonn.de
Tel.: +49 (0)228 9122 403

Zoologisches Forschungsmuseum Alexander Koenig

  • Leibniz Institut für Biodiversität der Tiere -
    Adenauerallee 160
    53113 Bonn, Germany
    www.zfmk.de

Stiftung des öffentlichen Rechts; Direktor: Prof. J. W. Wägele
Sitz: Bonn


@roblanf
Copy link
Collaborator Author

roblanf commented Jun 1, 2016

Hey @pbfrandsen do you have your pure python TIGER code running and tested yet? If so, can you pull in changes to master, then submit a pull request to master?

Once we have it in there and working, I'll clean up the C code issues as follows:

  1. Remove all C code
  2. Leave the hooks in the Python code so that, in principle, someone (us!) could use TIGGER or fast_Tiger if we wanted.
  3. Set it up so that k-means always works with entropy by default. But that if people want to, then can call --kmeans tiger at the commandline, and this will use your Python implementation.

@roblanf roblanf added this to the PF2.0 Beta milestone Jun 1, 2016
@pbfrandsen
Copy link
Collaborator

Hi Rob,

It's all implemented and running fine. Just need to write a couple of
tests. I'm in Panama and taking off on a boat tomorrow, but I'll do my best
to write a few tests this week.

Paul

sent from my mobile phone, apologies for any typos
On May 31, 2016 8:46 PM, "roblanf" notifications@github.com wrote:

Hey @pbfrandsen https://github.com/pbfrandsen do you have your pure
python TIGER code running and tested yet? If so, can you pull in changes to
master, then submit a pull request to master?

Once we have it in there and working, I'll clean up the C code issues as
follows:

  1. Remove all C code
  2. Leave the hooks in the Python code so that, in principle, someone
    (us!) could use TIGGER or fast_Tiger if we wanted.
  3. Set it up so that k-means always works with entropy by default. But
    that if people want to, then can call --kmeans tiger at the commandline,
    and this will use your Python implementation.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#103 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ABvScfBY5ZmrzhDqggzeGk2DLZG3tm6Nks5qHOQZgaJpZM4IhvHP
.

@roblanf
Copy link
Collaborator Author

roblanf commented Sep 18, 2016

Done.

Entropy is now the only option for DNA and protein - TIGER is just too slow, especially since lots of people have datasets with of the order of ~1m sites these days. Also Entropy seems to just work better.

Morphology can use entropy (default) or TIGER ('--kmeans tiger' at the command line).

@roblanf roblanf closed this as completed Sep 18, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants