evomics2019

Lab1: RAxML-NG Basics

Exercise 0 : Getting ready

Check input data:

cd /home/phylogenomics/workshop_materials/ng-tutorial/
ls

you should see roughly the following:

5.phy   dna.map    multi199.part  multi5.fa   prim2.part  prim.part  prim.phy.raxml.log  prot140.phy  prot21.fa.ckp  prot21.fa.out   prot21.part  README.md   terrace.part
bad.fa  fusob.phy  multi199.phy   multi5.map  prim.cons   prim.phy   prot140.part        prot21.fa    prot21.fa.log  prot21.fa.tree  rbcl.phy     terrace.fa

Run RAxML-NG binary without parameters

raxml-ng

this will show RAxML-NG version and short usage help:

RAxML-NG v. 0.8.0 BETA released on 11.01.2019 by The Exelixis Lab.
[...]
Usage: raxml-ng [OPTIONS]

Commands (mutually exclusive):
  --help                                     display help information
[...]

It is always a good idea to run alignment sanity check before starting the actual analysis:

raxml-ng --check --msa prim.phy --model GTR+G

This alignment is clean, so you should see:

Alignment can be successfully read by RAxML-NG.

4*. Run --check command again with bad.fa and examine error and warning messages.

Exercise 1 : Tree search

Infer ML tree with default parameters (10 random + 10 parsimony starting trees):

raxml-ng --msa prim.phy --model GTR+G --prefix S1

Check log-likelihoods for all 20 resulting trees:

grep "logLikelihood:" S1.raxml.log

Result:

[00:00:00] ML tree search #1, logLikelihood: -5708.961164
[00:00:01] ML tree search #2, logLikelihood: -5709.001321
[00:00:02] ML tree search #3, logLikelihood: -5708.928444
[00:00:03] ML tree search #4, logLikelihood: -5708.958315
[00:00:03] ML tree search #5, logLikelihood: -5708.932260
[00:00:04] ML tree search #6, logLikelihood: -5708.941449
[00:00:05] ML tree search #7, logLikelihood: -5708.959505
[00:00:05] ML tree search #8, logLikelihood: -5708.951658
[00:00:06] ML tree search #9, logLikelihood: -5709.022061
[00:00:07] ML tree search #10, logLikelihood: -5708.926872
[00:00:08] ML tree search #11, logLikelihood: -5709.016549
[00:00:08] ML tree search #12, logLikelihood: -5709.022648
[00:00:09] ML tree search #13, logLikelihood: -5709.009746
[00:00:10] ML tree search #14, logLikelihood: -5709.012081
[00:00:10] ML tree search #15, logLikelihood: -5709.017948
[00:00:11] ML tree search #16, logLikelihood: -5709.017067
[00:00:11] ML tree search #17, logLikelihood: -5709.030238
[00:00:12] ML tree search #18, logLikelihood: -5709.014300
[00:00:13] ML tree search #19, logLikelihood: -5709.018029
[00:00:13] ML tree search #20, logLikelihood: -5709.017251

Check topological distances between trees (so-called Robinson-Foulds or RF distance):

raxml-ng --rfdist --tree S1.raxml.mlTrees --prefix RF1

Result:

Average absolute RF distance in this tree set: 0.000000
Average relative RF distance in this tree set: 0.000000
Number of unique topologies in this tree set: 1

4*. Repeat steps 1 and 2 for fusob.phy.

raxml-ng --msa fusob.phy --model GTR+G --prefix S2

grep "logLikelihood:" S2.raxml.log

[00:00:07] ML tree search #1, logLikelihood: -9974.666846
[00:00:13] ML tree search #2, logLikelihood: -9974.669424
[00:00:20] ML tree search #3, logLikelihood: -9974.673880
[00:00:25] ML tree search #4, logLikelihood: -9980.602445
[00:00:30] ML tree search #5, logLikelihood: -9974.670042
[00:00:36] ML tree search #6, logLikelihood: -9980.601596

raxml-ng --rfdist --tree S2.raxml.mlTrees --prefix RF2

Reading input trees from file: S2.raxml.mlTrees
Loaded 6 trees with 38 taxa.

Average absolute RF distance in this tree set: 4.266667
Average relative RF distance in this tree set: 0.060952
Number of unique topologies in this tree set: 2

Exercise 2 : Bootstrapping

Infer bootstrap replicate trees:

raxml-ng --bootstrap --msa prim.phy --model GTR+G --prefix B1

Map bootstrap support values to the best ML trees:

raxml-ng --support --tree S1.raxml.bestTree --bs-trees B1.raxml.bootstraps --prefix B2

Open B2.raxml.support in the tree viewer of your choice.

4*. Repeat bootstrapping using a fixed number of replicates (100). Did this changed the support values?

Lazy "all-in-one" analysis

ML tree search, bootstrapping and branch support mapping can be performed with a single raxml-ng --all invocation:

raxml-ng --all --msa prim.phy --model GTR+G --prefix A1

However, this is not always possible/efficient for large datasets!

Exercise 3: Tree likelihood evaluation

The --evaluate command computes the likelihood of a fixed tree topology after re-optimizing all model parameters and branch lengths. We will use it to compare different evolutionary models:

raxml-ng --evaluate --msa prim.phy --tree S1.raxml.bestTree --model GTR+G --prefix E_GTRG

Repeat the above command for GTR+R4, GTR, JC and JC+G models. Don't forget to change the --prefix!
Check the log-likelihood scores:

grep "Final LogLikelihood:" E*.raxml.log

Which model yields the highest log-likelihood? Does it mean that this model should be preferred?

Parameter-rich models are more prone to overfitting the data. Thus, special criteria have been developed to compare models with different number of free parameters. In phylogenetics, AIC/AICc and BIC are most commonly used.

Check AIC/BIC scores for our runs (lower values are better):

grep "AIC score" E*.raxml.log

Which model should be preferred according to information theoretical criteria ?

Results:

grep "Final LogLikelihood:" E*.raxml.log

E_GTR.raxml.log:Final LogLikelihood: -5934.159081
E_GTRG.raxml.log:Final LogLikelihood: -5709.005399
E_GTRR.raxml.log:Final LogLikelihood: -5706.012286
E_JC.raxml.log:Final LogLikelihood: -6424.203377
E_JCG.raxml.log:Final LogLikelihood: -6272.469065

$ grep "AIC score" E*.raxml.log

E_GTR.raxml.log:AIC score: 11926.318162 / AICc score: 11928.322770 / BIC score: 12065.523094
E_GTRG.raxml.log:AIC score: 11478.010798 / AICc score: 11480.156127 / BIC score: 11622.015900
E_GTRR.raxml.log:AIC score: 11482.024572 / AICc score: 11484.948006 / BIC score: 11650.030524
E_JC.raxml.log:AIC score: 12890.406755 / AICc score: 12891.461549 / BIC score: 12991.210326
E_JCG.raxml.log:AIC score: 12588.938129 / AICc score: 12590.094701 / BIC score: 12694.541871

Exercise 4: Protein data and ModelTest-NG

Check online help

modeltest-ng --help

Important options are:

-i ALIGNMENT
-d DATATYPE, where DATATYPE is nt (default) or aa

Run model selection for prot21.fa (protein alignment):

modeltest-ng -i prot21.fa -d aa

Run tree inference with the best-scoring model determined by ModelTest-NG.

4*. Re-run ModelTest-NG including the LG4M and LG4X models which are not tested by default.

Results:

modeltest-ng -i prot21.fa -d aa

Partition 1/1:
                         Model         Score        Weight
----------------------------------------------------------
       BIC               LG+G4     6005.4554        0.5062
       AIC             LG+I+G4     5893.6825        0.7923
      AICc               LG+G4     5941.3599        0.5402

$ raxml-ng --msa prot21.fa --model LG+G4 --prefix S6

Final LogLikelihood: -2872.979205```

Exercise 5: Partitioned models

Partitioned model is defined in a file, e.g. --model prim2.part.

cat prim2.part

GTR+G+FO, NADH4=1-504/3,2-504/3
JC+I, tRNA=505-656
GTR+R4+FC, NADH5=657-898
HKY, NADH4p3=3-504/3

Re-run tree inference for prim.phy using partitioned model in prim2.part.
Compare the results (log-likelihood and tree topology) to the Exercise 1.

raxml-ng --msa prim.phy --model prim2.part --prefix P1

Likelihoods:

grep "Final LogLikelihood:" {S,P}1.raxml.log

S1.raxml.log:Final LogLikelihood: -5708.926872
P1.raxml.log:Final LogLikelihood: -5673.806570

RF distances:

cat S1.raxml.bestTree P1.raxml.bestTree > S1P1.trees 
raxml-ng --rfdist --tree S1P1.trees --prefix RF5

Reading input trees from file: S1P1.trees
Loaded 2 trees with 12 taxa.

Average absolute RF distance in this tree set: 0.000000
Average relative RF distance in this tree set: 0.000000
Number of unique topologies in this tree set: 1

Exercise 6 : RAxML-NG Web Server

Vital-IT/SIB kindly developed and maintain the RAxML-NG web server available here:

https://raxml-ng.vital-it.ch/

No registration or payment required. However, allocated computational resources are limited, so it is not a good option for large datasets.

If you have time, please feel free to play around with the web server (e.g., re-run Exercises 1 - 5).

Lab2: RAxML-NG Parallelization / Analyzing Large Datasets

Exercise 7: Alignment compression

Before analyzing a large dataset, it is highly recommended to convert alignment into RAxML binary format (RBA) using the --parse command. Doing this pre-processing step has two advantages:

binary alignments are faster to load than PHYLIP/FASTA
you will get estimated memory requirements and the recommended number of threads for this dataset

1, 1. Compress alignment file fusob.phy

raxml-ng --parse --msa fusob.phy --model GTR+G --prefix fusob

Partition 0: noname
Model: GTR+FO+G4m
Alignment sites / patterns: 1602 / 635
Gaps: 10.13 %
Invariant sites: 9.61 %

NOTE: Binary MSA file created: fusob.raxml.rba

* Estimated memory requirements                : 6 MB

* Recommended number of threads / MPI processes: 3

2*. Explore how resource estimates change depending on the selected --model (e.g., GTR, GTR+R8)

Exercise 8: Fine-grained parallelization

Use --threads option to re-run the same analysis with varying number of threads:

raxml-ng -search -msa fusob.raxml.rba -tree rand{10} -seed 1 -threads 1 -prefix T1
raxml-ng -search -msa fusob.raxml.rba -tree rand{10} -seed 1 -threads 2 -prefix T2

Compare runtimes. Which number of threads yields the lowest runtime? Which number of threads is “optimal”?

grep "Elapsed time:" T*.raxml.log

T1.raxml.log:Elapsed time: 68.191 seconds
T2.raxml.log:Elapsed time: 48.026 seconds

3*. Try to oversubscribe CPU cores by using 3 or 4 threads. What do you observe?

  parallelization: PTHREADS (3 threads), thread pinning: OFF

[00:00:00 -18709.360432] Initial branch length optimization
[00:00:30 -16006.630258] Model parameter optimization (eps = 10.000000)
[00:01:53 -14327.870042] AUTODETECT spr round 1 (radius: 5)

Exercise 9: Coarse-grained parallelization

1, Alternatively, we can parallelize across starting trees by running multiple RAxML-NG instances simultaneously:

for i in {1..2}; do (raxml-ng -msa fusob.raxml.rba -tree rand{5} -seed $i -threads 1 -prefix CT$i >CTlog$i &); done

This will start 2 RAxML-NG instances, each doing 5 independent tree searches on 1 CPU core. RAxML-NG will run in background, so use htop to monitor progress (look at per-core CPU load!).
Once all searches have finished, check the runtimes and compare them to fine-grained parallelization with 2 threads:

grep "Elapsed time:" CT*.raxml.log T*.raxml.log

CT1.raxml.log:Elapsed time: 40.091 seconds
CT2.raxml.log:Elapsed time: 37.830 seconds
T1.raxml.log:Elapsed time: 68.191 seconds
T2.raxml.log:Elapsed time: 48.026 seconds

Is it worth to use coarse-grained parallelization on this dataset?

Check the likelihood scores and pick the best tree. Does it differ topologically from the best-scoring tree we obtained in Exercise 2? Do the likelihood scores differ? Why?

grep "Final LogLikelihood" CT*.raxml.log | sort -k 3

CT1.raxml.log:Final LogLikelihood: -9974.663429
CT2.raxml.log:Final LogLikelihood: -9974.663779

Exercise 10: ParGenes

Check ParGenes options

python ~/software/ParGenes/pargenes/pargenes.py  --help

Analyze all alignments in the ~/software/ParGenes/examples/data/small/fasta_files/ folder () using default ParGenes settings

python ~/software/ParGenes/pargenes/pargenes.py  -a ~/software/ParGenes/examples/data/small/fasta_files/  -o parout -c 2 -m --scheduler openmp

2*. Run model testing and tree inference from 1 parsimony + 5 random starting trees for the prot21.fa alignment.

python ~/software/ParGenes/pargenes/pargenes.py  -a ~/workshop_materials/ng-tutorial/ --msa-filter msa_filter -o parout2 -c 2 -m --scheduler openmp -d aa -s 5 -p 1

Examine the results from both runs

Results 1:

########################
#    PARGENES v1.0.1   #
########################
ParGenes was called as follow:
/home/phylogenomics/software/ParGenes/pargenes/pargenes.py -a /home/phylogenomics/software/ParGenes/examples/data/small/fasta_files/ -o parout -c 2 -m --scheduler openmp

########################
#    PARGENES v1.0.1   #
########################
ParGenes was called as follow:
/home/phylogenomics/software/ParGenes/pargenes/pargenes.py -a /home/phylogenomics/software/ParGenes/examples/data/small/fasta_files/ -o parout -c 2 -m --scheduler openmp

[0:00:00] end of MSAs initializations
Calling mpi-scheduler: /home/phylogenomics/software/ParGenes/MPIScheduler/build/mpi-scheduler --openmp-scheduler /home/phylogenomics/software/ParGenes/pargenes/../raxml-ng/bin/raxml-ng parout/parse_run/parse_command.txt parout/parse_run 0
Logs will be redirected to parout/parse_run/logs.txt
[Warning] 2/9 commands failed
  Average number of taxa: 9
  Max number of taxa: 22
  Average number of sites: 1711
  Max number of sites: 6489
  Recommended MAXIMUM number of cores: 1
[Warning] Found 2 invalid MSAs (see parout/invalid_msas.txt)
[0:00:00] end of parsing mpi-scheduler run
[0:00:00] end of anlysing parsing results
Calling mpi-scheduler: /home/phylogenomics/software/ParGenes/MPIScheduler/build/mpi-scheduler --openmp-scheduler /home/phylogenomics/software/ParGenes/pargenes/../modeltest/bin/modeltest-ng parout/modeltest_run/modeltest_command.txt parout/modeltest_run 0
Logs will be redirected to parout/modeltest_run/logs.txt
[0:00:51] end of modeltest mpi-scheduler run
[0:00:51] end of parsing  modeltest results
Calling mpi-scheduler: /home/phylogenomics/software/ParGenes/MPIScheduler/build/mpi-scheduler --openmp-scheduler /home/phylogenomics/software/ParGenes/pargenes/../raxml-ng/bin/raxml-ng parout/parse_run/parse_command.txt parout/parse_run 0
Logs will be redirected to parout/parse_run/logs.txt
[Warning] 2/9 commands failed
  Average number of taxa: 9
  Max number of taxa: 22
  Average number of sites: 1711
  Max number of sites: 6489
  Recommended MAXIMUM number of cores: 1
[Warning] Found 2 invalid MSAs (see parout/invalid_msas.txt)
[0:00:51] end of the second parsing step
Calling mpi-scheduler: /home/phylogenomics/software/ParGenes/MPIScheduler/build/mpi-scheduler --openmp-scheduler /home/phylogenomics/software/ParGenes/pargenes/../raxml-ng/bin/raxml-ng parout/mlsearch_run/mlsearch_command.txt parout/mlsearch_run 0
Logs will be redirected to parout/mlsearch_run/logs.txt
[0:00:54] end of mlsearch mpi-scheduler run
[Warning] Total number of jobs that failed: 3
[Warning] For a detailed list, see parout/failed_commands.txt
[0:00:54] END OF THE RUN OF pargenes.py

Results 2:

########################
#    PARGENES v1.0.1   #
########################
ParGenes was called as follow:
/home/phylogenomics/software/ParGenes/pargenes/pargenes.py -a /home/phylogenomics/workshop_materials/ng-tutorial/ --msa-filter msa_filter -o parout2 -c 2 -m --scheduler openmp -d aa -s 5 -p 1

########################
#    PARGENES v1.0.1   #
########################
ParGenes was called as follow:
/home/phylogenomics/software/ParGenes/pargenes/pargenes.py -a /home/phylogenomics/workshop_materials/ng-tutorial/ --msa-filter msa_filter -o parout2 -c 2 -m --scheduler openmp -d aa -s 5 -p 1

[0:00:00] end of MSAs initializations
Calling mpi-scheduler: /home/phylogenomics/software/ParGenes/MPIScheduler/build/mpi-scheduler --openmp-scheduler /home/phylogenomics/software/ParGenes/pargenes/../raxml-ng/bin/raxml-ng parout2/parse_run/parse_command.txt parout2/parse_run 0
Logs will be redirected to parout2/parse_run/logs.txt
  Average number of taxa: 21
  Max number of taxa: 21
  Average number of sites: 111
  Max number of sites: 111
  Recommended MAXIMUM number of cores: 1
[0:00:00] end of parsing mpi-scheduler run
[0:00:00] end of anlysing parsing results
Calling mpi-scheduler: /home/phylogenomics/software/ParGenes/MPIScheduler/build/mpi-scheduler --openmp-scheduler /home/phylogenomics/software/ParGenes/pargenes/../modeltest/bin/modeltest-ng parout2/modeltest_run/modeltest_command.txt parout2/modeltest_run 0
Logs will be redirected to parout2/modeltest_run/logs.txt
[0:00:10] end of modeltest mpi-scheduler run
[0:00:10] end of parsing  modeltest results
Calling mpi-scheduler: /home/phylogenomics/software/ParGenes/MPIScheduler/build/mpi-scheduler --openmp-scheduler /home/phylogenomics/software/ParGenes/pargenes/../raxml-ng/bin/raxml-ng parout2/parse_run/parse_command.txt parout2/parse_run 0
Logs will be redirected to parout2/parse_run/logs.txt
  Average number of taxa: 21
  Max number of taxa: 21
  Average number of sites: 111
  Max number of sites: 111
  Recommended MAXIMUM number of cores: 3
[0:00:11] end of the second parsing step
Calling mpi-scheduler: /home/phylogenomics/software/ParGenes/MPIScheduler/build/mpi-scheduler --openmp-scheduler /home/phylogenomics/software/ParGenes/pargenes/../raxml-ng/bin/raxml-ng parout2/mlsearch_run/mlsearch_command.txt parout2/mlsearch_run 0
Logs will be redirected to parout2/mlsearch_run/logs.txt
[0:00:22] end of mlsearch mpi-scheduler run
[0:00:23] end of selecting the best ML tree
[0:00:23] END OF THE RUN OF pargenes.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly