# Table 1: Regressive vs Progressive MSA

This notebook contains all the code to generate Table 1 from the publication.

### _Input Data_
Our input data consists of 94 datasets from the HOMFAM benchmark. Each of ther datasets is a protein family with the number sequences in each ranging from 88 for the smallest family *seatoxin* to 93,675 for the largest family *rvp*.

In [6]:
cat ../data/seqs/seatoxin.fa | grep '>'| wc -l

      88


In [175]:
cat ../data/seqs/rvp.fa | grep '>'| wc -l

93675


For each dataset, there is also a reference set of sequences.
These references are from the PDB and have been structually aligned.

In [10]:
cat ../data/refs/seatoxin.ref

>1apf
GVPCLCDSDGPRPRGNTLSGILWFYPSGCPS--GWH-NCKAHGPNIGWCCKK--
>1ahl
GVSCLCDSDGPSVRGNTLSGTLWLYPSGCPS--GWH-NCKAHGPTIGWCCKQ--
>1atx
GAACLCKSDGPNTRGNSMSGTIWVF--GCPS--GWN-NCEGRA-IIGYCCKQ--
>1sh1
-AACKCDDEGPDIRTAPLTGTVDLG--SCNA--GWE-KCASYYTIIADCCRKKK
>1bds
AAPCFCSGKP-------GRGDLWILRGTCPGGYGYTSNCYK--WPNICCYPH--


Each sequence file has been combined with the unaligned reference sequences to create a combined sequence set.

In [16]:
!head ../data/combined_seqs/seatoxin.fa

>B1NWT1_NEMVE/41-81
ACACDSPGIRSASLSGIVWVGSCPSGWKKCKSYYSVVADCC
>B1NWR7_NEMVE/41-83
PCACDSDGPDIRSASLSGIVWMGSCPSGWKKCKSYYSIVADCC
>TXCN2_BUNCN/3-46
ACRCDSDGPTVRGDSLSGTLWLTGGCPSGWHNCRGSGPFIGYCC
>TX5_ANTXA/3-45
SCLCDSDGPSVSGNTLSGIIWLAGCPSGWHNCKAHGPNIGWCC
>TXH7_ANTS7/3-45
PCLCDSDGPSVHGNTLSGTIWLAGCPSGWHNCKAHGPTIGWCC


### _Command lines used in workflow to generate guide trees_

#### Clustal Omega Trees (mBed)
```
custalo -i ${seqs} --guidetree-out ${id}.${clustalo}.dnd
```

#### MAFFT PartTree trees
```
t_coffee -other_pg seq_reformat                 \
          -in ${seqs} -action +seq2dnd parttree \
          -output newick                        \
          >> ${id}.MAFFT_PARTTREE.dnd
```               

### _Command lines used in workflow to generate alignments_

#### Clustal Omega progressive alignments
```
custalo -i ${seqs} --guidetree-out ${id}.${clustalo}.dnd
```

#### MAFFT-FFTNS1 progressive alignments (3 commands)
```
t_coffee -other_pg seq_reformat \
            -in ${guide_tree} -input newick \
            -in2 ${seqs} -input2 fasta_seq  \
            -action +newick2mafftnewick     \
            >> ${id}.mafftnewick

newick2mafft.rb 1.0 ${id}.mafftnewick > ${id}.mafftbinary

mafft --retree 1 --anysymbol              \
       --treein ${id}.mafftbinary ${seqs} \
       > ${id}.std.MAFFT-FFTNS1.with.${tree_method}.tree.aln
```

#### Clustal Omega regressive (dpa) alignments
```
t_coffee -dpa -dpa_method clustalo_msa \
         -dpa_tree ${guide_tree}       \
         -seq ${seqs}                  \
         -dpa_nseq ${bucket_size}      \
         -outfile ${id}.dpa_${bucket_size}.CLUSTALO.with.${tree_method}.tree.aln
```

#### MAFFT-FFTNS1 regressive (dpa) alignments
```
t_coffee -dpa -dpa_method mafftfftns1_msa \
         -dpa_tree ${guide_tree} \
         -seq ${seqs} \
         -dpa_nseq ${bucket_size} \
         -outfile ${id}.dpa_${bucket_size}.CLUSTALO.with.${tree_method}.tree.aln
```

### Workflow

The main workflow for generating the alignments is written in Nextflow (http://nextflow.io)


### Part One: Progressive vs Regressive Alignments
The first table compares __Progressive vs Regressive__ alignment proceedurees using two of the most common large scale tree building and alignment proceedures.
* Clustal Omega with Clustal Omega trees
* Clustal Omega with MAFFT PartTree trees
* MAFFT FFT-NS-1 with MAFFT PartTree trees
* MAFFT FFT-NS-1 with Clustal Omega trees

The command to run for running the workflow is as follows:

In [176]:
!~/bin/nextflow run ../main.nf \
                    --align_method="CLUSTALO,MAFFT-FFTNS1" \
                    --tree_method="CLUSTALO,MAFFT_PARTTREE" \
                    --refs='../data/refs/*.ref' \
                    --seqs='../data/combined_seqs/*.fa' \
                    --dpa_align \
                    --std_align \
                    --default_align=false \
                    --output results/publication_table_1a \
                    -with-docker \
                    -resume

N E X T F L O W  ~  version 0.31.0
Launching `../main.nf` [ecstatic_faggin] - revision: 23d245b53b
R E G R E S S I V E   M S A   A n a l y s i s  ~  version 0.1"
Input sequences (FASTA)                        : ../data/combined_seqs/*.fa
Input references (Aligned FASTA)               : ../data/refs/*.ref
Input trees (NEWICK)                           : false
Output directory (DIRECTORY)                   : results/publication_table_1a
Alignment methods                              : CLUSTALO,MAFFT-FFTNS1
Tree methods                                   : CLUSTALO,MAFFT_PARTTREE
Generate default alignments                    : false
Generate standard alignments                   : true
Generate regressive alignments (DPA)           : true
Bucket Sizes for regressive alignments         : 1000
Perform evaluation? Requires reference         : true
Output directory (DIRECTORY)                   : results/publication_table_1a

[warm up] executor > local
[20/d12238] Submitted process > guide_tr

### Run the same workflow the reference data to generate the baseline scores
The workflow is run aligning only the reference PDB sequences.

In [1]:
!~/bin/nextflow run ../main.nf \
                    --align_method="CLUSTALO,MAFFT-FFTNS1" \
                    --tree_method="CLUSTALO,MAFFT_PARTTREE" \
                    --refs='../data/refs/*.ref' \
                    --seqs='../data/refs_fasta/*.ref' \
                    --dpa_align \
                    --std_align \
                    --default_align=false \
                    --output results/publication_table_1a/reference_data \
                    -with-docker \
                    -resume

N E X T F L O W  ~  version 0.31.0
Launching `../main.nf` [happy_lamport] - revision: 23d245b53b
R E G R E S S I V E   M S A   A n a l y s i s  ~  version 0.1"
Input sequences (FASTA)                        : ../data/refs_fasta/*.ref
Input references (Aligned FASTA)               : ../data/refs/*.ref
Input trees (NEWICK)                           : false
Output directory (DIRECTORY)                   : results/publication_table_1a/reference_data
Alignment methods                              : CLUSTALO,MAFFT-FFTNS1
Tree methods                                   : CLUSTALO,MAFFT_PARTTREE
Generate default alignments                    : false
Generate standard alignments                   : true
Generate regressive alignments (DPA)           : true
Bucket Sizes for regressive alignments         : 1000
Perform evaluation? Requires reference         : true
Output directory (DIRECTORY)                   : results/publication_table_1a/reference_data

[warm up] executor > local
[b0/c510c2] Su

[fa/dfaba5] Submitted process > guide_trees (DMRL_synthase.MAFFT_PARTTREE)
[45/b8921f] Submitted process > guide_trees (HMG_box.MAFFT_PARTTREE)
[9e/1346df] Submitted process > guide_trees (ghf13.CLUSTALO)
[62/058fea] Submitted process > guide_trees (myb_DNA-binding.CLUSTALO)
[0d/492889] Submitted process > guide_trees (DMRL_synthase.CLUSTALO)
[4e/078129] Submitted process > guide_trees (egf.MAFFT_PARTTREE)
[22/ed936e] Submitted process > guide_trees (egf.CLUSTALO)
[92/2481ed] Submitted process > guide_trees (uce.CLUSTALO)
[4a/8d07de] Submitted process > guide_trees (uce.MAFFT_PARTTREE)
[3a/2db1ad] Submitted process > guide_trees (ltn.MAFFT_PARTTREE)
[1c/fa8c6a] Submitted process > guide_trees (ltn.CLUSTALO)
[dc/84c6b4] Submitted process > guide_trees (GEL.CLUSTALO)
[39/f3f621] Submitted process > guide_trees (scorptoxin.MAFFT_PARTTREE)
[61/491873] Submitted process > guide_trees (GEL.MAFFT_PARTTREE)
[f3/032ed6] Submitted process > guide_trees (scorptoxin.CLUSTALO)
[54/0052db] Submitted

[e2/96e154] Submitted process > regressive_alignment (kringle.CLUSTALO.DPA.1000.CLUSTALO)
[cb/5f44e5] Submitted process > standard_alignment (kringle.MAFFT-FFTNS1.STD.NA.CLUSTALO)
[ca/87e93a] Submitted process > regressive_alignment (kringle.MAFFT-FFTNS1.DPA.1000.CLUSTALO)
[93/3ba8b6] Submitted process > regressive_alignment (cah.CLUSTALO.DPA.1000.MAFFT_PARTTREE)
[7e/6da6f8] Submitted process > standard_alignment (cah.CLUSTALO.STD.NA.MAFFT_PARTTREE)
[1b/d35411] Submitted process > standard_alignment (cah.MAFFT-FFTNS1.STD.NA.MAFFT_PARTTREE)
[0f/e0ac89] Submitted process > regressive_alignment (cah.MAFFT-FFTNS1.DPA.1000.MAFFT_PARTTREE)
[5c/ed810e] Submitted process > standard_alignment (LIM.MAFFT-FFTNS1.STD.NA.MAFFT_PARTTREE)
[73/81fba6] Submitted process > standard_alignment (LIM.CLUSTALO.STD.NA.MAFFT_PARTTREE)
[93/a1fdfa] Submitted process > regressive_alignment (LIM.CLUSTALO.DPA.1000.MAFFT_PARTTREE)
[78/3dc968] Submitted process > regressive_alignment (LIM.MAFFT-FFTNS1.DPA.1000.MAFFT_

[d5/91f766] Submitted process > standard_alignment (ldh.MAFFT-FFTNS1.STD.NA.CLUSTALO)
[15/cd6a98] Submitted process > regressive_alignment (ldh.CLUSTALO.DPA.1000.CLUSTALO)
[49/de50b1] Submitted process > standard_alignment (profilin.CLUSTALO.STD.NA.CLUSTALO)
[80/73cb58] Submitted process > regressive_alignment (profilin.CLUSTALO.DPA.1000.CLUSTALO)
[df/36b971] Submitted process > regressive_alignment (profilin.MAFFT-FFTNS1.DPA.1000.CLUSTALO)
[1d/222681] Submitted process > standard_alignment (profilin.MAFFT-FFTNS1.STD.NA.CLUSTALO)
[59/cc1782] Submitted process > regressive_alignment (cryst.CLUSTALO.DPA.1000.CLUSTALO)
[41/a34192] Submitted process > standard_alignment (cryst.MAFFT-FFTNS1.STD.NA.CLUSTALO)
[6b/d51ce2] Submitted process > regressive_alignment (cryst.MAFFT-FFTNS1.DPA.1000.CLUSTALO)
[7f/3c14cf] Submitted process > standard_alignment (cryst.CLUSTALO.STD.NA.CLUSTALO)
[ea/2f2ac0] Submitted process > standard_alignment (sodfe.CLUSTALO.STD.NA.MAFFT_PARTTREE)
[f8/1b68c2] Submitted 

[19/14694b] Submitted process > standard_alignment (aadh.MAFFT-FFTNS1.STD.NA.CLUSTALO)
[2e/b8fd56] Submitted process > regressive_alignment (aadh.CLUSTALO.DPA.1000.CLUSTALO)
[0f/5f9102] Submitted process > standard_alignment (aadh.CLUSTALO.STD.NA.CLUSTALO)
[7e/c5c024] Submitted process > regressive_alignment (aadh.MAFFT-FFTNS1.DPA.1000.CLUSTALO)
[42/d59971] Submitted process > regressive_alignment (ghf22.CLUSTALO.DPA.1000.MAFFT_PARTTREE)
[35/ba7821] Submitted process > regressive_alignment (ghf22.MAFFT-FFTNS1.DPA.1000.MAFFT_PARTTREE)
[38/831a15] Submitted process > standard_alignment (ghf22.MAFFT-FFTNS1.STD.NA.MAFFT_PARTTREE)
[c5/e8fa8a] Submitted process > standard_alignment (ghf22.CLUSTALO.STD.NA.MAFFT_PARTTREE)
[7f/3eaf84] Submitted process > standard_alignment (zf-CCHH.CLUSTALO.STD.NA.CLUSTALO)
[af/a5ee65] Submitted process > regressive_alignment (zf-CCHH.CLUSTALO.DPA.1000.CLUSTALO)
[f7/cb1fbb] Submitted process > regressive_alignment (zf-CCHH.MAFFT-FFTNS1.DPA.1000.CLUSTALO)
[5c/2f

[f5/df1935] Submitted process > standard_alignment (zf-CCHH.MAFFT-FFTNS1.STD.NA.MAFFT_PARTTREE)
[e2/3ea29a] Submitted process > standard_alignment (PDZ.MAFFT-FFTNS1.STD.NA.MAFFT_PARTTREE)
[90/3b8cfb] Submitted process > regressive_alignment (PDZ.MAFFT-FFTNS1.DPA.1000.MAFFT_PARTTREE)
[12/9c96a9] Submitted process > standard_alignment (PDZ.CLUSTALO.STD.NA.MAFFT_PARTTREE)
[e1/d22ca7] Submitted process > regressive_alignment (PDZ.CLUSTALO.DPA.1000.MAFFT_PARTTREE)
[0c/227891] Submitted process > standard_alignment (rhv.MAFFT-FFTNS1.STD.NA.CLUSTALO)
[e6/ba16ed] Submitted process > standard_alignment (rhv.CLUSTALO.STD.NA.CLUSTALO)
[10/e1ff90] Submitted process > regressive_alignment (rhv.CLUSTALO.DPA.1000.CLUSTALO)
[29/d89b70] Submitted process > regressive_alignment (rhv.MAFFT-FFTNS1.DPA.1000.CLUSTALO)
[06/d72c87] Submitted process > standard_alignment (ghf5.CLUSTALO.STD.NA.CLUSTALO)
[2a/3a17e5] Submitted process > regressive_alignment (ghf5.CLUSTALO.DPA.1000.CLUSTALO)
[c1/bfd1f0] Submitted 

[1c/67a9e2] Submitted process > regressive_alignment (seatoxin.CLUSTALO.DPA.1000.CLUSTALO)
[f3/dc91e1] Submitted process > regressive_alignment (seatoxin.MAFFT-FFTNS1.DPA.1000.CLUSTALO)
[fa/6c985f] Submitted process > standard_alignment (ghf10.CLUSTALO.STD.NA.CLUSTALO)
[c9/74246e] Submitted process > standard_alignment (ghf10.MAFFT-FFTNS1.STD.NA.CLUSTALO)
[f3/419771] Submitted process > regressive_alignment (ghf10.CLUSTALO.DPA.1000.CLUSTALO)
[37/c2406c] Submitted process > regressive_alignment (ghf10.MAFFT-FFTNS1.DPA.1000.CLUSTALO)
[b1/7bf97b] Submitted process > regressive_alignment (phoslip.MAFFT-FFTNS1.DPA.1000.MAFFT_PARTTREE)
[45/fca483] Submitted process > standard_alignment (phoslip.MAFFT-FFTNS1.STD.NA.MAFFT_PARTTREE)
[50/560749] Submitted process > standard_alignment (phoslip.CLUSTALO.STD.NA.MAFFT_PARTTREE)
[d7/576ff0] Submitted process > regressive_alignment (phoslip.CLUSTALO.DPA.1000.MAFFT_PARTTREE)
[37/2b71ec] Submitted process > standard_alignment (sti.CLUSTALO.STD.NA.CLUSTA

[c3/59c698] Submitted process > regressive_alignment (uce.MAFFT-FFTNS1.DPA.1000.MAFFT_PARTTREE)
[f3/a3d641] Submitted process > standard_alignment (GEL.MAFFT-FFTNS1.STD.NA.CLUSTALO)
[a1/98c3d4] Submitted process > standard_alignment (GEL.CLUSTALO.STD.NA.CLUSTALO)
[e1/a93173] Submitted process > regressive_alignment (GEL.CLUSTALO.DPA.1000.CLUSTALO)
[8b/987068] Submitted process > regressive_alignment (GEL.MAFFT-FFTNS1.DPA.1000.CLUSTALO)
[5c/962f40] Submitted process > regressive_alignment (ltn.CLUSTALO.DPA.1000.CLUSTALO)
[74/3c6ea4] Submitted process > standard_alignment (ltn.MAFFT-FFTNS1.STD.NA.CLUSTALO)
[b5/c9352a] Submitted process > standard_alignment (ltn.CLUSTALO.STD.NA.CLUSTALO)
[cf/753d00] Submitted process > regressive_alignment (ltn.MAFFT-FFTNS1.DPA.1000.CLUSTALO)
[e9/9767fc] Submitted process > standard_alignment (GEL.CLUSTALO.STD.NA.MAFFT_PARTTREE)
[6a/653daf] Submitted process > regressive_alignment (GEL.CLUSTALO.DPA.1000.MAFFT_PARTTREE)
[29/ad3471] Submitted process > regr

[cc/d87d32] Submitted process > standard_alignment (hpr.CLUSTALO.STD.NA.CLUSTALO)
[61/2f6a5f] Submitted process > standard_alignment (ghf1.CLUSTALO.STD.NA.CLUSTALO)
[13/b949fb] Submitted process > regressive_alignment (ghf1.CLUSTALO.DPA.1000.CLUSTALO)
[50/f40ac5] Submitted process > standard_alignment (ghf1.MAFFT-FFTNS1.STD.NA.CLUSTALO)
[54/570bac] Submitted process > regressive_alignment (ghf1.MAFFT-FFTNS1.DPA.1000.CLUSTALO)
[98/44e159] Submitted process > standard_alignment (sodcu.MAFFT-FFTNS1.STD.NA.CLUSTALO)
[f3/4e026b] Submitted process > regressive_alignment (sodcu.CLUSTALO.DPA.1000.CLUSTALO)
[7e/dda9d2] Submitted process > standard_alignment (sodcu.CLUSTALO.STD.NA.CLUSTALO)
[b2/f6f2b5] Submitted process > regressive_alignment (sodcu.MAFFT-FFTNS1.DPA.1000.CLUSTALO)
[75/61387d] Submitted process > regressive_alignment (hormone_rec.MAFFT-FFTNS1.DPA.1000.MAFFT_PARTTREE)
[00/73ea09] Submitted process > standard_alignment (hormone_rec.CLUSTALO.STD.NA.MAFFT_PARTTREE)
[8e/e8404e] Submit

[22/7856eb] Submitted process > standard_alignment (gpdh.CLUSTALO.STD.NA.MAFFT_PARTTREE)
[97/f7cadb] Submitted process > regressive_alignment (gpdh.CLUSTALO.DPA.1000.MAFFT_PARTTREE)
[ae/ffe743] Submitted process > regressive_alignment (kunitz.MAFFT-FFTNS1.DPA.1000.CLUSTALO)
[4b/278c08] Submitted process > regressive_alignment (kunitz.CLUSTALO.DPA.1000.CLUSTALO)
[b7/9984e1] Submitted process > standard_alignment (kunitz.MAFFT-FFTNS1.STD.NA.CLUSTALO)
[1a/b52e5d] Submitted process > standard_alignment (kunitz.CLUSTALO.STD.NA.CLUSTALO)
[0b/861bed] Submitted process > standard_alignment (kunitz.MAFFT-FFTNS1.STD.NA.MAFFT_PARTTREE)
[4a/94a455] Submitted process > regressive_alignment (kunitz.MAFFT-FFTNS1.DPA.1000.MAFFT_PARTTREE)
[c9/ec2865] Submitted process > standard_alignment (kunitz.CLUSTALO.STD.NA.MAFFT_PARTTREE)
[ba/103f64] Submitted process > regressive_alignment (kunitz.CLUSTALO.DPA.1000.MAFFT_PARTTREE)
[e1/7fba97] Submitted process > standard_alignment (aat.MAFFT-FFTNS1.STD.NA.CLUSTA

[9e/cca07b] Submitted process > evaluate (glob.CLUSTALO.MAFFT_PARTTREE.dpa_align.1000)
[07/65f73d] Submitted process > evaluate (glob.MAFFT-FFTNS1.MAFFT_PARTTREE.dpa_align.1000)
[27/064ded] Submitted process > evaluate (ace.CLUSTALO.CLUSTALO.std_align.NA)
[2a/dbf2aa] Submitted process > evaluate (glob.MAFFT-FFTNS1.MAFFT_PARTTREE.std_align.NA)
[1c/60e922] Submitted process > evaluate (ace.CLUSTALO.CLUSTALO.dpa_align.1000)
[45/a26d69] Submitted process > evaluate (ace.MAFFT-FFTNS1.CLUSTALO.std_align.NA)
[86/fa2b0b] Submitted process > evaluate (ace.MAFFT-FFTNS1.CLUSTALO.dpa_align.1000)
[ac/03fa00] Submitted process > evaluate (glob.CLUSTALO.CLUSTALO.dpa_align.1000)
[f4/e11191] Submitted process > evaluate (glob.CLUSTALO.CLUSTALO.std_align.NA)
[a8/d1f5ef] Submitted process > evaluate (glob.MAFFT-FFTNS1.CLUSTALO.std_align.NA)
[f8/aa35d3] Submitted process > evaluate (glob.MAFFT-FFTNS1.CLUSTALO.dpa_align.1000)
[33/502234] Submitted process > evaluate (rub.MAFFT-FFTNS1.CLUSTALO.std_align.NA)

[7b/318489] Submitted process > evaluate (lyase_1.CLUSTALO.CLUSTALO.dpa_align.1000)
[09/4ea2d1] Submitted process > evaluate (lyase_1.MAFFT-FFTNS1.CLUSTALO.std_align.NA)
[91/052c9b] Submitted process > evaluate (Ald_Xan_dh_2.CLUSTALO.CLUSTALO.std_align.NA)
[2e/26d0c4] Submitted process > evaluate (lyase_1.MAFFT-FFTNS1.CLUSTALO.dpa_align.1000)
[b7/1ad03f] Submitted process > evaluate (Ald_Xan_dh_2.MAFFT-FFTNS1.CLUSTALO.dpa_align.1000)
[6a/2c72ec] Submitted process > evaluate (Ald_Xan_dh_2.MAFFT-FFTNS1.CLUSTALO.std_align.NA)
[9c/22b6b4] Submitted process > evaluate (Ald_Xan_dh_2.CLUSTALO.CLUSTALO.dpa_align.1000)
[bc/55fd88] Submitted process > evaluate (ldh.MAFFT-FFTNS1.MAFFT_PARTTREE.std_align.NA)
[1f/46d4d9] Submitted process > evaluate (ldh.CLUSTALO.MAFFT_PARTTREE.std_align.NA)
[c8/da8c6b] Submitted process > evaluate (peroxidase.CLUSTALO.CLUSTALO.std_align.NA)
[2d/cbf016] Submitted process > evaluate (ldh.MAFFT-FFTNS1.MAFFT_PARTTREE.dpa_align.1000)
[82/dd68ac] Submitted process > eva

[97/484046] Submitted process > evaluate (flav.MAFFT-FFTNS1.CLUSTALO.std_align.NA)
[16/155c7d] Submitted process > evaluate (OTCace.MAFFT-FFTNS1.CLUSTALO.std_align.NA)
[91/464b87] Submitted process > evaluate (OTCace.MAFFT-FFTNS1.CLUSTALO.dpa_align.1000)
[f3/4212ac] Submitted process > evaluate (OTCace.CLUSTALO.CLUSTALO.dpa_align.1000)
[a9/46e388] Submitted process > evaluate (flav.CLUSTALO.MAFFT_PARTTREE.std_align.NA)
[57/0ad354] Submitted process > evaluate (flav.CLUSTALO.MAFFT_PARTTREE.dpa_align.1000)
[39/2a52d4] Submitted process > evaluate (flav.MAFFT-FFTNS1.MAFFT_PARTTREE.std_align.NA)
[a9/bd6799] Submitted process > evaluate (flav.MAFFT-FFTNS1.MAFFT_PARTTREE.dpa_align.1000)
[e5/c5e6b3] Submitted process > evaluate (cytb.MAFFT-FFTNS1.CLUSTALO.std_align.NA)
[e7/01dd13] Submitted process > evaluate (cytb.CLUSTALO.CLUSTALO.dpa_align.1000)
[6e/8fbe49] Submitted process > evaluate (cytb.CLUSTALO.CLUSTALO.std_align.NA)
[1d/6a7635] Submitted process > evaluate (cytb.MAFFT-FFTNS1.CLUSTAL

[a4/733667] Submitted process > evaluate (phoslip.MAFFT-FFTNS1.CLUSTALO.std_align.NA)
[80/2d4feb] Submitted process > evaluate (phoslip.CLUSTALO.CLUSTALO.std_align.NA)
[fb/1554a9] Submitted process > evaluate (phoslip.MAFFT-FFTNS1.CLUSTALO.dpa_align.1000)
[1a/c42e35] Submitted process > evaluate (phoslip.CLUSTALO.CLUSTALO.dpa_align.1000)
[9d/a1c51c] Submitted process > evaluate (proteasome.MAFFT-FFTNS1.MAFFT_PARTTREE.std_align.NA)
[ba/ba17d9] Submitted process > evaluate (proteasome.CLUSTALO.MAFFT_PARTTREE.dpa_align.1000)
[bc/1f2f49] Submitted process > evaluate (proteasome.MAFFT-FFTNS1.MAFFT_PARTTREE.dpa_align.1000)
[3a/91cce2] Submitted process > evaluate (PDZ.CLUSTALO.CLUSTALO.std_align.NA)
[f2/4f4e08] Submitted process > evaluate (proteasome.CLUSTALO.MAFFT_PARTTREE.std_align.NA)
[1c/35ec5c] Submitted process > evaluate (PDZ.CLUSTALO.CLUSTALO.dpa_align.1000)
[4f/3362ce] Submitted process > evaluate (PDZ.MAFFT-FFTNS1.CLUSTALO.std_align.NA)
[df/ab7db5] Submitted process > evaluate (zf

[ef/0e3ccb] Submitted process > evaluate (rrm.CLUSTALO.CLUSTALO.dpa_align.1000)
[fa/157d52] Submitted process > evaluate (rrm.MAFFT-FFTNS1.CLUSTALO.dpa_align.1000)
[97/5994f4] Submitted process > evaluate (rrm.MAFFT-FFTNS1.CLUSTALO.std_align.NA)
[bc/494966] Submitted process > evaluate (rrm.CLUSTALO.CLUSTALO.std_align.NA)
[b6/3a520f] Submitted process > evaluate (rrm.CLUSTALO.MAFFT_PARTTREE.dpa_align.1000)
[df/aebdd7] Submitted process > evaluate (rrm.MAFFT-FFTNS1.MAFFT_PARTTREE.dpa_align.1000)
[e3/c39ce5] Submitted process > evaluate (rrm.CLUSTALO.MAFFT_PARTTREE.std_align.NA)
[f1/fb598b] Submitted process > evaluate (seatoxin.CLUSTALO.CLUSTALO.std_align.NA)
[ef/e1f734] Submitted process > evaluate (rrm.MAFFT-FFTNS1.MAFFT_PARTTREE.std_align.NA)
[b2/1a51e6] Submitted process > evaluate (seatoxin.CLUSTALO.CLUSTALO.dpa_align.1000)
[f6/3089a1] Submitted process > evaluate (seatoxin.MAFFT-FFTNS1.CLUSTALO.std_align.NA)
[a6/21f728] Submitted process > evaluate (seatoxin.MAFFT-FFTNS1.CLUSTALO.

[72/e56fab] Submitted process > evaluate (ltn.CLUSTALO.MAFFT_PARTTREE.dpa_align.1000)
[ac/be9a4c] Submitted process > evaluate (uce.CLUSTALO.MAFFT_PARTTREE.dpa_align.1000)
[4f/b6d42e] Submitted process > evaluate (uce.CLUSTALO.MAFFT_PARTTREE.std_align.NA)
[7e/731f98] Submitted process > evaluate (uce.MAFFT-FFTNS1.MAFFT_PARTTREE.std_align.NA)
[05/1db542] Submitted process > evaluate (ltn.MAFFT-FFTNS1.MAFFT_PARTTREE.std_align.NA)
[66/61b708] Submitted process > evaluate (uce.MAFFT-FFTNS1.MAFFT_PARTTREE.dpa_align.1000)
[a0/326468] Submitted process > evaluate (GEL.CLUSTALO.CLUSTALO.std_align.NA)
[c0/00a391] Submitted process > evaluate (GEL.CLUSTALO.CLUSTALO.dpa_align.1000)
[f6/39fdc9] Submitted process > evaluate (GEL.MAFFT-FFTNS1.CLUSTALO.std_align.NA)
[4d/cffbc8] Submitted process > evaluate (ltn.CLUSTALO.CLUSTALO.dpa_align.1000)
[a2/df3ebd] Submitted process > evaluate (ltn.CLUSTALO.CLUSTALO.std_align.NA)
[57/b6f40b] Submitted process > evaluate (ltn.MAFFT-FFTNS1.CLUSTALO.std_align.NA

[5e/62369a] Submitted process > evaluate (hpr.CLUSTALO.CLUSTALO.std_align.NA)
[e9/2a12fb] Submitted process > evaluate (ghf1.CLUSTALO.CLUSTALO.std_align.NA)
[3d/59e66e] Submitted process > evaluate (ghf1.CLUSTALO.CLUSTALO.dpa_align.1000)
[7e/2658b0] Submitted process > evaluate (ghf1.MAFFT-FFTNS1.CLUSTALO.std_align.NA)
[11/0936d5] Submitted process > evaluate (ghf1.MAFFT-FFTNS1.CLUSTALO.dpa_align.1000)
[3e/8c926b] Submitted process > evaluate (sodcu.CLUSTALO.CLUSTALO.dpa_align.1000)
[df/f0964d] Submitted process > evaluate (sodcu.MAFFT-FFTNS1.CLUSTALO.std_align.NA)
[70/f1bf8d] Submitted process > evaluate (sodcu.CLUSTALO.CLUSTALO.std_align.NA)
[48/187ca6] Submitted process > evaluate (sodcu.MAFFT-FFTNS1.CLUSTALO.dpa_align.1000)
[88/be0cf2] Submitted process > evaluate (hormone_rec.CLUSTALO.MAFFT_PARTTREE.std_align.NA)
[22/7b5c3c] Submitted process > evaluate (hormone_rec.MAFFT-FFTNS1.MAFFT_PARTTREE.dpa_align.1000)
[59/01874b] Submitted process > evaluate (hormone_rec.MAFFT-FFTNS1.MAFFT

[8d/956c3e] Submitted process > evaluate (kunitz.MAFFT-FFTNS1.CLUSTALO.std_align.NA)
[34/533551] Submitted process > evaluate (kunitz.MAFFT-FFTNS1.MAFFT_PARTTREE.std_align.NA)
[50/263193] Submitted process > evaluate (kunitz.MAFFT-FFTNS1.MAFFT_PARTTREE.dpa_align.1000)
[82/abfcf7] Submitted process > evaluate (kunitz.CLUSTALO.MAFFT_PARTTREE.std_align.NA)
[20/f63796] Submitted process > evaluate (kunitz.CLUSTALO.MAFFT_PARTTREE.dpa_align.1000)
[90/0920b5] Submitted process > evaluate (aat.CLUSTALO.CLUSTALO.std_align.NA)
[40/fa52e5] Submitted process > evaluate (aat.MAFFT-FFTNS1.CLUSTALO.std_align.NA)
[9c/8043e8] Submitted process > evaluate (aat.MAFFT-FFTNS1.CLUSTALO.dpa_align.1000)
[32/5fab86] Submitted process > evaluate (aat.CLUSTALO.CLUSTALO.dpa_align.1000)
[ed/4f732d] Submitted process > evaluate (aat.CLUSTALO.MAFFT_PARTTREE.std_align.NA)
[b5/e98a7c] Submitted process > evaluate (aat.CLUSTALO.MAFFT_PARTTREE.dpa_align.1000)
[b5/1fa571] Submitted process > evaluate (aat.MAFFT-FFTNS1.MA

#### Import required python libraries

In [1]:
import plotly.plotly as py
import plotly.figure_factory as ff
import plotly.graph_objs as go

import numpy as np
import pandas as pd
import os
import csv

#### Create a dictionary for each familiy with values being the number of sequences in the dataset.

In [2]:
with open("../data/num_seqs.csv", mode='r') as infile:
    reader = csv.reader(infile, delimiter='\t')
    sizes_dict = {rows[0]:rows[1] for rows in reader}

#### Define common functions

In [3]:
# Function to read in the directory of scores to a dictionary
def scores_to_dict(scores_dir, scores_dict, tag):
    scores_list=[]
    for score_file in os.listdir(scores_dir):
        family, align_type, bucket, aligner, tree, score_type = score_file.split('.')
        y = [tag, align_type, aligner, tree, family, score_type]
        with open(scores_dir + score_file, 'r') as infile:
            data = infile.read()
        y.append(data.rstrip())
        scores_list.append(y)
    for score in scores_list:
        current_level = scores_dict
        for part in score:
            if part not in current_level:
                current_level[part] = {}
            current_level = current_level[part]
    return scores_dict

#### Read in the scores directory

In [4]:
# Read the full datasets
scores_dict = {}
full_scores_dir="results/edgar_full/"
scores_dict = scores_to_dict(full_scores_dir, scores_dict, "full")

# Read in the reference datasets
ref_scores_dir="results/edgar_ref/"
scores_dict = scores_to_dict(ref_scores_dir, scores_dict, "ref")

#### Calculate the average total coloumn score for families containing  > 10,000 seqs

In [5]:
tc_scores_dict={}

for tag, tagValues in scores_dict.items():
    for alignType, v in tagValues.items():
        for alignMethod, v1 in v.items():
            for treeMethod, v2 in v1.items():
                n=0
                sum_dict = {'sp':0, 'tc':0, 'col':0 , 'cpu':0}
                for k3, v3 in v2.items():
                    if int(sizes_dict[k3]) > 10000:               
                        n+=1
                        for k4, v4 in v3.items():
                            for k5, v5 in v4.items():
                                sum_dict[k4]+=float(k5)
                tc_avg = round((sum_dict['tc']/n), 2)
                key=(tag,alignType,alignMethod,treeMethod)
                print(key, tc_avg)
                tc_scores_dict[key] = tc_avg
                
print("Read in",len(tc_scores_dict.items()),"scores")

('full', 'dpa_align', 'CLUSTALO', 'MAFFT_PARTTREE') 42.21
('full', 'dpa_align', 'CLUSTALO', 'CLUSTALO') 41.91
('full', 'dpa_align', 'CLUSTALO', 'MAFFT-FFTNS1') 42.12
('full', 'dpa_align', 'MAFFT-GINSI', 'CLUSTALO') 50.2
('full', 'dpa_align', 'MAFFT-GINSI', 'MAFFT-FFTNS1') 48.47
('full', 'dpa_align', 'MAFFT-GINSI', 'MAFFT_PARTTREE') 47.54
('full', 'dpa_align', 'MAFFT-SPARSECORE', 'MAFFT_PARTTREE') 48.67
('full', 'dpa_align', 'MAFFT-SPARSECORE', 'MAFFT-FFTNS1') 46.87
('full', 'dpa_align', 'MAFFT-SPARSECORE', 'CLUSTALO') 51.06
('full', 'dpa_align', 'UPP', 'MAFFT_PARTTREE') 40.28
('full', 'dpa_align', 'UPP', 'CLUSTALO') 44.18
('full', 'dpa_align', 'UPP', 'MAFFT-FFTNS1') 45.34
('full', 'dpa_align', 'MAFFT-FFTNS1', 'CLUSTALO') 37.94
('full', 'dpa_align', 'MAFFT-FFTNS1', 'MAFFT-FFTNS1') 31.43
('full', 'dpa_align', 'MAFFT-FFTNS1', 'MAFFT_PARTTREE') 35.16
('full', 'std_align', 'MAFFT-FFTNS1', 'MAFFT-FFTNS1') 39.91
('full', 'std_align', 'MAFFT-FFTNS1', 'CLUSTALO') 41.33
('full', 'std_align', 'MA

#### Do the same as above for the reference alignments

Read in 2256 score files
Read in 20 scores


#### Generate Table 1 - Total Column Scores

In [30]:
alignment_methods=['CLUSTALO','MAFFT-FFTNS1']
tree_methods=['CLUSTALO', 'MAFFT_PARTTREE']

progressive_sum=0
regressive_sum=0
reference_sum=0
rows=[['Alignment Method', 'Tree Method', 'Progressive', 'Regressive', 'Reference']]
for a_method in alignment_methods:
    for t_method in tree_methods:
        progressive_sum+=tc_scores_dict["full","std_align",a_method,t_method]
        regressive_sum+=tc_scores_dict["full","dpa_align",a_method,t_method]
        reference_sum+=tc_scores_dict["ref","std_align",a_method,t_method]
        rows.append([a_method,
                     t_method, 
                     tc_scores_dict["full","std_align",a_method,t_method], 
                     tc_scores_dict["full","dpa_align",a_method,t_method],
                     tc_scores_dict["ref","std_align",a_method,t_method]
                     ])
        
rows.append(['AVERAGE', '', round(progressive_sum/4,2),round(regressive_sum/4,2),round(reference_sum/4,2)])
rows.append(['', '', '','',''])


# UPP with ClustalO Trees/Default
rows.append( ['Default/CLUSTALO', 'UPP',
               tc_scores_dict['full', 'default_align', 'UPP', 'DEFAULT'],
               tc_scores_dict['full', 'dpa_align', 'UPP', 'CLUSTALO'],
               tc_scores_dict['ref', 'default_align', 'UPP', 'DEFAULT']])
progressive_sum+=tc_scores_dict['full', 'default_align', 'UPP', 'DEFAULT']
regressive_sum+=tc_scores_dict['full', 'dpa_align', 'UPP', 'CLUSTALO']
reference_sum+=tc_scores_dict['ref', 'default_align', 'UPP', 'DEFAULT']


rows.append( ['Default/CLUSTALO', 'MAFFT-SPARSECORE',
               tc_scores_dict['full', 'default_align', 'MAFFT-SPARSECORE', 'DEFAULT'],
               tc_scores_dict['full', 'dpa_align', 'MAFFT-SPARSECORE', 'CLUSTALO'],
               tc_scores_dict['ref', 'default_align', 'MAFFT-SPARSECORE', 'DEFAULT']])
progressive_sum+=tc_scores_dict['full', 'default_align', 'MAFFT-SPARSECORE', 'DEFAULT']
regressive_sum+=tc_scores_dict['full', 'dpa_align', 'MAFFT-SPARSECORE', 'CLUSTALO']
reference_sum+=tc_scores_dict['ref', 'default_align', 'MAFFT-SPARSECORE', 'DEFAULT']
            
    
rows.append( ['MAFFT_PARTTREE', 'MAFFT-GINSI', '-',
               tc_scores_dict['full', 'dpa_align', 'MAFFT-GINSI', 'MAFFT_PARTTREE'],
               tc_scores_dict['ref', 'std_align', 'MAFFT-GINSI', 'MAFFT_PARTTREE']])
regressive_sum+=tc_scores_dict['full', 'dpa_align', 'MAFFT-GINSI', 'MAFFT_PARTTREE']
reference_sum+=tc_scores_dict['ref', 'std_align', 'MAFFT-GINSI', 'MAFFT_PARTTREE']
            

rows.append( ['CLUSTALO', 'MAFFT-GINSI', '-',
               tc_scores_dict['full', 'dpa_align', 'MAFFT-GINSI', 'CLUSTALO'],
               tc_scores_dict['ref', 'std_align', 'MAFFT-GINSI', 'CLUSTALO']])
regressive_sum+=tc_scores_dict['full', 'dpa_align', 'MAFFT-GINSI', 'CLUSTALO']
reference_sum+=tc_scores_dict['ref', 'std_align', 'MAFFT-GINSI', 'CLUSTALO']

rows.append(['GLOBAL AVERAGE', '', round(progressive_sum/6,2),round(regressive_sum/8,2),round(reference_sum/8,2)])
    
table = ff.create_table(rows)

py.iplot(table, filename='table1')

In [None]:
#### Create CSV of above table

In [28]:
with open('table1_A.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerows(rows)

# WORK IN PROGRESS BELOW HERE

#### Generate Table 1 - Relative Total Column Scores

In [41]:
alignment_methods=['CLUSTALO','MAFFT-FFTNS1']
tree_methods=['CLUSTALO', 'MAFFT_PARTTREE']

progressive_sum=0
regressive_sum=0
reference_sum=0
rows=[['Alignment Method', 'Tree Method', 'Progressive', 'Regressive']]
for a_method in alignment_methods:
    for t_method in tree_methods:
        progressive_sum+=tc_scores_dict["full","std_align",a_method,t_method]/tc_scores_dict["ref","std_align",a_method,t_method]
        regressive_sum+=tc_scores_dict["full","dpa_align",a_method,t_method]/tc_scores_dict["ref","std_align",a_method,t_method]
        rows.append([a_method,
                     t_method, 
                     round(tc_scores_dict["full","std_align",a_method,t_method]/tc_scores_dict["ref","std_align",a_method,t_method],2), 
                     round(tc_scores_dict["full","dpa_align",a_method,t_method]/tc_scores_dict["ref","std_align",a_method,t_method],2)
                     ])
        
rows.append(['AVERAGE', '', round(progressive_sum/4,2),round(regressive_sum/4,2)])
rows.append(['', '', '',''])


# UPP with ClustalO Trees/Default
rows.append( ['Default/CLUSTALO', 'UPP',
               round(tc_scores_dict['full', 'default_align', 'UPP', 'DEFAULT']/tc_scores_dict['ref', 'default_align', 'UPP', 'DEFAULT'],2),
               round(tc_scores_dict['full', 'dpa_align', 'UPP', 'CLUSTALO']/tc_scores_dict['ref', 'default_align', 'UPP', 'DEFAULT'],2)])
progressive_sum+=tc_scores_dict['full', 'default_align', 'UPP', 'DEFAULT']/tc_scores_dict['ref', 'default_align', 'UPP', 'DEFAULT']
regressive_sum+=tc_scores_dict['full', 'dpa_align', 'UPP', 'CLUSTALO']/tc_scores_dict['ref', 'default_align', 'UPP', 'DEFAULT']

# MAFFT-SPARSECORE with ClustalO Trees or Default
rows.append( ['Default/CLUSTALO', 'MAFFT-SPARSECORE',
               round(tc_scores_dict['full', 'default_align', 'MAFFT-SPARSECORE', 'DEFAULT']/tc_scores_dict['ref', 'default_align', 'MAFFT-SPARSECORE', 'DEFAULT'],2),
               round(tc_scores_dict['full', 'dpa_align', 'MAFFT-SPARSECORE', 'CLUSTALO']/tc_scores_dict['ref', 'default_align', 'MAFFT-SPARSECORE', 'DEFAULT'],2)])
progressive_sum+=tc_scores_dict['full', 'default_align', 'MAFFT-SPARSECORE', 'DEFAULT']/tc_scores_dict['ref', 'default_align', 'MAFFT-SPARSECORE', 'DEFAULT']
regressive_sum+=tc_scores_dict['full', 'dpa_align', 'MAFFT-SPARSECORE', 'CLUSTALO']/tc_scores_dict['ref', 'default_align', 'MAFFT-SPARSECORE', 'DEFAULT']


# MAFFT-GINSI with MAFFT-PARTTREE
rows.append( ['MAFFT_PARTTREE', 'MAFFT-GINSI', '-',
               round(tc_scores_dict['full', 'default_align', 'MAFFT-SPARSECORE', 'DEFAULT']/tc_scores_dict['ref', 'default_align', 'MAFFT-SPARSECORE', 'DEFAULT'],2),
               round(tc_scores_dict['full', 'dpa_align', 'MAFFT-SPARSECORE', 'CLUSTALO']/tc_scores_dict['ref', 'default_align', 'MAFFT-SPARSECORE', 'DEFAULT'],2)])
progressive_sum+=tc_scores_dict['full', 'default_align', 'MAFFT-SPARSECORE', 'DEFAULT']/tc_scores_dict['ref', 'default_align', 'MAFFT-SPARSECORE', 'DEFAULT']
regressive_sum+=tc_scores_dict['full', 'dpa_align', 'MAFFT-SPARSECORE', 'CLUSTALO']/tc_scores_dict['ref', 'default_align', 'MAFFT-SPARSECORE', 'DEFAULT']

# MAFFT-GINSI with CLUSTALO
rows.append( ['Default/CLUSTALO', 'MAFFT-SPARSECORE',
               round(tc_scores_dict['full', 'default_align', 'MAFFT-SPARSECORE', 'DEFAULT']/tc_scores_dict['ref', 'default_align', 'MAFFT-SPARSECORE', 'DEFAULT'],2),
               round(tc_scores_dict['full', 'dpa_align', 'MAFFT-SPARSECORE', 'CLUSTALO']/tc_scores_dict['ref', 'default_align', 'MAFFT-SPARSECORE', 'DEFAULT'],2)])
progressive_sum+=tc_scores_dict['full', 'default_align', 'MAFFT-SPARSECORE', 'DEFAULT']/tc_scores_dict['ref', 'default_align', 'MAFFT-SPARSECORE', 'DEFAULT']
regressive_sum+=tc_scores_dict['full', 'dpa_align', 'MAFFT-SPARSECORE', 'CLUSTALO']/tc_scores_dict['ref', 'default_align', 'MAFFT-SPARSECORE', 'DEFAULT']


rows.append(['GLOBAL AVERAGE', '', round(progressive_sum/6,2),round(regressive_sum/8,2)])


table = ff.create_table(rows)

py.iplot(table, filename='table1')

#### Generate Figure 1: Cumlative Score Progessive vs Regressive

In [174]:
cumlative_score_table=[['Method', 'Family', 'Size', 'TC Score']]
for k, v in score_dict.items(): # for each align type
    for k1, v1 in v.items(): # for each alignment method
        for k2, v2 in v1.items(): # for each tree method
            for k3, v3 in v2.items(): # for each family
                for k4, v4 in v3.items(): # for each score
                    if k4 == 'tc':
                        for k5, v5 in v4.items():
                            cumlative_score_table.append([str(k+'-'+k1+'-'+k2), k3, int(sizes_dict[k3]), float(k5)])
                            
df2 = pd.DataFrame.from_records(cumlative_score_table[1:]) #,columns=[cumlative_score_table[0]])
df3 = df2.sort_values(by=[0,2])

dpa_align_CO_PT = df3[0] == "dpa_align-CLUSTALO-MAFFT_PARTTREE"
std_align_CO_PT = df3[0] == "std_align-CLUSTALO-MAFFT_PARTTREE"

dpa_align_CO_PT_df = df3[dpa_align_CO_PT]
std_align_CO_PT_df = df3[std_align_CO_PT]

dpa_align_CO_PT_cumsum = dpa_align_CO_PT_df[3].cumsum()
std_align_CO_PT_cumsum = std_align_CO_PT_df[3].cumsum()

dpa_align_CO_PT_cumsum = dpa_align_CO_PT_cumsum.reset_index(drop=True)
std_align_CO_PT_cumsum = std_align_CO_PT_cumsum.reset_index(drop=True)

dpa_align_CO_PT_cumavg = dpa_align_CO_PT_cumsum/(dpa_align_CO_PT_cumsum.index+1)
std_align_CO_PT_cumavg = std_align_CO_PT_cumsum/(std_align_CO_PT_cumsum.index+1)

dpa_align_CO_PT_cumavg

log_sizes = dpa_align_CO_PT_df[2].apply(np.log)

# Create a trace
dpa = go.Scatter(
    x = log_sizes,
    y = dpa_align_CO_PT_cumavg,
    name = 'Cumlative Average of Regressive ClustalO PartTree'
)

std = go.Scatter(
    x = log_sizes,
    y = std_align_CO_PT_cumavg,
    name = 'Cumlative Average of Progressive ClustalO PartTree'
)

dpa_points = go.Scatter(
    x = log_sizes,
    y = dpa_align_CO_PT_df[3],
    mode = 'markers',
    name = 'Regressive ClustalO PartTree'
)

std_points = go.Scatter(
    x = log_sizes,
    y = std_align_CO_PT_df[3],
    mode = 'markers',
    name = 'Progressive ClustalO PartTree'
)

layout = dict(title = 'Progressive vs Regressive ClustalO with PartTree',
              xaxis = dict(title = 'log(number of sequences)'),
              yaxis = dict(title = 'total column score')
              )

data = [dpa,std,dpa_points,std_points]
fig = dict(data=data, layout=layout)
py.iplot(fig, filename='basic-line')