# Table 1: Regressive vs Progressive MSA

This notebook contains all the code to generate Table 1 from the publication.

## Input Data
Our input data consists of 94 datasets from the HOMFAM benchmark. Each of ther datasets is a protein family with the number sequences in each ranging from 88 for the smallest family *seatoxin* to 93,675 for the largest family *rvp*.

In [6]:
cat ../data/seqs/seatoxin.fa | grep '>'| wc -l

      88


In [7]:
cat ../data/seqs/rvp.fa | grep '>'| wc -l

   93675


For each dataset, there is also a reference set of sequences.
These references are from the PDB and have been structually aligned.

In [10]:
cat ../data/refs/seatoxin.ref

>1apf
GVPCLCDSDGPRPRGNTLSGILWFYPSGCPS--GWH-NCKAHGPNIGWCCKK--
>1ahl
GVSCLCDSDGPSVRGNTLSGTLWLYPSGCPS--GWH-NCKAHGPTIGWCCKQ--
>1atx
GAACLCKSDGPNTRGNSMSGTIWVF--GCPS--GWN-NCEGRA-IIGYCCKQ--
>1sh1
-AACKCDDEGPDIRTAPLTGTVDLG--SCNA--GWE-KCASYYTIIADCCRKKK
>1bds
AAPCFCSGKP-------GRGDLWILRGTCPGGYGYTSNCYK--WPNICCYPH--


Each sequence file has been combined with the unaligned reference sequences to create a combined sequence set.

In [16]:
!head ../data/combined_seqs/seatoxin.fa

>B1NWT1_NEMVE/41-81
ACACDSPGIRSASLSGIVWVGSCPSGWKKCKSYYSVVADCC
>B1NWR7_NEMVE/41-83
PCACDSDGPDIRSASLSGIVWMGSCPSGWKKCKSYYSIVADCC
>TXCN2_BUNCN/3-46
ACRCDSDGPTVRGDSLSGTLWLTGGCPSGWHNCRGSGPFIGYCC
>TX5_ANTXA/3-45
SCLCDSDGPSVSGNTLSGIIWLAGCPSGWHNCKAHGPNIGWCC
>TXH7_ANTS7/3-45
PCLCDSDGPSVHGNTLSGTIWLAGCPSGWHNCKAHGPTIGWCC


## Generating Alignments

The main workflow for generating the alignments is written in Nextflow.


### Part One: Progressive vs Regressive Alignments
__Progressive Alignment Proceedures__
* Clustal Omega with Clustal Omega (mBed) trees: **(Prog-CO-mBed)**
* Clustal Omega with Mafft PartTree trees: **(Prog-CO-PT)**
* Mafft FFT-NS-1 with Mafft PartTree trees: **(Prog-MFF1-PT)**
* Mafft FFT-NS-1 with Clustal Omega (mBed) trees: **(Prog-MFF1-mBed)**

__Regressive Alignment Proceedures__
* Clustal Omega with Clustal Omega (mBed) trees: **(Regr-CO-mBed)**
* Clustal Omega with Mafft PartTree trees: **(Regr-CO-PT)**
* Mafft FFT-NS-1 with Mafft PartTree trees: **(Regr-MFF1-PT)**
* Mafft FFT-NS-1 with Clustal Omega (mBed) trees: **(Regr-MFF1-mBed)**

| Name           | Alignment Method | Tree Method     | Type        |
|----------------|------------------|-----------------|-------------|
| Prog-CO-mBed   | ClustalO         | ClustalO (mBed) | Progressive |
| Prog-CO-PT     | ClustalO         | Mafft PartTree  | Progressive |
| Prog-MFF1-PT   | Mafft FFT-NS-1   | Mafft PartTree  | Progressive |
| Prog-MFF1-mBed | Mafft FFT-NS-1   | ClustalO (mBed) | Progressive |
| Regr-CO-mBed   | ClustalO         | ClustalO (mBed) | Regressive  |
| Regr-CO-PT     | ClustalO         | Mafft PartTree  | Regressive  |
| Regr-MFF1-PT   | Mafft FFT-NS-1   | Mafft PartTree  | Regressive  |
| Regr-MFF1-mBed | Mafft FFT-NS-1   | ClustalO (mBed) | Regressive  |

In [29]:
!~/bin/nextflow run ../main.nf \
                    --align_method="CLUSTALO" \
                    --tree_method="CLUSTALO" \
                    --refs='../data/refs/{seatoxin,rnasemam}.ref' \
                    --combined='../data/combined_seqs/{seatoxin,rnasemam}.fa' \
                    --dpa_align \
                    --std_align \
                    -with-docker

N E X T F L O W  ~  version 0.30.2
Launching `../main.nf` [sharp_swirles] - revision: 880b512e73
D P A   A n a l y s i s  ~  version 0.1"
Name                                                  : DPA_Analysis
Input sequences (FASTA)                               : /Users/efloden/projects/dpa-analysis/data/seqs/*.fa
Input references (Aligned FASTA)                      : ../data/refs/{seatoxin,rnasemam}.ref
Input trees (NEWICK)                                  : false
Input combined sequences (FASTA)		       : ../data/combined_seqs/{seatoxin,rnasemam}.fa
Output directory (DIRECTORY)                          : /Users/efloden/projects/dpa-analysis/results
Alignment methods [CLUSTALO|MAFFT]                    : CLUSTALO
Tree method [CLUSTALO|MAFFT|CLUSTALO_RND|MAFFT_RND]   : CLUSTALO
Perform default alignments                            : false
Perform standard alignments                           : true
Perform double progressive alignments (DPA)           : true
Bucket Sizes for DPA       

In [31]:
cat ../results/scores/tcScores.sharp_swirles.csv

seatoxin	CLUSTALO	CLUSTALO	dpa_align	1000	57.1	
rnasemam	CLUSTALO	CLUSTALO	dpa_align	1000	73.1	
seatoxin	CLUSTALO	CLUSTALO	std_align	NA	57.1	
rnasemam	CLUSTALO	CLUSTALO	std_align	NA	73.1	
