# ColabAlign

### Fast pairwise protein secondary structure comparisons using multiprocessing

[![DOI](https://zenodo.org/badge/788453062.svg)](https://doi.org/10.5281/zenodo.14169501)

Create a phylogenetic tree that compares the secondary structure of proteins, rather than nucleotide or amino acid sequence. Scoring uses the [_US-align_](https://zhanggroup.org/US-align/) algorithm by [_Zhang et al. (2022)_](https://doi.org/10.1038/s41592-022-01585-1). Based on [_mTM-align_](http://yanglab.nankai.edu.cn/mTM-align/) by [_Dong et al. (2018)_](https://doi.org/10.1093/nar/gky430). Made possible with [US-align](https://github.com/pylelab/USalign) and [Biopython](https://biopython.org).

A score of **<0.17** indicates similarity indistinguishable from a random pair of structures, where as as score **≥0.50** indicates a pair with broadly the same fold ([_Xu et al., 2010_](https://doi.org/10.1093/bioinformatics/btq066))

##### Usage

* For the best performance, click `Runtime` -> `Change runtime type` -> `TPU v2`

* If `TPU v2` is not available, `L4 GPU` or `T4 GPU` runtimes (paid options) are the next best options. These perform about 10x faster than the default `CPU` runtime. I recommend using `TPU v2`, `L4 GPU` or `T4 GPU` for datasets containing > 250 structures.

* Upload .pdb or .cif format files directly to the Colab instance by clicking the folder icon on the left, then dragging and dropping your structures.

* US-align only considers the first chain in each .pdb or .cif file, so please ensure this is the chain you wish to include in the pairwise alignment.

* Phylogenetic trees generated by ColabAlign should be viewed/analysed as un-rooted trees.

In [None]:
# @title Install conda

!pip install -q condacolab
import condacolab
condacolab.install()

In [None]:
# @title Check conda install

import condacolab
condacolab.check()

In [None]:
# @title Set up ColabAlign conda environment and prepare file structure

%%bash

git clone https://github.com/crfield18/ColabAlign.git ColabAlign_git
mv ColabAlign_git/* .
rm -rd ColabAlign_git

mamba env update -n base -f colabalign.yml

# Create necessary directories
mkdir -p results models

# Extract any tar.gz files
find . -name "*.tar.gz" -exec tar -xzf {} \; > /dev/null 2>&1

# Move all pdb and cif files to the models directory to make file handling with colabalign.py easier
find . -type f \( -name "*.pdb" -o -name "*.cif" \) -not -path "./models/*" | xargs -I {} mv {} models/ 2>/dev/null || true

# Remove Apple Double files
find ./models/ -name "._*" -delete 2>/dev/null

In [None]:
# @title Align Structures
# @markdown Make sure your .pdb and/or .cif files are uploaded before running this block.

import os
import ipywidgets as widgets
from IPython.display import display

threshold = 0.1 # @param {type:"slider", min:0.01, max:1.00, step:0.01}

os.environ['threshold'] = str(threshold)

# Run colabalign.py with all available cores
!echo "Starting alignment with $(nproc --all) cores..."
!python3 colabalign.py -i /content/models -o /content/results -c $(nproc --all) -t ${threshold}

In [None]:
# @title Zip and download results

import os
import datetime
from google.colab import files

# Name the zipped results file using the current date and time
# to not accidentally overwrite older results files when downloading
current_dt = datetime.datetime.now()
zip_filename = f'colabalign_results_{current_dt.strftime("%Y%m%d-%H%M")}.zip'

# Using the built-in zip function rather than a python module for improved efficiency
os.system(f'zip -r {zip_filename} results')
files.download(zip_filename)
