# **ColabAlign**

## Fast pairwise protein secondary structure comparisons using multiprocessing

[![Code DOI](https://zenodo.org/badge/788453062.svg)](https://doi.org/10.5281/zenodo.14169501) [![Paper DOI](https://img.shields.io/badge/DOI-10.1101/2025.10.06.677802-green)](https://doi.org/10.1101/2025.10.06.677802)

---

Create a phylogenetic tree that compares the secondary structure of proteins, rather than nucleotide or amino acid sequence. Scoring uses the [US-align](https://zhanggroup.org/US-align/) algorithm by [Zhang _et al._, (2022)](https://doi.org/10.1038/s41592-022-01585-1).

A score of **<0.17** indicates similarity indistinguishable from a random pair of structures, where as as score **≥0.50** indicates a pair with broadly the same fold ([Xu _et al._, 2010](https://doi.org/10.1093/bioinformatics/btq066))

---

## **Usage**

1) **Click `Runtime` -> `Change runtime type` -> `v5e-1 TPU`**

2) **Upload `.pdb` or `.cif` format files directly or compressed into a `.tar.gz` file to the Colab instance by clicking the folder icon on the left, then dragging and dropping your structures.**

3) **Click `Run all`**

## N.B.

* US-align only considers the first chain in each .pdb or .cif file, so please ensure this is the chain you wish to include in the pairwise alignment.

* Structural dendrograms generated by ColabAlign should be viewed/analysed as unrooted trees.

* Other runtimes can be used if you are paying for compute units. Ensure you select a runtime with more than 2 CPU cores, particularly if you plan to analyse a large dataset.

In [1]:
# @title Set up ColabAlign conda environment and prepare file structure

%%bash

rm -r sample_data

git clone https://github.com/crfield18/ColabAlign.git ColabAlign_git
mv ColabAlign_git/*.py .
mv ColabAlign_git/*.yml .
rm -rd ColabAlign_git

pip install konda
konda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/main
konda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/r
konda env update -n base -f colabalign.yml --quiet

# The version on MView in the bioconda channel doesn't currently like working on Colab, so
# we need to manually install it instead
git clone https://github.com/desmid/mview.git MView_git
cd MView_git
perl install.pl



Collecting konda
  Downloading konda-0.1.0-py3-none-any.whl.metadata (3.7 kB)
Downloading konda-0.1.0-py3-none-any.whl (7.3 kB)
Installing collected packages: konda
Successfully installed konda-0.1.0
PREFIX=/usr/local
Unpacking bootstrapper...
Unpacking payload...

Installing base environment...

Preparing transaction: ...working... done
Executing transaction: ...working... done
installation finished.
    You currently have a PYTHONPATH environment variable set. This may cause
    unexpected behavior when running the Python interpreter in Miniconda3.
    For best results, please verify that your PYTHONPATH only points to
    directories of packages that are compatible with the Python interpreter
    in Miniconda3: /usr/local
accepted Terms of Service for https://repo.anaconda.com/pkgs/main
❌ Conda not found. Installing Miniconda first...
Downloading Miniconda installer...
Installing Miniconda to /usr/local...
✅ Miniconda installed successfully!
Run '!conda --version' to check if conda 

Cloning into 'ColabAlign_git'...
Cloning into 'MView_git'...
Use of uninitialized value $ENV{"USER"} in string eq at install.pl line 391.
###########################################################################

                     MView installer

  *********************************************************
  **                                                     **
  **                                                     **
  *********************************************************


>> Press the 'return' or 'enter' key to continue (Ctrl-C to exit). 
###########################################################################

                         MView installer

The installation requires a folder to contain the driver script.

That folder must be (1) writable by the installer and (2) on your PATH if
installing a personal copy, or on the shared PATH for site installations.

You can accept the suggested default, choose a folder from the following
PATH list, or specify another 

In [None]:
# @title Align Structures
# Run colabalign.py with all available cores

threshold = 0.1 # @param {type:"slider", min:0.01, max:1.00, step:0.01}

!mkdir -p results models
!find . -name "*.tar.gz" -exec tar -xzf {} \; > /dev/null 2>&1
!find . -type f \( -name "*.pdb" -o -name "*.cif" \) -not -path "./models/*" | xargs -I {} mv {} models/ 2>/dev/null || true
!find ./models/ -name "._*" -delete 2>/dev/null

!echo "Starting alignment with $(nproc --all) cores..."
!python3 colabalign.py -i /content/models -o /content/results -c $(nproc --all) -t $threshold

Starting alignment with 24 cores...
Setting up file structure.
Starting pairwise alignment.
Creating hardlinks for model files.
Running US-align jobs.
Total pairs:   4% 16881/440391 [01:10<22:29, 313.80pair/s]

In [None]:
# @title Zip and download results

import os
import datetime
from google.colab import files

# Name the zipped results file using the current date and time
# to not accidentally overwrite older results files when downloading
current_dt = datetime.datetime.now()
zip_filename = f'colabalign_results_{current_dt.strftime("%Y%m%d-%H%M")}.zip'

# Using the built-in zip function rather than a python module for improved efficiency
os.system(f'zip -r {zip_filename} results')
files.download(zip_filename)
