# Uni-Fold Notebook

This notebook provides protein structure prediction service of [Uni-Fold](https://github.com/dptech-corp/Uni-Fold/) as well as [UF-Symmetry](https://www.biorxiv.org/content/10.1101/2022.08.30.505833v1). Predictions of both protein monomers and multimers are supported. The homology search process in this notebook is enabled with the [MMSeqs2](https://github.com/soedinglab/MMseqs2.git) server provided by [ColabFold](https://github.com/sokrypton/ColabFold). For more consistent results with the original AlphaFold(-Multimer), please refer to the open-source repository of [Uni-Fold](https://github.com/dptech-corp/Uni-Fold/), or our convenient web server at [Hermite™](https://hermite.dp.tech/).

Please note that this notebook is provided as an early-access prototype, and is NOT an official product of DP Technology. It is provided for theoretical modeling only and caution should be exercised in its use. 

**Licenses**

This Colab uses the [Uni-Fold model parameters](https://github.com/dptech-corp/Uni-Fold/#model-parameters-license) and its outputs are under the terms of the Creative Commons Attribution 4.0 International (CC BY 4.0) license. You can find details at: https://creativecommons.org/licenses/by/4.0/legalcode. The Colab itself is provided under the [Apache 2.0 license](https://www.apache.org/licenses/LICENSE-2.0).


**Citations**

Please cite the following papers if you use this notebook:

*   Ziyao Li, Xuyang Liu, Weijie Chen, Fan Shen, Hangrui Bi, Guolin Ke, Linfeng Zhang. "[Uni-Fold: An Open-Source Platform for Developing Protein Folding Models beyond AlphaFold.](https://www.biorxiv.org/content/10.1101/2022.08.04.502811v1)" biorxiv (2022)
*   Ziyao Li, Shuwen Yang, Xuyang Liu, Weijie Chen, Han Wen, Fan Shen, Guolin Ke, Linfeng Zhang. "[Uni-Fold Symmetry: Harnessing Symmetry in Folding Large Protein Complexes.](https://www.biorxiv.org/content/10.1101/2022.08.30.505833v1)" bioRxiv (2022)
*   Mirdita M, Schütze K, Moriwaki Y, Heo L, Ovchinnikov S and Steinegger M. "[ColabFold: Making protein folding accessible to all.](https://www.nature.com/articles/s41592-022-01488-1)" Nature Methods (2022)

**Acknowledgements**

The model architecture of Uni-Fold is largely based on [AlphaFold](https://doi.org/10.1038/s41586-021-03819-2) and [AlphaFold-Multimer](https://www.biorxiv.org/content/10.1101/2021.10.04.463034v1). The design of this notebook refers directly to [ColabFold](https://www.nature.com/articles/s41592-022-01488-1). We specially thank [@sokrypton](https://twitter.com/sokrypton) for his helpful suggestions to this notebook.

Copyright © 2022 DP Technology. All rights reserved.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
#!wget https://github.com/dptech-corp/Uni-Fold/releases/download/v2.0.0/unifold_params_2022-08-01.tar.gz
#!tar -xzf unifold_params_2022-08-01.tar.gz

In [3]:
#!wget https://github.com/dptech-corp/Uni-Fold/releases/download/v2.2.0/uf_symmetry_params_2022-09-06.tar.gz
#!tar -xzf uf_symmetry_params_2022-09-06.tar.gz

In [4]:
#@title Provide the arguments here and hit `Run` -> `Run All Cells`
jobname = 'unifold_colab' #@param {type:"string"}

sequence_1 = 'LILNLRGGAFVSNTQITMADKQKKFINEIQEGDLVRSYSITDETFQQNAVTSIVKHEADQLCQINFGKQHVVCTVNHRFYDPESKLWKSVCPHPGSGISFLKKYDYLLSEEGEKLQITEIKTFTTKQPVFIYHIQVENNHNFFANGVLAHAMQVSI'  #@param {type:"string"}
sequence_2 = ''  #@param {type:"string"}
sequence_3 = ''  #@param {type:"string"}
sequence_4 = ''  #@param {type:"string"}

#@markdown Use symmetry group `C1` for default Uni-Fold predictions.
#@markdown Or, specify a **cyclic** symmetry group (e.g. `C4``) and
#@markdown the sequences of the asymmetric unit (i.e. **do not copy
#@markdown them multiple times**) to predict with UF-Symmetry.

symmetry_group = 'C1' #@param {type:"string"}

#@markdown Parameters for MSA searching with MMseqs2.

use_templates = True #@param {type:"boolean"}
msa_mode = "MMseqs2" #@param ["MMseqs2","single_sequence"]

#@markdown Parameters for model inference.

max_recycling_iters = 3 #@param {type:"integer"}
num_ensembles = 2 #@param {type:"integer"}
manual_seed = 42 #@param {type:"integer"}
times = 1 #@param {type:"integer"}

#@markdown Plotting parameters.

show_sidechains = False #@param {type:"boolean"}
dpi = 100 #@param {type:"integer"}


In [5]:
import os
import re
import random
import hashlib
import numpy as np
from pathlib import Path
from prtm.models.unifold.input_validation import validate_input
from prtm.models.unifold.msa.utils import divide_multi_chains

import pickle
import gzip
from prtm.models.unifold.msa import parsers
from prtm.models.unifold.msa import pipeline
from prtm.models.unifold.data.utils import compress_features
from prtm.models.unifold.data.protein import PDB_CHAIN_IDS
from prtm.models.unifold.mmseqs import get_msa_and_templates





[2024-01-26 04:36:32,091] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
PyRosetta-4 2023 [Rosetta PyRosetta4.conda.linux.cxx11thread.serialization.CentOS.python310.Release 2023.47+release.5fe66cd241adb376f3a0af661ea0dcd77ea0dbbe 2023-11-21T10:47:25] retrieved from: http://www.pyrosetta.org
(C) Copyright Rosetta Commons Member Institutions. Created in JHU by Sergey Lyskov and PyRosetta Team.


In [6]:
MIN_SINGLE_SEQUENCE_LENGTH = 6
MAX_SINGLE_SEQUENCE_LENGTH = 3000
MAX_MULTIMER_LENGTH = 3000

input_sequences = [sequence_1, sequence_2, sequence_3, sequence_4]

In [8]:
sequences, is_multimer, symmetry_group = validate_input(
    input_sequences=input_sequences,
    symmetry_group=symmetry_group,
    min_length=MIN_SINGLE_SEQUENCE_LENGTH,
    max_length=MAX_SINGLE_SEQUENCE_LENGTH,
    max_multimer_length=MAX_MULTIMER_LENGTH,
)

Using the single-chain model.


In [12]:
def add_hash(x,y):
    return x+"_"+hashlib.sha1(y.encode()).hexdigest()[:5]

basejobname = "".join(input_sequences)
basejobname = re.sub(r'\W+', '', basejobname)
target_id = add_hash(jobname, basejobname)
descriptions = ['> '+target_id+' seq'+str(ii) for ii in range(len(sequences))]

In [9]:
if is_multimer:
    divide_multi_chains(target_id, output_dir_base, sequences, descriptions)

In [17]:
s = []
for des, seq in zip(descriptions, sequences):
    s += [des, seq]

In [27]:
unique_sequences = list(set(sequences))

In [28]:
if len(unique_sequences)==1:
    homooligomers_num = len(sequences)
else:
    homooligomers_num = 1

In [29]:
with open(f"{jobname}.fasta", "w") as f:
    f.write("\n".join(s))

In [30]:
os.makedirs("./unifold_results", exist_ok=True)
result_dir = Path("./unifold_results")
output_dir = os.path.join("./unifold_results", target_id)

(
  unpaired_msa,
  paired_msa,
  template_results,
) = get_msa_and_templates(
    target_id,
    unique_sequences,
    result_dir=result_dir,
    msa_mode=msa_mode,
    use_templates=use_templates,
    homooligomers_num = homooligomers_num
)



Sequence 0 found templates: ['2lwy_A', '2imz_A', '3ifj_B', '4o1r_A', '2l8l_A', '2imz_B', '2in9_A', '6rpq_A', '4e2u_A', '2in8_A', '3igd_A', '7cfv_A', '1mi8_A', '6rpp_A', '6frg_A', '6frh_A', '6frh_B', '6qaz_A', '6riz_A']


In [31]:
idx, seq = 0, unique_sequences[0]

In [36]:
chain_id = PDB_CHAIN_IDS[idx]
sequence_features = pipeline.make_sequence_features(
    sequence=seq, description=f'> {jobname} seq {chain_id}', num_res=len(seq)
)
monomer_msa = parsers.parse_a3m(unpaired_msa[idx])
msa_features = pipeline.make_msa_features([monomer_msa])
template_features = template_results[idx]
feature_dict = {**sequence_features, **msa_features, **template_features}
feature_dict = compress_features(feature_dict)

In [37]:
if is_multimer:
    multimer_msa = parsers.parse_a3m(paired_msa[idx])
    pair_features = pipeline.make_msa_features([multimer_msa])
    pair_feature_dict = compress_features(pair_features)

In [39]:
#features_output_path = os.path.join(
#    output_dir, "{}.feature.pkl.gz".format(chain_id)
#)
#pickle.dump(
#    feature_dict, 
#    gzip.GzipFile(features_output_path, "wb"), 
#    protocol=4
#)
#if is_multimer:
#    uniprot_output_path = os.path.join(
#        output_dir, "{}.uniprot.pkl.gz".format(chain_id)
#    )
#    pickle.dump(
#        pair_feature_dict,
#        gzip.GzipFile(uniprot_output_path, "wb"),
#        protocol=4,
#    )

In [None]:
#@title Uni-Fold prediction on GPU.

from unifold.colab.model import colab_inference

best_result = colab_inference(
    target_id=target_id,
    data_dir=output_dir,
    param_dir=".",
    output_dir=output_dir,
    symmetry_group=symmetry_group,
    is_multimer=is_multimer,
    max_recycling_iters=max_recycling_iters,
    num_ensembles=num_ensembles,
    times=times,
    manual_seed=manual_seed,
    device="cuda:0",                # do not change this on colab.
)


In [None]:
#@title Show the protein structure

from unifold.colab.plot import colab_plot

colab_plot(
    best_result=best_result,
    output_dir=output_dir,
    show_sidechains=show_sidechains,
    dpi=dpi,
)


In [None]:
#@title Download the prediction
#@markdown **The content of zip file**:
#@markdown 1. PDB formatted structures
#@markdown 2. Json file of the model quality (pLDDT and pTM for multimer)
#@markdown 2. Plots of the best model quality (pLDDT and PAE for multimer)

from google.colab import files
import glob

file_lists = glob.glob(os.path.join(output_dir, "*.pdb"))
file_lists += glob.glob(os.path.join(output_dir, "*.json"))
file_lists += glob.glob(os.path.join(output_dir, "*.svg"))

!zip -q {target_id}.zip {" ".join(file_lists)}
files.download(f'{target_id}.zip')