# Uni-Fold Batch Inference Notebook

This notebook provides protein structure prediction service of [Uni-Fold](https://github.com/dptech-corp/Uni-Fold/) as well as [UF-Symmetry](https://www.biorxiv.org/content/10.1101/2022.08.30.505833v1). Predictions of both protein monomers and multimers are supported. The homology search process in this notebook is enabled with the [MMSeqs2](https://github.com/soedinglab/MMseqs2.git) server provided by [ColabFold](https://github.com/sokrypton/ColabFold). For more consistent results with the original AlphaFold(-Multimer), please refer to the open-source repository of [Uni-Fold](https://github.com/dptech-corp/Uni-Fold/), or our convenient web server at [Hermite™](https://hermite.dp.tech/).

Please note that this notebook is provided as an early-access prototype, and is NOT an official product of DP Technology. It is provided for theoretical modeling only and caution should be exercised in its use. 

**Licenses**

This Colab uses the [Uni-Fold model parameters](https://github.com/dptech-corp/Uni-Fold/#model-parameters-license) and its outputs are under the terms of the Creative Commons Attribution 4.0 International (CC BY 4.0) license. You can find details at: https://creativecommons.org/licenses/by/4.0/legalcode. The Colab itself is provided under the [Apache 2.0 license](https://www.apache.org/licenses/LICENSE-2.0).


**Citations**

Please cite the following papers if you use this notebook:

*   Ziyao Li, Xuyang Liu, Weijie Chen, Fan Shen, Hangrui Bi, Guolin Ke, Linfeng Zhang. "[Uni-Fold: An Open-Source Platform for Developing Protein Folding Models beyond AlphaFold.](https://www.biorxiv.org/content/10.1101/2022.08.04.502811v1)" biorxiv (2022)
*   Ziyao Li, Shuwen Yang, Xuyang Liu, Weijie Chen, Han Wen, Fan Shen, Guolin Ke, Linfeng Zhang. "[Uni-Fold Symmetry: Harnessing Symmetry in Folding Large Protein Complexes.](https://www.biorxiv.org/content/10.1101/2022.08.30.505833v1)" bioRxiv (2022)
*   Mirdita M, Schütze K, Moriwaki Y, Heo L, Ovchinnikov S and Steinegger M. "[ColabFold: Making protein folding accessible to all.](https://www.nature.com/articles/s41592-022-01488-1)" Nature Methods (2022)

**Acknowledgements**

The model architecture of Uni-Fold is largely based on [AlphaFold](https://doi.org/10.1038/s41586-021-03819-2) and [AlphaFold-Multimer](https://www.biorxiv.org/content/10.1101/2021.10.04.463034v1). The design of this notebook refers directly to [ColabFold](https://www.nature.com/articles/s41592-022-01488-1). We specially thank [@sokrypton](https://twitter.com/sokrypton) for his helpful suggestions to this notebook.

Copyright © 2022 DP Technology. All rights reserved.

In [None]:
import warnings
warnings.filterwarnings("ignore")
import os
import json
from unifold.colab.data import validate_input, get_features

is_colab = False
#@title Provide the arguments here and hit `Run` -> `Run All Cells`
jobname = 'unifold_batch_colab' #@param {type:"string"}
use_templates = True #@param {type:"boolean"}
msa_mode = "MMseqs2" #@param ["MMseqs2","single_sequence"]
#@markdown Parameters for model inference.
max_recycling_iters = 3 #@param {type:"integer"}
num_ensembles = 2 #@param {type:"integer"}
manual_seed = 42 #@param {type:"integer"}
times = 1 #@param {type:"integer"}
#@markdown Plotting parameters.
show_sidechains = False #@param {type:"boolean"}
dpi = 100 #@param {type:"integer"}
max_display_cnt = 3  #@param {type:"integer"}

MIN_SINGLE_SEQUENCE_LENGTH = 6
MAX_SINGLE_SEQUENCE_LENGTH = 3000
MAX_MULTIMER_LENGTH = 3000

In [None]:
# #@title Install Uni-Fold and third-party softwares
# #@markdown Please execute this cell by pressing the _Play_ button 
# #@markdown on the left to download and import third-party software 
# #@markdown in this Colab notebook. (See the [acknowledgements](https://github.com/dptech-corp/Uni-Fold/#acknowledgements) in our readme.)

# #@markdown **Note**: This installs the software on the Colab 
# #@markdown notebook in the cloud and not on your computer.
# %%bash
# if [ ! -f ENV_READY ]; then
#     apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y -qq kalign

#     # Install HHsuite.
#     wget -q https://github.com/soedinglab/hh-suite/releases/download/v3.3.0/hhsuite-3.3.0-AVX2-Linux.tar.gz; tar xfz hhsuite-3.3.0-AVX2-Linux.tar.gz; ln -s $(pwd)/bin/* /usr/bin 

#     pip3 -q install py3dmol gdown

#     pip3 -q install libmsym

#     touch ENV_READY
# fi

# GIT_REPO='https://github.com/dptech-corp/Uni-Fold'
# UNICORE_URL='https://github.com/dptech-corp/Uni-Core/releases/download/0.0.2/unicore-0.0.1+cu118torch2.0.0-cp310-cp310-linux_x86_64.whl'
# PARAM_URL='https://github.com/dptech-corp/Uni-Fold/releases/download/v2.0.0/unifold_params_2022-08-01.tar.gz'
# UF_SYMM_PARAM_URL='https://github.com/dptech-corp/Uni-Fold/releases/download/v2.2.0/uf_symmetry_params_2022-09-06.tar.gz'

# if [ ! -f UNIFOLD_READY ]; then
#     wget  ${UNICORE_URL} 
#     pip3 -q install "unicore-0.0.1+cu118torch2.0.0-cp310-cp310-linux_x86_64.whl"
#     git clone -b main ${GIT_REPO}
#     pip3 -q install ./Uni-Fold
#     wget ${PARAM_URL}
#     tar -xzf "unifold_params_2022-08-01.tar.gz"
#     wget  ${UF_SYMM_PARAM_URL}
#     tar -xzf "uf_symmetry_params_2022-09-06.tar.gz"

#     touch UNIFOLD_READY
# fi

# CONFIGURATION
Set up input contents (<font color="red">from file or directly filling `input_json`</font>) and output path.
- `jobname (str)`: name of the job, served as prefix of output directories.
- `input_json_path (str)`: path of input json file, which contains a list or dict of proteins. *If it's a list, we take indices as IDs.* Each protein is a dict with keys:
    - `symmetry`: protein's symmetry group. Use "C1" as default.
    - `sequence`: the sequences of the asymmetric unit (splitted by ";").
    - `id`  is optional. if not existed, it will be the order of the sequences.
    - other thing you can add.
- `output_dir_base (str)`: root directory of output files.


examples of `list`:
```python
input_json = [
        {'sequence': 'MGSSHHHHHHSSGLVPRGSHMEDRDPTQFEERHLKFLQQLGKGNFGSVEMCRYDPLQDNTGEVVAVKKLQHSTEEHLRDFEREIEILKSLQHDNIVKYKGVCYSAGRRNLKLIMEYLPYGSLRDYLQKHKERIDHIKLLQYTSQICKGMEYLGTKRYIHRDLATRNILVENENRVKIGDFGLTKVLPQDKEFFKVKEPGESPIFWYAPESLTESKFSVASDVWSFGVVLYELFTYIEKSKSPPAEFMRMIGNDKQGQMIVFHLIELLKNNGRLPRPDGCPDEIYMIMTECWNNNVNQRPSFRDLALRVDQIRDNMAG'},
        {'symmetry': 'C2', 'sequence': 'GSHMKNVLIGVQTNLGVNKTGTEFGPDDLIQAYPDTFDEMELISVERQKEDFNDKKLKFKNTVLDTCEKIAKRVNEAVIDGYRPILVGGDHSISLGSVSGVSLEKEIGVLWISAHGDMNTPESTLTGNIHGMPLALLQGLGDRELVNCFYEGAKLDSRNIVIFGAREIEVEERKIIEKTGVKIVYYDDILRKGIDNVLDEVKDYLKIDNLHISIDMNVFDPEIAPGVSVPVRRGMSYDEMFKSLKFAFKNYSVTSADITEFNPLNDINGKTAELVNGIVQYMMNPDY'},
        {'symmetry': 'C2', 'sequence': 'GGSGGSGGSGGSLFCEQVTTVTNLFEKWNDCERTVVMYALLKRLRYPSLKFLQYSIDSNLTQNLGTSQTNLSSVVIDINANNPVYLQNLLNAYKTARKEDILHEVLNMLPLLKPGNEEAKLIYLTLIPVAVKDTMQQIVPTELVQQIFSYLLIHPAITSEDRRSLNIWLRHLEDHIQ;SVPSYGEDELQQAMRLLNAASRQRTEAANEDFGGT'},
        {'symmetry': 'C3', 'sequence': 'LILNLRGGAFVSNTQITMADKQKKFINEIQEGDLVRSYSITDETFQQNAVTSIVKHEADQLCQINFGKQHVVCTVNHRFYDPESKLWKSVCPHPGSGISFLKKYDYLLSEEGEKLQITEIKTFTTKQPVFIYHIQVENNHNFFANGVLAHAMQVSI'},
    ]
```

Another `dict` case is showed as followed:

In [None]:
output_dir_base = "./prediction" if is_colab else "./prediction" #@param {type:"string"}
os.makedirs(output_dir_base, exist_ok=True)

input_json_path = 'your_json_path.json'


if os.path.isfile(input_json_path):
    with open(input_json_path, encoding="utf-8") as fp:
        input_json = json.load(fp)
    default_list_case = False
    default_dict_case = False
else:  # A DEMO CASE (DICT). list case is above.
    input_json = {
        '7teu': {'sequence': 'MGSSHHHHHHSSGLVPRGSHMEDRDPTQFEERHLKFLQQLGKGNFGSVEMCRYDPLQDNTGEVVAVKKLQHSTEEHLRDFEREIEILKSLQHDNIVKYKGVCYSAGRRNLKLIMEYLPYGSLRDYLQKHKERIDHIKLLQYTSQICKGMEYLGTKRYIHRDLATRNILVENENRVKIGDFGLTKVLPQDKEFFKVKEPGESPIFWYAPESLTESKFSVASDVWSFGVVLYELFTYIEKSKSPPAEFMRMIGNDKQGQMIVFHLIELLKNNGRLPRPDGCPDEIYMIMTECWNNNVNQRPSFRDLALRVDQIRDNMAG'},
        '8d27': {'symmetry': 'C2', 'sequence': 'GSHMKNVLIGVQTNLGVNKTGTEFGPDDLIQAYPDTFDEMELISVERQKEDFNDKKLKFKNTVLDTCEKIAKRVNEAVIDGYRPILVGGDHSISLGSVSGVSLEKEIGVLWISAHGDMNTPESTLTGNIHGMPLALLQGLGDRELVNCFYEGAKLDSRNIVIFGAREIEVEERKIIEKTGVKIVYYDDILRKGIDNVLDEVKDYLKIDNLHISIDMNVFDPEIAPGVSVPVRRGMSYDEMFKSLKFAFKNYSVTSADITEFNPLNDINGKTAELVNGIVQYMMNPDY'},
        '8oij': {'symmetry': 'C2', 'sequence': 'GGSGGSGGSGGSLFCEQVTTVTNLFEKWNDCERTVVMYALLKRLRYPSLKFLQYSIDSNLTQNLGTSQTNLSSVVIDINANNPVYLQNLLNAYKTARKEDILHEVLNMLPLLKPGNEEAKLIYLTLIPVAVKDTMQQIVPTELVQQIFSYLLIHPAITSEDRRSLNIWLRHLEDHIQ;SVPSYGEDELQQAMRLLNAASRQRTEAANEDFGGT'},
        'c2404': {'symmetry': 'C3', 'sequence': 'LILNLRGGAFVSNTQITMADKQKKFINEIQEGDLVRSYSITDETFQQNAVTSIVKHEADQLCQINFGKQHVVCTVNHRFYDPESKLWKSVCPHPGSGISFLKKYDYLLSEEGEKLQITEIKTFTTKQPVFIYHIQVENNHNFFANGVLAHAMQVSI'},
    }


def process_batch_json(tasks, jobname):
    if isinstance(tasks, dict):
        new_tasks = []
        for k, v in tasks.items():
            v['id'] = k
            new_tasks.append(v)
        tasks = new_tasks
    
    # check the input.
    for idx, task in enumerate(tasks):
        if 'id' not in task.keys():
            task['id'] = idx
            
        if 'sequence' not in task.keys():
            raise KeyError(f"number {idx+1}-th 'sequence' not found in dict keys: {task.keys()} in json.")
        
        target_id = f"{jobname}_{task['id']}"
        input_sequences = task['sequence'].strip().split(';')
        
        task['target_id'] = target_id
        
        if 'symmetry' not in task.keys():
            task['symmetry'] = 'C1'
        
        symmetry_group = task['symmetry'] 
        # check the sequences
        sequences, is_multimer, symmetry_group = validate_input(
            input_sequences=input_sequences,
            symmetry_group=symmetry_group,
            min_length=MIN_SINGLE_SEQUENCE_LENGTH,
            max_length=MAX_SINGLE_SEQUENCE_LENGTH,
            max_multimer_length=MAX_MULTIMER_LENGTH)
        task['is_multimer'] = is_multimer
        
        # save features to `output_dir_base`
        feature_output_dir = get_features(
            jobname=jobname,
            target_id=target_id,
            sequences=sequences,
            output_dir_base=output_dir_base,
            is_multimer=is_multimer,
            msa_mode=msa_mode,
            use_templates=use_templates
            )
        
        task['feature_output_dir'] = feature_output_dir
        task['symmetry'] = task['symmetry'] if task['symmetry'] != 'C1' else None

    return tasks


all_tasks = process_batch_json(input_json, jobname)

In [None]:
#@title Uni-Fold prediction on GPU.
from tqdm import tqdm
from unifold.colab.model import colab_inference

def manual_operations():
    # developers may operate on the pickle files here
    # to customize the features for inference.
    pass

manual_operations()

for task in tqdm(all_tasks, desc='running Unifold'):
    best_result = colab_inference(
        target_id=task['target_id'],
        data_dir=task['feature_output_dir'],
        param_dir='.',
        output_dir=task['feature_output_dir'],
        symmetry_group=task['symmetry'],
        is_multimer=task['is_multimer'],
        max_recycling_iters=max_recycling_iters,
        num_ensembles=num_ensembles,
        times=times,
        manual_seed=manual_seed,
        device="cuda:0",                # do not change this on colab.
    )
    
    task['best_plddt'] = best_result['plddt'].mean().item()
    task['pae'] = best_result['pae'].mean().item() if best_result['pae'] is not None else None
    task['best_results_path'] = best_result['best_results_path']
    task['protein'] = best_result['protein']
    


In [None]:
task_best_proteins = []
with open(os.path.join(output_dir_base, 'all_tasks_summary.json'), 'w') as f:
    # remove the protein for clean resluts config.
    for item in all_tasks:
        if 'protein' in all_tasks:
            task_best_proteins.append(item.pop('protein'))
    json.dump(all_tasks, f, indent=2)

In [None]:
#@title Show the one protein structure, select one to display.

display_cases_number = -1

from unifold.colab.plot import colab_plot

colab_plot(
    best_result=task_best_proteins[display_cases_number],
    output_dir=task[display_cases_number],
    show_sidechains=show_sidechains,
    dpi=dpi,
)
