<a href="https://colab.research.google.com/github/YaoYinYing/Umol/blob/main/Umol.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Umol** - **U**niversal **mol**ecular framework
**Structure prediction of protein-ligand complexes from sequence information**

The protein is represented with a multiple sequence alignment and the ligand as a SMILES string, allowing for unconstrained flexibility in the protein-ligand interface. At a high accuracy threshold, unseen protein-ligand complexes can be predicted more accurately than for RoseTTAFold-AA, and at medium accuracy even classical docking methods that use known protein structures as input are surpassed.

For local installation, see: https://github.com/patrickbryant1/Umol
\
[Read the paper here](https://www.biorxiv.org/content/10.1101/2023.11.03.565471v1)

Umol has **no size limit**. This only depends on available RAM.

Umol is available under the [Apache License, Version 2.0](http://www.apache.org/licenses/LICENSE-2.0). \
The Umol parameters are made available under the terms of the [CC BY 4.0 license](https://creativecommons.org/licenses/by/4.0/legalcode).


In [1]:
#@markdown ###Install OpenMM in a conda environment (this can take a while).

!pip install -q condacolab
import condacolab
condacolab.install()


⏬ Downloading https://github.com/conda-forge/miniforge/releases/download/23.11.0-0/Mambaforge-23.11.0-0-Linux-x86_64.sh...
📦 Installing...
📌 Adjusting configuration...
🩹 Patching environment...
⏲ Done in 0:00:10
🔁 Restarting kernel...


In [2]:
#@title Install dependencies
#@markdown Make sure your runtime is GPU.
#@markdown In the menu above do: Runtime --> Change runtime type --> Hardware accelerator (set to GPU)

#@markdown **Press play.**

#@markdown Simply press play on each cell below and follow the instructions.

#@markdown You will have to restart the runtime after this finishes to include the new packages.
#@markdown In the menu above do: Runtime --> Restart runtime
#@markdown Don't worry about all the errors that pip give below, these are resolved in the end.

!echo "Installing dependencies, please wait..."
!conda install -c conda-forge -c  omnia pdbfixer openff-toolkit  openmm=8.1.0 openmmforcefields=0.11.2 -q  > /dev/null

!echo "Installing Umol package..."
!pip install git+https://github.com/YaoYinYing/Umol.git -q  > /dev/null
!echo "Updating JAX..."
!pip install --upgrade "jax[cuda12_pip]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html -q  > /dev/null

!pip install rdkit-pypi -q  > /dev/null
!pip install numpy==1.26.4 -q  > /dev/null
!pip install py3Dmol -q  > /dev/null
!echo "Done!"

Installing dependencies, please wait...
Installing Umol package...
[0mUpdating JAX...
[0mDone!


In [1]:
#@title #Follow all steps outlined below to run Umol
#@markdown To try the **test case** 7NB4, click the box "test_case". Then press the play button to the left.
\
#@markdown If you don't want to run the test case, **leave the box blank**.
\
#@markdown The target positions are visualised on a structure generated from [ESMFold](https://www.science.org/doi/10.1126/science.ade2574).
#@markdown You can opt out of adding target positions, but the accuracy is increased if you do.

#@markdown #Settings
#@markdown - *ID* - name \
#@markdown - **MSA** - currently no MSA search is available directly in the browser, therefore you have to provide your own MSAs in a3m format and upload them here. \
#@markdown - Generating an MSA takes a few minutes: \
#@markdown Go to https://toolkit.tuebingen.mpg.de/tools/hhblits \
#@markdown Paste your protein sequence in the search field in fasta format --> Submit. \
#@markdown When the search is finished, go to the tab "Query MSA" and "Download Full A3M" \
#@markdown Upload the MSAs here: \
#@markdown Click the folder icon (Files) to the left and select the upload file icon. Upload your files.
#@markdown Make sure to name your MSA **"ID".a3m**

#@markdown - LIGAND - smiles string of your ligand. Make sure these are canonical (e.g. as generated by RDKit)
#@markdown - SEQUENCE - protein sequence (same as used for the MSA). **Limit=400 amino acids**.
#@markdown You can crop the structure around the target region if it is too big - Umol will handle it.
#@markdown - TARGET_POS - **OPTIONAL** (leave blank if none). What positions to target in the protein sequence (the binding pocket, starting at 1). Note that these have to be all CBs within 10 Å from the putative ligand position.
#@markdown e.g. "50,51,53,54,55,56,57,58,59,60,61,62,64,65,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,92,93,94,95,96,97,98,99,100,101,103,104,124,127,128"
#@markdown - NUM_RECYCLES - how many recycles to use in the network (increase for harder targets)


import sys, os
from google.colab import files
import pandas as pd
import numpy as np
import py3Dmol

test_case = True #@param {type:"boolean"}

ID = "7NB4" #@param {type:"string"}
LIGAND = "CCc1sc2ncnc(N[C@H](Cc3ccccc3)C(=O)O)c2c1-c1cccc(Cl)c1C" # @param {type:"string"}
SEQUENCE = "SEDELYRQSLEIISRYLREQATGAKDTKPMGRSGATSRKALETLRRVGDGVQRNHETAFQGMLRKLDIKNEDDVKSLSRVMIHVFSDGVTNWGRIVTLISFGAFVAKHLKTINQESCIEPLAESITDVLVRTKRDWLVKQRGWDGFVEFFH" #@param {type:"string"}
TARGET_POSITIONS = "" #@param {type:"string"}


if len(TARGET_POSITIONS)>0:
  TARGET_POSITIONS = [int(x) for x in TARGET_POSITIONS.split(',')]
else:
  TARGET_POSITIONS = []
NUM_RECYCLES = 3 # @param {type:"integer"}
OUTDIR="/content/"+ID+'/'
if not os.path.exists(OUTDIR):
  os.mkdir(OUTDIR)

 #Write fasta
with open('/content/'+ID+'.fasta', 'w') as file:
  file.write('>'+ID+'\n')
  file.write(SEQUENCE)
FASTA_FILE='/content/'+ID+'.fasta'

#Check MSA
MSA='/content/'+ID+'.a3m'
if test_case!=True:
  from umol.check_msa_colab import process_a3m
  PROCESSED_MSA=MSA.split('.')[0]+'_processed.a3m'
  process_a3m(MSA, SEQUENCE, PROCESSED_MSA)
  MSA=PROCESSED_MSA
else:
  if not os.path.exists(MSA):
    ! wget 'https://raw.githubusercontent.com/patrickbryant1/Umol/master/data/test_case/7NB4/7NB4.a3m' -O $MSA -q


print('Using ligand:', LIGAND)
#print('Using MSA:' ,MSA)
print('Using',NUM_RECYCLES,'recycles')

#Visualise with ESMFold
if not os.path.exists(OUTDIR+'/'+ID+'_esmfold.pdb'):
  !curl -k -X POST --data $SEQUENCE  https://api.esmatlas.com/foldSequence/v1/pdb/ >> $OUTDIR/$ID'_esmfold.pdb'


view = py3Dmol.view(js='https://3dmol.org/build/3Dmol.js',)
view.addModel(open(OUTDIR+ID+"_esmfold.pdb",'r').read(),'pdb')
view.setStyle({'chain':'A'},{'cartoon': {'color':'green'}})

#Highlight
with open(OUTDIR+ID+"_esmfold.pdb","r") as ifile:
    system = "".join([x for x in ifile])
i = 0
for line in system.split("\n"):
    split = line.split()
    if len(split) == 0 or split[0] != "ATOM":
        continue
    idx = int(split[5])
    if idx in TARGET_POSITIONS:
        color = "green"
        view.setStyle({'model': -1, 'serial': i+1}, {"stick": {'color': color}})
    i += 1

view.zoomTo()
view.show()

print('The target structure is in cartoon and the target positions (pocket) in stick format.')

#Subtract 1 from the target positions to make them zero indexed
if len(TARGET_POSITIONS)>0:
  TARGET_POSITIONS=[x-1 for x in TARGET_POSITIONS]

Using ligand: CCc1sc2ncnc(N[C@H](Cc3ccccc3)C(=O)O)c2c1-c1cccc(Cl)c1C
Using 3 recycles
curl: /usr/local/lib/libcurl.so.4: no version information available (required by curl)
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   97k  100 99996  100   151   228k    353 --:--:-- --:--:-- --:--:--  229k


The target structure is in cartoon and the target positions (pocket) in stick format.


In [2]:
#@title Check that the GPU is accessible. The response from this cell should be "gpu".
import jax
from jax.lib import xla_bridge
print(xla_bridge.get_backend().platform)

gpu


In [3]:
#@markdown #Get the parameters (if not downloaded)
UMOL_WEIGHTS_DIR='/content/umol_weights/'

! mkdir $UMOL_WEIGHTS_DIR

if not os.path.exists(f'{UMOL_WEIGHTS_DIR}/params_pocket.npy'):
  print('Getting Umol-pocket network parameters.')
  #!wget https://zenodo.org/records/10048543/files/params40000.npy
  !wget -q https://zenodo.org/records/10397462/files/params40000.npy -O $UMOL_WEIGHTS_DIR/params_pocket.npy

if not os.path.exists(f'{UMOL_WEIGHTS_DIR}/params_no_pocket.npy'):
  print('Getting Umol network parameters.')
  !wget -q https://zenodo.org/records/10489242/files/params60000.npy -O $UMOL_WEIGHTS_DIR/params_no_pocket.npy


Getting Umol-pocket network parameters.
Getting Umol network parameters.


In [4]:
#@markdown #Run Umol pipeline

import os
import shutil
if os.path.exists(OUTDIR):
  shutil.rmtree(OUTDIR)

os.makedirs(os.path.join(OUTDIR, 'msas'), exist_ok=True)
shutil.copy(os.path.join(f'{ID}.a3m'), os.path.join(OUTDIR, 'msas', 'output.a3m'))

if not os.path.exists('/content/mock/uniref30_uc30/uniclust30_2018_08/'):
  # here we mock out ur30 database bcs we have already uploaded the msa file.
  ! mkdir -p /content/mock/uniref30_uc30/uniclust30_2018_08/
  ! touch /content/mock/uniref30_uc30/uniclust30_2018_08/uniclust30_2018_08_mock.file



import hydra
import os
import umol.inference
from umol.inference import main
from omegaconf import DictConfig, OmegaConf

cfg_path=os.path.join(os.path.abspath(os.path.dirname(umol.inference.__file__)), 'config')

try:
  hydra.initialize_config_dir(
      version_base=None, config_dir=cfg_path
  )
except ValueError as e :
  print(f'Ignore re-instantializing Hydra: {e}')

def reload_config_file(config_name: str = 'umol') -> DictConfig:
    return hydra.compose(
        config_name=config_name,
        return_hydra_config=False,
    )


cfg=reload_config_file()

OmegaConf.update(cfg, 'input.fasta', FASTA_FILE)
OmegaConf.update(cfg, 'input.ligand.smiles', LIGAND)
OmegaConf.update(cfg, 'input.target_pos', TARGET_POSITIONS)
OmegaConf.update(cfg, 'weights.dir', UMOL_WEIGHTS_DIR)
OmegaConf.update(cfg, 'runtime.recycles', NUM_RECYCLES)
OmegaConf.update(cfg, 'input.id', ID)
OmegaConf.update(cfg, 'output.dir', OUTDIR)


# db mock
OmegaConf.update(cfg, 'database.uc30', '/content/mock/uniref30_uc30/uniclust30_2018_08/uniclust30_2018_08')

main(cfg)



Saved MSA features to /content/7NB4/features/msa_features.pkl
Saved features to /content/7NB4/features/ligand_inp_features.pkl
protein_len
ligand_len




Reading ligand
Added all atoms...
Creating system...
Preparing complex
System has 2442 atoms
Adding ligand...
System has 2495 atoms
Preparing system
Adding restraints on protein CAs and ligand atoms
Final complex with PLDDT scores: /content/7NB4/relax/7NB4_relaxed_plddt.pdb




In [5]:
#@markdown #Visualise Unrelaxed Predictions
import py3Dmol
import pandas as pd
import numpy as np


view = py3Dmol.view(js='https://3dmol.org/build/3Dmol.js',)
view.addModel(open(os.path.join(OUTDIR,'sdf',ID+"_pred_ligand.sdf"),'r').read(),'sdf')
view.setStyle({'stick': {'color':'cyan'}})
view.addModel(open(os.path.join(OUTDIR,'pdb',ID+"_pred_protein.pdb"),'r').read(),'pdb')
view.setStyle({'chain':'A'},{'cartoon': {'color':'green'}})
view.zoomTo()
view.show()

In [6]:
#@markdown #Visualise Relaxed complex
import py3Dmol
import pandas as pd
import numpy as np
ID='7NB4'
OUTDIR=f'/content/{ID}/'
view = py3Dmol.view(js='https://3dmol.org/build/3Dmol.js',)
view.addModel(open(os.path.join(OUTDIR,'relax',ID+"_relaxed_complex.pdb"),'r').read(),'pdb')
view.setStyle({'chain':'A'},{'cartoon': {'color':'green'}})
view.setStyle({'chain':'B'},{'stick': {'color':'cyan'}})


view.zoomTo()
view.show()

In [7]:
#@title Download relaxed results
from google.colab import files
import glob
for relaxed_complex in glob.glob('/content/*/relax/*_relaxed_complex.pdb'):
  files.download(relaxed_complex)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>