# *De novo* design of an allyl hydratase

In this repo, we're designing for a large, aromatic substrate, 2-allylphenol. If we can show that we can get activity on that substrate (and if it's soluble), we'll move on to ethylene. 

## Input file prep 

We prepped our scaffold set, wrote constraint file, and generated ligand params file. 

## Match the theozyme into our scaffold set 

### Prep input files 

In [1]:
# make a list of input structures 
from glob import glob 
runs = [ '-s {} -match:scaffold_active_site_residues {}'.format( s[6:], s[6:].replace( 'pdb', 'pos' ) ) for s in glob( 'match/scaffold_set/*pdb' ) ] 
with open( 'match/list', 'w' ) as fn:
    fn.write( '\n'.join( runs ) )

In [3]:
! tail -n 5 match/list

-s scaffold_set/2uvh_11.pdb -match:scaffold_active_site_residues scaffold_set/2uvh_11.pos
-s scaffold_set/2v73_11.pdb -match:scaffold_active_site_residues scaffold_set/2v73_11.pos
-s scaffold_set/3B5L_11.pdb -match:scaffold_active_site_residues scaffold_set/3B5L_11.pos
-s scaffold_set/3B5V_11.pdb -match:scaffold_active_site_residues scaffold_set/3B5V_11.pos
-s scaffold_set/bglb_11.pdb -match:scaffold_active_site_residues scaffold_set/bglb_11.pos

### RosettaMatch run on scaffold set (207 proteins)

In [41]:
%%bash 

#cat out_of_mem_list > match/list
runs=$( wc -l match/list | cut -d" " -f1 )
echo $runs runs

cd match && sbatch --array=1-$runs sub.sh

40 runs
Submitted batch job 741482


### Match results 

Using the classic match algorithm for both catalytic residues

In [4]:
# the ones that still fail at 16 GB
# ! grep -i memory match/logs/slurm-741482_{1..40}.err | cut -d: -f1 | sort | uniq 

Some of the jobs crash with only 8 GB of RAM, so we will collect those and run them again, bumping up the RAM per core to 16 GB 

In [6]:
#! ls -1 match/UM*pdb | wc -l #number of matches 
# ! squeue -u carlin | grep match | wc -l 
#! grep CANCELLED match/logs/*740932* | grep . | wc -l 

In [7]:
#! grep -i memory match/logs/slurm-740932_{1..206}.err | cut -d: -f1 | sort | uniq > out_of_mem 
#! ( for i in "$( cat out_of_mem )"; do head -2 ${i/err/out} | tail -n 1 | cut -d@ -f2 | cut -c 7-; done ) > out_of_mem_list

## Submit matches to enzyme design 

Now that we have matched the active site residues and the ligand into various positions in the protein backbones of our scaffold set, we'll use an enzyme design protocol to optimize the ligand placement and shape complementarity with the protein 

In [297]:
from glob import glob
nstruct = 5 

runs = [ 
    '-s ../{} -suffix _{:04d}\n'.format( i, j ) 
    for j in range( nstruct ) 
    for i in glob( 'match/UM*pdb' ) 
]

with open( 'enzdes/list', 'w' ) as fn:
    fn.write( ''.join( runs ) )

In [298]:
! wc -l enzdes/list
! tail enzdes/list

52230 enzdes/list
-s ../match/UM_8_Y80E52_1_1lbf_11_hydratase_1.pdb -suffix _0004
-s ../match/UM_30_Y189E247_1_1fp2_11_hydratase_1.pdb -suffix _0004
-s ../match/UM_85_Y137D99_1_1h1a_11_hydratase_1.pdb -suffix _0004
-s ../match/UM_56_Y177E109_1_1igs_11_hydratase_1.pdb -suffix _0004
-s ../match/UM_26_Y44E163_1_1ukb_11_hydratase_1.pdb -suffix _0004
-s ../match/UM_44_Y147E231_1_1abe_11_hydratase_1.pdb -suffix _0004
-s ../match/UM_19_Y69E241_1_1ls6_11_hydratase_1.pdb -suffix _0004
-s ../match/UM_33_Y263E304_1_1cgz_11_hydratase_1.pdb -suffix _0004
-s ../match/UM_12_Y74E114_1_1tt8_11_hydratase_1.pdb -suffix _0004
-s ../match/UM_24_Y61D172_1_1re8_11_hydratase_1.pdb -suffix _0004


In [303]:
%%bash

#rm enzdes/logs/* 
cd enzdes && echo sbatch -p gc128 --array=1-$( wc -l list | cut -d" " -f1 ) sub.sh 

# now the rest 
# cd enzdes && sbatch --array=101-$( wc -l list | cut -d" " -f1 ) sub.sh 

# now the real
# cd enzdes && sbatch --array=1-$( wc -l list | cut -d" " -f1 ) sub.sh 

sbatch -p gc128 --array=1-52230 sub.sh


In [304]:
! squeue -u carlin 

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
 796740_[986-1111,     gc128   design   carlin PD       0:00      1 (Resources)
       796740_1700     gc128   design   carlin  R       0:10      1 c0-14.hpc.genomecenter.ucdavis.edu
       796740_1701     gc128   design   carlin  R       0:10      1 c7-5.hpc.genomecenter.ucdavis.edu
       796740_1696     gc128   design   carlin  R       0:25      1 c0-9.hpc.genomecenter.ucdavis.edu
       796740_1698     gc128   design   carlin  R       0:25      1 c0-19.hpc.genomecenter.ucdavis.edu
       796740_1567     gc128   design   carlin  R       2:10      1 c7-4.hpc.genomecenter.ucdavis.edu
       796740_1562     gc128   design   carlin  R       3:16      1 c7-4.hpc.genomecenter.ucdavis.edu
       796740_1563     gc128   design   carlin  R       3:16      1 c0-7.hpc.genomecenter.ucdavis.edu
       796740_1564     gc128   design   carlin  R       3:16      1 c0-7.hpc.genomecenter.ucdavis.edu
       796740_156

In [305]:
! ls enzdes/out/*sc 

enzdes/out/score_0000.sc  enzdes/out/score_0002.sc  enzdes/out/score_0004.sc
enzdes/out/score_0001.sc  enzdes/out/score_0003.sc


## Results 

### Filter the RosettaScripts enzyme design output 

Let's start by looking at histograms of the `total_score` of each of the catalytic residues. Because we're looking at a big ligand, let's also check out the interface energy score. 

In [1]:
# download scorefiles from cluster 
! rsync -avz $ep:/share/work/alex/dnh-aro/enzdes/out/*sc results/

receiving file list ... done
score_0000.sc
score_0001.sc
score_0002.sc
score_0003.sc
score_0004.sc

sent 50644 bytes  received 2485233 bytes  563528.22 bytes/sec
total size is 31763430  speedup is 12.53


In [2]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas
from glob import glob 

dfs = [ pandas.read_csv( i, sep='\s+' ) for i in glob( 'results/*sc' ) ]
df = pandas.concat( dfs ).dropna()
df['scaffold'] = df.description.str.split( '_' ).str.get( 4 ) 
df['match'] = df.description.str.split( '_11' ).str[0]

print len( df ), 'enzdes out'
print len( df.scaffold.unique() ), 'scaffolds'
print len( df.match.unique() ), 'unique matches'

# sf = pandas.read_csv( 'enzdes/out/score.sc', sep='\s+' )
# sf.index = sf.description
# sf['scaffold'] = sf.index.str.split( '_' ).str[4]
# sf['match'] = sf.index.str.split( '_11' ).str[0]
# sf.sample( 3 )

# print len( sf ), 'enzdes out'
# print len( sf.scaffold.unique() ), 'scaffolds'
# print len( sf.match.unique() ), 'unique matches'

42180 enzdes out
159 scaffolds
8511 unique matches


In [3]:
# get lowest 1 of each nstruct 

def lowest( df ):
    return df.sort( 'total_score' ).head( 1 )

grouped = df.groupby( 'match' ).apply( lowest )
print len( grouped ), 'lowest from nstruct groups (should equal number of matches)'

8511 lowest from nstruct groups (should equal number of matches)




In [4]:
filtered = grouped[ 
    ( grouped.all_cst < 1 ) & 
    ( grouped.SR_1_total_score < -1 ) & 
    ( grouped.SR_2_total_score < -1 ) & 
    ( grouped.SR_3_total_score < -1 ) & 
    ( grouped.SR_3_dsasa_1_2 > 0.90 ) 
]

print len( filtered ), 'structures meet critera'
filtered.to_csv( 'results/filtered.csv' )

13 structures meet critera


Histograms of all the columns in the scorefile 

In [5]:
#sf.hist( linewidth=0, color='k', figsize=( 12, 120 ), bins=50, layout=(41,1) )

Per-scaffold histograms! (Why?)

In [6]:
#sf.total_score.hist(by=df.scaffold, figsize=(21,21), color='k', linewidth=0 )
#plt.tight_layout()

In [7]:
filtered.sample( 12 )

Unnamed: 0_level_0,Unnamed: 1_level_0,total_score,fa_rep,hbond_sc,all_cst,tot_pstat_pm,tot_nlpstat_pm,tot_burunsat_pm,tot_hbond_pm,tot_NLconts_pm,tot_nlsurfaceE_pm,...,SR_3_fa_rep,SR_3_hbond_sc,SR_3_all_cst,SR_3_interf_E_1_2,SR_3_dsasa_1_2,SR_3_hbond_pm,SR_3_burunsat_pm,description,scaffold,match
match,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
UM_1_Y27E253_1_1p4c,7873,-357.93,379.53,-33.88,0.05,0.62,0.61,97,308,70,49.76,...,0.55,-0.69,0.02,-5.58,0.94,2,0,UM_1_Y27E253_1_1p4c_11_hydratase_1_0003_0001,1p4c,UM_1_Y27E253_1_1p4c
UM_50_Y66D6_1_1tuv,7265,-153.3,39.8,-5.09,0.0,0.58,0.56,15,78,13,48.4,...,0.49,-0.66,0.0,-5.69,0.99,1,0,UM_50_Y66D6_1_1tuv_11_hydratase_1_0003_0001,1tuv,UM_50_Y66D6_1_1tuv
UM_71_Y153D114_1_1tt8,1287,-349.02,64.79,-23.07,0.09,0.66,0.65,29,150,49,11.72,...,1.0,-0.93,0.05,-8.05,0.98,2,0,UM_71_Y153D114_1_1tt8_11_hydratase_1_0003_0001,1tt8,UM_71_Y153D114_1_1tt8
UM_16_Y396E343_1_1HZV,7907,227.42,389.97,-22.86,0.1,0.67,0.65,272,356,108,48.17,...,0.55,-0.73,0.05,-4.91,0.92,2,0,UM_16_Y396E343_1_1HZV_11_hydratase_1_0000_0001,1HZV,UM_16_Y396E343_1_1HZV
UM_54_Y229E253_1_1p4c,2918,-422.48,352.57,-36.17,0.09,0.64,0.64,90,315,78,51.02,...,0.84,-0.68,0.04,-2.33,0.91,2,0,UM_54_Y229E253_1_1p4c_11_hydratase_1_0003_0001,1p4c,UM_54_Y229E253_1_1p4c
UM_4_Y3D45_1_1ikt,866,-252.0,51.54,-8.43,0.02,0.61,0.6,17,97,19,3.66,...,0.42,-0.76,0.01,-4.83,0.92,1,0,UM_4_Y3D45_1_1ikt_11_hydratase_1_0001_0001,1ikt,UM_4_Y3D45_1_1ikt
UM_55_Y75D6_1_1tuv,4537,-137.2,35.93,-9.15,0.02,0.56,0.56,16,85,9,43.73,...,0.23,-1.38,0.01,-6.04,0.95,2,0,UM_55_Y75D6_1_1tuv_11_hydratase_1_0002_0001,1tuv,UM_55_Y75D6_1_1tuv
UM_50_Y147D206_1_1abe,1426,-374.16,199.63,-25.11,0.05,0.71,0.71,78,275,58,4.57,...,0.68,-0.72,0.03,-4.83,0.97,2,0,UM_50_Y147D206_1_1abe_11_hydratase_1_0003_0001,1abe,UM_50_Y147D206_1_1abe
UM_21_Y93D143_1_1r5l,3395,-415.76,172.85,-24.47,0.13,0.6,0.59,58,228,43,17.42,...,0.99,-0.26,0.06,-4.17,0.97,1,0,UM_21_Y93D143_1_1r5l_11_hydratase_1_0002_0001,1r5l,UM_21_Y93D143_1_1r5l
UM_14_Y102D146_1_1ghp,2092,-360.43,175.69,-25.56,0.56,0.69,0.7,50,240,46,2.88,...,0.86,-0.53,0.28,-5.09,0.99,1,0,UM_14_Y102D146_1_1ghp_11_hydratase_1_0002_0001,1ghp,UM_14_Y102D146_1_1ghp


In [8]:
# use this when working on the cluster 

# from subprocess import call 

# tar_list = [ 'dnh_aro/enzdes/out/{}.pdb'.format( i ) for i in filtered.description ]
# cmd = [ 'tar', '--create', '--verbose', '--file', 'filtered.tar' ] + tar_list 
# #call( cmd )

In [9]:
# use this when working on local machine 

pull_list = [ 'dnh-aro/enzdes/out/{}.pdb'.format( i ) for i in filtered.description ]
with open( 'results/pull_list', 'w' ) as fn:
    fn.write( '\n'.join( pull_list ) )

#! head pull_list   
! rsync -avz --files-from=results/pull_list $ep:. results/

receiving file list ... done
dnh-aro/enzdes/
dnh-aro/enzdes/out/
dnh-aro/enzdes/out/UM_16_Y396E343_1_1HZV_11_hydratase_1_0000_0001.pdb
dnh-aro/enzdes/out/UM_1_Y27E253_1_1p4c_11_hydratase_1_0003_0001.pdb
dnh-aro/enzdes/out/UM_50_Y66D6_1_1tuv_11_hydratase_1_0003_0001.pdb

sent 98 bytes  received 309340 bytes  618876.00 bytes/sec
total size is 4281588  speedup is 13.84


## Manual curation of structures 

Now that we have run Rosetta simulations and picked a set of design critera to fiter by, it's time for manual curation of the structures. We'll download the 50 or so that look the best, and pick some designs to order. 

One thing that I really like to see when looking at a design is the mutations that were made when compared to the wild type enzyme. Let's diff the designs against their wild types (only those in the filtered list though).

Here, let's try to automate the curation step. Let's write a function that

+ takes a design PDB file name in enzdes/out

and makes a nice PyMOL session with 

+ the wild type loaded and overlayed 
+ mutations colored 

In [10]:
import os
from Bio.PDB.Polypeptide import PPBuilder
from Bio.PDB import PDBParser
import string 

def return_seq( pdb ):
    '''
    Utility function to get a FASTA-like sequence from a PDB file 
    '''
    parser = PDBParser()
    structure = parser.get_structure( pdb[:-4], pdb )
    ppb = PPBuilder() 
    peps = ppb.build_peptides( structure )
    seqs = [ str( pp.get_sequence() ) for pp in peps ] 
    return ''.join( seqs ) # wow 

def my_map( design ):
    '''
    Utility function that will fetch the design and WT PDBs from disk, diff them, 
    as well as generate a nice PyMOL script for looking at them later 
    '''
    wild_type_pdb = 'match/scaffold_set/{}_11.pdb'.format( design.split( '_' )[4] )
    design_pdb = 'results/dnh-aro/enzdes/out/{}.pdb'.format( design ) 
    #print os.path.isfile( wild_type_pdb ), wild_type_pdb, os.path.isfile( design_pdb ), design_pdb
    wt_seq = return_seq( wild_type_pdb )
    des_seq = return_seq( design_pdb )
    diff = [ '{}{}{}'.format( i, n+1, j ) for n, (i, j) in enumerate( zip( wt_seq, des_seq ) ) if i != j ]
    string_rep = '+'.join( diff )
    
    with open( 'results/{}.pml'.format( design ), 'w' ) as fn:
        filestring = """
load ~/Documents/dnh-aro/{0}, wt; load ~/Documents/dnh-aro/{1}, design; 
select muts, resi {2} and obj design; util.cbaw muts; util.cnc; 
orient muts; show sticks, organic; remove hydrogens; remove solvent;  
        """.format( 
            wild_type_pdb, 
            design_pdb, 
            string_rep.translate( None, string.letters )
        )
        fn.write( filestring )
    
    return string_rep

filtered['mutations'] = filtered.description.map( my_map )

Exception ignored.
Some atoms or residues may be missing in the data structure.
Exception ignored.
Some atoms or residues may be missing in the data structure.
Exception ignored.
Some atoms or residues may be missing in the data structure.
Exception ignored.
Some atoms or residues may be missing in the data structure.
Exception ignored.
Some atoms or residues may be missing in the data structure.
Exception ignored.
Some atoms or residues may be missing in the data structure.
Exception ignored.
Some atoms or residues may be missing in the data structure.
Exception ignored.
Some atoms or residues may be missing in the data structure.
Exception ignored.
Some atoms or residues may be missing in the data structure.
Exception ignored.
Some atoms or residues may be missing in the data structure.
Exception ignored.
Some atoms or residues may be missing in the data structure.
Exception ignored.
Some atoms or residues may be missing in the data structure.
Exception ignored.
Some atoms or residue

## Create a new notebook with design data that can be used to keep notes 

Use this if you want to make a new notebook with seperate cells for each design 

In [11]:
import nbformat.v4 as nbf

nb = nbf.new_notebook()

header_cell = nbf.new_markdown_cell( '# Design notes\n\n {} designs'.format( len( filtered ) ) )

cells = [ header_cell ]

for i, series in filtered.iterrows():
    
    header = '---\n\n### {}'.format( series.description ) 
    
    info = [ 
        '**Scaffold**: {}\n'.format( series.scaffold ), 
        '**Mutation string**: {}\n'.format( series.mutations ), 
        '**Mutations**: \n\n  1. {}\n'.format( series.mutations.replace( '+', '\n  1. ' ) ),
        '**Scores**: \n\n', 
        '  + `total_score`: {}'.format( series.total_score ),
        '  + Tyrosine `total_score`: {}'.format( series.SR_1_total_score ),
        '  + Glu/Asp `total_score`: {}'.format( series.SR_2_total_score ),
        '  + Ligand `total_score`: {}\n'.format( series.SR_1_total_score ),
        '**Notes**: \n\n --- \n' 
    ]
    
    cmd = [ '! open {}.pml'.format( series.description ) ] 
    
    cell_list = [ 
        nbf.new_markdown_cell( header ), 
        nbf.new_code_cell( cmd ), 
        nbf.new_markdown_cell( '\n'.join( info ) ), 
    ]
    cells += cell_list

nb.cells = cells 

from datetime import datetime 
with open( 'results/results_{}.ipynb'.format( datetime.now() ), 'w') as f:
    f.write( nbf.writes_json( nb ) )