# *De novo* design of an allyl hydratase

In this repo, we're designing for a large, aromatic substrate, 2-allylphenol. If we can show that we can get activity on that substrate (and if it's soluble), we'll move on to ethylene. 

## Input file prep 

We prepped our scaffold set, wrote constraint file, and generated ligand params file. 

## Match the theozyme into our scaffold set 

### RosettaMatch run on scaffold set (207 proteins)

In [21]:
! ls # we're in the matching directory and we've prepared our input files

enzdes	hydratase.enzdes.cst  list     old-out	prep	      sub.sh
flags	LG1.params	      log.txt  out	scaffold_set  Untitled.ipynb


In [1]:
%%bash 

runs=$( wc -l list | cut -d" " -f1 )
echo $runs runs

#sbatch --array=1-$runs sub.sh

207 runs
Submitted batch job 632172


In [100]:
! squeue -u carlin | head 

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
 632586_[2058-1021        gc   enzdes   carlin PD       0:00      1 (Priority)
       632586_2055        gc   enzdes   carlin  R       0:23      1 c0-3.hpc.genomecenter.ucdavis.edu
       632586_2057        gc   enzdes   carlin  R       0:23      1 c4-1.hpc.genomecenter.ucdavis.edu
       632586_2054        gc   enzdes   carlin  R       1:22      1 c0-9.hpc.genomecenter.ucdavis.edu
       632586_2053        gc   enzdes   carlin  R       2:14      1 c4-3.hpc.genomecenter.ucdavis.edu
       632586_2052        gc   enzdes   carlin  R       3:00      1 c7-4.hpc.genomecenter.ucdavis.edu
       632586_2051        gc   enzdes   carlin  R       3:19      1 c0-0.hpc.genomecenter.ucdavis.edu
       632586_2049        gc   enzdes   carlin  R       3:28      1 c0-17.hpc.genomecenter.ucdavis.edu
       632586_2050        gc   enzdes   carlin  R       3:28      1 c0-4.hpc.genomecenter.ucdavis.edu


In [13]:
%%bash 
#clean up after matcher 
#ls UM*pdb | wc -l # number of matches
#mkdir out
#mv UM*pdb out/
#ls out/ | wc -l 

## Submitting our matches to the enzyme design protocol 

Now that we have matched the active site residues and the ligand into various positions in the protein backbones of our scaffold set, we'll use an enzyme design protocol to optimize the ligand placement and shape complementarity with the protein 

In [132]:
# use nstruct 10 here since each job is only taking a few minutes, it's OK if they run overnight 

from glob import glob
nstruct = 5 

runs = [ '-s ../{} -suffix _{}'.format( i, j ) for j in range( nstruct ) for i in glob( 'out/*pdb' ) ]
with open( 'enzdes/list', 'w' ) as fn:
    fn.write( '\n'.join( runs ) )

In [133]:
! wc -l enzdes/list
! head enzdes/list

51059 enzdes/list
-s ../out/UM_73_Y174E18_1_1pvx_11_hydratase_1.pdb -suffix _0
-s ../out/UM_4_Y3D45_1_1ikt_11_hydratase_1.pdb -suffix _0
-s ../out/UM_93_Y168E71_1_1bvv_11_hydratase_1.pdb -suffix _0
-s ../out/UM_38_Y121E124_1_1v0a_11_hydratase_1.pdb -suffix _0
-s ../out/UM_41_Y147D231_1_1abe_11_hydratase_1.pdb -suffix _0
-s ../out/UM_68_Y338E186_1_1tfz_11_hydratase_1.pdb -suffix _0
-s ../out/UM_35_Y511E174_1_1HZU_11_hydratase_1.pdb -suffix _0
-s ../out/UM_30_Y155D79_1_1b9i_11_hydratase_1.pdb -suffix _0
-s ../out/UM_1_Y38E299_1_1c1d_11_hydratase_1.pdb -suffix _0
-s ../out/UM_43_Y168E143_1_1w78_11_hydratase_1.pdb -suffix _0


In [96]:
%%bash

# let's start with doing 100 and making sure everything is OK 
# cd enzdes && sbatch --array=1-100 sub.sh 

# now the rest 
cd enzdes && sbatch --array=101-$( wc -l list | cut -d" " -f1 ) sub.sh 

# now the real
# cd enzdes && sbatch --array=1-$( wc -l list | cut -d" " -f1 ) sub.sh 

Submitted batch job 632586


In [120]:
! squeue -u carlin | head

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)


In [121]:
! tail enzdes/log.txt

protocols.jd2.JobDistributor: UM_45_Y96D29_1_1nwz_11_hydratase_1_0010 reported success in 6036 seconds
protocols.jd2.JobDistributor: no more batches to process... 
protocols.jd2.JobDistributor: 10 jobs considered, 10 jobs attempted in 75168 seconds


In [123]:
! ls enzdes/out/*pdb | wc -l  

22032


## Results 

### Filter the RosettaScripts enzyme design output 

Let's start by looking at histograms of the `total_score` of each of the catalytic residues. Because we're looking at a big ligand, let's also check out the interface energy score. 

In [83]:
! rsync -avz $ep:/share/work/alex/dnh_aro/enzdes/out/*sc designs/

receiving file list ... done

sent 20 bytes  received 147 bytes  111.33 bytes/sec
total size is 2084077  speedup is 12479.50


In [84]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas
from glob import glob 

dfs = [ pandas.read_csv( i, sep='\s+' ) for i in glob( 'designs/*sc' ) ]
df = pandas.concat( dfs ).dropna()
df['scaffold'] = df.description.str.split( '_' ).str.get( 4 ) 
df['match'] = df.description.str.split( '_11' ).str[0]

print len( df ), 'enzdes out'
print len( df.scaffold.unique() ), 'scaffolds'
print len( df.match.unique() ), 'unique matches'

# sf = pandas.read_csv( 'enzdes/out/score.sc', sep='\s+' )
# sf.index = sf.description
# sf['scaffold'] = sf.index.str.split( '_' ).str[4]
# sf['match'] = sf.index.str.split( '_11' ).str[0]
# sf.sample( 3 )

# print len( sf ), 'enzdes out'
# print len( sf.scaffold.unique() ), 'scaffolds'
# print len( sf.match.unique() ), 'unique matches'

2774 enzdes out
173 scaffolds
2402 unique matches


In [85]:
# get lowest 1 of each nstruct 

def lowest( df ):
    return df.sort( 'total_score' ).head( 1 )

grouped = df.groupby( 'match' ).apply( lowest )
print len( grouped ), 'lowest from nstruct groups (should equal number of matches)'

2402 lowest from nstruct groups (should equal number of matches)




In [86]:
filtered = grouped[ 
    ( grouped.all_cst < 1 ) & 
    ( grouped.SR_1_total_score < 0 ) & 
    ( grouped.SR_2_total_score < 0 ) & 
    ( grouped.SR_3_total_score < 0 ) & 
    ( grouped.SR_3_dsasa_1_2 > 0.9 ) 
]

print len( filtered ), 'structures meet critera'

43 structures meet critera


Histograms of all the columns in the scorefile 

In [87]:
#sf.hist( linewidth=0, color='k', figsize=( 12, 120 ), bins=50, layout=(41,1) )

Per-scaffold histograms! (Why?)

In [88]:
#sf.total_score.hist(by=df.scaffold, figsize=(21,21), color='k', linewidth=0 )
#plt.tight_layout()

In [89]:
filtered

Unnamed: 0_level_0,Unnamed: 1_level_0,total_score,fa_rep,hbond_sc,all_cst,tot_pstat_pm,tot_nlpstat_pm,tot_burunsat_pm,tot_hbond_pm,tot_NLconts_pm,tot_nlsurfaceE_pm,...,SR_3_fa_rep,SR_3_hbond_sc,SR_3_all_cst,SR_3_interf_E_1_2,SR_3_dsasa_1_2,SR_3_hbond_pm,SR_3_burunsat_pm,description,scaffold,match
match,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
UM_12_Y182E142_1_1AYU,110,-295.41,138.63,-15.5,0.09,0.72,0.7,71,178,50,1.47,...,0.89,-0.69,0.04,-5.38,0.99,1,0,UM_12_Y182E142_1_1AYU_11_hydratase_1_3_0001,1AYU,UM_12_Y182E142_1_1AYU
UM_136_Y165E129_1_1qv0,691,-382.48,86.8,-17.09,0.17,0.67,0.65,37,188,35,6.16,...,1.06,-0.41,0.08,-5.1,0.98,1,0,UM_136_Y165E129_1_1qv0_11_hydratase_1_0_0001,1qv0,UM_136_Y165E129_1_1qv0
UM_13_Y102D146_1_1ghp,741,-351.68,172.41,-25.21,0.17,0.72,0.68,51,236,47,3.44,...,0.56,-0.43,0.08,-5.74,0.98,1,0,UM_13_Y102D146_1_1ghp_11_hydratase_1_0_0001,1ghp,UM_13_Y102D146_1_1ghp
UM_13_Y19E52_1_1oho,840,-239.01,44.67,-9.54,0.08,0.67,0.65,19,108,19,12.87,...,1.17,-0.48,0.04,-4.71,0.97,1,0,UM_13_Y19E52_1_1oho_11_hydratase_1_0_0001,1oho,UM_13_Y19E52_1_1oho
UM_13_Y39E109_1_1nwz,410,-237.95,45.59,-9.43,0.25,0.67,0.68,24,99,29,3.6,...,0.48,-0.89,0.12,-5.4,0.98,2,0,UM_13_Y39E109_1_1nwz_11_hydratase_1_1_0001,1nwz,UM_13_Y39E109_1_1nwz
UM_13_Y44D175_1_1re8,607,-549.28,137.74,-26.11,0.04,0.67,0.65,101,275,79,33.14,...,1.34,-0.38,0.02,-1.99,0.95,1,0,UM_13_Y44D175_1_1re8_11_hydratase_1_1_0001,1re8,UM_13_Y44D175_1_1re8
UM_14_Y89E115_1_1r5l,208,-277.74,328.45,-20.68,0.01,0.59,0.6,56,229,48,15.72,...,0.54,0.0,0.01,-5.64,0.91,0,0,UM_14_Y89E115_1_1r5l_11_hydratase_1_2_0001,1r5l,UM_14_Y89E115_1_1r5l
UM_15_Y165D247_1_1fp2,357,-509.39,216.24,-19.91,0.01,0.63,0.62,66,290,58,132.34,...,0.4,-0.42,0.0,-5.01,0.95,1,0,UM_15_Y165D247_1_1fp2_11_hydratase_1_0_0001,1fp2,UM_15_Y165D247_1_1fp2
UM_17_Y24E81_1_1tuv,313,-123.79,51.38,-5.22,0.01,0.61,0.63,13,77,11,49.41,...,0.75,-0.02,0.0,-4.87,0.91,1,0,UM_17_Y24E81_1_1tuv_11_hydratase_1_2_0001,1tuv,UM_17_Y24E81_1_1tuv
UM_1_Y23D113_1_1d3g,305,-603.15,193.28,-39.28,0.02,0.67,0.65,84,336,63,8.75,...,0.95,0.0,0.01,-3.88,0.94,0,1,UM_1_Y23D113_1_1d3g_11_hydratase_1_2_0001,1d3g,UM_1_Y23D113_1_1d3g


In [90]:
# use this when working on the cluster 

# from subprocess import call 

# tar_list = [ 'dnh_aro/enzdes/out/{}.pdb'.format( i ) for i in filtered.description ]
# cmd = [ 'tar', '--create', '--verbose', '--file', 'filtered.tar' ] + tar_list 
# #call( cmd )

# use this when working on local machine 

pull_list = [ 'dnh_aro/enzdes/out/{}.pdb'.format( i ) for i in filtered.description ]
with open( 'results/pull_list', 'w' ) as fn:
    fn.write( '\n'.join( pull_list ) )

#! head pull_list   
! rsync -avz --files-from=pull_list $ep:. results/

receiving file list ... done
dnh_aro/
dnh_aro/enzdes/
dnh_aro/enzdes/out/
dnh_aro/enzdes/out/UM_12_Y182E142_1_1AYU_11_hydratase_1_3_0001.pdb
dnh_aro/enzdes/out/UM_136_Y165E129_1_1qv0_11_hydratase_1_0_0001.pdb
dnh_aro/enzdes/out/UM_13_Y102D146_1_1ghp_11_hydratase_1_0_0001.pdb
dnh_aro/enzdes/out/UM_13_Y19E52_1_1oho_11_hydratase_1_0_0001.pdb
dnh_aro/enzdes/out/UM_13_Y39E109_1_1nwz_11_hydratase_1_1_0001.pdb
dnh_aro/enzdes/out/UM_13_Y44D175_1_1re8_11_hydratase_1_1_0001.pdb
dnh_aro/enzdes/out/UM_14_Y89E115_1_1r5l_11_hydratase_1_2_0001.pdb
dnh_aro/enzdes/out/UM_15_Y165D247_1_1fp2_11_hydratase_1_0_0001.pdb
dnh_aro/enzdes/out/UM_17_Y24E81_1_1tuv_11_hydratase_1_2_0001.pdb
dnh_aro/enzdes/out/UM_1_Y23D113_1_1d3g_11_hydratase_1_2_0001.pdb
dnh_aro/enzdes/out/UM_1_Y27E253_1_1p4c_11_hydratase_1_0_0001.pdb
dnh_aro/enzdes/out/UM_22_Y23D132_1_1qv0_11_hydratase_1_0_0001.pdb
dnh_aro/enzdes/out/UM_24_Y83E128_1_1f5j_11_hydratase_1_1_0001.pdb
dnh_aro/enzdes/out/UM_26_Y46E202_1_1upv_11_hydratase_1_0_0001.pdb
d

## Manual curation of structures 

Now that we have run Rosetta simulations and picked a set of design critera to fiter by, it's time for manual curation of the structures. We'll download the 50 or so that look the best, and pick some designs to order. 

One thing that I really like to see when looking at a design is the mutations that were made when compared to the wild type enzyme. Let's diff the designs against their wild types (only those in the filtered list though).

## Automated manual curation

Here, let's try to automate the curation step. Let's write a function that

+ takes a design PDB file name in enzdes/out

and makes a nice PyMOL session with 

+ the wild type loaded and overlayed 
+ mutations colored 

In [91]:
import os
from Bio.PDB.Polypeptide import PPBuilder
from Bio.PDB import PDBParser
import string 

def return_seq( pdb ):
    '''
    Utility function to get a FASTA-like sequence from a PDB file 
    '''
    parser = PDBParser()
    structure = parser.get_structure( pdb[:-4], pdb )
    ppb = PPBuilder() # lol why don't these have PDBParser( 'XZY1.pdb' )-style constructors? 
    for pp in ppb.build_peptides( structure ):
        sequence = pp.get_sequence()
        return sequence 

def my_map( design ):
    '''
    Utility function that will fetch the design and WT PDBs from disk, diff them, 
    as well as generate a nice PyMOL script for looking at them later 
    '''
    wild_type_pdb = 'scaffold_set/{}_11.pdb'.format( design.split( '_' )[4] )
    design_pdb = 'results/dnh_aro/enzdes/out/{}.pdb'.format( design )
    #print os.path.isfile( wild_type_pdb ), wild_type_pdb, os.path.isfile( design_pdb ), design_pdb
    wt_seq = return_seq( wild_type_pdb )
    des_seq = return_seq( design_pdb )
    diff = [ '{}{}{}'.format( i, n+1, j ) for n, (i, j) in enumerate( zip( wt_seq, des_seq ) ) if i != j ]
    string_rep = '+'.join( diff )
    
    with open( 'results/{}.pml'.format( design ), 'w' ) as fn:
        filestring = """
load ~/Documents/dnh-aro/{0}, wt; load ~/Documents/dnh-aro/{1}, design;  
clean; orient resn LG1; stored.design_ca = []; stored.wt_ca = []; 
iterate obj design and name ca, stored.design_ca.append( resn ); 
iterate obj wt and name ca, stored.wt_ca.append( resn ) ; 
select muts, resi {2} and obj design; util.cbaw muts; util.cnc; 
        """.format( 
            wild_type_pdb, 
            design_pdb, 
            string_rep.translate( None, string.letters )
        )
        fn.write( filestring )
    
    # print out some nice markdown we can use in a the next cell lol 
#     print '### {}\n'.format( design ) 
#     print '**Scaffold**: {} \n\n **Mutations**: {}\n'.format( design.split( '_' )[4], string_rep )
#     match_series = filtered.loc[ design.split( '_11' )[0] ]
#     print '  + total_score: {}\n'.format( match_series.total_score.values[0] )
#     print '  + Tyrosine score: {}\n'.format( match_series.SR_1_total_score.values[0] )
#     print '  + Glu/Asp score: {}\n'.format( match_series.SR_2_total_score.values[0] )
#     print '  + Ligand score: {}\n'.format( match_series.SR_3_total_score.values[0] )
#     print '**Notes**: \n'
#     print '---'
#     print ''
    
    return string_rep
    
filtered['mutations'] = filtered.description.map( my_map )

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


## Create a new notebook with design data that can be used to keep notes 

Use this if you want to make a new notebook with seperate cells for each design 

In [92]:
from nbformat import current as nbf

nb = nbf.new_notebook()

header_cell = nbf.new_text_cell( 'markdown', '# Design notes\n\n {} designs'.format( len( filtered ) ) )

cells = [ header_cell ]

for i, series in filtered.iterrows():
    
    info = [ 
        '### {}'.format( series.description ), 
        '**Scaffold**: {}\n'.format( series.scaffold ), 
        '**Mutations**: {}\n'.format( series.mutations ),
        '  + `total_score`: {}'.format( series.total_score ),
        '  + Tyrosine `total_score`: {}'.format( series.SR_1_total_score ),
        '  + Glu/Asp `total_score`: {}'.format( series.SR_2_total_score ),
        '  + Ligand `total_score`: {}\n'.format( series.SR_1_total_score ),
        '**Notes**: \n\n --- \n' 
    ]
    
    cell = nbf.new_text_cell( 'markdown', '\n'.join( info ) )
    cells.append( cell ) 

nb['worksheets'].append(nbf.new_worksheet(cells=cells))

from datetime import datetime 

with open( 'results/results_{}.ipynb'.format( datetime.now() ), 'w') as f:
    nbf.write(nb, f, 'ipynb')