# This notebook contains the pipeline for docking proteins onto DNA targets

This notebook will take you through the first stage of designing DNA-binding proteins. <br>
The goal here is to create models that have protein backbones docked in reasonable positions onto DNA targets of interest. <br>
The two major steps of this process are Rifgen and Rifdock. <br>
More details about how these programs work and what we have to do to run them is present throughout the notebook. <br> <br>

## Step 0: Imports and variable initialization (MUST RUN EVERY TIME)

### Imports

In [2]:
### DOCK-00

###########################
#### GENERAL UTILITIES ####
###########################
# sys is a common utility used to interface with the system, similar to running commands directly on the bash command line
import sys, os, time, glob, random, shutil, subprocess, math, re
from subprocess import Popen, PIPE
import numpy as np
import pandas as pd

# silent_tools is a custom-built library used for handling silent files, which can store many structures in a single compressed file
sys.path.append('../software/silent_tools/') # This is the path your silent_tools installation.
import silent_tools

# This is just to ignore some warnings that we don't care about
import warnings
warnings.filterwarnings('ignore')

print('Imported required utilities')
#######################################
#######################################

Imported required utilities


In [6]:
### DOCK-01

# This is the path to the version of Rosetta to use. For now, we are just using the latest version
rosetta_db = '/software/rosetta/latest/database' # change this to your rosetta installation
rosetta_scripts = '/software/rosetta/latest/bin/rosetta_scripts.hdf5.linuxgccrelease' # change this to your rosetta installation
print('Set paths to Rosetta')


#######################################################################################
#### USER INPUT : Update this with some identifier for your binder design campaign ####
#######################################################################################
ID = 'ID'
#######################################################################################
################################### END USER INPUT ####################################
#######################################################################################


# This defines the path to your working directory for rifdock. Everything will go in here
rifdock_path = f'./rifdock_output/' 
#              ^              ^^^^^^             ^^^^
# Now we create the directory we need, as well as any other directories above it in the path
os.makedirs(rifdock_path, exist_ok=True)
print(f'Created working directory in scratch space for rifdock: {rifdock_path}')

Set paths to Rosetta
Created working directory in scratch space for rifdock: ./rifdock_output/


### Collect DNA binder scaffolds

When we design DNA-binding proteins, we start with a large set of protein "scaffolds", which are well-folded proteins of a desirable shape for binding DNA. <br>
The mentors have already created many scaffolds, so you don't need to worry about that step. This cell will just collect a list of those pre-existing scaffolds. <br>
If you're interested in how we made these scaffolds, reach out on Mattermost!

In [6]:
### DOCK-02

############################################################################
##### UPDATE this second path with the path to your scaffolds #####
############################################################################
master_scaffold_list = glob.glob('scaffold_pdbs/*pdb')
############################################################################

scaffold_list = 'scaffolds.list'
with open(scaffold_list,'r') as f:
    for pdb in master_scaffold_list:
        f.write(pdb+'/n')

# Finally, print our count to the screen to see how many we have. It should be 26216, as of SEP22.
print(f'Docking with {n_scaffolds} scaffolds, listed in {scaffold_list}')

Docking with 26216 scaffolds, listed in /net/scratch/cjg263/de_novo_dna/HBVa_HSB-2/rifdock/scaffolds.list


### Define target sequences:


In [4]:
### DOCK-03

# This notebook is set up to be able to dock proteins onto multiple DNA targets at the same time.
# Here you will need to fill your full sequence in. 

########################################
#### Fill in your assigned sequence ####
########################################
seqs = ['CGCACCGACTCACG']
########################################
########################################

In [5]:
### DOCK-04

# DNA, as you know, forms a double helix -- but you only input the sequence for one strand!
# This cell will automatically compute the reverse complement sequence, based on Watson-Crick base pairing rules
# You'll never need to painfully write out the reverse sequence again!

# This dictionary is an easy way to map out the complementary base to each input
pairings = {'G':'C', 'A':'T', 'C':'G', 'T':'A'}

# We're going to make another dictionary with the same key(s), but now for the reverse complementary sequence(s)
# First, we initialize an empty list
seqs_revcomp = []
for seq in seqs:
    
    # Then, we build up the new sequence:
    # We work forward through the original sequence, and
    # at each step we add the complementary base to the *beginning* of the new sequence
    # note that strings can be concatenated in python just using the "+" operator
    new_seq = ''
    for base in seq:
        new_seq = pairings[base] + new_seq
    
    # Finally, we add the new sequence to the new dictionary
    seqs_revcomp.append(new_seq)
       
    # Confirm that this code works and try to understand why it does! Can you code another way to do the same thing?
    print('-'*40)
    print('Forward sequence:\t',seqs)
    print('Reverse sequence:\t',seqs_revcomp)

----------------------------------------
Forward sequence:	 ['CGCACCGACTCACG']
Reverse sequence:	 ['CGTGAGTCGGTGCG']


## Step 1: Generate and Prepare Targets (only need to run once)

### Generate target structures

Now, we need to create a 3D structure for the desired DNA sequence. <br>
For this, one option is to use X3DNA. However you can also substitute any other PDB structure of your DNA target. <br>

In [17]:
### DOCK-05

path_to_x3dna = '' # change this to your installation path for x3dna
os.environ['X3DNA'] = f'{path_to_x3dna}'
sys.path.append(f':{software_path}/x3dna-v2.4/bin')
print('Added X3DNA path to the PATH environment variable')
# ...and this line specifies the path to the actual executable, called `fiber`
x3dna_path = f'{software_path}x3dna-v2.4/bin/fiber'
# This function generates ideal DNA from a given sequence, based on the original structures
# These structures were solved not from crystals but from fibers, where the DNA sequence was ambiguous
# We will be using the most famous DNA structure, ideal B-form

outdir=f'{rifdock_path}/target_seqs'
os.makedirs(outdir,exist_ok=True)

for seq in seqs:
    cmd = f'{x3dna_path} -seq={seq} -rep=1 -b {outdir}/{seq}.pdb'
    print(f'Running "{cmd}" with a subprocess')
    # -seq is a flag which takes the sequence of the DNA, -rep is the number of repeats of the sequence to generate, 
    # -b specifies B-form, and the final argument is the output path

    # the split() command divides a string into a list of its substrings separated by a space
    args = cmd.split()
    print(f'Output file: {args[-1]}')
    # subprocess.Popen() runs a shell command, such as a C++ executable, as a subprocess
    process = Popen(args,stdout=PIPE, stderr=PIPE)
    out, err = process.communicate()
    # This "process" is a subprocess running the executable command. You could also have copied the command into a bash terminal and run it that way.

Added X3DNA path to the PATH environment variable
Running "/mnt/home/cjg263/software/x3dna-v2.4/bin/fiber -seq=AATTGGTCTGCGCACCAGCA -rep=1 -b /net/scratch/cjg263/de_novo_dna/HBVa_HSB-2/rifdock/target_seqs/AATTGGTCTGCGCACCAGCA.pdb" with a subprocess
Output file: /net/scratch/cjg263/de_novo_dna/HBVa_HSB-2/rifdock/target_seqs/AATTGGTCTGCGCACCAGCA.pdb
Running "/mnt/home/cjg263/software/x3dna-v2.4/bin/fiber -seq=GCACCGCCCGCGAGCCAACC -rep=1 -b /net/scratch/cjg263/de_novo_dna/HBVa_HSB-2/rifdock/target_seqs/GCACCGCCCGCGAGCCAACC.pdb" with a subprocess
Output file: /net/scratch/cjg263/de_novo_dna/HBVa_HSB-2/rifdock/target_seqs/GCACCGCCCGCGAGCCAACC.pdb


### Generate hotspots

Now that we have the target structures made, we will also need to specify exactly where on that target we want our proteins to bind. <br>
To do this, look at where your assigned 3mer lies in your assigned target sequence and find the corresponding residue numbers in the pdb. <br>
Note that you need to get the numbers for both the forward and reverse strand. <br>
I recommend opening up your pdbs in pymol to figure this out. You should be able to find the paths in the commands printed above. <br>

In [26]:
### HOT-00

# Collect pose info from DNA target PDBs. Make sure that the pdb is numbered continuously, starting at residue 1.
# You need to specify the length of the double-stranded region and range of base-pairs that you want to target. 

pdbs = glob.glob(f'{rifdock_path}/target_seqs/*.pdb')
print(f'List of dsDNA .pdbs: {pdbs}')

strand1_hotspots = {}
strand2_hotspots = {}
######################################################################
#### User input: Fill in the residue numbers for your target site ####
######################################################################
# If you want to assign different target sites for different sequences,
# then remove the for loop and input them manually
for seq in seqs:
    strand1_hotspots[seq] = list(range(start, end))     # Usually exclude the first and last 2 bases of the target sequence
    strand2_hotspots[seq] = list(range(start+len(seqs[seq], end+len(seqs[seq]))     # Target range is the complementary bases on the other strand
######################################################################
######################################################################

# Write the target residues to a list, so the code can access it later.
for seq in seqs:
    target_resis = strand1_hotspots[seq] + strand2_hotspots[seq]
    with open(f'{rifdock_path}/target_res_{seq}.list', 'w') as f:
        f.write('\n'.join([str(x) for x in target_resis]) + '\n') # this will write out the list, separated by the newline character "\n" - look up "list comprehension" if you want to know more about how this line works!
        f.close() 
print(f'List of hotspot target residues: {target_resis}')

List of dsDNA .pdbs: ['/net/scratch/cjg263/de_novo_dna/HBVa_HSB-2/rifdock/target_seqs/AATTGGTCTGCGCACCAGCA.pdb', '/net/scratch/cjg263/de_novo_dna/HBVa_HSB-2/rifdock/target_seqs/GCACCGCCCGCGAGCCAACC.pdb']
List of hotspot target residues: [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38]


The next few cells are going to generate "hotspots", which are specific locations on the target where we look for special interactions we think will help binding. <br>
Specifically, we consider all ADENINE and GUANINE bases in the target site of the DNA. 
Then, we align bidentate-hydrogen-bonding residues from known DNA-binding proteins onto the target DNA.
The positions where those protein residues end up are the hotspots used by rifdock to place our scaffolds onto the target.

In [27]:
### HOT-01

# First, we just need to identify the residue numbers of all ADE and GUA nucleotides in the structure. 

for i, seq in enumerate(seqs):
    # Combine the sequences of both strands
    full_seq = seqs[i] + seqs_revcomp[i]
  
    # Use a list comprehension with an "if" clause to pull out just the sites where there's an ADE
    ADE_hotspot_resis = [str(res) for res in target_resis if full_seq[res-1] == 'A']
    print(f'Found hotspots (adenine residues) at the following positions for {seq}: {ADE_hotspot_resis}')
#   python is 0-indexed, while the pdb numbering starts at 1 ^^^^^^^^^^^^^^^ 
#   (i.e. to get the first letter of a string, you use `string[0]`)
    
    # Save the list to a file so we can use it down the road
    with open(f'{rifdock_path}/ADE_hotspot_target_res_{seq}.list', 'w') as f:
        f.write('\n'.join(ADE_hotspot_resis) + '\n')
        
    # Same thing for GUA:
    GUA_hotspot_resis = [str(res) for res in target_resis if full_seq[res-1] == 'G']
    print(f'Found hotspots (guanine residues) at the following positions in {seq}: {GUA_hotspot_resis}')
    with open(f'{rifdock_path}/GUA_hotspot_target_res_{seq}.list', 'w') as f:
        f.write('\n'.join(GUA_hotspot_resis) + '\n')

Found hotspots (adenine residues) at the following positions for AATTGGTCTGCGCACCAGCA: ['14', '17', '32', '34', '37', '38']
Found hotspots (guanine residues) at the following positions in AATTGGTCTGCGCACCAGCA: ['5', '6', '10', '12', '18', '25', '26', '28', '30', '33']
Found hotspots (adenine residues) at the following positions for GCACCGCCCGCGAGCCAACC: ['3', '13', '17', '18']
Found hotspots (guanine residues) at the following positions in GCACCGCCCGCGAGCCAACC: ['6', '10', '12', '14', '25', '26', '30', '32', '33', '34', '36', '37']


In [29]:
### HOT-02

# Next, we will align bidentate hydrogen bonding residues from natives onto the target hotspots. 
script = f'{software_path}align_vdm_w_lig_atoms_A.py'

# This cell will construct the inputs and command necessary to run the script for GLN- and ASN-ADE bidentates

# First, collect paths to the native bidentates and save them to a list
bidentate_paths = glob.glob(f'./hotspots/ADE_hotspots/*pdb')
with open(f'{rifdock_path}/ADE_hotspots.list', 'w') as f:
    for path in bidentate_paths:
        f.write(path+'\n')
print('Collected bidentate ligand models to adenine residues (GLN and ASN rotamers)')

# Then, assemble the command for the script, which takes the form:
# <script> <directory> <bidentate_list> <protein_residue_types> <DNA_base_types> <target_pdb> <residue,indexes,,,>

for seq in seqs:
    pdb = f'{rifdock_path}/target_seqs/{seq}.pdb'
    with open(f'{rifdock_path}/ADE_hotspot_target_res_{seq}.list') as f_in:
        ADE_hotspot_resis = [line.strip() for line in f_in]

        # We want to skip running the script if there are no relevant hotspots
        if ADE_hotspot_resis == ['']:
            print('There are no adenine hotspots, skipping',seq)
            continue

    # Again, we are going to run this script using subprocess.
    cmd = f'python {script} {rifdock_path} ADE_hotspots.list GLN_ASN DA {pdb} {",".join(ADE_hotspot_resis)}'
    print(f'Running "{cmd}" with a subprocess')
    os.chdir(rifdock_path)

    p = subprocess.Popen(cmd.split())
    p.communicate()

# Afterwards, look at some of the pdbs in `base_aligned_GLN_ASN_hotspots` with the target structure to see what this code did.
print(f'Outputted aligned ligand rotamer .pdbs in {rifdock_path}/base_aligned_GLN_ASN_hotspots/')

Collected bidentate ligand models to adenine residues (glutamine and asparagine rotamers)
Running "python /mnt/home/cjg263/software/align_vdm_w_lig_atoms_A.py /net/scratch/cjg263/de_novo_dna/HBVa_HSB-2/rifdock ADE_hotspots.list GLN_ASN DA /net/scratch/cjg263/de_novo_dna/HBVa_HSB-2/rifdock/target_seqs/AATTGGTCTGCGCACCAGCA.pdb 14,17,32,34,37,38" with a subprocess
Running "python /mnt/home/cjg263/software/align_vdm_w_lig_atoms_A.py /net/scratch/cjg263/de_novo_dna/HBVa_HSB-2/rifdock ADE_hotspots.list GLN_ASN DA /net/scratch/cjg263/de_novo_dna/HBVa_HSB-2/rifdock/target_seqs/GCACCGCCCGCGAGCCAACC.pdb 3,13,17,18" with a subprocess
Outputted aligned ligand rotamer .pdbs in /net/scratch/cjg263/de_novo_dna/HBVa_HSB-2/rifdock/base_aligned_GLN_ASN_hotspots/


In [32]:
### HOT-03

# This cell will construct the inputs and command necessary to run the same script for ARG-GUA bidentates

bidentate_paths = glob.glob(f'{software_path}/rifdock/GUA_hotspots/*pdb')
with open(f'{rifdock_path}/GUA_hotspots.list', 'w') as f:
    for path in bidentate_paths:
        f.write(path+'\n')
print('Collected bidentate ligand models to guanine residues (arginine rotamers)')

script = f'{software_path}align_vdm_w_lig_atoms_G.py'

for seq in seqs:
    pdb = f'{rifdock_path}/target_seqs/{seq}.pdb'
    with open(f'{rifdock_path}/GUA_hotspot_target_res_{seq}.list') as f_in:
        GUA_hotspot_resis = [line.strip() for line in f_in]

        # We want to skip running the script if there are no relevant hotspots
        if len(GUA_hotspot_resis) == 0
            print('skipping',seq)
            continue

    # Again, we are going to run this script using subprocess.
    cmd = f'python {script} {rifdock_path} GUA_hotspots.list ARG DG {pdb} {",".join(GUA_hotspot_resis)}'
    print(f'Running "{cmd}" with a subprocess')

    # For this script, we need to change our working directory. The python equivalent of bash's cd is os.chdir()
    os.chdir(rifdock_path)

    p = subprocess.Popen(cmd.split())
    p.communicate()

print(f'Outputted aligned ligand rotamer .pdbs in {rifdock_path}/base_aligned_ARG_hotspots/')

Collected bidentate ligand models to guanine residues (arginine rotamers)
Running "python /mnt/home/cjg263/software/align_vdm_w_lig_atoms_G.py /net/scratch/cjg263/de_novo_dna/HBVa_HSB-2/rifdock GUA_hotspots.list ARG DG /net/scratch/cjg263/de_novo_dna/HBVa_HSB-2/rifdock/target_seqs/AATTGGTCTGCGCACCAGCA.pdb 5,6,10,12,18,25,26,28,30,33" with a subprocess
Running "python /mnt/home/cjg263/software/align_vdm_w_lig_atoms_G.py /net/scratch/cjg263/de_novo_dna/HBVa_HSB-2/rifdock GUA_hotspots.list ARG DG /net/scratch/cjg263/de_novo_dna/HBVa_HSB-2/rifdock/target_seqs/GCACCGCCCGCGAGCCAACC.pdb 6,10,12,14,25,26,30,32,33,34,36,37" with a subprocess
Outputted aligned ligand rotamer .pdbs in /net/scratch/cjg263/de_novo_dna/HBVa_HSB-2/rifdock/base_aligned_ARG_hotspots/


In [33]:
### HOT-04

# This cell groups together hotspot interactions found in the previous step into a single pdb of all the interactions to use at each site

# directory where the output will go
os.makedirs(f'{rifdock_path}/base_aligned_hotspot_groups/', exist_ok=True)

for seq in seqs:
    pdb = f'{rifdock_path}/target_seqs/{seq}.pdb'
    aligned_hotspots = glob.glob(f'{rifdock_path}/base_aligned_*_hotspots/{seq}/*pdb')
    
    hotspot_resis = strand1_hotspots[seq] + strand2_hotspots[seq]
    
    for res in hotspot_resis:
        num_hotspots = 0
        for hotspot in aligned_hotspots:
            if f'_{res}_' in hotspot:
                num_hotspots += 1
        print(f'Found {num_hotspots} for {seq} resi {res}')
        if num_hotspots == 0:
            continue
        
        with open(f'{rifdock_path}/base_aligned_hotspot_groups/{seq}_group{res}.pdb', 'w') as f_out:
            for hotspot in aligned_hotspots:
                if f'_{res}_' in hotspot:
                    with open(hotspot, 'r') as f_in:
                        lines = [line for line in f_in]
                        f_in.close()
                    for line in lines:
                        f_out.write(line)
                    f_out.write('TER\n')
                    
# Notice any pattern in the output? I see 748 for every GUA and 106 for every ADE
# Look at the outputs again, this time in base_alinged_hotspot_groups. Make sure to also load your target DNA!

Found 0 for AATTGGTCTGCGCACCAGCA resi 3
Found 0 for AATTGGTCTGCGCACCAGCA resi 4
Found 748 for AATTGGTCTGCGCACCAGCA resi 5
Found 748 for AATTGGTCTGCGCACCAGCA resi 6
Found 0 for AATTGGTCTGCGCACCAGCA resi 7
Found 0 for AATTGGTCTGCGCACCAGCA resi 8
Found 0 for AATTGGTCTGCGCACCAGCA resi 9
Found 748 for AATTGGTCTGCGCACCAGCA resi 10
Found 0 for AATTGGTCTGCGCACCAGCA resi 11
Found 748 for AATTGGTCTGCGCACCAGCA resi 12
Found 0 for AATTGGTCTGCGCACCAGCA resi 13
Found 106 for AATTGGTCTGCGCACCAGCA resi 14
Found 0 for AATTGGTCTGCGCACCAGCA resi 15
Found 0 for AATTGGTCTGCGCACCAGCA resi 16
Found 106 for AATTGGTCTGCGCACCAGCA resi 17
Found 748 for AATTGGTCTGCGCACCAGCA resi 18
Found 0 for AATTGGTCTGCGCACCAGCA resi 23
Found 0 for AATTGGTCTGCGCACCAGCA resi 24
Found 748 for AATTGGTCTGCGCACCAGCA resi 25
Found 748 for AATTGGTCTGCGCACCAGCA resi 26
Found 0 for AATTGGTCTGCGCACCAGCA resi 27
Found 748 for AATTGGTCTGCGCACCAGCA resi 28
Found 0 for AATTGGTCTGCGCACCAGCA resi 29
Found 748 for AATTGGTCTGCGCACCAGCA resi 30
F

## Step 2: Rifgen (Only run once)

Now we're going to run Rifgen. This is a big and powerful program that goes hand in hand with rifdock. <br>
Rifgen's goal is to make a hash table which can be used to very quickly find and evaluate solutions to the rigid-body docking problem. <br>
Basically, it evaluates how good each of the 20 amino acids would be at every point in space around the target, <br>
which can be rapidly extrapolated to know how good it is to place the backbone of a protein at any of those points. <br>
This information is then used by rifdock to quickly find docked positions of the full scaffold that score as well as possible. <br>

In [8]:
### RIFG-00

# This cell, rather than running code itself, defines a function that we will use later.
# This function takes several arguments, which correspond to important input parameters of rifgen.
# The function then returns this large multi-line string, which can be written to a file to define all of rifgen's settings.

# We use mostly the same settings every time, so the variables here are mostly the paths to inputs and outputs, plus the hotspot residues.

def gen_flag_file(pdb,rifdock_path,name,hotspot_resis):
    RIFGEN_FLAG_FILE = f"""
################### File I/O flags ######################################
-rifgen:target           {pdb}
-in:file:extra_res_fa
-rifgen:target_res       {rifdock_path}/target_res_{name}.list

-rifgen:outdir           {rifdock_path}/{name}_rifgen_output/
-rifgen:outfile          {name}_rnd1.rif.gz

############################## RIF Flags #####################################

# What kind of RIF do you want to generate:
#                                    Normal: RotScore64
#            Normal with hbond satisfaction: RotScoreSat (RotScoreSat_2x16 if you have >255 of polar atoms)
# Hotspots:
#    I may want to use require_satisfaction: RotScoreSat_1x16
#  I don't want to use require_satisfaction: RotScore64

-rifgen::rif_type RotScoreSat_1x16

##################### Normal RIF Configuration ##############################

# The next three flags control how the RIF is built (hotspots are separate!!)
# Which rif residues do you want to use?
#  apores are the hydrophobics (h-bonding is not considered when placing these)
#  donres donate hydrogens to form hydrogen bonds
#  accres accept hydrogens to form hydrogen bonds
-rifgen:apores # ALA VAL ILE LEU MET PHE TRP # optionally allow hydrophobic residues for thymine
-rifgen:donres ARG LYS GLN ASN # roughly in decreasing order of sample size. Only seeding canonical hbonding residues in major groove. 
-rifgen:accres GLU GLN ASP ASN

-rifgen:score_threshold -0.5  # the score a rotamer must get in order to be added to the rif (kcal/mol) 

###################### Hotspot configuration #################################
#   (use this either with or without apores, donres, and accres)

# Pick one of the two following methods for hotspot input:

# Hotspot input multiple distinct groups
-hotspot_groups {" ".join(hotspot_resis)}

# Hotspot input every hotspot is a group
# -hotspot_groups all_my_hotspots.pdb
#-single_file_hotspots_insertion single_file_hotspots.pdb

-hotspot_sample_cart_bound 0.5   # How much do you want your hotspots to move left/right/up/down
-hotspot_sample_angle_bound 5   # What angular deviation from your hotspot will you accept

-hotspot_nsamples 3000  # How many times should the random sampling be done. 100000 - 1000000 is good

-hotspot_score_thresh -0.5 # What score must a hotspot produce in order to be added to the RIF
#-hotspot_score_bonus -4    # Be careful, rifdock has a maximum score of -9
                             #  do not exceed this (this gets added to the hotspot score)

###################### General flags #######################################

-rifgen:hbond_weight 2.0           # max score per h-bond (kcal/mol. Rosetta is ~ 2.1)
-rifgen:upweight_multi_hbond 0.0   # extra score factor for bidentate hbonds (this is really sketchy)

# For donres and accres. What's the minimum quality h-bond where we keep the rotamers even if it doesn't pass score_threshold?
# This is on a scale from -1 to 0 where -1 represents a perfect hbond
-min_hb_quality_for_satisfaction -0.5


# Change this depending on what version of rifdock you are using
-database /home/bcov/rifdock/latest/database

###################### Experimental flags ##################################

# -use_rosetta_grid_energies true/false  # Use Frank's grid energies for donres, accres, and user hotspots

##############################################################################
##############################################################################
#################### END OF USER ADJUSTABLE SETTINGS #########################
##############################################################################
##############################################################################

-rifgen:extra_rotamers false          # actually ex1 ex2 
-rifgen:extra_rif_rotamers true       # kinda like ex2

-rif_accum_scratch_size_M 24000

-renumber_pdb

-hash_cart_resl              0.7      # rif cartesian resolution
-hash_angle_resl            14.0      # rif angle resolution

# how exactly should the rosetta energy field be calculated?
# The further down you go, the higher the resolution
# This only affects hydrophobics
-rifgen::rosetta_field_resl 0.25
-rifgen::search_resolutions 3.0 1.5 0.75
#-rifgen::rosetta_field_resl 0.125
#-rifgen::search_resolutions 4.0 2.0 1.0 0.5
#-rifgen::rosetta_field_resl 0.125
#-rifgen::search_resolutions 3.0 1.5 0.75 0.375

# This folder only exists to save time. Set it somewhere you can write if the default doesn't work
#-rifgen:data_cache_dir    /software/rifdock/cache
-rifgen:data_cache_dir    /net/scratch/{user}/rifgen_orbital_test_cache

-rifgen:score_cut_adjust 0.8

-hbond_cart_sample_hack_range 1.00
-hbond_cart_sample_hack_resl  0.33

-rifgen:tip_tol_deg        60.0 # for now, do either 60 or 36
-rifgen:rot_samp_resl       6.0

-rifgen:rif_hbond_dump_fraction  0.000001
-rifgen:rif_apo_dump_fraction    0.000001

-add_orbitals

-rifgen:beam_size_M 10000.0
-rifgen:hash_preallocate_mult 0.125
-rifgen:max_rf_bounding_ratio 4.0

-rifgen:hash_cart_resls   16.0   8.0   4.0   2.0   1.0
-rifgen:hash_cart_bounds   512   512   512   512   512
-rifgen:lever_bounds      16.0   8.0   4.0   2.0   1.0
-rifgen:hash_ang_resls     38.8  24.4  17.2  13.6  11.8 # yes worky worky
-rifgen:lever_radii        23.6 18.785501 13.324600  8.425850  4.855575

"""
    return RIFGEN_FLAG_FILE
print('Set rifgen command flags')

Set rifgen command flags


In [None]:
### RIFG-01

# This cell will now assemble the command(s) to run rifgen on our target(s). You will need RIFgen installed in your system. Enter the path below:
path_to_rifgen = '' # this is the path to rifgen

cmds = f'{rifdock_path}/cmds_rifgen'

with open(cmds, 'w') as cmds_file:
    for seq in seqs:
        
        # Get the necessary paths
        pdb = f'{rifdock_path}/target_seqs/{seq}.pdb'
        hotspot_resis = glob.glob(f'{rifdock_path}/base_aligned_hotspot_groups/{seq}*pdb')
        
        # Save a rifgen flag file.
        RIFGEN_FLAG_FILE=gen_flag_file(pdb,rifdock_path,seq,hotspot_resis)
        with open(f'{rifdock_path}/{seq}_rifgen.flag', 'w') as f:
            f.write(RIFGEN_FLAG_FILE)
        
        cmd = f'OMP_NUM_THREADS=16 {path_to_rigen} @{rifdock_path}/{seq}_rifgen.flag> {rifdock_path}/{seq}_rifgen.log 2>&1'
        cmds_file.write(cmd+'\n')
        
        print("Run the following command in a node with around 100 GB of memory and 16 CPU's: ", cmd)

Once your Rifgen job is done, check to make sure the output is there. <br>
It will be in a directory ending with `rifgen_output` in your /net/scratch/ working directory. <br>
There should be a bunch of compressed (`.gz`) files. <br>

## Step 3: Rifdock (only run once)

Finally, we are going to run rifdock itself. This program works with the outputs from rifgen we just made, <br>
and uses them to place entire protein scaffolds into good spots on the target structure. <br>
We are going to generate a LOT of docks, so be prepared for this step to take several days to run on the digs.

In [42]:
### RIFD-00

# Once again, we define a function that will generate the text for the flags file used by rifdock.
# This time, there are several parameters we've played around with in the past.
# If you're running the pipeline for the first time, you can probably use our defaults here,
# but you may want to play around with the settings later on.

# Flags that you may consider changing are marked with a ^_^

def gen_rifdock_flags(rifgen_log,name,rifdock_path): # write a rifdock flags file:
    
    # This flag sets a minimum requirement for docks to pass scoring and be output.
    # Specifically, it requires that there be at least two "privileged" interactions between the protein and the target
    # We define privileged interactions as base-specific hydrogen bonds, especially potential bidentates.
    # Increasing this number will likely result in fewer docks, but they may be higher quality.
    n_satisfied = 2             # ^_^
    
    # This flag enables a special behavior in rifdock that imposes a penalty for bringing the protein backbone
    # too close to the target, in addition to the energetics of their interactions
    # We used to have problems with docks being too close, but no longer really use this.
    # If you want, you could turn on this feature and play around with how strong you want the effect to be.
    distance_penalty = False    # ^_^
    CB_too_close_dist = 3.5     # ^_^
    CB_too_close_penalty = 8    # ^_^

    # This feature has a similar story: it provides a bonus or penalty for bringing two types of residues close together
    # Specifically, we used it to force the N-terminal ends of helices to be close to the negatively-charged phosphate backbone
    # The idea was to take advantage of the inherent dipole of alpha-helices, which is positive on the N-terminal end
    # In practice, our scaffolds are designed such that most of the docks produce this effect anyway, without needing to force it.
    # If you want, you could turn this on and play around with the strengths and distance cutoffs.
    helix_dipole_bonus = False  # ^_^
    INFOLABEL_N = 'N-TERM:N'    
    INFOLABEL_C = 'C-TERM:O'    
    BONUS = 1                   # ^_^
    PENALTY = 8                 # ^_^
    RESL = 0.1                  # ^_^
    MIN_DIST = 3.5              # ^_^
    MAX_DIST = 5                # ^_^
    MIN_DIST_PEN = 6            # ^_^
    MAX_DIST_PEN = 8            # ^_^

    # The rest of this is just formatting the rifdock flag file based on the parameters specified.
    
    if distance_penalty == True:
        distance_penalty_lines = f"""-CB_too_close_dist {CB_too_close_dist}
-CB_too_close_penalty {CB_too_close_penalty}
-CB_too_close_resl 0.1"""
    else:
        distance_penalty_lines = ""

    if helix_dipole_bonus == True:
        helix_dipole_bonus_lines = f"""-specific_atoms_close_bonus {INFOLABEL_N},-{BONUS},{RESL},{MIN_DIST},{MAX_DIST},{TARGET_ATOMS}
-specific_atoms_close_bonus {INFOLABEL_C},{PENALTY},{RESL},{MIN_DIST_PEN},{MAX_DIST_PEN},{TARGET_ATOMS}"""

    else:
        helix_dipole_bonus_lines = ""

    with open(rifgen_log, 'r') as f_in:
        rifgen_log = ''
        line_list = [line for line in f_in]
        for line in line_list[-17:]:
            rifgen_log = rifgen_log + line

    # Everthing between the """   """ is one long string
    #                   vvvv BEGIN string
    rifdock_flag_file = f"""{rifgen_log}


-rif_dock:num_hotspots 10000

################################### Flags that control distance from DNA atoms ##########################
{distance_penalty_lines}
{helix_dipole_bonus_lines}

################################## Require DNA recognition helix ########################################
-scaffold_res_pdbinfo_labels ss_RH

#################################### Flags that control output ##########################################

-rif_dock:outdir  {rifdock_path}/{name}_rifdock_output          # the output folder for this run
-rif_dock:dokfile all.dok                 # the "score file" for this run

-rif_dock:n_pdb_out 300                   # max number of output pdbs  # ^_^
-rif_dock:redundancy_filter_mag 0.6       # The "RMSD" cluster threshold for the output. Smaller numbers give more, but redundant output  # ^_^

#-rif_dock:target_tag conf01              # optional tag to add to all outputs
-rif_dock:align_output_to_scaffold false  # If this is false, the output is aligned to the target

# Pick either one of the following or none
# (None)                                  # Output target + scaffold. But scaffold may be poly ALA with rifres based on scaffold_to_ala
#-output_scaffold_only                    # Output just the scaffold. But scaffold may be poly ALA with rifres based on scaffold_to_ala
-output_full_scaffold                     # Output target + scaffold. Scaffold retains input sequence plus rifres
#-output_full_scaffold_only               # Output just the scaffold. Scaffold retains input sequence plus rifres


############################ Flags that affect runtime/search space ####################################

-beam_size_M 5                            # The number of search points to using during HSearch
-hsearch_scale_factor 1.2                 # The default search resolution gets multiplied by this. People don't usually change this.

#-rif_dock:tether_to_input_position 3     # Only allow results within this "RMSD" of the input scaffold

-rif_dock:global_score_cut -7.0          # After HSearch and after HackPack, anything worse than this gets thrown out

##################### Flags that only affect the PatchDock/RifDock runs ################################
# Uncomment everything here except seeding_pos if running PatchDock/RifDock

#-rif_dock:seeding_pos ""                 # Either a single file or a list of seeding position files
#-rif_dock:seeding_by_patchdock true       # If true, seeding_pos is literally the PatchDock .out file
                                          # If false, seeding_pos file is list of transforms. 
                                          #   (Each row is 12 numbers. First 9 are rotation matrix and last 3 are translation.)
#-rif_dock:patchdock_min_sasa  1000        # Only take patchdock outputs with more than this sasa
#-rif_dock:patchdock_top_ranks 2000        # Only take the first N patchdock outputs

                                           
#-rif_dock:xform_pos /mnt/home/bcov/sc/scaffold_comparison/data/xform_pos_ang30.x #/home/bcov/sc/random/xform_poss/xform_pos_ang10_0.35A_1.1d.x
                                           # Which xform file do you want to use. Difference is how many degrees do you want to 
                                           #   deviate from the PatchDock outputs. Pick one from here:
                                           #                 /home/bcov/sc/scaffold_comparison/data/xform_pos_ang*



#-rif_dock:cluster_score_cut -5.0          # After HackPack, what results should be thrown out before applying -keep_top_clusters_frac
#-rif_dock:keep_top_clusters_frac 1.0      # After applying the cluster_score_cut, what fraction of remaining seeding positions should survive?
                                         
#-rosetta_score_each_seeding_at_least 1    # When cutting down the results by rosetta_score_fraction, make sure at least this many from each 
                                           #   seeding position survive

#-only_load_highest_resl                   # This will make rifdock use less ram. Highly recommended for the patchdock protocol.

#-n_pdb_out_global 300                      # n_pdb_out controls how many per patchdock output. This is how many total

##################### Advanced seeding position flags ##################################################

#-rif_dock:seed_with_these_pdbs *.pdb      # List of scaffolds floating in space above the target that you would like to use instead.
                                           #   of numeric seeding positions. The target shouldn't be present and the scaffold must match exactly
                                           #   Use this instead of -seeding_pos
#-rif_dock:seed_include_input true         # Include the input pdb as one of the pdbs for -seed_with_these_pdbs

#-rif_dock:write_seed_to_output true       # Use this if you want to know which output came from which seeding position


##################### Flags that affect how things are scored ##########################################

#-use_rosetta_grid_energies true/false     # Your choice. If True, uses Frank's grid energies during Hackpack

-hbond_weight 2.0                          # max score per hbond (Rosetta's max is 2.1)  # ^_^
-upweight_multi_hbond 0.0                  # extra score factor for bidentate hbonds (BrianC recommends don't do this)
-min_hb_quality_for_satisfaction -0.25     # If using require_satisfaction (or buried unsats). How good does a hydrogen bond need to be to "count"?
                                           #   The scale is from -1.0 to 0 where -1.0 is a perfect hydrogen bond.
-scaff_bb_hbond_weight 2.0                 # max score per hbond on the scaffold backbone  # ^_^

-favorable_1body_multiplier 0.2            # Anything with a one-body energy less than favorable_1body_cutoff gets multiplied by this
-favorable_1body_multiplier_cutoff 4       # Anything with a one-body energy less than this gets multiplied by favorable_1body_multiplier
-favorable_2body_multiplier 5              # Anything with a two-body energy less than 0 gets multiplied by this

-user_rotamer_bonus_constant 0             # Anything that makes a hydrogen-bond, is a hotspot, or is a "requirement" gets this bonus
-user_rotamer_bonus_per_chi -2             # Anything that makes a hydrogen-bond, is a hotspot, or is a "requirement" gets this bonus * number of chis  # ^_^

-rif_dock:upweight_iface 2.0               # During RifDock and HackPack. rifres-target interactions are multiplied by this number  # ^_^


################ stuff related to picking designable and fixed positions #################

#### if you DO NOT supply scaffold_res files, this will attempt to pick which residues on the scaffold
#### can be mutated based on sasa, internal energy, and bb-sc hbonds
-scaffold_res_use_best_guess true

#### if scaffold_res is NOT used, this option will cause loop residues to be ignored
#### scaffold_res overrides this
-rif_dock::dont_use_scaffold_loops false

#### these cause the non-designable scaffold residues to still contribute sterically
#### and to the 1 body rotamer energies. use these flags if you have a fully-designed scaffold
#-rif_dock:scaffold_to_ala false
#-rif_dock:scaffold_to_ala_selonly true
#-rif_dock:replace_all_with_ala_1bre false
#### if you don't have a fully designed scaffold, treat non-designable positions as alanine
-rif_dock:scaffold_to_ala true            # Brian thinks that converting the whole scaffold to alanine works better during rosetta min
-rif_dock:scaffold_to_ala_selonly false
-rif_dock:replace_all_with_ala_1bre true



#################################### HackPack options #####################################
-hack_pack false                            # Do you want to do HackPack? (Probably a good idea)
-rif_dock:hack_pack_frac  1.0              # What fraction of your HSearch results (that passed global_score_cut) do you want to HackPack?


############################# rosetta re-scoring / min stuff ###################################

#-rif_dock:rosetta_score_cut -10.0                    # After RosettaScore, anything with a score worse than this gets thrown out

#-rif_dock:rosetta_score_fraction 0                   # These two flags greaty affect runtime!!!!!
#-rif_dock:rosetta_min_fraction 0                     # Choose wisely, higher fractions give more, better output at the cost of runtime

#-rif_dock:rosetta_min_at_least 30                    # Make sure at least this many survive the rosetta_min_fraction
#-rif_dock:rosetta_min_at_most 300                    # Make sure no more than this get minned
#-rif_dock:rosetta_score_at_most  3000              # Make sure that no more than this many go to rosetta score

#-rif_dock:replace_orig_scaffold_res false            # If you converted to poly ALA with scaffold_to_ala, this puts the original residues
                                                     #   back before you do rosetta min.
#-rif_dock:override_rosetta_pose true                 # Brian highly recommends this flag. This prevents the minimized pose from being output
#-rif_dock:rosetta_min_scaffoldbb false               # Set BB movemap of scaffold to True
#-rif_dock:rosetta_min_targetbb   false               # Set BB movemap of target to true
#-rif_dock:rosetta_hard_min false                     # Minimize with the "hard" score function (alternative is "soft" score function)

#-rif_dock:rosetta_score_rifres_rifres_weight   0.6   # When evaluating the final score, multiply rifres-rifres interactions by this
#-rif_dock:rosetta_score_rifres_scaffold_weight 0.4   # When evaluating the final score, multiply rifres-scaffold interactions by this
                                                     #  These two flags only get used if the interaction is good. Bad interactions are
                                                     #    full weight.


######################### Special flags that do special things #################################

#-hack_pack_during_hsearch False           # Run HackPack during the HSearch. Doesn't usually help, but who knows.
-require_satisfaction {n_satisfied}        # Require at least this many hbonds, hotspots, or "requirements"
-require_n_rifres  3                      # Require at least this may rifres (not perfect)

#-requirements 0,1,2,8                     # Require that certain satisfactions be required in all outputs
                                           # If one runs a standard RifDock, these will be individual hydrogen bonds to specific atoms
                                           # If one uses hotspots during rifgen, these will correspond the the hotspots groups
                                           #   However, due to some conflicts, these will also overlap with hydrogen bonds to specific atoms
                                           # Finally, if one uses a -tuning_file, these will correspond to the "requirements" set there


######################### Hydrophobic Filters ##################################################
# These are rather experimental flags. You'll have to play with the values.
# Hydrophobic ddG is roughly fa_atr + fa_rep + fa_sol for hydrophobic residues.

#-hydrophobic_ddg_cut -12                  # All outputs must have hydrophobic ddG at least this value
#-require_hydrophobic_residue_contacts 5   # All outputs must make contact with at least this many target hydrophobics
#-hydrophobic_ddg_weight 3 # Overweight hydrophobic interactions so that the HackPack understands they are important, for T hydrophobic interactions
#-one_hydrophobic_better_than -2           # Require that at least one rifres have a hydrophobic ddG better than this
#-two_hydrophobics_better_than -2          # Require that at least two rifres have a hydrophobic ddG better than this
#-three_hydrophobics_better_than -1        # Require that at least three rifres have a hydrophobic ddG better than this

# This next flag affects the *_hydrophobics_better_than flags. A rifres can only be counted towards those flags if it passes this one.
#-hydrophobic_ddg_per_atom_cut -0.3        # Require that hydrophobics for the *_hydrophobics_better_than flags have at least this much 
                                           #  ddG per side-chain heavy atoms.

#-hydrophobic_target_res 1,15,29,35        # If you want your selection of hydrophobic residues to include only a subset of the ones
                                           #  you selected for the target_res, place that selection here with commas.

######################### options to favor existing scaffold residues ##########################
-add_native_scaffold_rots_when_packing 0 # 1
-bonus_to_native_scaffold_res          0 # -0.5


################################# Twobody table caching ####################################

# RifDock caches the twobody tables so that you can save time later. If you use the same scaffolds
#  in the same directory mulitple times. This is a good idea. Otherwise, these take up quite
#  a bit of space and it might be smart to turn the caching off.

-rif_dock:cache_scaffold_data true
-rif_dock:data_cache_dir  ./rifdock_v4_scaffdata_025_0_atr1


################################ Rosetta Database ##########################################

-database /software/rifdock/latest/database

############################################################################################
############################################################################################
############################ END OF USER ADJUSTABLE SETTINGS ###############################
############################################################################################
############################################################################################


#### to use -beta, ask will if you don't want to use -beta
-beta
-score:weights beta_soft
-add_orbitals false

#### HackPack options you probably shouldn't change
-rif_dock:pack_n_iters    2
-rif_dock:pack_iter_mult  2.0
-rif_dock:packing_use_rif_rotamers        true
-rif_dock:extra_rotamers                  false
-rif_dock:always_available_rotamers_level 0


#### details for how twobody rotamer energies are computed and stored, don't change
-rif_dock:rotrf_resl   0.25
-rif_dock:rotrf_spread 0.0
-rif_dock:rotrf_scale_atr 1.0
-rif_dock:rotrf_cache_dir /net/scratch/{user}/rifdock_cache         # This folder only exists to save time. 
                                                          #   Set it somewhere you can write if the default doesn't work

### Brian doesn't know what these flags do
-rif_dock::rf_resl 0.5
-rif_dock::rf_oversample 2
-rif_dock:use_scaffold_bounding_grids 0
-rif_dock:target_rf_oversample 2

 # disulfides seem to cause problems... ignoring them isn't really an issue unless
 # you do bbmin where there should be disulfides
-detect_disulf 0


-mute core.scoring.ScoreFunctionFactory
-mute core.io.pose_from_sfr.PoseFromSFRBuilder

-outputsilent
"""
#^^^ END string

    # for this function, we need to return the output 
    return rifdock_flag_file
print('Created function to set rifdock command flags')

Created function to set rifdock command flags


In [39]:
### RIFD-01

# Make a rifdock flag for each target structure, using the above function

rifgen_logs = []
for seq in seqs:
    rifgen_log = f'{rifdock_path}/{seq}_rifgen.log'
    rifgen_logs.append(rifgen_log)
    os.makedirs(f'/net/scratch/{user}/rifdock_cache_{ID}_{seq}/',exist_ok=True)
    rifdock_flag_file = gen_rifdock_flags(rifgen_log, seq, rifdock_path)
    with open(f'{rifdock_path}/{seq}_rifdock.flag', 'w') as f:
            f.write(rifdock_flag_file)
print('Set rifdock command flags')

Set rifdock command flags


In [40]:
### RIFD-02

### Get the RIF size (the size of the hash table generated by rifgen. You'll need at least this much memory to run rifdock)
with open(rifgen_logs[0], 'r') as f_in:
    for line in f_in:
        if 'RIF size:' in line:
            rif_size = int(line[10:13])
            break
print(f'RIF size is {rif_size}G')

RIF size is 4G


In [41]:
### RIFD-03

# Set up rifdock commands with the following. Then test a job to see how many outputs you get. 
# Take a look at the docks and make sure they look good to you.

#RIFDOCK = '/home/bcov/rifdock/scheme/build/apps/rosetta/rif_dock_test'
RIFDOCK = '/software/rifdock/latest/rif_dock_test'
SCAFFOLDS = f'{rifdock_path}/scaffolds.list'

os.makedirs(f'{rifdock_path}/rifdock_commands/splits', exist_ok=True)
os.chdir(f'{rifdock_path}/rifdock_commands/splits')                     
# os.chdir changes the current working directory for the Jupyter notebook

# the bash `split` command will split the scaffolds list file into a bunch of files with 120 lines (specified by the -l flag below)
split_cmd = f'split -l 120 {SCAFFOLDS}'
p = subprocess.Popen(split_cmd.split())
p.communicate()

splits = glob.glob(f'{rifdock_path}/rifdock_commands/splits/x*')
with open(f'{rifdock_path}/rifdock_commands/commands.list','w') as f:
    for seq in seqs:
        rifdock_flags = f'{rifdock_path}/{seq}_rifdock.flag'
        os.makedirs(f'{rifdock_path}/rifdock_logs_{seq}', exist_ok=True)

        for split in sorted(splits):
            split_name = os.path.basename(split)

            cmd  = f' cd {rifdock_path} ;'
            cmd += f' {RIFDOCK} @{rifdock_flags} -scaffolds $(cat {split})'
            cmd += f' > rifdock_logs_{seq}/{split_name}.log' 
            cmd +=  ' 2>&1'
            
            f.write(cmd+'\n')
                    
# Print test command instructions
print(f'ssh digs')
print(f'qlogin -c 15 --mem={rif_size+46}g -p cpu -q interactive')
print(f'cd {rifdock_path}')
print(f'head -n 1 rifdock_commands/commands.list')

# RUN THE BELOW IN A TERMINAL
# THEN, copy the output of the final step into the command line again to run it. Let it run for a few minutes, then cancel the job with CTRL-C.
# THEN, look at the actual outputs. cd into the output directory (ends in rifdock_output), then pick a silent file and use silentextract to get the pdbs out of it. Look at the pdbs in pymol

ssh digs
qlogin -c 15 --mem=50g -p cpu -q interactive
cd /net/scratch/cjg263/de_novo_dna/HBVa_HSB-2/rifdock
head -n 1 rifdock_commands/commands.list


In [10]:
### RIFD-04

# If you wish, go back and adjust the parameters of rifdock.flag and rerun the test command
# When you are satisfied with the number and quality of docks, continue to submit all the jobs