<h2>IDR Design</h2>

231005 - Tian Hao Huang, Julie Forman-Kay, Alan Moses

This notebook generates protein sequences with a target biophysical feature vector, as described in the following papers:
-   Zarin, T., Strome, B., Ba, A. N. N., Alberti, S., Forman‐Kay, J. D., & Moses, A. M. (2019). Proteome-wide signatures of function in highly diverged intrinsically disordered regions. eLife, 8. https://doi.org/10.7554/elife.46883
-   Zarin, T., Strome, B., Peng, G., Pritišanac, I., Forman‐Kay, J. D., & Moses, A. M. (2021). Identifying molecular features that are associated with biological function of intrinsically disordered protein regions. eLife, 10. https://doi.org/10.7554/elife.60220

The vectors consist of bulk biophysical properties and short linear interaction motifs.
Using the example of the human helicase DDX3X, let's take a look at the biophysical feature vector for its N- and C-terminal IDRs:

In [1]:
from idr_design.feature_calculators.main import SequenceFeatureCalculator
from pandas import DataFrame

feature_calculator = SequenceFeatureCalculator()
ddx3x_n_idr = "MSHVAVENALGLDQQFAGLDLNSSDNQSGGSTASKGRYIPPHLRNREATKGFYDKDSSGWSSSKDKDAYSSFGSRSDSRGKSSFFSDRGSGSRGRFDDRGRSDYDGIGSRGDRSGFGKFERGGNSRWCDKSDEDDWSKPLPPSERLEQELFSGGNTGINFEKYDDIP"
ddx3x_c_idr = "YEHHYKGSSRGRSKSSRFSGGFGARDYRQSSGASSSSFSSSRASSSRSGGGGHGSSRGFGGGGYGGFYNSDGYGGNYNSQGVDWWGN"
features_unlabelled = feature_calculator.run_feats_multiple_seqs([ddx3x_n_idr, ddx3x_c_idr])
columns = feature_calculator.supported_features
features_labelled = DataFrame(features_unlabelled.values(), columns=columns, index=["N IDR", "C IDR"])

features_labelled

Unnamed: 0,my_kappa,my_omega,SCD,isoelectric_point,FCR,DEG_APCC_DBOX_1,DEG_APCC_KENBOX_2,DOC_MAPK_MEF2A_6,DOC_CYCLIN_RxL_1,DOC_MAPK_FxFP_2,...,P_repeats,PTS_repeats,Q_repeats,QN_repeats,R_repeats,RG_repeats,S_repeats,SG_repeats,SR_repeats,complexity
N IDR,-1.844929,-0.545011,-0.670259,5.45991,0.317365,0,0,0,0,0,...,2,10,1,2,0,12,6,19,14,2.448322
C IDR,0.09022,-0.095982,0.093023,10.07898,0.16092,0,0,0,0,0,...,0,11,0,0,0,12,11,25,18,1.918661


We can determine how similar these vectors are by calculating their euclidean distance.

However, the underlying variance of each of these traits is different, so we perform a rescaling of these features to have variance 1 before calculating the distance. 

First, we need to calculate the variance of each feature over all disprot IDRs (as of Oct 3rd, 2023)

In [2]:
from idr_design.feature_calculators.main import DistanceCalculator

dist_calculator = DistanceCalculator(feature_calculator, proteome_path="./tests/disprot_idrs_clean.fasta")
dist_calculator.proteome_variance

my_kappa              1.815060
my_omega              0.489790
SCD                  99.730140
isoelectric_point     7.293132
FCR                   0.018291
                       ...    
RG_repeats           16.225385
S_repeats             4.811465
SG_repeats           30.241691
SR_repeats           17.775914
complexity            0.210320
Length: 94, dtype: float64

Then, we calculate the distance between the N-terminal IDR and the C-terminal IDR:

In [3]:
from math import sqrt

feats_n_idr = features_labelled.loc["N IDR"]
feats_c_idr = features_labelled.loc["C IDR"]
distance = sqrt(dist_calculator.sqr_distance(feats_n_idr, feats_c_idr))

distance

11.660992763802641

Now, we're ready to design sequences with a target biophysical feature vector, even if the primary sequence is completely different!

We do this by starting with a random sequence, and making iterative sequence changes that greedily minimize the distance to the target vector.
Let's try a brute force approach first, where you try substituting every amino acid at every position and take the best guess out of all those.

(Skip this cell if you are in a hurry to design sequences)

In [4]:
from idr_design.design_models.iter_guess_model import BruteForce
from pandas import Series

designer_brute = BruteForce(dist_calculator, "2023")
designed_seq = designer_brute.design_similar(1, ddx3x_c_idr, verbose=True)[0]
Series(feature_calculator.run_feats(designed_seq), index=feature_calculator.supported_features)

seq                                                                                                 	dist_to_target      	time                
SSSSANDCSSFGLTWLNFQMSCNNTDIDHVENGGYAPGGGFGMTHRGFGGCWCWMMSWFNPIHTSSSHHPPEVQVVNCRRPGWNWYF             	10.26705203188014   	8.849781036376953   

KeyboardInterrupt: 

Now that's pretty slow! I don't even want to show you what happens on the longer N-terminal IDR.

Let's try a better approach, which I have called the Random Multiple Changes approach. This approach guesses point mutations randomly, finds multiple changes which decrease the distance, and the combines them to form guesses with multiple changes. Hence the name.

(Again, skip if you just want to design sequences, although this one's fast.)

In [5]:
from idr_design.design_models.iter_guess_model import RandMultiChange
from pandas import Series

designer_brute = RandMultiChange(dist_calculator, "2023")
designed_seq = designer_brute.design_similar(1, ddx3x_c_idr, verbose=True)[0]
Series(feature_calculator.run_feats(designed_seq), index=feature_calculator.supported_features)

seq                                                                                                 	dist_to_target      	time                
YGGGGRKGGENGGGSSSFGGSSYSSYVDSSYGGDYGWGFGFGSSSGRAYGKHARSRSRFNSQFSSSHHAQGGWNSRDSRGYGSSSRN             	0.34271289145477046 	3.60129976272583    


my_kappa              0.090220
my_omega             -0.095982
SCD                   0.516144
isoelectric_point    10.078980
FCR                   0.160920
                       ...    
RG_repeats           12.000000
S_repeats            11.000000
SG_repeats           25.000000
SR_repeats           18.000000
complexity            1.918661
Length: 94, dtype: float64

So super quick if that's what you're here for, here's the code to design a sequence similar to any user inputted sequence:

(If you run this you need to find the place to input sequences/numbers)

In [8]:
from idr_design.design_models.iter_guess_model import RandMultiChange
from pandas import Series

designer_brute = RandMultiChange(dist_calculator, "2023")
user_seq = input("Give a sequence! (Defaults to ddx3x_n_idr)")
if user_seq == "":
    user_seq = ddx3x_n_idr
n = int(input("How many do you want to design?"))
# Make sure that you gave a valid sequence; errors here if not
feature_calculator.run_feats(user_seq)
designed_seqs = designer_brute.design_similar(n, user_seq, verbose=True)

designed_seqs

seq                                                                                                 	dist_to_target      	time                
QLPPLQYRPIGRASMTEGRTRRLIDQRFGCLRYCLLALRALPAPSGQLVPLASKPLRQSM                                        	0.40096914414786927 	2.930323839187622   
seq                                                                                                 	dist_to_target      	time                
AYRGLYSGVGRVTRQASCSAIVIPPQAAPQVPCQVWRILSTRKRLQAVVRRPDGRVPEPA                                        	0.02447803930453163 	3.515394926071167   
seq                                                                                                 	dist_to_target      	time                
YSAISIRPSIQEQPPRREIDFIQIRPLAIPQASGAAATGRVLLQYIAIRTLRMGRDRGKA                                        	0.43012348355615493 	3.3333680629730225  


['QLPPLQYRPIGRASMTEGRTRRLIDQRFGCLRYCLLALRALPAPSGQLVPLASKPLRQSM',
 'AYRGLYSGVGRVTRQASCSAIVIPPQAAPQVPCQVWRILSTRKRLQAVVRRPDGRVPEPA',
 'YSAISIRPSIQEQPPRREIDFIQIRPLAIPQASGAAATGRVLLQYIAIRTLRMGRDRGKA']

And that's it! Have fun designing sequences!