<h2>IDR Design</h2>

231005 - Tian Hao Huang, Julie Forman-Kay, Alan Moses

This notebook generates protein sequences with a target biophysical feature vector, as described in the following papers:
-   Zarin, T., Strome, B., Ba, A. N. N., Alberti, S., Forman‐Kay, J. D., & Moses, A. M. (2019). Proteome-wide signatures of function in highly diverged intrinsically disordered regions. eLife, 8. https://doi.org/10.7554/elife.46883
-   Zarin, T., Strome, B., Peng, G., Pritišanac, I., Forman‐Kay, J. D., & Moses, A. M. (2021). Identifying molecular features that are associated with biological function of intrinsically disordered protein regions. eLife, 10. https://doi.org/10.7554/elife.60220

The vectors consist of bulk molecular properties and short linear interaction motifs.
Using the example of the human helicase DDX3X, let's take a look at the biophysical feature vector for its N- and C-terminal IDRs:

In [None]:
from idr_design.feature_calculators.main import SequenceFeatureCalculator
from pandas import DataFrame

feature_calculator = SequenceFeatureCalculator()
ddx3x_n_idr = "MSHVAVENALGLDQQFAGLDLNSSDNQSGGSTASKGRYIPPHLRNREATKGFYDKDSSGWSSSKDKDAYSSFGSRSDSRGKSSFFSDRGSGSRGRFDDRGRSDYDGIGSRGDRSGFGKFERGGNSRWCDKSDEDDWSKPLPPSERLEQELFSGGNTGINFEKYDDIP"
ddx3x_c_idr = "YEHHYKGSSRGRSKSSRFSGGFGARDYRQSSGASSSSFSSSRASSSRSGGGGHGSSRGFGGGGYGGFYNSDGYGGNYNSQGVDWWGN"
features_unlabelled = feature_calculator.run_feats_multiple_seqs([ddx3x_n_idr, ddx3x_c_idr])
columns = feature_calculator.supported_features
features_labelled = DataFrame(features_unlabelled.values(), columns=columns, index=["N IDR", "C IDR"])

features_labelled

We can determine how similar these vectors are by calculating their euclidean distance.

However, the underlying variance of each of these traits is different, so we perform a rescaling of these features to have variance 1 before calculating the distance. 

First, we need to calculate the variance of each feature over some disprot IDRs (grabbed on Oct 3rd, 2023)

In [None]:
from idr_design.feature_calculators.main import DistanceCalculator

dist_calculator = DistanceCalculator(feature_calculator, proteome_path="./tests/disprot_idrs_clean.fasta")
dist_calculator.proteome_variance

Then, we calculate the distance between the N-terminal IDR and the C-terminal IDR:

In [None]:
from math import sqrt

feats_n_idr = features_labelled.loc["N IDR"]
feats_c_idr = features_labelled.loc["C IDR"]
distance = sqrt(dist_calculator.sqr_distance(feats_n_idr, feats_c_idr))

distance

Now, we're ready to design sequences with a target biophysical feature vector, even if the primary sequence is completely different!

We do this by starting with a random sequence, and making iterative sequence changes that greedily minimize the distance to the target vector.
Let's try a brute force approach first, where you try substituting every amino acid at every position and take the best guess out of all those.

In [None]:
from idr_design.design_models.iter_guess_model import BruteForce
from idr_design.design_models.progress_logger import DisplayToStdout
from pandas import Series

designer_brute = BruteForce(dist_calculator, "2023", log=DisplayToStdout())
designed_seq = designer_brute.design_similar(1, ddx3x_c_idr, verbose=True)[0]
Series(feature_calculator.run_feats(designed_seq), index=feature_calculator.supported_features)

Now that's pretty slow! I don't even want to show you what happens on the longer N-terminal IDR.

Let's try a better approach, which I have called the Random Multiple Changes approach. This approach guesses point mutations randomly, finds multiple changes which decrease the distance, and the combines them to form guesses with multiple changes. Hence the name.

In [None]:
from idr_design.design_models.iter_guess_model import RandMultiChange

designer_brute = RandMultiChange(dist_calculator, "2023", log=DisplayToStdout())
designed_seq = designer_brute.design_similar(1, ddx3x_c_idr, verbose=True)[0]
Series(feature_calculator.run_feats(designed_seq), index=feature_calculator.supported_features)

And that's it! Have fun designing sequences at this notebook!