# Predict Helix Capping Residues #

The goal is to identify residues just before an alpha helix begins or the residues just after the helix ends. This will improve secondary structure predictors becuase they often extend too far or do not start at the right place. 

The CapsDB has annoted sequences of structures of helix capping residues that can be used to train a deep nueral net. We will use a Bidirectional LSTM using phi/psi features to see if it will those will be good predictors.

## 1. Download data ##

## 2. Generate Features ##
### MMTF Pyspark Imports ###

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from mmtfPyspark.io import mmtfReader
from mmtfPyspark.webfilters import Pisces
from mmtfPyspark.filters import ContainsLProteinChain
from mmtfPyspark.mappers import StructureToPolymerChains
from mmtfPyspark.ml import ProteinSequenceEncoder

### Custom imports ###

In [None]:
import secondaryStructureExtractorFull

### Configure Spark Context ###

In [None]:
spark = SparkSession.builder.master("local[4]").appName("1-Features").getOrCreate()

### Read MMTF File and get a set of L-protein chains ###

In [None]:
pdb = mmtfReader.read_sequence_file('../resources/mmtf_reduced_sample/') \
                .flatMap(StructureToPolymerChains()) \
                .filter(ContainsLProteinChain())

### Get Torsion angle and secondary structure info ###

In [None]:
data = secondaryStructureExtractor.get_dataset(pdb)

### Filter out chains not in CapsDB ###

In [None]:
import gov.llnl.spark.hdf._ #Get from https://github.com/LLNL/spark-hdf5
capsdb = sqlContext.read.hdf5("path/to/file.h5", "/dataset")

#Write this function
caps_pdb = pdb.filter(intersect_caps_db_pdb())

### Write features to H5 file ###

In [None]:
caps_pdb.write.mode('overwrite').format('hdf').save('./features.h5')

### Get truth labels and Save to H5 ###

In [None]:
#Write out truth.h5...

### Terminate Spark ###

In [None]:
sc.stop()

## 4. Build Bidirectional LSTM ##

In [None]:
def create_model(num_features, num_outputes=2, latent_dim=100):
    """Create a Seq2Seq Bidirectional LSTM
    From: https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html
    
    Parameters
    ----------
    num_features : int
        The number of features in your trianing data
    num_outputs : int
        Number of outputs to rpedict, i.e. number of classes or 2 for binary
        
    Returns
    -------
    A new Keras Seq2Seq Bidirectional LSTM
    """
    
    # Define an input sequence and process it.
    encoder_inputs = Input(shape=(None, num_features))
    encoder = LSTM(latent_dim, return_state=True)
    encoder_outputs, state_h, state_c = encoder(encoder_inputs)
    
    # We discard `encoder_outputs` and only keep the states.
    encoder_states = [state_h, state_c]

    # Set up the decoder, using `encoder_states` as initial state.
    decoder_inputs = Input(shape=(None, num_outputes))
    
    # We set up our decoder to return full output sequences,
    # and to return internal states as well. We don't use the
    # return states in the training model, but we will use them in inference.
    decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
    decoder_outputs, _, _ = decoder_lstm(decoder_inputs,
                                         initial_state=encoder_states)
    decoder_dense = Dense(num_outputes, activation='softmax')
    decoder_outputs = decoder_dense(decoder_outputs)
    
    model = Model(inputs=inp, outputs=x)
    
    return model

In [None]:
def train():
    X_data = HDF5Matrix('features.hdf5', 'data')
    y_data = HDF5Matrix('truth.h5', 'data')
    model = create_model()
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    
    #Automicatlly determine batch sizes, train/test splits
    model.fit(X_data, y_data)