# Example Usage of sumonet

# Loading Data #

You can load data in 2 different ways:

1) By using Encoding class -> Takes data path or data sequence and output encoded (one-hot, nlf, blosum62) vectors

2) By using Data class -> It does not take any input, output our dbPTM data -> entire or sampled data can be taken

### Data Class ###

#### You can use our data automatically  by using Data Class####

- Data class gives X_train, X_test as samples so you need to encode them 
- y_test, y_train are list so you need to convert them to a 2-d array

In [1]:
from sumonet.utils.load_data import Data

In [2]:
data = Data()

In [3]:
X_train, y_train, X_test, y_test = data.sample_data(ratio = 0.2) #ratio defined as 0.4 in class
# If you want to use entire data as we did, you can set ratio as 1.

In [4]:
print(f'A sample from X_train: {X_train[0]}')

A sample from X_train: LLPPSATASVKMEPENKYLPE


### Encode samples and convert label list to 2-d vectors###

### Encoding Class ###

In [5]:
from sumonet.utils.encodings import Encoding

#### Define Encoding class ###

Encoding class takes 2 parameters: encoderTypes and scaler.

- encoderTypes is initially defined as blosum62 according to our experiments but you can use one-hot or nlf also
- scaler is initially defined as True according to our experiments. It means that data will be passed into min-max scaler. If you want you can cancel it.
- You can change encoder type with set_encoder_type(encoderType) function

In [6]:
encoder = Encoding(encoderType='one-hot') ## Encoding(encoderType = 'blosum62', scale = True)

In [7]:
X_train, y_train = encoder.get_encoded_vectors_from_data(X_train, y_train)
X_test, y_test = encoder.get_encoded_vectors_from_data(X_test, y_test)

In [8]:
print(f"Shape of the train and test samples are: X_train = {X_train.shape} || X_test = {X_test.shape}")
print(f"Shape of the train and test labels are: y_train = {y_train.shape} || y_test = {y_test.shape}")

Shape of the train and test samples are: X_train = (1912, 21, 21) || X_test = (211, 21, 21)
Shape of the train and test labels are: y_train = (1912, 2) || y_test = (211, 2)


### Or you can use data path (we use ours in that tutorial) to take encoded vectors ###

#### You can give data path ###

In [9]:
trainDataPath = "sumonet/data/train"
testDataPath = "sumonet/data/test"

dataPathPositiveTrain = trainDataPath+'/Sumoylation_pos_Train.fasta'
dataPathNegativeTrain = trainDataPath+'/Sumoylation_neg_Train.fasta'

dataPathPositiveTest = testDataPath+'/Sumoylation_pos_Test.fasta'
dataPathNegativeTest = testDataPath+'/Sumoylation_neg_Test.fasta'

In [10]:
#Lets first change encoding type
encoder.set_encoder_type('blosum62')

### !! The order of the paths is important !! Positive train path should come first ###

In [11]:
X_train, y_train = encoder.get_encoded_vectors_from_path(dataPathPositiveTrain,dataPathNegativeTrain)

In [12]:
X_test, y_test = encoder.get_encoded_vectors_from_path(dataPathPositiveTest,dataPathNegativeTest)

In [13]:
print(f"Shape of the train and test samples are: X_train = {X_train.shape} || X_test = {X_test.shape}")
print(f"Shape of the train and test labels are: y_train = {y_train.shape} || y_test = {y_test.shape}")

Shape of the train and test samples are: X_train = (19131, 21, 24) || X_test = (2126, 21, 24)
Shape of the train and test labels are: y_train = (19131, 2) || y_test = (2126, 2)


### Now our data is ready ###

## SUMOnet Model ##

- You can use our architecture with randomly initialized weights

- You can also use our pre-trained model

#### Let's import SUMOnet ####

In [15]:
from sumonet.model.architecture import SUMOnet


#### You can use our architecture with randomly initialized weights ####

In [16]:
model = SUMOnet()

### If you want to see summary of the model you need to build it with input shape ###

In [17]:
input_shape = X_train.shape

##### Build function takes entire shape because it takes batch_size #####

In [18]:
model.build(input_shape)

##### model.summary will not show output shape because it is a subclass #####

In [19]:
model.summary()

Model: "sum_onet"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv1d (Conv1D)              multiple                  6272      
_________________________________________________________________
bidirectional (Bidirectional multiple                  14016     
_________________________________________________________________
global_average_pooling1d (Gl multiple                  0         
_________________________________________________________________
dense (Dense)                multiple                  2112      
_________________________________________________________________
dropout (Dropout)            multiple                  0         
_________________________________________________________________
activation (Activation)      multiple                  0         
_________________________________________________________________
dense_1 (Dense)              multiple                  832

#### Let's compile and train our model ####

In [20]:
model.compile(loss='categorical_crossentropy', optimizer='Adam', metrics=['accuracy'])


In [21]:
model.fit(X_train,y_train,epochs=3)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<tensorflow.python.keras.callbacks.History at 0x7feb3a3f1240>

### You can use pre-trained model###

- By using load_weights function SUMOnet creates our provided model SUMOnet-3
- Again you need to build model first with input shape

In [22]:
from sumonet.model.architecture import SUMOnet


In [24]:
SUMOnet3_model = SUMOnet()
SUMOnet3_model.build(input_shape)

#### Let's load weights of pre-trained model ####

In [25]:
SUMOnet3_model.load_weights()

#### Now we can predict ####

In [30]:
y_preds = SUMOnet3_model.predict(X_test)

### Let's evaluate results ###

#### import evaluate function, which organized according to our evaluation set-up ####

In [28]:
from sumonet.evaluation.metrics import evaluate

evaluate function takes 3 arguments:
- y_test -> Gold labels should be in 1-d so if yours is 2-d as ours, use argmax(-1)
- y_pred -> Predictions are already 2-d vector
- string or array that includes metrics


#### You can calculate results one-by-one ####

In [31]:
f1_score = evaluate(y_test.argmax(-1),y_preds,'f1')
mcc = evaluate(y_test.argmax(-1),y_preds,'mcc')
roc = evaluate(y_test.argmax(-1),y_preds,'roc')
aupr = evaluate(y_test.argmax(-1),y_preds,'aupr')

In [32]:
print(f"F1 score: ", f1_score)
print(f"MCC score: ", mcc)
print(f"ROC score: ", roc)
print(f"AUPR score: ", aupr)

F1 score:  {'f1': 0.6580921757770631}
MCC score:  {'mcc': 0.5694399870602478}
ROC score:  {'roc': 0.8713018549625735}
AUPR score:  {'aupr': 0.7598319565641193}


#### You can calculate all results at once ####

- This calculation outputs a dictionary

In [29]:
evaluate(y_test.argmax(-1),y_preds,['f1','mcc','roc','aupr'])

{'aupr': 0.7598319565641193,
 'f1': 0.6580921757770631,
 'mcc': 0.5694399870602478,
 'roc': 0.8713018549625735}