# Lab 1: Perceptron
In this tutorial we introduce some of the concepts for working with neural networks using [Pytorch](https://pytorch.org/tutorials/recipes/recipes_index.html). The entire notebook can be executed as-is, given the lack of time for this first lab session. We encourage you to explore the code yourselves to get comfortable with the concepts of deel learning in the context of biology. A few questions at the end challenge you to play around with the code and try things for yourselves.

In this session, you will create a simple neural network that classifies any given DNA sequence as protein coding or not. As a starting point, we use as examples the coding DNA sequences from humans (homo sapiens (HS)). As negatives, we use random sequences of DNA where each nucleotide is drawn from a uniform distribution over the possible nucleotides. We then train a neural network on de [codon frequencies](https://en.wikipedia.org/wiki/DNA_and_RNA_codon_tables) of these sequences.

In addition to human DNA sequences, we also take a look at coding sequences from mice ([mus musculus (MM)](https://en.wikipedia.org/wiki/House_mouse)) and yeast ([saccharomyces cerevisiae (SC)](https://en.wikipedia.org/wiki/Saccharomyces_cerevisiae)). There are subtle differences between the coding frequencies of these species. You will test how well your human-trained model is able to recover the coding sequences for mice and yeast (think of your results in the context of evolutionary distances between species). 


In [None]:

# import pytorch
import torch
import torch.nn as nn
from torch import Tensor
from torch import optim
import torch.nn.functional as F
from torch.utils.data import TensorDataset, DataLoader, RandomSampler
from sklearn.model_selection import KFold

# import basic functionality
import random
import numpy as np
import pandas as pd

# libraries for plotting
import seaborn as sns
import matplotlib.pyplot as plt


import Bio
from Bio import SeqIO

# Step 1: Pre-processing the data
Here we download and pre-process the dataset. We only consider DNA sequences that are protein coding, contain a integer number of codons, have a start and stop codon, and do not contain any uncertain nucleotides. Finally, we remove duplicates and randomly mix the sequences. 

In [None]:

# download and unpack DNA coding sequences for human, mouse and yeast
############################

!mkdir -p ~/all_seqs

!wget -P ~/all_seqs/ https://ftp.ensembl.org/pub/current_fasta/homo_sapiens/cds/Homo_sapiens.GRCh38.cds.all.fa.gz
!gzip -df "all_seqs/Homo_sapiens.GRCh38.cds.all.fa.gz"

!wget -P ~/all_seqs/ https://ftp.ensembl.org/pub/current_fasta/saccharomyces_cerevisiae/cds/Saccharomyces_cerevisiae.R64-1-1.cds.all.fa.gz
!gzip -df "all_seqs/Saccharomyces_cerevisiae.R64-1-1.cds.all.fa.gz"

!wget -P ~/all_seqs/ https://ftp.ensembl.org/pub/current_fasta/mus_musculus/cds/Mus_musculus.GRCm39.cds.all.fa.gz
!gzip -df "all_seqs/Mus_musculus.GRCm39.cds.all.fa.gz"


In [None]:

# function that loads and processes a FASTA file containing coding sequences
def load_species_cds(file_name):
    train_cds = []
    train_prot = []
    for record in SeqIO.parse(file_name, "fasta"):
        # ensure that sequences are protein coding
        if 'gene_biotype:protein_coding' in record.description:
            if 'transcript_biotype:protein_coding' in record.description:
                if ' cds ' in record.description:
                    if len(record.seq) % 3 == 0:
                        train_cds.append(str(record.seq))
                        train_prot.append(str(record.seq.translate()))
                        
    # keep sequences that are protein coding
    train_cds_filtered = []
    train_prot_filtered = []
    for i in range(len(train_prot)):
        if (train_prot[i][0]=='M') & (train_prot[i][-1]=='*'):
            train_cds_filtered.append(train_cds[i])
            train_prot_filtered.append(train_prot[i])

    # avoid sequences with undetermined/uncertain nucleotides
    # restrict to sequences with at least 100 aa for codon frequency estimation
    train_prot_filtered = [train_prot_filtered[i] for i in range(len(train_cds_filtered)) if ('N' not in train_cds_filtered[i]) and (len(train_cds_filtered[i])>=300)]
    train_cds_filtered = [train_cds_filtered[i] for i in range(len(train_cds_filtered)) if ('N' not in train_cds_filtered[i]) and (len(train_cds_filtered[i])>=300)]

    
    # remove duplicates and randomly mix the list of sequences
    seqs = list(zip(train_cds_filtered, train_prot_filtered))
    seqs = list(set(seqs))
    random.shuffle(seqs)
    train_cds_filtered, train_prot_filtered = zip(*seqs)
    
    return list(train_cds_filtered), list(train_prot_filtered)


In [None]:

# load coding sequences for different species
print('loading human proteins')
cds_hs, prot_hs = load_species_cds("all_seqs/Homo_sapiens.GRCh38.cds.all.fa")

print('loading yeast proteins')
cds_sc, prot_sc = load_species_cds("all_seqs/Saccharomyces_cerevisiae.R64-1-1.cds.all.fa")

print('loading mouse proteins')
cds_mm, prot_mm = load_species_cds("all_seqs/Mus_musculus.GRCm39.cds.all.fa")

# take a look at some sequences
[cds_hs[i][0:40]+'...' for i in range(5)]

# Step 2: Encoding sequences as codon frequencies
The steps above give us a set of unique coding sequences for humans, mice and yeast. To train a neural network on de coding frequencies of these sequences, we encode the sequences by converting each sequence to an array of frequencies for each possible codon. Each codon gets assigned a index in the array. We first create a dictionary that contains all possible codons for a given codon length and input sequence. This dictionary then allows us to convert between codon (e.g., 'ATG') and indices in the array (e.g., 'ATG' -> 0) to keep track of the codon frequencies.

Biologically, [DNA codons](https://en.wikipedia.org/wiki/DNA_and_RNA_codon_tables) consist of three nucleotides, encoding amino acids. However, since we are training a neural network to classify a sequence to be protein coding or not, we can choose any number of nucleotides to represent a 'codon'. For example, we can choose a "codon length" (codon_length) of a single nucleotide (which would result in us training the model on the [frequencies of nucleotides in DNA](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC403801/)), or a codon length of two nucleotides (no biological meaning as this does not represent a biological unit - we do not expect a model to learn any biology at all), or a codon length of 6 nucleotides (representing pairs of amino acids - would this yield a model that "learns" any biology?). You can play around with the codon_length yourself, but we start with a codon length of 3 nucleotides - represeting exactly one amino acid. Finally, you could also train the model on the frequencies of amino acids by using protein sequences as input. You can try this yourself by changing the training_seqs variable to prot_hs that was defined above (protein sequences for humans).

In [None]:
training_seqs = cds_hs.copy()
training_seqs_mm = cds_mm.copy()
training_seqs_sc = cds_sc.copy()

In [None]:
# extract the alphabet for the current set of sequences
import itertools
alphabet = list(set(itertools.chain.from_iterable([list(set(training_seqs[i])) for i in range(len(training_seqs))])))

# number of nucleotides per codon
codon_length = 3
codons = [''.join(cur_tok) for cur_tok in list(itertools.product(alphabet, repeat = codon_length))]

# define dictionary for encoding the codons
codon_to_int = dict((c, i) for i, c in enumerate(codons))
int_to_codon = dict((i, c) for i, c in enumerate(codons))

# take a look at the alphabet, codons, and encoding
alphabet, codons[0:5],int_to_codon

Having established a dictionary to convert between codon and array indices, we can encode our DNA sequences into codon frequencies. Here, we define two functions to do that for us. The first converts DNA sequences to an array of codon frequencies using the defined dictionary. The second function creates a random sequence of codons following the same sequence length distribution as the DNA sequences. These nucleotides are drawn from a random uniform distribution over the possible codons.

Finally, for both functions, each sequence gets assigned a label that we can use to train our model (i.e., label = 1 tells the model that a sequence is protein coding, label = 0 tells the model that a sequence is not protein coding). 

In [None]:

# function to encode DNA sequence into codons and convert to codon frequency using defined codon dictionary
def encode_seq_to_codon_freq(sequences, use_label):
    sequences_freqs = []
    sequences_labels = []

    for cur_seq in sequences:
        # initialize array for codon counts
        seq_to_codon_freq = np.zeros(len(codons))

        # split sequence into codons
        seq_codons = [cur_seq[i:i+codon_length] for i in range(0, len(cur_seq), codon_length)]

        # count codon frequencies
        for cur_codon in seq_codons:
            if len(cur_codon)==codon_length:
                seq_to_codon_freq[codon_to_int[cur_codon]]+=1
        seq_to_codon_freq /= len(seq_codons)

        # add codon frequencies and label for current sequence to collection
        sequences_freqs.append( seq_to_codon_freq )
        sequences_labels.append(use_label)
    return sequences_freqs, sequences_labels


# function to generate random codon frequency sample given nr of codons in sequences
def random_codon_freq(sequences, use_label):
    sequences_freqs = []
    sequences_labels = []

    for cur_seq in sequences:
        # initialize array for codon counts
        seq_to_codon_freq = np.zeros(len(codons))
        
        # define nr of codons in sequence
        nr_codons = round(len(cur_seq)/codon_length)

        # generate random sequence of codons given the current sequence length
        seq_rnd = list(np.random.randint(low=0, high=len(codons), size = nr_codons, dtype=int))

        # count codon frequencies
        for i in seq_rnd:
            seq_to_codon_freq[i]+=1
        seq_to_codon_freq /= nr_codons

        # add codon frequencies and label for current sequence to collection
        sequences_freqs.append( seq_to_codon_freq )
        sequences_labels.append(use_label)
    return sequences_freqs, sequences_labels

In [None]:

# using functions and codons as defined above, encode sequences with their codon frequencies
seqs_codon_freqs_hs, seqs_codon_labels_hs = encode_seq_to_codon_freq(training_seqs, 1) # sequences, label (is sequence / no sequence)
rnd_codon_freqs_hs,  rnd_codon_labels_hs = random_codon_freq(training_seqs, 0)

seqs_codon_freqs_mm, seqs_codon_labels_mm = encode_seq_to_codon_freq(training_seqs_mm, 0)
seqs_codon_freqs_sc, seqs_codon_labels_sc = encode_seq_to_codon_freq(training_seqs_sc, 0)

In [None]:
# show us one sample (codon frequencies and label)
seqs_codon_freqs_hs[0], seqs_codon_labels_hs[0]

We have now encoded all DNA sequences in the same format: an array of codon frequencies of length 64 and a label, and the different sequences can now be used as input for a model (i.e., we converted the DNA sequences (variable-length text) into a consistent format (array of numbers of fixed length) where each position in the array has the same meaning across sequences. To illustrate this, we plot the codon frequencies across species and our random sequences.

In [None]:

# plot coding frequencies for different species
# initialize figure
sns.set(rc={'figure.figsize':(5,15)})
sns.set(font="Arial")
sns.set(style="whitegrid")

# average and merge codon frequencies of the different species and random sequences
codon_freqs = pd.concat([pd.DataFrame(np.mean(np.array(seqs_codon_freqs_hs),axis=0)),
                         pd.DataFrame(np.mean(np.array(seqs_codon_freqs_mm),axis=0)),
                         pd.DataFrame(np.mean(np.array(seqs_codon_freqs_sc),axis=0)),
                         pd.DataFrame(np.mean(np.array(rnd_codon_freqs_hs),axis=0))
                         ],axis=1)

# label codons and sort by human frequency
codon_freqs.index = [int_to_codon[cur_codon] for cur_codon in list(codon_freqs.index)]
codon_freqs.reset_index(inplace=True)
codon_freqs.columns = ['codon','human','mouse','yeast','random']
codon_freqs.sort_values(by='human',ascending=False,inplace=True)

# stack dataframe of frequencies
plot_freqs = codon_freqs.set_index('codon').stack().reset_index()
plot_freqs.columns = ['codon','origin','frequency']

# plot codon frequencies for different species
sns.barplot(data=plot_freqs,x='frequency',y='codon',hue='origin')#, x = 'codon', y='frequency', hue='origin')
plt.tight_layout()
plt.show()


# Step 3: Creating a dataloader
Having encoded our DNA sequences as codon frequencies, we are ready to prepare the data for training a neural network. We will create a 'dataloader' that converts the arrays into Tensors (appropriate format for pytorch) and takes care of splitting the data into batches.

In [None]:

# Function defining the data loader
def get_dataloader_simple(train_samples_x, train_samples_y, batch_size):
    train_x = np.array(train_samples_x)
    train_y = np.array(train_samples_y)
    train_data = TensorDataset(torch.from_numpy(train_x).float(), torch.from_numpy(train_y).float())
    
    train_sampler = RandomSampler(train_data)
    train_dl = DataLoader(
        dataset=train_data,
        batch_size=batch_size,
        sampler=train_sampler,
        drop_last=True
    )
    return train_dl


In [None]:

# create a set of positive and negative samples, print number of available samples
positive_set = (seqs_codon_freqs_hs, seqs_codon_labels_hs)
negative_set = (rnd_codon_freqs_hs, rnd_codon_labels_hs)
len(positive_set[0]), len(negative_set[0])


In [None]:

# define number of samples to take for training and validation. We pick small numbers to speed up the training process
batch_size = 300
nr_train_samples = 5000
nr_val_samples = 2000

# define training (train+test) and validation data by mixing a balanced number of positive and negative samples
nr_train_samples = round(nr_train_samples/2)
nr_val_samples = round(nr_val_samples/2)
train_dl = get_dataloader_simple(positive_set[0][0:nr_train_samples]+negative_set[0][0:nr_train_samples], 
                                 positive_set[1][0:nr_train_samples]+negative_set[1][0:nr_train_samples],
                                 batch_size)

val_dl = get_dataloader_simple(positive_set[0][nr_train_samples:(nr_train_samples+nr_val_samples)]+negative_set[0][nr_train_samples:(nr_train_samples+nr_val_samples)],
                               positive_set[1][nr_train_samples:(nr_train_samples+nr_val_samples)]+negative_set[1][nr_train_samples:(nr_train_samples+nr_val_samples)],
                               1)


# Step 4: Define model
As a final preparation, we define our [model](https://pytorch.org/docs/stable/generated/torch.nn.Module.html). We create a class that instructs pytorch to follow a certain architecture for the model. Our model initializes all relevant parts (init function) and tells pytorch how to compute the output for a given input (forward function). For our perceptron, we use a single dense linear layer with a sigmoidal activation function. You can play around with the model architecture later - several options are left as comments.
We are trying to train a model for solving a classfication problem. The labels for our samples are binary (ones and zeros). We therefore use a binary loss function. 

In [None]:

# Define the model architecture
class myModel(nn.Module):
    def __init__(self, input_param, hidden_param, output_param, dropout_prob):
        super(myModel, self).__init__()
        
        self.input_param = input_param
        self.hidden_param = hidden_param
        self.output_param = output_param
        
#        self.dropout = nn.Dropout(dropout_prob)
        self.linear0 = nn.Linear(input_param, output_param)
#        self.linear1 = nn.Linear(input_param, hidden_param)
#        self.relu = nn.ReLU()
#        self.linear2 = nn.Linear(hidden_param, output_param)
        self.sigmoid = nn.Sigmoid()

    def forward(self, inp):
#        inp_drop = self.dropout(inp)
        layer0 = self.linear0(inp)
        return self.sigmoid(layer0)

#        layer1 = self.linear1(inp)
#        layer1_act = self.relu(layer1)
#        layer2 = self.linear2(layer1_act)
#        return self.sigmoid(layer2)


In [None]:

# Define the device (CPU or GPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


In [None]:

# use binary cross entropy los for this classification problem
criterion = nn.BCELoss()

# initialize an instance of our model class (a variable that is a model following the architecture we defined above)
model = myModel(len(codon_to_int), 20, 1, 0).to(device)

def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(count_parameters(model))
model


# Step 5: Train simple model 
To train our model, we need a "training loop". 
1) First, we tell pytorch that we want to train our model (so it has to keep track of gradients). 
2) We then iterate over our data in batches to speed up computations (there is little advantage for computing the gradient with all samples over, say, a few hundred samples). 
3) We set the gradients to zero (we don't want to re-use previous computations for our next training step).
4) We compute the output of the model for the given input sequences. 
5) We then compute the loss of the model output for the given target labels of the input sequences and [backpropagate](https://pytorch.org/docs/stable/generated/torch.Tensor.backward.html) the loss through the network to compute the gradient. 
6) Finally, we instruct the optimizer to use the gradient and perform one appropriately-sized [step](https://pytorch.org/docs/stable/generated/torch.optim.Optimizer.step.html) to update the model weights. 
7) To keep track of our effors we compute the accuracy and training loss for the current samples.

Finally, we train our model using the given data and training loop

In [None]:

# Define the training loop
def train(model, train_loader, optimizer, device):
    # training mode
    model.train(True)
    
    # Enabling gradient calculation
    with torch.set_grad_enabled(True):
        collect_loss = 0
        correct = 0
        nr_samples = 0
        for batch_idx, (data, target) in enumerate(train_loader):
            # send features and labels to GPU/CPU
            data, target = data.to(device), target.to(device)

            # zero the gradients
            model.zero_grad()
            optimizer.zero_grad()

            # compute output of model
            output = model(data)

            # compute the loss and update model parameters
            loss = criterion(output, target.unsqueeze(1))
            loss.backward()

            # adjust learning weights
            optimizer.step()
            
            # store training loss
            collect_loss += loss.item()
            
            # compute accuracy of training data
            pred = torch.round(output,decimals=0)
            correct += (pred.eq(target.view_as(pred)).sum().item())
            nr_samples += len(target)
            
        return collect_loss*batch_size/nr_samples, correct/nr_samples
    

In [None]:
n_epochs = 30

# initialize model
model = myModel(len(codon_to_int), 20, 1, 0).to(device)

# use stochastic gradient descent with the given learning rate
optimizer = torch.optim.Adam(model.parameters(), lr=0.1)

# Train the model on the current fold
for epoch in range(1, n_epochs):
    # train the model and get training loss
    train_loss = train(model, train_dl, optimizer, device)
    print(epoch, train_loss[0], train_loss[1])

Similar to the training loop, we need a "test loop" to get the output of the model for a given set of validation samples on which we do not train the model. 
1) First, we tell pytorch that we do NOT want to train our model (no keeping track of gradients - evaluation mode). 
2) We then iterate over our validation data. 
3) We compute the output of the model for the given input sequences. 
4) We then compute the loss of the model output for the given target labels of the input sequences.
5) We compute the accuracy and training loss for the current samples.

In [None]:

# define the test loop
def test(model, test_loader, device, batched = True):
    # Evaluation mode
    model.eval()
    
    with torch.no_grad():
        collect_loss = 0
        correct = 0
        nr_samples = 0
        for data, target in test_loader:
            # send features and labels to GPU/CPU
            data, target = data.to(device), target.to(device)
            
            # compute output of model
            output = model(data)

            # store test loss
            collect_loss += criterion(output, target.unsqueeze(1)).item()
            
            # compute accuracy for test data
            pred = torch.round(output,decimals=0)
            correct += (pred.eq(target.view_as(pred)).sum().item())
            nr_samples += len(target)
            
        if batched:
            collect_loss *= batch_size/nr_samples
        else:
            collect_loss /= nr_samples
            
        return collect_loss, correct/nr_samples
    

In [None]:
print('Validation error, accuracy:')
test(model, val_dl, device, batched = False)

# Step 6: Cross-validation
For cross-validation, we perform the same steps but create the dataloader for each split of test and training data. The training and testing is identical to the above, with a slightly extended loop for tracking the errors

In [None]:

# Function defining the data loaders for k-fold cross-validation
def get_dataloader_kfold(cur_dataset, train_idx, test_idx, batch_size):    
    train_dl = DataLoader(
        dataset=cur_dataset,
        batch_size=batch_size,
        sampler=torch.utils.data.SubsetRandomSampler(train_idx),
        drop_last=True
    )
    test_dl = DataLoader(
        dataset=cur_dataset,
        batch_size=batch_size,
        sampler=torch.utils.data.SubsetRandomSampler(test_idx),
        drop_last=True
    )

    return train_dl, test_dl


# function to package samples into a Tensor dataset
def samples_to_dataset(train_samples_x, train_samples_y):
    train_x = np.array(train_samples_x)
    train_y = np.array(train_samples_y)
    return TensorDataset(torch.from_numpy(train_x).float(), torch.from_numpy(train_y).float())


In [None]:

# create a set of positive and negative samples
positive_set = (seqs_codon_freqs_hs, seqs_codon_labels_hs)
negative_set = (rnd_codon_freqs_hs, rnd_codon_labels_hs)
len(positive_set[0]), len(negative_set[0])


In [None]:

# define number of samples to take for training and validation. We pick small numbers to speed up the process
nr_train_samples = 5000
nr_val_samples = 2000

# define training (train+test) and validation data. 
# (!) These are NOT dataloaders but TensorDatasets because we split them into folds and convert to dataloaders during training
nr_train_samples = round(nr_train_samples/2)
nr_val_samples = round(nr_val_samples/2)
train_dataset = samples_to_dataset(positive_set[0][0:nr_train_samples]+negative_set[0][0:nr_train_samples], 
                                   positive_set[1][0:nr_train_samples]+negative_set[1][0:nr_train_samples])

val_dataset = samples_to_dataset(positive_set[0][nr_train_samples:(nr_train_samples+nr_val_samples)]+negative_set[0][nr_train_samples:(nr_train_samples+nr_val_samples)],
                                 positive_set[1][nr_train_samples:(nr_train_samples+nr_val_samples)]+negative_set[1][nr_train_samples:(nr_train_samples+nr_val_samples)])


In [None]:
n_epochs = 100
k_folds = 5
batch_size = 300

# Initialize the k-fold cross validation
# make sure to shuffle the data before splitting into folds (our input data may be ordered!)
kf = KFold(n_splits=k_folds, shuffle=True)

# prepare dataframe to store training errors
train_error = pd.DataFrame(index=range(n_epochs),columns=range(k_folds))
test_error = pd.DataFrame(index=range(n_epochs),columns=range(k_folds))
train_acc = pd.DataFrame(index=range(n_epochs),columns=range(k_folds))
test_acc = pd.DataFrame(index=range(n_epochs),columns=range(k_folds))

# Loop through each fold
for fold, (train_idx, test_idx) in enumerate(kf.split(train_dataset)):
    print(f"Fold {fold + 1}")
    print("-------")
    print("epoch: ",end='')

    # define training and test data for given fold and batch size
    train_dl, test_dl = get_dataloader_kfold(train_dataset, train_idx, test_idx, batch_size)
    
    # initialize a model
    model = myModel(len(codon_to_int), 20, 1, 0).to(device)

    # define optimizer with the given learning rate
    optimizer = torch.optim.Adam(model.parameters(), lr=0.1)
        
    # Train the model on the current fold
    for epoch in range(1, n_epochs):
        # train the model and get training loss
        train_loss = train(model, train_dl, optimizer, device)
        
        # test the model on training data
        test_loss = test(model, test_dl, device, batched = True)
        
        # save training error and accuracy
        train_error.loc[epoch, fold] = train_loss[0]
        test_error.loc[epoch, fold] = test_loss[0]
        train_acc.loc[epoch, fold] = train_loss[1]
        test_acc.loc[epoch, fold] = test_loss[1]
        
        print(epoch, end=' ')
    print()

In [None]:
print('Using last model & last epoch from cross-validation; error, accuracy:')
test(model, DataLoader(dataset = val_dataset), device, batched = False)

# Step 7: Plotting training/testing error and accuracy
Using the output from training, we can plot the results for each epoch to look at the learning of our model. For this, we average the errors over the different folds from training. Carefully look at the training and testing error to choose an appropriate number of epochs for training (to avoid overfitting). 

NOTE: the initial model included trains very fast with a small error and high accuracy. The error and accuracy become interesting when looking at training on other species as negative samples in the questions.

In [None]:

# initialize figure
sns.set(rc={'figure.figsize':(5,5)})
sns.set(font="Arial")
sns.set(style="whitegrid")

# format loss data
plot_loss_df = pd.concat([train_error.mean(axis=1), test_error.mean(axis=1)],axis=1).iloc[1:,].reset_index()
plot_loss_df.columns = ['epoch','training','test']
plot_loss_df = plot_loss_df.set_index('epoch').stack().reset_index()
plot_loss_df.columns = ['epoch','dataset','loss']

# plot training and test loss as function of epoch
sns.lineplot(data=plot_loss_df, x='epoch', y='loss',hue='dataset')
plt.tight_layout()
plt.show()


In [None]:

# initialize figure
sns.set(rc={'figure.figsize':(5,5)})
sns.set(font="Arial")
sns.set(style="whitegrid")

# format loss data
plot_loss_df = pd.concat([train_acc.mean(axis=1), test_acc.mean(axis=1)],axis=1).iloc[1:,].reset_index()
plot_loss_df.columns = ['epoch','training','test']
plot_loss_df = plot_loss_df.set_index('epoch').stack().reset_index()
plot_loss_df.columns = ['epoch','dataset','loss']

# plot training and test loss as function of epoch
sns.lineplot(data=plot_loss_df, x='epoch', y='loss',hue='dataset')
plt.tight_layout()
plt.show()


In [None]:
# finally, using a trained model, we can compute a 'probability' that a given input sequence is encoding a protein
# here we pick a random human sequence that was not used for training, you can change this to any sequence you would like (change 'cds_hs' for human to 'cds_mm' for mice etc).
random_validation_sample = np.random.randint(low=nr_train_samples, high=(nr_train_samples+nr_val_samples), size = 1, dtype=int)[0]
cur_test_seq = cds_hs[random_validation_sample]
print('sequence: ',cur_test_seq[0:50])

# Evaluation mode
model.eval()
with torch.no_grad():
    # send features and labels to GPU/CPU
    data = torch.from_numpy(encode_seq_to_codon_freq([cur_test_seq],0)[0][0]).float().to(device)

    # compute output of model
    output = model(data)
    print('probability: ', output.item())

# Step 8: Questions
1) how many parameters does your model have?
2) what happens if you change the codon length?
3) adapt the code for computing the probability of a sequence being coding to compute the probability for mouse and yeast DNA sequences. Having trained the model on human DNA sequences, what is the difference in average probabilities between species? Why?
4) train the model using mouse or yeast sequences as negative samples for coding sequences. This creates a model that learns to classify sequences as being likely from mice/yeast or from humans
5) change model architecture (e.g., more layers, other parameters, dropout parameters) / minimize training time / minimize number of parameters
6) what happens to the probablities when sequences are frameshifted? can you train the model using frameshifted sequences as input?
7) train the model on amino acid frequencies instead of codons