# Lab 1: Introduction to Neural Networks
In this tutorial we introduce some of the concepts for working with neural networks using [Pytorch](https://pytorch.org/tutorials/recipes/recipes_index.html). The entire notebook can be executed as-is, given the lack of time for this first lab session. We encourage you to explore the code yourselves to get comfortable with the concepts of deel learning in the context of biology. A few questions at the end challenge you to play around with the code and try things for yourselves.

In this session, you will create a simple neural network that classifies any given DNA sequence as protein coding or not. As a starting point, we use as examples the coding DNA sequences from humans (homo sapiens (HS)). As negatives, we use random sequences of DNA where each nucleotide is drawn from a uniform distribution over the possible nucleotides. We then train a neural network on de [codon frequencies](https://en.wikipedia.org/wiki/DNA_and_RNA_codon_tables) of these sequences.

In addition to human DNA sequences, we also take a look at coding sequences from mice ([mus musculus (MM)](https://en.wikipedia.org/wiki/House_mouse)) and yeast ([saccharomyces cerevisiae (SC)](https://en.wikipedia.org/wiki/Saccharomyces_cerevisiae)). There are subtle differences between the coding frequencies of these species. You will test how well your human-trained model is able to recover the coding sequences for mice and yeast (think of your results in the context of evolutionary distances between species). 


In [None]:

# import pytorch
import torch
import torch.nn as nn
from torch import Tensor
from torch import optim
import torch.nn.functional as F
from torch.utils.data import TensorDataset, DataLoader, RandomSampler
from sklearn.model_selection import KFold

# import basic functionality
import random
import numpy as np
import pandas as pd

# libraries for plotting
import seaborn as sns
import matplotlib.pyplot as plt

import _pickle as pickle

import Bio
from Bio import SeqIO

# Step 1: Loading the data
For your convenience, we provide the pre-processed and encoded sequences. That is, we pre-processed human, mouse and yeast DNA sequences by filtering for coding sequences that contain an integer number of codons, that are of sufficient length for computing codon frequencies (>=300 base pairs), whose translation encodes a protein (start and stop codon). We removed duplicates and randomly mixed the sequences. 

We then encoded these sequences by computing the codon frequencies for each sequence. Here we load this data set. To train a neural network on de coding frequencies of these sequences, we encoded the sequences by converting each sequence to an array of frequencies for each possible codon. To do so, each codon gets assigned a index in the array. This array allows us to convert between codon (e.g., 'ATG') and indices in the array (e.g., 'ATG' -> 0) to keep track of the codon frequencies. We use Tensors - the datatype used for pytorch data - to store the coding frequencies.

Biologically, [DNA codons](https://en.wikipedia.org/wiki/DNA_and_RNA_codon_tables) consist of three nucleotides, encoding amino acids. However, since we are training a neural network to classify a sequence to be protein coding or not, we can choose any number of nucleotides to represent a 'codon'. For example, we can choose a "codon length" of a single nucleotide (which would result in us training the model on the [frequencies of nucleotides in DNA](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC403801/)), or a codon length of two nucleotides (no biological meaning as this does not represent a biological unit - we do not expect a model to learn any biology at all), or a codon length of 6 nucleotides (representing pairs of amino acids - would this yield a model that "learns" any biology?). You can play around with the yourself in the notebook that creates the codon frequencies, but we start with a codon length of 3 nucleotides - represeting one amino acid. 

Finally, for each human sequence, we also created a random sequence of codons following the same sequence length distribution as the DNA sequences. These nucleotides are drawn from a random uniform distribution over the possible codons. These random sequences are used as negatives (i.e., label = 1 will tell the model that a sequence is protein coding, label = 0 tells the model that a sequence is not protein coding). 

Download the data folder from [here](https://polybox.ethz.ch/index.php/s/qXpo8ZcsuHryxY0). If you would like (not required), you can read how the data was processed in "Lab1_bonus_encoding".


In [None]:

# load the encoded the sequence data
with open('lab1_seq_data_human_encoded_pos.obj', 'rb') as handle:
    seq_data_human_encoded_pos = pickle.load(handle)
    
with open('lab1_seq_data_human_encoded_neg.obj', 'rb') as handle:
    seq_data_human_encoded_neg = pickle.load(handle)
    
with open('seq_data_mouse_encoded.obj', 'rb') as handle:
    seq_data_mouse_encoded = pickle.load(handle)
    
with open('seq_data_yeast_encoded.obj', 'rb') as handle:
    seq_data_yeast_encoded = pickle.load(handle)
        
with open('sequence_encoding.obj', 'rb') as handle:
    dna_lang = pickle.load(handle)
    

In [None]:

# take a look at the encoding
dna_lang.word2index


In [None]:

# take a look at a sample
seq_data_human_encoded_pos[0]


In [None]:

# explore the correlation of coding frequencies across species
# first average and merge codon frequencies of the different species and random sequences
codon_freqs = pd.DataFrame([np.mean(np.array([ch['frequencies'] for ch in seq_data_human_encoded_pos if ch['label']==1]),axis=0),
                         np.mean(np.array([ch['frequencies'] for ch in seq_data_mouse_encoded if ch['label']==1]),axis=0),
                         np.mean(np.array([ch['frequencies'] for ch in seq_data_yeast_encoded if ch['label']==1]),axis=0),
                         np.mean(np.array([ch['frequencies'] for ch in seq_data_human_encoded_neg if ch['label']==0]),axis=0)
                         ]).T

# label codons and sort by human frequency
codon_freqs.index = [dna_lang.index2word[idx] for idx in list(codon_freqs.index)]
codon_freqs.reset_index(inplace=True)
codon_freqs.columns = ['codon','human','mouse','yeast','random']
codon_freqs.sort_values(by='human',ascending=False,inplace=True)

print('correlation matrix: ')
print(codon_freqs.set_index('codon').corr())


In [None]:

# plot coding frequencies for different species
# initialize figure
sns.set(rc={'figure.figsize':(5,15)})
sns.set(font="Arial")
sns.set(style="whitegrid")

# stack dataframe of frequencies
plot_freqs = codon_freqs.set_index('codon').stack().reset_index()
plot_freqs.columns = ['codon','origin','frequency']

# plot codon frequencies for different species
sns.barplot(data=plot_freqs,x='frequency',y='codon',hue='origin')
plt.tight_layout()
plt.show()


# Step 2: Creating a dataloader
Having encoded our DNA sequences as codon frequencies, we are ready to prepare the data for training a neural network. We will create a 'dataloader' that takes care of splitting the data into batches.

In [None]:

# merge encoded human sequences and the matching random samples
seq_data_human_encoded = seq_data_human_encoded_pos + seq_data_human_encoded_neg
random.shuffle(seq_data_human_encoded)

# split the sequence data that we defined above into training, validation and test sets
#train_set_human, val_set_human, test_set_human = torch.utils.data.random_split(seq_data_human_encoded, [0.5,0.4,0.1])
#train_set_mouse, val_set_mouse, test_set_mouse = torch.utils.data.random_split(seq_data_mouse_encoded, [0.5,0.4,0.1])
#train_set_yeast, val_set_yeast, test_set_yeast = torch.utils.data.random_split(seq_data_yeast_encoded, [0.5,0.4,0.1])

train_set_human, val_set_human, test_set_human = torch.utils.data.random_split(seq_data_human_encoded, [int(ch*len(seq_data_human_encoded)) for ch in [0.5,0.4,0.1]])
train_set_mouse, val_set_mouse, test_set_mouse = torch.utils.data.random_split(seq_data_mouse_encoded, [int(ch*len(seq_data_mouse_encoded)) for ch in [0.5,0.4,0.1]])


In [None]:

############################
# define a function to create a dataloader for the encoded sequences
def get_dataloader(dataset, batch_size):
    cur_sampler = RandomSampler(dataset)
    cur_dataloader = DataLoader(dataset=dataset, sampler=cur_sampler, batch_size=batch_size, drop_last=True, num_workers=16)
    return cur_dataloader    
############################


In [None]:

# how many samples should be trained on simultaneously?
batch_size = 300

# define dataloader for training
train_loader_human = get_dataloader(train_set_human, batch_size)
val_loader_human = get_dataloader(val_set_human, batch_size)
test_loader_human = get_dataloader(test_set_human, 1)


# Step 3: Define model
As a final preparation, we define our [model](https://pytorch.org/docs/stable/generated/torch.nn.Module.html). We create a class that instructs pytorch to follow a certain architecture for the model. Our model initializes all relevant parts (init function) and tells pytorch how to compute the output for a given input (forward function). For our perceptron, we use a single dense linear layer with a sigmoidal activation function. You can play around with the model architecture later - several options are left as comments.
We are trying to train a model for solving a classfication problem. The labels for our samples are binary (ones and zeros). We therefore use a binary loss function. 

In [None]:

# Define the device (CPU or GPU)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device


In [None]:

# Define the model architecture
class myPerceptron(nn.Module):
    def __init__(self, input_param, output_param):
        super(myPerceptron, self).__init__()
        
        self.input_param = input_param
        self.output_param = output_param
        
        self.linear0 = nn.Linear(input_param, output_param)
        self.sigmoid = nn.Sigmoid()

    def forward(self, input_data):
        output_linear = self.linear0(input_data)
        output_activation = self.sigmoid(output_linear)
        return output_activation


In [None]:

# use binary cross entropy los for this classification problem
my_loss_function = nn.BCELoss()

# initialize an instance of our model class (a variable that is a model following the architecture we defined above)
my_model = myPerceptron(dna_lang.n_words, # size of input tensors (the number of codons)
                     1, # size of the model's output
                    ).to(device) # send model to device

# show model architecture
my_model


# Step 4: Train simple model 
To train our model, we need a "training loop". 
1) First, we tell pytorch that we want to train our model (so it has to keep track of gradients). 
2) We then iterate over our data in batches to speed up computations (there is little advantage for computing the gradient with all samples over, say, a few hundred samples). 
3) We set the gradients to zero (we don't want to re-use previous computations for our next training step).
4) We compute the output of the model for the given input sequences. 
5) We then compute the loss of the model output for the given target labels of the input sequences and [backpropagate](https://pytorch.org/docs/stable/generated/torch.Tensor.backward.html) the loss through the network to compute the gradient. 
6) Finally, we instruct the optimizer to use the gradient and perform one appropriately-sized [step](https://pytorch.org/docs/stable/generated/torch.optim.Optimizer.step.html) to update the model weights. 
7) To keep track of our effors we compute the accuracy and training loss for the current samples.

Finally, we train our model using the given data and training loop

In [None]:

# Define the training loop
def train(model, train_loader, optimizer, device):
    # training mode
    model.train(True)
    
    # Enabling gradient calculation
    with torch.set_grad_enabled(True):
        collect_loss = 0
        correct = 0
        nr_samples = 0
        for batch_idx, data in enumerate(train_loader):
            # send features and labels to GPU/CPU
            model_input = data['frequencies'].to(device)
            target = data['label'].to(device)

            # zero the gradients
            model.zero_grad()
            optimizer.zero_grad()

            # compute output of model
            output = model(model_input)

            # compute the loss and update model parameters
            loss = my_loss_function(output, target)
            loss.backward()

            # adjust learning weights
            optimizer.step()
            
            # store training loss
            collect_loss += loss.item()*batch_size
            
            # compute accuracy of training data
            pred = torch.round(output,decimals=0)
            correct += (pred.eq(target.view_as(pred)).sum().item())
            nr_samples += len(target)
            
        return {'train_loss':collect_loss/nr_samples, 'train_accuracy':correct/nr_samples}
    

Similar to the training loop, we need a "test loop" to get the output of the model for a given set of validation samples on which we do not train the model. 
1) First, we tell pytorch that we do NOT want to train our model (no keeping track of gradients - evaluation mode). 
2) We then iterate over our validation data. 
3) We compute the output of the model for the given input sequences. 
4) We then compute the loss of the model output for the given target labels of the input sequences.
5) We compute the accuracy and training loss for the current samples.

In [None]:

# define the test loop
def validate(model, test_loader, device):
    # Evaluation mode
    model.eval()
    
    with torch.no_grad():
        collect_loss = 0
        correct = 0
        nr_samples = 0
        for data in test_loader:
            # send features and labels to GPU/CPU
            model_input = data['frequencies'].to(device)
            target = data['label'].to(device)
            
            # compute output of model
            output = model(model_input)

            # store test loss
            collect_loss += my_loss_function(output, target).item()*batch_size
            
            # compute accuracy for test data
            pred = torch.round(output,decimals=0)
            correct += (pred.eq(target.view_as(pred)).sum().item())
            nr_samples += len(target)
            
        return {'val_loss':collect_loss/nr_samples, 'val_accuracy':correct/nr_samples}
    

In [None]:

# define the number of epochs - how often should the model (my_model) see all of the data (train_loader_human)?
n_epochs = 20

# initialize an instance of our model class (a variable that is a model following the architecture we defined above)
my_model = myPerceptron(dna_lang.n_words, # size of input tensors (the number of codons)
                     1, # size of the model's output
                    ).to(device) # send model to device

# use stochastic gradient descent with the given learning rate
optimizer = torch.optim.Adam(my_model.parameters(), lr=0.01)

# Train the model on the current data
stats_tracker = []
for epoch in range(0, n_epochs):
    # train the model and get training loss
    test_stats = validate(my_model, val_loader_human, device)
    train_stats = train(my_model, train_loader_human, optimizer, device)
    stats_tracker.append( train_stats|test_stats )
    print('epoch: ', epoch, train_stats, test_stats, '\t\t\t\t\t\t\t\t', end='\r')
    

# Step 5: Plotting error and accuracy
Using the output from training, we can plot the results for each epoch to look at the learning of our model. For this, we average the errors over the different folds from training. Carefully look at the training and testing error to choose an appropriate number of epochs for training (to avoid overfitting). 

NOTE: the initial model included trains very fast with a small error and high accuracy. The error and accuracy become interesting when looking at training on other species as negative samples in the questions.

In [None]:

# initialize figure
sns.set(rc={'figure.figsize':(5,5)})
sns.set(font="Arial")
sns.set(style="whitegrid")

# format loss data
plot_data = pd.DataFrame(stats_tracker)
plot_data = plot_data.stack().reset_index()
plot_data.columns = ['epoch','dataset','value']

# plot training and test loss as function of epoch
ax=sns.lineplot(data=plot_data, x='epoch', y='value',hue='dataset')
#ax.set_yscale('log')
ax.set_ylim([0,1])
plt.tight_layout()
plt.show()


In [None]:

# define function for evaluating the trained model for a given sample
def evaluate(model, sample):
    # set the model in evaluation mode without computing gradients
    model.eval()
    with torch.no_grad():
        # compute the output of the model for a given sample
        output = my_model(sample.to(device))
    return output.item()


# finally, using a trained model, we can compute a 'probability' that a given input sequence is encoding a protein
test_sampler = enumerate(test_loader_human)

# here we pick a random sequence that was not used for training, but you can change this to any sequence you would like 
batch_idx, test_sample = next(test_sampler)

# evaluate model for given test sample
output = evaluate(my_model, test_sample['frequencies'])

# print output of the model together with the label of the sample
print('probability: ',output, test_sample['label'])


# Step 6: Questions
1) how many parameters does your model have?
3) change model architecture, you can add more layers with hidden parameters, other activation functions, dropout parameters, etc. Make sure that the model does not overfit.
4) make predictions for mouse sequences using the model trained for human sequences. What changes for the accuracy? What happens if you make predictions for yeast sequences?
5) now train the model to use yeast sequences as negative samples for coding sequences. And then for mouse sequences as negative samples. This creates a model that learns to classify sequences as being likely from mice/yeast or from humans. 

# Step 7: Bonus (hard - use encoding notebook)
6) what happens if you change the codon length?
7) what happens to the probablities when sequences are frameshifted? can you train the model using frameshifted sequences as input?
8) train the model on amino acid frequencies instead of codons