# Lab 1: Introduction to Neural Networks
In this tutorial we introduce some of the concepts for working with neural networks using [Pytorch](https://pytorch.org/tutorials/recipes/recipes_index.html). The entire notebook can be executed as-is, given the lack of time for this first lab session. We encourage you to explore the code yourselves to get comfortable with the concepts of deel learning in the context of biology. A few questions at the end challenge you to play around with the code and try things for yourselves.

In this session, you will create a simple neural network that classifies any given DNA sequence as protein coding or not. As a starting point, we use as examples the coding DNA sequences from humans (homo sapiens (HS)). As negatives, we use random sequences of DNA where each nucleotide is drawn from a uniform distribution over the possible nucleotides. We then train a neural network on de [codon frequencies](https://en.wikipedia.org/wiki/DNA_and_RNA_codon_tables) of these sequences. In addition to human DNA sequences, we also take a look at coding sequences from mice ([mus musculus (MM)](https://en.wikipedia.org/wiki/House_mouse)) and yeast ([saccharomyces cerevisiae (SC)](https://en.wikipedia.org/wiki/Saccharomyces_cerevisiae)). 


In [None]:

# import pytorch
import torch
import torch.nn as nn
from torch import Tensor
from torch import optim
import torch.nn.functional as F
from torch.utils.data import TensorDataset, DataLoader, RandomSampler
from sklearn.model_selection import KFold

# import basic functionality
import random
import numpy as np
import pandas as pd

# libraries for plotting
import seaborn as sns
import matplotlib.pyplot as plt

import _pickle as pickle

import Bio
from Bio import SeqIO

# Step 1: Pre-processing the data
Here we download and pre-process the dataset. We only consider DNA sequences that are protein coding, contain a integer number of codons, and have a start and stop codon. Finally, we remove duplicates and randomly mix the sequences. 

In [None]:

# download and unpack DNA coding sequences for human, mouse and yeast
############################

!mkdir -p ~/all_seqs
%cd ~/

!wget -P ~/all_seqs/ https://ftp.ensembl.org/pub/current_fasta/homo_sapiens/cds/Homo_sapiens.GRCh38.cds.all.fa.gz
!gzip -df "all_seqs/Homo_sapiens.GRCh38.cds.all.fa.gz"

!wget -P ~/all_seqs/ https://ftp.ensembl.org/pub/current_fasta/saccharomyces_cerevisiae/cds/Saccharomyces_cerevisiae.R64-1-1.cds.all.fa.gz
!gzip -df "all_seqs/Saccharomyces_cerevisiae.R64-1-1.cds.all.fa.gz"

!wget -P ~/all_seqs/ https://ftp.ensembl.org/pub/current_fasta/mus_musculus/cds/Mus_musculus.GRCm39.cds.all.fa.gz
!gzip -df "all_seqs/Mus_musculus.GRCm39.cds.all.fa.gz"


In [None]:

############################
### some unnecessarily complex code to process the FASTA files for coding sequences
############################

# function that loads and processes a FASTA file containing coding sequences
def load_species_cds(file_name, max_nr_samples):
    seqs = []
    for record in SeqIO.parse(file_name, "fasta"):
        # ensure that sequences are protein coding
        if 'gene_biotype:protein_coding' in record.description:
            if 'transcript_biotype:protein_coding' in record.description:
                if ' cds ' in record.description:
                    if len(record.seq) % 3 == 0:
                        # translate sequence and check for start and stop codons
                        code_translation = str(record.seq.translate())
                        if (code_translation[0]=='M') & (code_translation[-1]=='*'):
                            seqs.append(str(record.seq))

    # avoid sequences with undetermined/uncertain nucleotides
    # restrict to sequences with at least 100 aa for codon frequency estimation
    seqs = [seqs[i] for i in range(len(seqs)) if (len(seqs[i])>=300)]#('N' not in train_cds_filtered[i]) and (len(train_cds_filtered[i])>=300)]
    
    # remove duplicates and randomly mix the list of sequences
    seqs = list(set(seqs))
    random.shuffle(seqs)
    
    return list(seqs)[0:max_nr_samples]


In [None]:

# ensure we're in the right directory
%cd ~/

# there are many sequences, given the time constraints, we limit the number of sequences to speed up the processing
max_nr_samples = 20000

# load coding sequences for different species
print('loading human proteins')
seq_data_human = load_species_cds("all_seqs/Homo_sapiens.GRCh38.cds.all.fa", max_nr_samples)

print('loading yeast proteins')
seq_data_yeast = load_species_cds("all_seqs/Saccharomyces_cerevisiae.R64-1-1.cds.all.fa", max_nr_samples)

print('loading mouse proteins')
seq_data_mouse = load_species_cds("all_seqs/Mus_musculus.GRCm39.cds.all.fa", max_nr_samples)

# take a look at some sequences
[seq_data_human[i][0:20]+'...'+seq_data_human[i][-20:] for i in range(5)]


# Step 2: Encoding sequences as codon frequencies
The steps above give us a set of unique coding sequences for humans, mice and yeast. To train a neural network on de coding frequencies of these sequences, we encode the sequences by converting each sequence to an array of frequencies for each possible codon. Each codon gets assigned a index in the array. We first create a 'language' that knows all possible words (codons) for a given codon length and input sequence. This language then allows us to convert between codon (e.g., 'ATG') and indices in the array (e.g., 'ATG' -> 0) to keep track of the codon frequencies. Here, we use Tensors - the datatype used for pytorch data - to store the coding frequencies.

Biologically, [DNA codons](https://en.wikipedia.org/wiki/DNA_and_RNA_codon_tables) consist of three nucleotides, encoding amino acids. However, since we are training a neural network to classify a sequence to be protein coding or not, we can choose any number of nucleotides to represent a 'codon'. For example, we can choose a "codon length" (codon_length) of a single nucleotide (which would result in us training the model on the [frequencies of nucleotides in DNA](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC403801/)), or a codon length of two nucleotides (no biological meaning as this does not represent a biological unit - we do not expect a model to learn any biology at all), or a codon length of 6 nucleotides (representing pairs of amino acids - would this yield a model that "learns" any biology?). You can play around with the codon_length yourself, but we start with a codon length of 3 nucleotides - represeting one amino acid. 

In [None]:

codon_length = 3

# create a language for human DNA sequences
from language import Language
dna_lang = Language(name="dna_human", codon_len=codon_length)

# memorize the dna language by parsing all sequences
for cur_seq in seq_data_human:
    dna_lang.learnWords(cur_seq)

# encode data
seq_data_human_encoded_pos = dna_lang.encode_positives(seq_data_human)
seq_data_human_encoded_neg = dna_lang.encode_negatives(seq_data_human)
seq_data_mouse_encoded = dna_lang.encode_positives(seq_data_mouse)
seq_data_yeast_encoded = dna_lang.encode_positives(seq_data_yeast)


In [None]:

# store output
with open('all_seqs/lab1_seq_data_human_encoded_pos.obj', "wb") as file_handler:
        pickle.dump(seq_data_human_encoded_pos, file_handler)

# store output
with open('all_seqs/lab1_seq_data_human_encoded_neg.obj', "wb") as file_handler:
        pickle.dump(seq_data_human_encoded_neg, file_handler)

# store output
with open('all_seqs/seq_data_mouse_encoded.obj', "wb") as file_handler:
        pickle.dump(seq_data_mouse_encoded, file_handler)

# store output
with open('all_seqs/seq_data_yeast_encoded.obj', "wb") as file_handler:
        pickle.dump(seq_data_yeast_encoded, file_handler)


In [None]:

# store output
with open('all_seqs/sequence_encoding.obj', "wb") as file_handler:
        pickle.dump(dna_lang, file_handler)
