## Training Notebook:

Currently Handles only case of single task - can be adapted for mutlitask output by summing across losses and taking a step back in that way: See --> https://towardsdatascience.com/tuning-a-multi-task-fate-grand-order-trained-pytorch-network-152cfda2e086

**Requirements:**

1) Pytorch and all subclasses (eg., nn and F) <br>
2) DontGetSNPpyWithMe.py in same folder as this notebook<br>
3) EarlyStop.py in same folder as this notebook 

**Import necessary packages:**

In [None]:
import torch 
import argparse
import os
import numpy as np
from torch.utils.data import DataLoader
from DontGetSNPpyWithMe import * #The neural net
import torch.optim as optim
import torch.nn.functional as F
from EarlyStop import * #Early Stopping object
from data_loader import *

**Define a Validation Splitting Function (maybe handle externally by randomly sampling in a phenotype balanced way so can just read in and go here), and read in training, validation, and test (openSNP) data:**

In [None]:
def ValSplit():
    return None

training = torch.randint(0,8, size = (5, 500000)).float() #DataLoader(...)
trainingLabels = torch.randint(0,3, size= (5,1)) #random numbers between 0 and 2 --> simulate 3 class problem
val = None #DataLoader(...)
test = None #DataLoader(...)

Define variables that can be easily changed for model instantiation later in one place:

In [None]:
train_loader = get_loader(genotype_file="../data/2020_05_26/train_set.csv",
                          phenotype_file="../data/2020_05_26/train_labels.csv",
                          batch_size=32,
                          shuffle=False,
                          num_workers=4)
val_loader = get_loader(genotype_file="../data/2020_05_26/val_set.csv",
                          phenotype_file="../data/2020_05_26/val_labels.csv",
                          batch_size=32,
                          shuffle=False,
                          num_workers=4)

In [None]:
batchSize = 5
numSNPs= 69258  # 500000
numLayers = 5
layerWidths = [512, 512,256,128,64]
dropout = 0.3
multitaskOutputs = [3]
learningRate = 5e-3
epochs = 100
patience = 5

Instantiate the model, loss function, and optimizer using the above parameters:

In [None]:
derekZoolander = DiddyKongRacing([batchSize,numSNPs], numLayers, layerWidths, dropout, multitaskOutputs)
criterion = nn.CrossEntropyLoss() 
optimizer = optim.Adam(derekZoolander.parameters(), lr=learningRate) # can add in regularization w/ wd as well later: weight_decay=1e-4) 

Move the model to gpu if available:

In [None]:
pleaseBeGpu = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(pleaseBeGpu)
derekZoolander = derekZoolander.to(pleaseBeGpu)

Define a Function to calc accuracy on a batch since the model returns raw outputs and not softmax!:

In [None]:
#input of 2 tensors --> make sure labels is a tensor of size == batch size
#output: Total number of matches (accuracy can be returned by removing the comment # below); 
# this was done to get accuracy on an epoch basis where can be weighted by total number of samples as opposed to a per batch basis
def AccuracyCalc(raw, labels): 
    print(F.softmax(raw, dim = 1))
    payMeForMyPredictions = torch.argmax(F.softmax(raw, dim = 1), dim = 1)
    print(payMeForMyPredictions)
    return sum(payMeForMyPredictions.eq(labels)).item() #/len(lables)

Define a validation (and Test) function (also base case of evaluation for epoch 0): 

Can mute print statements later when running everything especially with test set and when plotting is integrated but for now they are there to ensure things intuitively are working and can get a sense model is improving in real time.

In [None]:
#inputs: 1) model (derekZoolander) 2) current epoch number 3) validation dataloader 4) Loss function 5) device
#outputs: 1) accuracy 2) loss

def Validate(model, epochNum, data = val, lossFn = criterion, device = pleaseBeGpu):
    model.eval() #shut off batch norm and dropout
    # numBatches = 0 in case want to know how many batches iterating through
    totalSamples = 0
    totalLoss = 0
    for i, (snpBatch, phenotypeBatch) in enumerate(data):            
        totalSamples += len(phenotypeBatch) #running total of number of samples

        #move labels to gpu (hopefully)
        snpBatch = snpBatch.to(device)
        phenotypeBatch = phenotypeBatch.to(device)

        #Run it:
        output = model(snpBatch)
        loss = lossFn(output[0], phenotypeBatch) #recall output returns a list to handle multi-task learning so here the prediction is output[0]

        #Calculate accuracy:
        accuracy += AccuracyCalc(output[0], phenotypeBatch)
        totalLoss += loss.item()
        
    epochAcc = numerator/(totalSamples)
    print("Validation accuracy at epoch {} is: {} .".format(epochNum,epochAcc))
    return epochAcc, totalLoss

Define a training f(x):

In [None]:
#Inputs: 1) model 2) Number of epochs want to train for 3) training dataloader 4) loss function 
# 5) the device available (cpu or gpu/cuda --> hopefully it is the latter)

#Outputs: 4 lists corresponding to a per epoch basis of accuracies and losses
# 1) Training Accs 2) Training Losses 3) Validation Accuracies 4) Validation Losses

def Train(model, numEpochs = epochs, data = training, lossFn = criterion, device = pleaseBeGpu):
    
    #Lists of training accuracies and losses to return for visualization later
    losses = []
    accuracies = []
    
    #Lists of validation accuracies and losses to return for visualization later
    valAccs = []
    valLosses = []
    
    #Early stopping structures: 
    stop = EarlyStop(patience) #Create an EarlyStop object that will evaluate and save the best model if patience has been exceeded
    areWeThereYet = False #flag for condition of early stopping
    topDog = None #Best model to save if early stopping condition has been met
    
    for epoch in range(numEpochs):
        model.train() # turn on batch norm and dropout
        numerator = 0
        # numBatches = 0 in case want to know how many batches iterating through
        totalSamples = 0
        totalLoss = 0
        for i, (snpBatch, phenotypeBatch) in enumerate(data):
            #numBatches += 1 
            #print(snpBatch)
            #print(phenotypeBatch)
            totalSamples += len(phenotypeBatch) #running total of number of samples
            
            optimizer.zero_grad() #zero the gradient to prevent accumulation; apparently can also do model.zero_grad() as well and does the same thing...
            
            #move labels to gpu (hopefully)
            snpBatch = snpBatch.to(device)
            phenotypeBatch = phenotypeBatch.to(device)
            
            #Run it:
            output = model(snpBatch)
            loss = lossFn(output[0], phenotypeBatch.squeeze(1)) #recall output returns a list to handle multi-task learning so here the prediction is output[0]
            
            #Calculate accuracy:
            numerator += AccuracyCalc(output[0], phenotypeBatch.squeeze(1))
            totalLoss += loss.item()
            
            #Train it: 
            loss.backward()
            optimizer.step()

            
        epochAcc = numerator/(totalSamples)
        print("Training accuracy at epoch {} is: {} .".format(epoch+1,epochAcc))
        print("Training loss at epoch {} is : {} .".format(epoch+1, totalLoss))
        accuracies.append(epochAcc)
        losses.append(totalLoss)
        
        #Make some sort of call to Val when finished --> can discuss how want to do this of not a happy camper w/ this
        valAcc, valLoss = Validate(model, epoch + 1)
        valAccs.append(valAcc)
        valLosses.append(valLoss)
        
        #Early Stopping: 
        areWeThereYet, topDog = stop(valLoss, model)
        if areWeThereYet:
            #Change this name later so can include hyperparametrs, etc
            torch.save(topDog.state_dict(), '{}.pt'.format("BestModel_"+str(epoch) +"_epochs_" + str(numLayers) + "_hiddenLayers")) 
            print("Early stopping at epoch:" + str(epoch))
            break
        
        
    #Save final model if no early stopping
    if not areWeThereYet:
        print("Ugh didn't get to go home early, well lets save the final epoch model anyway.")
        torch.save(topDog.state_dict(), '{}.pt'.format("BestModel_"+str(epochs) +"_epochs_" + str(numLayers) + "_hiddenLayers"))
    
    return accuracies, losses, valAccs, valLosses
        
      
        

In [None]:
# Test with dataloader
Train(derekZoolander, numEpochs = epochs, data = train_loader, lossFn = criterion, device = pleaseBeGpu)

Quick Test run to ensure skeleton code (should) work(s):

In [None]:
training = training.to(pleaseBeGpu)
trainingLabels = trainingLabels.to(pleaseBeGpu)
print("Targets are: {}".format(trainingLabels.squeeze(1)))
print("\n")

derekZoolander.train()
optimizer.zero_grad()
print("Summary before simulated training: \n")
output = derekZoolander(training)
loss = criterion(output[0], trainingLabels.squeeze(1)) #recall output returns a list to handle multi-task learning so here the prediction is output[0]
            
#Calculate accuracy:
numerator = AccuracyCalc(output[0], trainingLabels.squeeze(1))
print(loss.item())
    
#Train it: 
loss.backward()
optimizer.step()

print("\n")
print("Summary after simulated training: \n")
output = derekZoolander(training)
loss = criterion(output[0], trainingLabels.squeeze(1)) #recall output returns a list to handle multi-task learning so here the prediction is output[0]


print(loss.item())

#Calculate accuracy:
numerator = AccuracyCalc(output[0], trainingLabels.squeeze(1))

**Seems to run properly!**

Main F(x) (or some sort of cohesive structure) and Plotting: Tbd

In [None]:
#Make some sort of call plotting later...