<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#CTGAN" data-toc-modified-id="CTGAN-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>CTGAN</a></span><ul class="toc-item"><li><span><a href="#Generate-sythetic-data" data-toc-modified-id="Generate-sythetic-data-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Generate sythetic data</a></span></li></ul></li><li><span><a href="#References-(GANs)" data-toc-modified-id="References-(GANs)-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>References (GANs)</a></span></li><li><span><a href="#TODO" data-toc-modified-id="TODO-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>TODO</a></span></li></ul></div>

In [1]:
# import PyTorch
import torch

# standard DS stack
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import pandas as pd
# embed static images in the ipynb
%matplotlib inline 

# neural network package
import torch.nn as nn 
import torch.nn.functional as F

# computer vision
import torchvision
from torchvision import transforms
from PIL import Image

# dataset loading
from torch.utils.data import Dataset, DataLoader

# convenient package for plotting loss functions
# from livelossplot import PlotLosses

import copy
import importlib.util # to run outside module
from sklearn.model_selection import train_test_split

In [2]:
# Retrieve preprocessed titanic dataset
system_path = r"C:\Users\uniqu\Adaptation\github repos" \
              + "\Bioinformatics-Neural Networks for Genomic Risk"

exec(open(system_path+"\preprocess_titanic.py").read())

X, Y, col_names = preprocess_titanic()

In [3]:
from sklearn.model_selection import train_test_split

# Real (toy) dataset
def get_real_data(X, Y):
    """[summary]

    Args:
        X, Y (np.ndarray): feature & target matrices.

    Returns:
        [type]: [description]
    """
    # feature_names = list(dataset.columns)[1:]
    # target_names = list(dataset.columns)[0]
    
    # Perform train-test split
    X_train, X_test, Y_train, Y_test = train_test_split(
        X, Y, test_size=0.3, random_state=7)

    real_data = np.hstack([Y, X])
    train_data = np.hstack([Y_train, X_train])
    test_data = np.hstack([Y_test, X_test])
        
    return real_data, train_data, test_data

real_data, train_data, test_data = get_real_data(X, Y, titanic_data)

print(f"feature names:\n  {list(titanic_data.columns[1:])}\n")
print(f"target names: '{titanic_data.columns[0]}'")
X[0], Y[0]

feature names:
  ['pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked', 'class', 'who', 'adult_male', 'embark_town', 'alone']

target names: 'survived'


(array([ 3.  ,  1.  , 22.  ,  1.  ,  0.  ,  7.25,  0.  ,  0.  ,  1.  ,
         0.  ,  0.  ,  0.  ,  1.  ,  0.  ,  0.  ,  1.  ,  0.  ,  1.  ,
         0.  ,  0.  ,  1.  ]),
 array([0.]))

## CTGAN

[TGAN](https://github.com/sdv-dev/TGAN) seemed like the exact tool I was looking for. It turns out [CTGAN](https://github.com/sdv-dev/CTGAN) is better. Several major differences make CTGAN outperform TGAN.
- **Preprocessing**: CTGAN uses more sophisticated Variational Gaussian Mixture Model to detect modes of continuous columns.
- **Network structure**: TGAN uses LSTM to generate synthetic data column by column. CTGAN uses Fully-connected networks which is more efficient.
- **Features to prevent** mode collapse: We design a conditional generator and resample the training data to prevent model collapse on discrete columns. We use WGANGP and PacGAN to stabilize the training of GAN.

In [6]:
import warnings
warnings.filterwarnings('ignore')

from ctgan import load_demo
data = load_demo()

from ctgan import CTGANSynthesizer
ctgan = CTGANSynthesizer()

# CTGAN Synthesizer models: G & D
# Fit the models to the training data
ctgan.fit(train_data=train_data,
          epochs=100)

### Generate sythetic data 

In [7]:
# Sample data similar to training data.
fake_samples = ctgan.sample(n=1000, 
             condition_column=None, 
             condition_value=None)

pd.DataFrame(data=fake_samples, columns=col_names)

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,alone,who_child,who_man,...,embark_town_Queenstown,embark_town_Southampton,embarked_C,embarked_Q,embarked_S,adult_male_False,adult_male_True,class_First,class_Second,class_Third
0,1.080237,0.791131,1.063448,28.413202,0.771792,-0.021365,34.495780,1.035639,-0.023513,1.014079,...,0.013898,0.145945,0.935486,0.028479,-0.073765,-0.120157,0.918098,0.867076,-0.033363,1.104807
1,0.917577,2.145518,1.065295,19.750091,2.596427,-0.150105,91.814781,0.891793,0.026567,1.029313,...,0.004099,0.930086,-0.006607,0.998861,0.968039,-0.073928,0.999255,0.000214,0.062596,1.018597
2,0.855402,3.106432,-0.123146,20.296845,3.790539,0.800693,7.423602,0.947184,0.775650,1.025470,...,0.009357,0.931154,-0.045928,1.174743,0.980231,1.004876,-0.096677,-0.038539,0.048083,0.020445
3,1.012989,1.833180,1.018224,33.280963,-0.145543,0.068389,7.426653,-0.132893,-0.024482,0.983342,...,0.008880,1.037112,-0.056452,-0.008572,0.041343,-0.073893,-0.048799,0.772480,0.057497,1.104943
4,0.086709,1.047103,1.022782,19.670918,0.806505,0.065827,7.851000,0.921226,-0.014895,-0.023149,...,-0.009781,0.969639,0.000792,-0.019421,-0.066347,-0.073329,1.073442,1.061139,0.008524,0.060689
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0.952774,0.820870,1.023659,26.797144,-0.092427,4.615633,6.082351,0.897115,0.843766,1.062096,...,-0.010419,-0.008431,1.009220,-0.020458,0.976878,-0.099069,0.991255,1.057259,0.046726,0.081935
996,1.025562,0.683903,0.148759,23.072406,-0.112876,0.841941,-1.496806,0.966524,0.624451,1.057840,...,0.014569,-0.020873,0.053306,0.008983,1.008041,-0.041440,0.944608,0.005733,0.015435,0.971664
997,-0.017521,2.971828,0.065092,20.669503,2.625256,5.984285,347.287242,-0.067533,-0.025066,1.071818,...,1.007326,-0.110520,-0.023948,0.020534,0.121246,0.028778,0.055800,-0.010000,-0.051961,0.091571
998,1.080764,0.863658,0.928797,21.933438,0.770049,4.145770,17.517423,0.965627,-0.012321,1.024681,...,0.606789,0.975313,-0.054702,0.006308,0.012312,0.015555,1.004909,-0.038201,0.938499,1.088596


In [None]:
# GPU for cloud training
if torch.cuda.is_available(): 
    device = torch.device("cuda") # device = GPU
else:
    device = torch.device("cpu") # device = CPU

```python

# =======================================
# Section commented out until next meeting
# =======================================

class Generator(nn.Module):
    def __init__(self):
        super(Generator, self).__init__()
        self.fc1 = nn.Linear(n_features, H_0)
        self.fc2 = nn.Linear(H_0, H_1)
        self.fc3 = nn.Linear(H_1, D_out)
        
        self.dropout = nn.Dropout(p=0.2)
        
    def forward(self, z):
        z = F.leaky_relu(self.fc1(z))
        z = F.leaky_relu(self.fc2(z)) 
        z = F.leaky_relu(self.fc3(z))
        return z

    
class Discriminator(nn.Module):
    def __init__(self):
        super(Discriminator, self).__init__()
        self.fc1 = nn.Linear(n_features, H_0)
        self.fc2 = nn.Linear(H_0, H_1)
        self.fc3 = nn.Linear(H_1, D_out)
    def forward(self, x):
        x = F.leaky_relu(self.fc1(x))
#          x = self.dropout()
        x = F.leaky_relu(self.fc2(x)) 
        x = F.leaky_relu(self.fc3(x))
        return x

class Toy_Dataset(Dataset): # inherit from torch's Dataset class.
    def __init__(self, train):
        # data loading
        if train == True:
            self.X = torch.from_numpy(X_train.astype(np.float32))
            self.Y = torch.from_numpy(Y_train.astype(np.float32))
        else:
            self.X = torch.from_numpy(X_test.astype(np.float32))
            self.Y = torch.from_numpy(Y_test.astype(np.float32))

        if self.X.shape[0] == self.Y.shape[0]:
            self.n_samples = self.X.shape[0]
        else:
            raise ValueError("Shape mismatch")
        
    def __getitem__(self, idx):
        return self.X[idx], self.Y[idx]
    
    def __len__(self):
        return self.n_samples
        # len(dataset)

# Initialize constants
n_features = X_train.shape[1]
BATCH_SIZE, D_in, D_out = 15, n_features, 1
H_0, H_1 = int(0.7*n_features), int(0.4*n_features)

# Set DataLoaders    
train_set = Toy_Dataset(train=True)
test_set = Toy_Dataset(train=False)
train_dl = DataLoader(dataset=train_set, batch_size=BATCH_SIZE, shuffle=True)
test_dl = DataLoader(dataset=test_set, batch_size=BATCH_SIZE, shuffle=True)

# Initialize networks
G = Generator()
D = Discriminator()

loss_fn = nn.BCEWithLogitsLoss()

def check_models():
    x = torch.from_numpy(X_train[0]).float() 
    print( G(x) )
    print( D(x) )
    
check_models()
```

## References (GANs)

Conceptual References:
- [*GANS*, Google Developers](https://developers.google.com/machine-learning/gan/gan_structure)
- [Ian Goodfellow on Lex Fridman podcast](https://www.youtube.com/watch?v=Z6rxFNMGdn0&t=2826s&ab_channel=LexFridman)
- [Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., ... & Bengio, Y. (2014). Generative adversarial nets. In *Advances in neural information processing systems* (pp. 2672-2680).](https://arxiv.org/pdf/1406.2661.pdf)
- [Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., & Chen, X. (2016). Improved techniques for training gans. In *Advances in neural information processing systems* (pp. 2234-2242).](https://arxiv.org/pdf/1606.03498.pdf)
- [DeepInsight: A methodology to transform a non-image data to an image for convolution neural network architecture](https://www.nature.com/articles/s41598-019-47765-6)

Implementation References:
- [eriklindernoren/PyTorch-GAN](https://github.com/eriklindernoren/PyTorch-GAN/blob/master/implementations/gan/gan.py)
  - [list of successful GAN architectures](https://github.com/eriklindernoren/PyTorch-GAN#gan)
- [fast ai, free GPU w/ Google Colab guide](https://www.kdnuggets.com/2018/02/fast-ai-lesson-1-google-colab-free-gpu.html)
- [PyTorch GAN, github.com/devnag](https://github.com/devnag/pytorch-generative-adversarial-networks/blob/master/gan_pytorch.py)
- [Brownlee, Jason (2020). How to Code the GAN Training Algorithm. *machinelearningmastery.com*](https://machinelearningmastery.com/how-to-code-the-generative-adversarial-network-training-algorithm-and-loss-functions/)
- TGAN: [paper](https://arxiv.org/pdf/1811.11264.pdf), [repo](https://github.com/sdv-dev/TGAN)
- CTGAN: [repo](https://github.com/sdv-dev/CTGAN)

----

# Next steps


**Fun stuff to try**:
- .
- . 
- .

## TODO
- Play around with the CTGAN architecture to improve upon it
- Email Itsik to get approval to work with the actual dataset.
- Plan out a hyperparameter optimization scheme for the disease predicting NNs.
- Implement the NNs for image representation of the genome.
  - Set up Convolutional NN architecture to analyze the images
- Compare everything empirically in a disease prediction task. Linear methods, MLP + GAN, image representation + CNN, XGBoost (benchmark)

