# Testing of the Deep CCA method on the UK BioBank dataset

The data includes 599 Cognitive Normal Males. 

We are interested in 2 views of the data. The first contains the imaging data reduced dimensionally, and the second is a concatenation of genetic data and health related information.

### Importing necessary packages:

In [85]:
import pandas as pd
import random
import torch
import torch.nn as nn
import numpy as np
from linear_cca import linear_cca
from torch.utils.data import BatchSampler, SequentialSampler
from DeepCCAModels import DeepCCA
from main import Solver
from utils import load_data, svm_classify
try:
    import cPickle as thepickle
except ImportError:
    import _pickle as thepickle

import gzip

torch.set_default_tensor_type(torch.DoubleTensor)

### Data loading:

In [86]:
"""
    1st view:
    
    This view consists of the imaging data, which have been reduced in 
    dimensionality through OPNMF.Initially there were 145 Regions Of 
    Interest (ROIs), then through Orthogonal Projective Non-Negative 
    Matrix Factorization, the 145 ROIs have been reduced to a set of 18
    components. That is, each of the 599 samples in the database can be
    described using a linear combination of those 18 components weighted
    by their unique coefficient set.
    
    We now load the 599x18 coefficient dataset, that represents the 145
    ROIs:
"""
components = pd.read_pickle("DATA/ROI_OPNMF_Component_Coefficients.pkl")
view_1 = components
print(view_1.shape)
components.head()

(599, 18)


Unnamed: 0,Component 1,Component 2,Component 3,Component 4,Component 5,Component 6,Component 7,Component 8,Component 9,Component 10,Component 11,Component 12,Component 13,Component 14,Component 15,Component 16,Component 17,Component 18
0,2.459898,0.423744,0.931475,0.0,0.840692,3.0396,0.935822,0.485301,0.288034,2.447542,0.0,1.193371,2.435244,2.079515,1.530588,1.941043,1.478721,0.364583
1,0.82803,2.786958,3.553455,2.721096,1.646939,6.717957,1.822474,1.822576,2.516197,0.295859,2.407094,0.95139,2.249399,0.0,2.23457,2.560864,1.073027,1.999624
2,6.796636,3.513108,2.574221,0.0,1.022791,2.396651,0.652889,0.588703,0.458944,0.419344,0.0,0.0,1.605214,0.0,1.56931,3.396855,1.636174,1.165219
3,5.059488,3.198297,1.573845,1.239967,0.574101,3.530107,2.466629,0.599458,1.328156,2.271253,0.204155,0.0,2.169149,2.497214,2.349669,2.538714,0.908674,0.34403
4,5.408747,3.722463,2.912291,1.121762,0.0,3.131542,1.542064,1.60095,0.958946,0.564767,0.215583,0.0,3.686751,2.30543,0.963403,1.741627,2.921104,0.0


In [110]:
"""
    2nd view:
    
    This view consists of the genetic data, as well as health related
    information for each individual. A table is created that contains
    the following attributes for each participant:
    age (number)
    APOE Genotype (1.0-6.0, exact numbers)
    Diabetes (either 0.0 or 1.0)
    Systolic pressure (number)
    Diastolic Pressure (number)
    Smoking (either 1.0, 2.0 or 3.0)
    BMI (number)
    
    The APOE Genotype is based on whether the participant has:
    E2/E2 alleles -> 1.0
    E2/E3 alleles -> 2.0
    E2/E4 alleles -> 3.0
    E3/E3 alleles -> 4.0
    E3/E4 alleles -> 5.0
    E4/E4 alleles -> 6.0
    
    The Smoking attribute is based on whether the participant is a 
    non(1.0), former(2.0) or current smoker(3.0).
"""
data = pd.read_csv("DATA/ukbb_cn_males_baseline_4550.csv")
data['APOE_Genotype'].replace({"E2/E2":1.0, 
                               "E2/E3":2.0, 
                               "E2/E4":3.0,
                               "E3/E3":4.0,
                               "E3/E4":5.0,
                               "E4/E4":6.0,
                               np.nan:1.0}, inplace=True)
data['Diabetes'].replace({"Diabetes negative/absent":0.0,
                          "Diabetes positive/present":1.0,
                          np.nan:0.0}, inplace=True)
data['Smoking'].replace({"Never":1.0,
                         "Former":2.0,
                         "Current":3.0,
                         "unknown":1.0,
                         np.nan:1.0},inplace=True)

# NaNs must be handled before being converted to a tensor:
# Normally NaN-containing datapoints should be removed but this is just a test:
data['Diastole'].replace({np.nan:data['Diastole'].mean()}, inplace=True)
data['Systole'].replace({np.nan:data['Systole'].mean()}, inplace=True)
data['BMI'].replace({np.nan:data['BMI'].mean()}, inplace=True)


columns_of_interest = ['Age', 'APOE_Genotype','Diabetes', 'Diastole', 'Systole', 'Smoking','BMI']
view_2 = data[columns_of_interest].astype('float64')

print(view_2.shape)
print(view_2.dtypes)
view_2.head()

(599, 7)
Age              float64
APOE_Genotype    float64
Diabetes         float64
Diastole         float64
Systole          float64
Smoking          float64
BMI              float64
dtype: object


Unnamed: 0,Age,APOE_Genotype,Diabetes,Diastole,Systole,Smoking,BMI
0,45.0,2.0,0.0,70.0,113.0,1.0,21.7009
1,45.0,4.0,0.0,78.0,124.0,1.0,25.1177
2,46.0,4.0,0.0,78.0,138.0,2.0,27.9676
3,46.0,1.0,0.0,73.0,124.0,2.0,28.1831
4,46.0,4.0,0.0,88.0,148.0,1.0,26.818


### Parameters:

In [120]:
# if a gpu exists, torch.device should be 'gpu'
device = torch.device('cpu')
# print("Using", torch.cuda.device_count(), "GPUs")

# the path to save the final learned features
save_to = './new_features.gz'

# the size of the new space learned by the model (number of the new features)
outdim_size = 6

# size of the input for view 1 and view 2
input_shape1 = 18
input_shape2 = 7

# number of layers with nodes in each one
# this apparently can be different for each network, some experimentation is needed!
layer_sizes1 = [1024, 1024, outdim_size]
layer_sizes2 = [1024, 1024, outdim_size]

# the parameters for training the network
learning_rate = 1e-3
epoch_num = 100
batch_size = 2000

# the regularization parameter of the network
# seems necessary to avoid the gradient exploding especially when non-saturating activations are used
reg_par = 1e-5

# specifies if all the singular values should get used to calculate the correlation or just the top 
# outdim_size ones
# if one option does not work for a network or dataset, try the other one
use_all_singular_values = False

# if a linear CCA should get applied on the learned features extracted from the networks
# it does not affect the performance on noisy MNIST significantly
apply_linear_cca = True

###  Building, training, and producing the new features by DCCA


In [121]:
# Convert the pandas dataframe to numpy arrays for pytorch:
view_1_n = view_1.to_numpy()
view_2_n = view_2.to_numpy()

In [122]:
# Scramble the datapoints for randomness:
indices = np.arange(view_1_n.shape[0])
np.random.shuffle(indices)
view_1_n = view_1_n[indices]
view_2_n = view_2_n[indices]

print(view_1_n.shape, type(view_1_n), view_1_n.dtype)
print(view_2_n.shape, type(view_2_n), view_2_n.dtype)

view_1_t = torch.from_numpy(view_1_n)
print(view_1_t.shape, type(view_1_t))
view_2_t = torch.from_numpy(view_2_n)
print(view_2_t.shape, type(view_2_t))

(599, 18) <class 'numpy.ndarray'> float64
(599, 7) <class 'numpy.ndarray'> float64
torch.Size([599, 18]) <class 'torch.Tensor'>
torch.Size([599, 7]) <class 'torch.Tensor'>


In [126]:
data1 = view_1_t
data2 = view_2_t

model = DeepCCA(layer_sizes1, layer_sizes2, input_shape1,
                input_shape2, outdim_size, use_all_singular_values, device=device).double()
l_cca = None
if apply_linear_cca:
    l_cca = linear_cca()
solver = Solver(model, l_cca, outdim_size, epoch_num, batch_size,
                learning_rate, reg_par, device=device)
# Split the dataset into training, validation and testing (450-100-50):
train1, train2 = data1[0:450], data2[0:450]
val1, val2 = data1[450:550], data2[450:550]
test1, test2 = data1[550:], data2[550:]

solver.fit(train1, train2, val1, val2, test1, test2)
# TODO: Save linear_cca model if needed

set_size = [0, train1.size(0), train1.size(
    0) + val1.size(0), train1.size(0) + val1.size(0) + test1.size(0)]
loss, outputs = solver.test(torch.cat([train1, val1, test1], dim=0), torch.cat(
    [train2, val2, test2], dim=0), apply_linear_cca)

[ INFO : 2021-10-30 23:05:22,739 ] - DataParallel(
  (module): DeepCCA(
    (model1): MlpNet(
      (layers): ModuleList(
        (0): Sequential(
          (0): Linear(in_features=18, out_features=1024, bias=True)
          (1): Sigmoid()
          (2): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=False, track_running_stats=True)
        )
        (1): Sequential(
          (0): Linear(in_features=1024, out_features=1024, bias=True)
          (1): Sigmoid()
          (2): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=False, track_running_stats=True)
        )
        (2): Sequential(
          (0): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=False, track_running_stats=True)
          (1): Linear(in_features=1024, out_features=6, bias=True)
        )
      )
    )
    (model2): MlpNet(
      (layers): ModuleList(
        (0): Sequential(
          (0): Linear(in_features=7, out_features=1024, bias=True)
          (1): Sigmoid()
          (2): BatchNorm1d(1024, eps=1e-05,

[ INFO : 2021-10-30 23:05:32,832 ] - Epoch 33: val_loss did not improve from -1.3322
[ INFO : 2021-10-30 23:05:32,833 ] - Epoch 33/100 - time: 0.30 - training_loss: -2.8401 - val_loss: -1.0676
[ INFO : 2021-10-30 23:05:33,129 ] - Epoch 34: val_loss did not improve from -1.3322
[ INFO : 2021-10-30 23:05:33,130 ] - Epoch 34/100 - time: 0.30 - training_loss: -2.8795 - val_loss: -1.0261
[ INFO : 2021-10-30 23:05:33,430 ] - Epoch 35: val_loss did not improve from -1.3322
[ INFO : 2021-10-30 23:05:33,430 ] - Epoch 35/100 - time: 0.30 - training_loss: -2.9125 - val_loss: -0.9856
[ INFO : 2021-10-30 23:05:33,729 ] - Epoch 36: val_loss did not improve from -1.3322
[ INFO : 2021-10-30 23:05:33,729 ] - Epoch 36/100 - time: 0.30 - training_loss: -2.9385 - val_loss: -1.0911
[ INFO : 2021-10-30 23:05:34,026 ] - Epoch 37: val_loss did not improve from -1.3322
[ INFO : 2021-10-30 23:05:34,027 ] - Epoch 37/100 - time: 0.30 - training_loss: -2.9706 - val_loss: -1.1676
[ INFO : 2021-10-30 23:05:34,325 ] 

[ INFO : 2021-10-30 23:05:45,684 ] - Epoch 76: val_loss did not improve from -1.3322
[ INFO : 2021-10-30 23:05:45,685 ] - Epoch 76/100 - time: 0.30 - training_loss: -3.8664 - val_loss: -1.0548
[ INFO : 2021-10-30 23:05:45,984 ] - Epoch 77: val_loss did not improve from -1.3322
[ INFO : 2021-10-30 23:05:45,984 ] - Epoch 77/100 - time: 0.30 - training_loss: -3.8813 - val_loss: -0.8871
[ INFO : 2021-10-30 23:05:46,283 ] - Epoch 78: val_loss did not improve from -1.3322
[ INFO : 2021-10-30 23:05:46,284 ] - Epoch 78/100 - time: 0.30 - training_loss: -3.8987 - val_loss: -1.2667
[ INFO : 2021-10-30 23:05:46,583 ] - Epoch 79: val_loss did not improve from -1.3322
[ INFO : 2021-10-30 23:05:46,583 ] - Epoch 79/100 - time: 0.30 - training_loss: -3.9145 - val_loss: -1.0138
[ INFO : 2021-10-30 23:05:46,882 ] - Epoch 80: val_loss did not improve from -1.3322
[ INFO : 2021-10-30 23:05:46,882 ] - Epoch 80/100 - time: 0.30 - training_loss: -3.9264 - val_loss: -1.0423
[ INFO : 2021-10-30 23:05:47,181 ] 

Linear CCA started!


### Classification test using SVM on the learned features:

In [128]:
# This can't be done because we don't have data with labels

### Saving the new features:

In [None]:
# Saving new features in a gzip pickled file specified by save_to
print('saving new features ...')
f1 = gzip.open(save_to, 'wb')
thepickle.dump(outputs, f1)
f1.close()

### Loading the model:

In [132]:
d = torch.load('checkpoint.model')
solver.model.load_state_dict(d)
solver.model.parameters()

<generator object Module.parameters at 0x7ff5443e2970>