# Deep leaning model using Tensorflow for predicting the label of SSRI
Reference:  https://www.tensorflow.org/tutorials/wide_and_deep

## Introduction
In this project, the raw data are 91 infants’ 90*90 brain connectivity matrix and the label of SSRI (1 corresponding positive, 0 corresponding to negative). What we are trying to do is to apply the brain connectivity matrix data to a deep learning model so that to predict the labels of SSRI, which is a classification problem. Since Tensorflow is a great tool for training deep neural networks, in this project, we use Tensorflow to create the deep model and train our data.


## Data Conversion
Before we create the deep model, we need to convert connectivity matrix to vector format, so that each row which contains 8100 columns can represent a subject. Each columns represents an edge.
This step is accomplished by MATLAB.
After conversion, data needs to be exported into csv files so that it can be read as dataframe in python.

'infant' -- data contains 90*90 connectivity matrix estimated from 91 infant brains

'infant_white' -- contains the same info as 'infant' but it is based on different parameters for white matter tract estimate

## Data Import
Generated csv files need to be imorted as dataframe by using pandas. 'Variables.txt' contains the information of the labels of SSRI, 'infant.csv' and 'infant_white' contain the information about the 91 infants' brain connectivity matrix. After read in SSRI labels and brain connectivity matrix, they need to be combined for further analysis.

In [272]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import argparse
import sys
import tempfile

from six.moves import urllib
import tensorflow as tf

In [273]:
# Read in data
import pandas as pd
import scipy.io
variable = pd.read_csv('/Users/posnerlab/Desktop/Project 2/BrainCNN/neonatalinfantbrainstudy/Variables.txt', sep="\t")
infant = pd.read_csv('/Users/posnerlab/Desktop/Project 2/BrainCNN/neonatalinfantbrainstudy/infant.csv', header=None)
infant_white = pd.read_csv('/Users/posnerlab/Desktop/Project 2/BrainCNN/neonatalinfantbrainstudy/infant_white.csv', header=None)
#infant = infant.drop(0, 1)
#infant = infant.drop(8099, 1)
#infant_white = infant_white.drop(0, 1)
#infant_white = infant_white.drop(8099, 1)

# Combine connectivity matrix and SSRI labels
data = pd.DataFrame(pd.concat([infant, variable['SSRI']], axis=1))
data_white = pd.concat([infant_white, variable['SSRI']], axis=1)

# Rename edge names
new_cols = dict(zip(data.columns[0:data.shape[1]-1], ['edge_' + str(x+1) for x in data.columns[0:data.shape[1]-1]]))
data.rename(columns= new_cols, inplace=True)
data_white.rename(columns= new_cols, inplace=True)

# Export data into csv files
data.to_csv('data.csv')
data_white.to_csv('data_white.csv')

# Data visualization
data

Unnamed: 0,edge_1,edge_2,edge_3,edge_4,edge_5,edge_6,edge_7,edge_8,edge_9,edge_10,...,edge_8092,edge_8093,edge_8094,edge_8095,edge_8096,edge_8097,edge_8098,edge_8099,edge_8100,SSRI
0,0,4795,6541,1942,2365,2044,7596,1181,803,1301,...,601,541,746,603,11937,1269,3694,2591,0,0
1,0,4265,5010,2839,1844,1857,6875,2165,1132,1640,...,748,407,650,300,15548,1767,3644,2852,0,0
2,0,4887,4858,1472,1349,1446,4551,1532,773,1207,...,273,306,450,243,12996,1953,3142,3085,0,0
3,0,4914,4679,2965,1668,2219,6755,2097,1071,1609,...,687,338,456,311,12487,2078,5100,4112,0,0
4,0,5257,3946,3984,1556,2123,5717,2714,1387,1301,...,548,192,726,274,18218,1632,3584,3483,0,0
5,0,5991,5225,3566,2944,3431,6900,1771,1501,2185,...,221,272,238,378,9325,1123,2631,2016,0,1
6,0,5227,5827,4043,1910,2246,4844,3703,891,1359,...,930,157,262,383,15861,1510,4662,3396,0,0
7,0,4925,6014,3686,2012,1884,5759,2222,1063,1057,...,782,417,820,520,15467,1695,5840,3143,0,0
8,0,4860,4128,3415,1462,2213,5276,2934,983,1133,...,950,437,474,1012,16144,1359,3706,3873,0,0
9,0,4683,4692,3075,1723,1557,6465,1979,1183,848,...,616,259,490,334,14113,1050,3577,3271,0,0


In [274]:
# Split training and testing data
from sklearn.model_selection import train_test_split
train_infant, test_infant = train_test_split(data, test_size = 0.2)
train_infant_white, test_infant_white = train_test_split(data_white, test_size = 0.2)
train_infant.to_csv('train_infant.csv', header=False, index=False)
test_infant.to_csv('test_infant.csv', header=False, index=False)
train_infant_white.to_csv('train_infant_white.csv', header=False, index=False)
test_infant_white.to_csv('test_infant_white.csv', header=False, index=False)

## Create Deep Model 
### (1) Define Base Feature Columns
We need to define the base categorical and continuous feature columns that we are gonna use. Since the only features we have are the edges which should be continous, so no catergorical columns need to be defined.

In [275]:
# Continuous base columns.
edge = []
for i in range(data.shape[1]-2):
    edge.append(tf.contrib.layers.real_valued_column(data.columns[i]))

In [276]:
# Deep columns
deep_columns = []
for i in range(data.shape[1]-2):
    deep_columns.append(edge[i])
# Wide columns
wide_columns = []

### (2) Combining Wide and Deep Models into One
The wide models and deep models are combined by summing up their final output log odds as the prediction, then feeding the prediction to a logistic loss function. Also, a DNNLinearCombinedClassifier needs to be created.

In [277]:
# Combine deep and wide columns
import tempfile
model_dir = tempfile.mkdtemp()
m = tf.contrib.learn.DNNLinearCombinedClassifier(
    model_dir=model_dir,
    linear_feature_columns=wide_columns,
    dnn_feature_columns=deep_columns,
    dnn_hidden_units=[100, 50],
    n_classes=2)

Instructions for updating:
Please set fix_global_step_increment_bug=True and update training steps in your pipeline. See pydoc for details.
INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_save_checkpoints_secs': 600, '_num_ps_replicas': 0, '_keep_checkpoint_max': 5, '_task_type': None, '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x135b4aa50>, '_model_dir': '/var/folders/z1/n4w0rk993zl8406dtsghskvr0000gn/T/tmpVq2Ns7', '_save_checkpoints_steps': None, '_keep_checkpoint_every_n_hours': 10000, '_session_config': None, '_tf_random_seed': None, '_environment': 'local', '_num_worker_replicas': 0, '_task_id': 0, '_save_summary_steps': 100, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1
}
, '_evaluation_master': '', '_master': ''}


## Training and Evaluating the Model

In [283]:
#Import the data to the model
import pandas as pd
import urllib
COLUMNS = list(data)
LABEL_COLUMN = 'label'
CONTINUOUS_COLUMNS = list(data[0:data.shape[1]-2])
CATEGORICAL_COLUMNS = []

train_file = tempfile.NamedTemporaryFile()
test_file = tempfile.NamedTemporaryFile()
urllib.urlretrieve("/Users/posnerlab/Desktop/BingHan/NYSPI_Tensorflow/train_infant.csv", train_file.name)
urllib.urlretrieve("/Users/posnerlab/Desktop/BingHan/NYSPI_Tensorflow/test_infant.csv", test_file.name)

df_train = pd.read_csv(train_file, names=COLUMNS)
df_test = pd.read_csv(test_file, names=COLUMNS)

# Add a label column
df_train[LABEL_COLUMN] = (df_train['SSRI'].apply(lambda x: 0 != x)).astype(int)
df_test[LABEL_COLUMN] = (df_test['SSRI'].apply(lambda x: 0 != x)).astype(int)

df_train

Unnamed: 0,edge_1,edge_2,edge_3,edge_4,edge_5,edge_6,edge_7,edge_8,edge_9,edge_10,...,edge_8093,edge_8094,edge_8095,edge_8096,edge_8097,edge_8098,edge_8099,edge_8100,SSRI,label
0,0,3078,3062,1884,1174,1518,2381,1838,838,705,...,264,305,261,7221,975,1948,2719,0,0,0
1,0,5227,5827,4043,1910,2246,4844,3703,891,1359,...,157,262,383,15861,1510,4662,3396,0,0,0
2,0,5656,5348,3368,1855,2020,5910,2357,1714,1417,...,554,679,536,14330,1247,3146,4041,0,0,0
3,0,5588,5163,4312,2439,2472,4364,3069,1222,1495,...,240,484,493,13683,2203,5009,3432,0,0,0
4,0,4190,6691,3871,2110,3525,8115,2051,1584,2261,...,331,419,686,18815,1584,2486,3045,0,0,0
5,0,3699,5822,3280,1882,1853,6272,2186,1283,2068,...,136,438,436,11484,2011,3212,5462,0,0,0
6,0,4914,4679,2965,1668,2219,6755,2097,1071,1609,...,338,456,311,12487,2078,5100,4112,0,0,0
7,0,3267,5359,2695,2254,1928,4649,3605,1498,1438,...,223,662,240,12913,1819,5983,3865,0,0,0
8,0,4460,5137,2938,2067,2062,5167,2086,985,901,...,200,589,342,16934,1820,4798,2754,0,0,0
9,0,4072,3695,2517,2008,1847,5717,2927,1126,1211,...,117,162,342,12040,2100,2029,2666,0,0,0


In [280]:
def input_fn(df):
    """Input builder function."""
    # Creates a dictionary mapping from each continuous feature column name (k) to
    # the values of that column stored in a constant Tensor.
    continuous_cols = {k: tf.constant(df[k].values, shape=[df[k].size, 1]) for k in CONTINUOUS_COLUMNS}
    # Creates a dictionary mapping from each categorical feature column name (k)
    # to the values of that column stored in a tf.SparseTensor.
    categorical_cols = {k: tf.SparseTensor(
            indices=[[i, 0] for i in range(df[k].size)],
            values=df[k].values,
            dense_shape=[df[k].size, 1])
                   for k in CATEGORICAL_COLUMNS}
    # Merges the two dictionaries into one.
    feature_cols = dict(continuous_cols.items() + categorical_cols.items())
    #feature_cols.update(categorical_cols)
    # Converts the label column into a constant Tensor.
    label = tf.constant(df[LABEL_COLUMN].values)
    # Returns the feature columns and the label.
    return feature_cols, label

In [281]:
def train_input_fn():
    return input_fn(df_train)

def eval_input_fn():
    return input_fn(df_test)

In [282]:
m.fit(input_fn=train_input_fn, steps=200)
results = m.evaluate(input_fn=eval_input_fn, steps=1)
for key in sorted(results):
    print("%s: %s" % (key, results[key]))

Instructions for updating:
Please switch to tf.summary.scalar. Note that tf.summary.scalar uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on the scope they are created in. Also, passing a tensor or list of tags to a scalar summary op is no longer supported.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Saving checkpoints for 1 into /var/folders/z1/n4w0rk993zl8406dtsghskvr0000gn/T/tmpVq2Ns7/model.ckpt.
INFO:tensorflow:loss = 725.396, step = 1
INFO:tensorflow:global_step/sec: 51.7315
INFO:tensorflow:loss = 0.212845, step = 101 (1.934 sec)
INFO:tensorflow:Saving checkpoints for 200 into /var/folders/z1/n4w0rk993zl8406dtsghskvr0000gn/T/tmpVq2Ns7/model.ckpt.
INFO:tensorflow:Loss for final step: 0.212025.
Instructions for updating:
Please switch to tf.summary.scalar. Note that tf.summary.scalar uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names bas

In [38]:
m.predict(input_fn=lambda: input_fn(df_test), as_iterable=False)

Instructions for updating:
The default behavior of predict() is changing. The default value for
as_iterable will change to True, and then the flag will be removed
altogether. The behavior of this flag is described below.
Instructions for updating:
Please switch to predict_classes, or set `outputs` argument.
Instructions for updating:
The default behavior of predict() is changing. The default value for
as_iterable will change to True, and then the flag will be removed
altogether. The behavior of this flag is described below.
Instructions for updating:
Estimator is decoupled from Scikit Learn interface by moving into
separate class SKCompat. Arguments x, y and batch_size are only
available in the SKCompat class, Estimator will only accept input_fn.
Example conversion:
  est = Estimator(...) -> est = SKCompat(Estimator(...))
INFO:tensorflow:Restoring parameters from /var/folders/z1/n4w0rk993zl8406dtsghskvr0000gn/T/tmpFSmTG9/model.ckpt-200


array([], dtype=int64)

In [232]:
results

{'accuracy': 0.84210527,
 'accuracy/baseline_label_mean': 0.10526316,
 'accuracy/threshold_0.500000_mean': 0.84210527,
 'auc': 0.83823532,
 'auc_precision_recall': 0.21666694,
 'global_step': 200,
 'labels/actual_label_mean': 0.10526316,
 'labels/prediction_mean': 0.14715943,
 'loss': 4.8720641,
 'precision/positive_threshold_0.500000_mean': 0.33333334,
 'recall/positive_threshold_0.500000_mean': 0.5}

## Results of running 12 times

{'accuracy': 0.89473683,
 'accuracy/baseline_label_mean': 0.10526316,
 'accuracy/threshold_0.500000_mean': 0.89473683,
 'auc': 0.52941191,
 'auc_precision_recall': 0.55555534,
 'global_step': 200,
 'labels/actual_label_mean': 0.10526316,
 'labels/prediction_mean': 0.20150132,
 'loss': 0.3643176,
 'precision/positive_threshold_0.500000_mean': 0.0,
 'recall/positive_threshold_0.500000_mean': 0.0}

{'accuracy': 0.94736844,
 'accuracy/baseline_label_mean': 0.052631579,
 'accuracy/threshold_0.500000_mean': 0.94736844,
 'auc': 0.47222269,
 'auc_precision_recall': 0.026316285,
 'global_step': 200,
 'labels/actual_label_mean': 0.052631579,
 'labels/prediction_mean': 0.0016874616,
 'loss': 1.7687163,
 'precision/positive_threshold_0.500000_mean': 0.0,
 'recall/positive_threshold_0.500000_mean': 0.0}

{'accuracy': 0.78947371,
 'accuracy/baseline_label_mean': 0.21052632,
 'accuracy/threshold_0.500000_mean': 0.78947371,
 'auc': 0.50000006,
 'auc_precision_recall': 0.60526305,
 'global_step': 200,
 'labels/actual_label_mean': 0.21052632,
 'labels/prediction_mean': 0.28551298,
 'loss': 0.52930146,
 'precision/positive_threshold_0.500000_mean': 0.0,
 'recall/positive_threshold_0.500000_mean': 0.0}
 
{'accuracy': 0.84210527,
 'accuracy/baseline_label_mean': 0.15789473,
 'accuracy/threshold_0.500000_mean': 0.84210527,
 'auc': 0.50000012,
 'auc_precision_recall': 0.57894719,
 'global_step': 200,
 'labels/actual_label_mean': 0.15789473,
 'labels/prediction_mean': 0.29757032,
 'loss': 0.48882499,
 'precision/positive_threshold_0.500000_mean': 0.0,
 'recall/positive_threshold_0.500000_mean': 0.0}
 
{'accuracy': 0.78947371,
 'accuracy/baseline_label_mean': 0.15789473,
 'accuracy/threshold_0.500000_mean': 0.78947371,
 'auc': 0.46875015,
 'auc_precision_recall': 0.078947857,
 'global_step': 200,
 'labels/actual_label_mean': 0.15789473,
 'labels/prediction_mean': 0.052631579,
 'loss': 22.450542,
 'precision/positive_threshold_0.500000_mean': 0.0,
 'recall/positive_threshold_0.500000_mean': 0.0}
 
{'accuracy': 0.68421054,
 'accuracy/baseline_label_mean': 0.2631579,
 'accuracy/threshold_0.500000_mean': 0.68421054,
 'auc': 0.46428579,
 'auc_precision_recall': 0.13157943,
 'global_step': 200,
 'labels/actual_label_mean': 0.2631579,
 'labels/prediction_mean': 0.029362384,
 'loss': 24.203154,
 'precision/positive_threshold_0.500000_mean': 0.0,
 'recall/positive_threshold_0.500000_mean': 0.0}
 
{'accuracy': 0.89473683,
 'accuracy/baseline_label_mean': 0.10526316,
 'accuracy/threshold_0.500000_mean': 0.89473683,
 'auc': 0.50000018,
 'auc_precision_recall': 0.55263138,
 'global_step': 200,
 'labels/actual_label_mean': 0.10526316,
 'labels/prediction_mean': 0.33566472,
 'loss': 0.48082879,
 'precision/positive_threshold_0.500000_mean': 0.0,
 'recall/positive_threshold_0.500000_mean': 0.0}
 
{'accuracy': 0.78947371,
 'accuracy/baseline_label_mean': 0.21052632,
 'accuracy/threshold_0.500000_mean': 0.78947371,
 'auc': 0.66666675,
 'auc_precision_recall': 0.64285702,
 'global_step': 200,
 'labels/actual_label_mean': 0.21052632,
 'labels/prediction_mean': 0.17840759,
 'loss': 0.44450483,
 'precision/positive_threshold_0.500000_mean': 0.0,
 'recall/positive_threshold_0.500000_mean': 0.0}
 
{'accuracy': 0.7368421,
 'accuracy/baseline_label_mean': 0.2631579,
 'accuracy/threshold_0.500000_mean': 0.7368421,
 'auc': 0.62142861,
 'auc_precision_recall': 0.5192982,
 'global_step': 200,
 'labels/actual_label_mean': 0.2631579,
 'labels/prediction_mean': 0.032619666,
 'loss': 1.0176443,
 'precision/positive_threshold_0.500000_mean': 0.0,
 'recall/positive_threshold_0.500000_mean': 0.0}
 
{'accuracy': 0.68421054,
 'accuracy/baseline_label_mean': 0.31578946,
 'accuracy/threshold_0.500000_mean': 0.68421054,
 'auc': 0.51923084,
 'auc_precision_recall': 0.4222393,
 'global_step': 200,
 'labels/actual_label_mean': 0.31578946,
 'labels/prediction_mean': 0.041446891,
 'loss': 1.2817221,
 'precision/positive_threshold_0.500000_mean': 0.0,
 'recall/positive_threshold_0.500000_mean': 0.0}

{'accuracy': 0.84210527,
 'accuracy/baseline_label_mean': 0.10526316,
 'accuracy/threshold_0.500000_mean': 0.84210527,
 'auc': 0.83823532,
 'auc_precision_recall': 0.21666694,
 'global_step': 200,
 'labels/actual_label_mean': 0.10526316,
 'labels/prediction_mean': 0.14715943,
 'loss': 4.8720641,
 'precision/positive_threshold_0.500000_mean': 0.33333334,
 'recall/positive_threshold_0.500000_mean': 0.5}