## TFRecords creation
The goal of this Notebook is to **convert all the images that have been preprocessed in the Notebook** ***2_MRI_preprocessing*** **and saved as a numpy’s compressed format (.npz) in the folder 'Datasets/Image_files' to TFRecords.**

According to [Tensorflow](https://www.tensorflow.org/tutorials/load_data/tfrecord), TFRecords are used to store the data as a sequence of binary strings. The main advantage of using TFRecords is that it speeds up data reading.

This notebook is structured as follows:
   - Initial set-up
   - Import libraries
   - Load features
   - Load labels
   - Create TFRecords

### Initial set-up

#### Google Colab

In [1]:
# Specify if user is working on Google Drive
google_colab = False

In [2]:
if google_colab == True:
    
    from google.colab import drive 
    drive.mount('/content/drive')
    
    path = './drive/MyDrive/TFM/Code/'
    
    import os
    os.chdir(path)

else:
    path = '../'
    
    import os
    os.chdir(path)

Mounted at /content/drive


### Import libraries

In [2]:
import os
import random
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd
import tqdm
import tensorflow as tf
import glob

### Load features

#### Load 3D images directories (features)

In [6]:
# Specify folder where there are the 3D images in numpy’s compressed format (.npz)
directory_images = './Datasets/Image_files/'

In [7]:
# Get list of filenames from 3D volumes
filenames = os.listdir(directory_images)

# Remove ".DS_Store" file from list
if '.DS_Store' in filenames:
    filenames.remove('.DS_Store')

# Include all the path for each file
filenames = [directory_images + file for file in filenames]

# Check number of images loaded
print('[+] Number of 3D images:', len(filenames))

[+] Number of 3D images: 1146


In [8]:
# Shuffle list of filenames
random.shuffle(filenames)

#### Split dataset into training, validation and testing

In [None]:
train_size = 0.7
test_size = 0.3

In [None]:
train_filenames, test_filenames = train_test_split(filenames, 
                                               test_size = 0.3, 
                                               random_state = 7)

In [None]:
test_filenames, val_filenames = train_test_split(test_filenames, 
                                             test_size = 0.5, 
                                             random_state = 7)

In [None]:
# Check size of each dataset
print(f'[+] Training size:', len(train_filenames))
print(f'[+] Validation size:', len(val_filenames))
print(f'[+] Testing size:', len(test_filenames))

[+] Training size: 802
[+] Validation size: 172
[+] Testing size: 172


#### Load array for each 3D image (features)

In [None]:
def return_volumes(filenames):
    '''
    Function used to load numpy arrays (.npz) from a list of directories
    Input: directories where there are the numpy arrays
    Output: array with 3D images
    ''' 

    # List where to save the 3D images
    volumes = []

    # Load 3D images from filenames list
    for file in filenames:

        # Read 3D image
        volume = np.load(file, allow_pickle= True)['arr_0']

        # Append 3D image to volumes list 
        volumes.append(volume)
        
        if len(volumes) % 100 == 0:
            print('[+] Number of images loaded:', len(volumes))
    
    print('Total number of images loaded:', len(volumes))
        
    return np.array(volumes)

In [None]:
# Load training images
train_dataset = return_volumes(train_filenames)

[+] Number of images loaded: 100
[+] Number of images loaded: 200
[+] Number of images loaded: 300
[+] Number of images loaded: 400
[+] Number of images loaded: 500
[+] Number of images loaded: 600
[+] Number of images loaded: 700
[+] Number of images loaded: 800
Total number of images loaded: 802


In [None]:
# Load validation images
val_dataset = return_volumes(val_filenames)

[+] Number of images loaded: 100
Total number of images loaded: 172


In [None]:
# Load testing images
test_dataset = return_volumes(test_filenames)

[+] Number of images loaded: 100
Total number of images loaded: 172


In [None]:
# Check size of each dataset
print(f'[+] Training shape:', train_dataset.shape)
print(f'[+] Validation shape:', val_dataset.shape)
print(f'[+] Testing shape:', test_dataset.shape)

[+] Training shape: (802, 110, 130, 80)
[+] Validation shape: (172, 110, 130, 80)
[+] Testing shape: (172, 110, 130, 80)


### Load labels

#### Load CSV files with image details: images IDs and class

In [10]:
# Load individuals CSV files with image details
df_1 = pd.read_csv('./Datasets/ADNI1_Complete_1Yr_1.5T.csv')
df_2 = pd.read_csv('./Datasets/ADNI1_Complete_2Yr_1.5T.csv')
df_3 = pd.read_csv('./Datasets/ADNI1_Complete_3Yr_1.5T.csv')

# Concatenate all CSV files in a unique dataframe
df = pd.concat([df_1, df_2, df_3])

# Remove extra whitespaces from column names
df.columns = df.columns.str.replace(" ", "")

df.head()

Unnamed: 0,ImageDataID,Subject,Group,Sex,Age,Visit,Modality,Description,Type,AcqDate,Format,Downloaded
0,I125941,137_S_1426,MCI,M,85,4,MRI,MPR-R; GradWarp; N3; Scaled,Processed,10/30/2008,NiFTI,
1,I121703,128_S_1408,MCI,M,73,4,MRI,MPR; GradWarp; B1 Correction; N3; Scaled,Processed,9/19/2008,NiFTI,
2,I121637,037_S_1421,MCI,F,76,4,MRI,MPR; GradWarp; N3; Scaled,Processed,9/17/2008,NiFTI,
3,I122382,128_S_1407,MCI,F,76,4,MRI,MPR; GradWarp; B1 Correction; N3; Scaled,Processed,9/05/2008,NiFTI,
4,I121689,127_S_1427,MCI,F,71,4,MRI,MPR; GradWarp; B1 Correction; N3; Scaled,Processed,9/02/2008,NiFTI,


In [None]:
# Retrieve only the image ID and Group (class) columns
df = df[['ImageDataID', 'Group']]
df.head()

Unnamed: 0,ImageDataID,Group
0,I125941,MCI
1,I121703,MCI
2,I121637,MCI
3,I122382,MCI
4,I121689,MCI


#### Load class for each 3D volume (labels)

In [None]:
def return_labels(filenames, df):
    '''
    Function used to retrieve label for each sample of a dataset
    Input: directories where there are the images
    Output: array with labels
    ''' 

    # List where to save the class for each image: 0 (CN), 1 (AD)
    labels = []

    # Load labels from filenames list
    for file in filenames:

        # Get image ID from file name
        image_id = file.split('/')[-1].split('.')[0]

        # Retrieve class from dataframe searching by the image ID
        label = df['Group'].loc[df['ImageDataID'] == image_id].values[0]

        # Assign a class numerical value depending on the group: 0 (CN) or 1 (AD)
        if label in ['CN']:
            labels.append([0]) 
        elif label in ['AD']:
            labels.append([1])  
        else:
            print(f'ERROR with image ID {image_id}')
    
    return np.array(labels)

In [None]:
# Load training labels
train_labels = return_labels(train_filenames, df)

In [None]:
# Load validation labels
val_labels = return_labels(val_filenames, df)

In [None]:
# Load testing labels
test_labels = return_labels(test_filenames, df)

In [None]:
# Check size of each dataset
print(f'[+] Training labels shape:', train_labels.shape)
print(f'[+] Validation labels shape:', val_labels.shape)
print(f'[+] Testing labels shape:', test_labels.shape)

[+] Training labels shape: (802, 1)
[+] Validation labels shape: (172, 1)
[+] Testing labels shape: (172, 1)


### Create TFRecords
The following functions have been already defined by Tensorflow, and can be found in this [link](https://www.tensorflow.org/tutorials/load_data/tfrecord).

#### Define functions

In [None]:
def _bytes_feature(value):
    '''
    Returns a bytes_list from a string / byte.
    '''
    
    if isinstance(value, type(tf.constant(0))): # if value is tensor
        value = value.numpy() # get value of tensor
    
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))

def _float_feature(value):
    '''
    Returns a floast_list from a float / double.
    '''
    
    return tf.train.Feature(float_list=tf.train.FloatList(value=[value]))

def _int64_feature(value):
    '''
    Returns an int64_list from a bool / enum / int / uint.
    '''
    
    return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))

def serialize_array(array):
    
    array = tf.io.serialize_tensor(array)
    
    return array

def parse_single_volume(volume, label):
    
    # Get first value of label, as it is an array of length 1
    label = label[0]
    
    # Define the dictionary -- the structure -- of our single example
    data = {'height' : _int64_feature(volume.shape[0]),
            'width' : _int64_feature(volume.shape[1]),
            'depth' : _int64_feature(volume.shape[2]),
            'raw_image' : _bytes_feature(serialize_array(volume)),
            'label' : _int64_feature(label)}
    
    # Create an Example, wrapping the single features
    out = tf.train.Example(features = tf.train.Features(feature = data))

    return out

def write_images_to_tfr(volumes, labels, filename = 'images', max_files = 10, out_dir = './Datasets/TFRecords/'):

    # Determine the number of TFRecords needed
    splits = (len(volumes)//max_files) + 1 
    if len(volumes) % max_files == 0:
        splits-=1   
    
    print(f'[+] Number of TFRecords needed for {len(volumes)} volumes: {splits}')
    print(f'    [-] Number of files per TFRecord: {max_files}')
    
    # Check if the output directory exists
    if not os.path.exists(out_dir):
        os.mkdir(out_dir)   
    print(f'\n[+] Output directory: {out_dir}\n')
    
    # Write TFRecords
    file_count = 0
    
    for i in tqdm.tqdm(range(splits)):
        
        # Retrieve name of the TFRecord
        tfr_name = '{}{}_{}.tfrecords'.format(out_dir, i+1, filename)
        print(f'[+] Writing TFRecord: {tfr_name}')

        # Start writer
        writer = tf.io.TFRecordWriter(tfr_name)
        current_tfr_count = 0
    
        while current_tfr_count < max_files: 
            
            # Get the index of the file that we want to parse now
            index = i * max_files + current_tfr_count
            
            # Check if all dataset has been added to TFRecords
            if index == len(volumes):
                break
                
            # Retrieve volume and label
            current_volume = volumes[index]
            current_label = labels[index]

            # Create the required example representation
            out = parse_single_volume(volume = current_volume, label = current_label)

            writer.write(out.SerializeToString())
            
            # Update counters
            current_tfr_count+=1
            file_count += 1
       
        # Close writer
        writer.close()
    
    print(f'Number of files wrote to TFRecords: {file_count}')

#### Create training TFRecords

In [None]:
write_images_to_tfr(train_dataset, train_labels, 
                    max_files = 30, 
                    filename = 'train_volumes',
                    out_dir = './Datasets/TFRecords/Train/')

[+] Number of TFRecords needed for 802 volumes: 27
    [-] Number of files per TFRecord: 30

[+] Output directory: ../Datasets/TFRecords/Train/



  0%|          | 0/27 [00:00<?, ?it/s]

[+] Writing TFRecord: ../Datasets/TFRecords/Train/1_train_volumes.tfrecords


  4%|▎         | 1/27 [00:01<00:40,  1.57s/it]

[+] Writing TFRecord: ../Datasets/TFRecords/Train/2_train_volumes.tfrecords


  7%|▋         | 2/27 [00:03<00:38,  1.54s/it]

[+] Writing TFRecord: ../Datasets/TFRecords/Train/3_train_volumes.tfrecords


 11%|█         | 3/27 [00:05<00:43,  1.82s/it]

[+] Writing TFRecord: ../Datasets/TFRecords/Train/4_train_volumes.tfrecords


 15%|█▍        | 4/27 [00:07<00:42,  1.86s/it]

[+] Writing TFRecord: ../Datasets/TFRecords/Train/5_train_volumes.tfrecords


 19%|█▊        | 5/27 [00:09<00:45,  2.07s/it]

[+] Writing TFRecord: ../Datasets/TFRecords/Train/6_train_volumes.tfrecords


 22%|██▏       | 6/27 [00:13<00:55,  2.65s/it]

[+] Writing TFRecord: ../Datasets/TFRecords/Train/7_train_volumes.tfrecords


 26%|██▌       | 7/27 [00:15<00:51,  2.57s/it]

[+] Writing TFRecord: ../Datasets/TFRecords/Train/8_train_volumes.tfrecords


 30%|██▉       | 8/27 [00:17<00:44,  2.32s/it]

[+] Writing TFRecord: ../Datasets/TFRecords/Train/9_train_volumes.tfrecords


 33%|███▎      | 9/27 [00:19<00:40,  2.24s/it]

[+] Writing TFRecord: ../Datasets/TFRecords/Train/10_train_volumes.tfrecords


 37%|███▋      | 10/27 [00:21<00:38,  2.25s/it]

[+] Writing TFRecord: ../Datasets/TFRecords/Train/11_train_volumes.tfrecords


 41%|████      | 11/27 [00:23<00:33,  2.08s/it]

[+] Writing TFRecord: ../Datasets/TFRecords/Train/12_train_volumes.tfrecords


 44%|████▍     | 12/27 [00:25<00:30,  2.06s/it]

[+] Writing TFRecord: ../Datasets/TFRecords/Train/13_train_volumes.tfrecords


 48%|████▊     | 13/27 [00:27<00:29,  2.08s/it]

[+] Writing TFRecord: ../Datasets/TFRecords/Train/14_train_volumes.tfrecords


 52%|█████▏    | 14/27 [00:30<00:30,  2.37s/it]

[+] Writing TFRecord: ../Datasets/TFRecords/Train/15_train_volumes.tfrecords


 56%|█████▌    | 15/27 [00:32<00:24,  2.07s/it]

[+] Writing TFRecord: ../Datasets/TFRecords/Train/16_train_volumes.tfrecords


 59%|█████▉    | 16/27 [00:33<00:20,  1.88s/it]

[+] Writing TFRecord: ../Datasets/TFRecords/Train/17_train_volumes.tfrecords


 63%|██████▎   | 17/27 [00:35<00:17,  1.78s/it]

[+] Writing TFRecord: ../Datasets/TFRecords/Train/18_train_volumes.tfrecords


 67%|██████▋   | 18/27 [00:36<00:15,  1.69s/it]

[+] Writing TFRecord: ../Datasets/TFRecords/Train/19_train_volumes.tfrecords


 70%|███████   | 19/27 [00:38<00:13,  1.68s/it]

[+] Writing TFRecord: ../Datasets/TFRecords/Train/20_train_volumes.tfrecords


 74%|███████▍  | 20/27 [00:39<00:11,  1.60s/it]

[+] Writing TFRecord: ../Datasets/TFRecords/Train/21_train_volumes.tfrecords


 78%|███████▊  | 21/27 [00:41<00:09,  1.65s/it]

[+] Writing TFRecord: ../Datasets/TFRecords/Train/22_train_volumes.tfrecords


 81%|████████▏ | 22/27 [00:42<00:07,  1.60s/it]

[+] Writing TFRecord: ../Datasets/TFRecords/Train/23_train_volumes.tfrecords


 85%|████████▌ | 23/27 [00:44<00:06,  1.60s/it]

[+] Writing TFRecord: ../Datasets/TFRecords/Train/24_train_volumes.tfrecords


 89%|████████▉ | 24/27 [00:46<00:04,  1.61s/it]

[+] Writing TFRecord: ../Datasets/TFRecords/Train/25_train_volumes.tfrecords


 93%|█████████▎| 25/27 [00:47<00:03,  1.60s/it]

[+] Writing TFRecord: ../Datasets/TFRecords/Train/26_train_volumes.tfrecords


 96%|█████████▋| 26/27 [00:49<00:01,  1.58s/it]

[+] Writing TFRecord: ../Datasets/TFRecords/Train/27_train_volumes.tfrecords


100%|██████████| 27/27 [00:50<00:00,  1.87s/it]

Number of files wrote to TFRecords: 802





#### Create validation TFRecords

In [None]:
write_images_to_tfr(val_dataset, val_labels, 
                    max_files = 30, 
                    filename = 'val_volumes',
                    out_dir = './Datasets/TFRecords/Validation/')

  0%|          | 0/6 [00:00<?, ?it/s]

[+] Number of TFRecords needed for 172 volumes: 6
    [-] Number of files per TFRecord: 30

[+] Output directory: ../Datasets/TFRecords/Validation/

[+] Writing TFRecord: ../Datasets/TFRecords/Validation/1_val_volumes.tfrecords


 17%|█▋        | 1/6 [00:01<00:06,  1.24s/it]

[+] Writing TFRecord: ../Datasets/TFRecords/Validation/2_val_volumes.tfrecords


 33%|███▎      | 2/6 [00:02<00:05,  1.40s/it]

[+] Writing TFRecord: ../Datasets/TFRecords/Validation/3_val_volumes.tfrecords


 50%|█████     | 3/6 [00:04<00:04,  1.65s/it]

[+] Writing TFRecord: ../Datasets/TFRecords/Validation/4_val_volumes.tfrecords


 67%|██████▋   | 4/6 [00:06<00:03,  1.61s/it]

[+] Writing TFRecord: ../Datasets/TFRecords/Validation/5_val_volumes.tfrecords


 83%|████████▎ | 5/6 [00:07<00:01,  1.55s/it]

[+] Writing TFRecord: ../Datasets/TFRecords/Validation/6_val_volumes.tfrecords


100%|██████████| 6/6 [00:08<00:00,  1.50s/it]

Number of files wrote to TFRecords: 172





#### Create testing TFRecords

In [None]:
write_images_to_tfr(test_dataset, test_labels, 
                    max_files = 30, 
                    filename = 'test_volumes',
                    out_dir = './Datasets/TFRecords/Test/')

  0%|          | 0/6 [00:00<?, ?it/s]

[+] Number of TFRecords needed for 172 volumes: 6
    [-] Number of files per TFRecord: 30

[+] Output directory: ../Datasets/TFRecords/Test/

[+] Writing TFRecord: ../Datasets/TFRecords/Test/1_test_volumes.tfrecords


 17%|█▋        | 1/6 [00:01<00:07,  1.47s/it]

[+] Writing TFRecord: ../Datasets/TFRecords/Test/2_test_volumes.tfrecords


 33%|███▎      | 2/6 [00:03<00:06,  1.52s/it]

[+] Writing TFRecord: ../Datasets/TFRecords/Test/3_test_volumes.tfrecords


 50%|█████     | 3/6 [00:04<00:04,  1.52s/it]

[+] Writing TFRecord: ../Datasets/TFRecords/Test/4_test_volumes.tfrecords


 67%|██████▋   | 4/6 [00:06<00:03,  1.66s/it]

[+] Writing TFRecord: ../Datasets/TFRecords/Test/5_test_volumes.tfrecords


 83%|████████▎ | 5/6 [00:08<00:01,  1.84s/it]

[+] Writing TFRecord: ../Datasets/TFRecords/Test/6_test_volumes.tfrecords


100%|██████████| 6/6 [00:09<00:00,  1.58s/it]

Number of files wrote to TFRecords: 172



