<a href="https://colab.research.google.com/github/ZKingQ/CS598-DLH-SP24/blob/main/DLH_Team_71_Tensorflow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CS598 Deep Learning for Healthcare - Final Project - DeepMicro: Deep Representation Learning for Disease Prediction based on Microbiome Data

### Lotte Zhu, Kaiqing Zhang, Matthew Trueblood (Team ID: 71)
#### GitHub Link: https://github.com/ZKingQ/CS598-DLH-SP24

# Before you use this template

This template is just a recommended template for project Report. It only considers the general type of research in our paper pool. Feel free to edit it to better fit your project. You will iteratively update the same notebook submission for your draft and the final submission. Please check the project rubriks to get a sense of what is expected in the template.

---

# FAQ and Attentions
* Copy and move this template to your Google Drive. Name your notebook by your team ID (upper-left corner). Don't eidt this original file.
* This template covers most questions we want to ask about your reproduction experiment. You don't need to exactly follow the template, however, you should address the questions. Please feel free to customize your report accordingly.
* any report must have run-able codes and necessary annotations (in text and code comments).
* The notebook is like a demo and only uses small-size data (a subset of original data or processed data), the entire runtime of the notebook including data reading, data process, model training, printing, figure plotting, etc,
must be within 8 min, otherwise, you may get penalty on the grade.
  * If the raw dataset is too large to be loaded  you can select a subset of data and pre-process the data, then, upload the subset or processed data to Google Drive and load them in this notebook.
  * If the whole training is too long to run, you can only set the number of training epoch to a small number, e.g., 3, just show that the training is runable.
  * For results model validation, you can train the model outside this notebook in advance, then, load pretrained model and use it for validation (display the figures, print the metrics).
* The post-process is important! For post-process of the results,please use plots/figures. The code to summarize results and plot figures may be tedious, however, it won't be waste of time since these figures can be used for presentation. While plotting in code, the figures should have titles or captions if necessary (e.g., title your figure with "Figure 1. xxxx")
* There is not page limit to your notebook report, you can also use separate notebooks for the report, just make sure your grader can access and run/test them.
* If you use outside resources, please refer them (in any formats). Include the links to the resources if necessary.

# Mount Notebook to Google Drive
Upload the data, pretrianed model, figures, etc to your Google Drive, then mount this notebook to Google Drive. After that, you can access the resources freely.

Instruction: https://colab.research.google.com/notebooks/io.ipynb

Example: https://colab.research.google.com/drive/1srw_HFWQ2SMgmWIawucXfusGzrj1_U0q

Video: https://www.youtube.com/watch?v=zc8g8lGcwQU

Although here we mount the My Google Drive to the Colab, we will update the GitHub repository code later to load the data from the public cloud directly. This is to ensure the reproducibility of the project.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Introduction
<!-- This is an introduction to your report, you should edit this text/mardown section to compose. In this text/markdown, you should introduce:

*   Background of the problem
  * what type of problem: disease/readmission/mortality prediction,  feature engineeing, data processing, etc
  * what is the importance/meaning of solving the problem
  * what is the difficulty of the problem
  * the state of the art methods and effectiveness.
*   Paper explanation
  * what did the paper propose
  * what is the innovations of the method
  * how well the proposed method work (in its own metrics)
  * what is the contribution to the reasearch regime (referring the Background above, how important the paper is to the problem). -->


The expanding knowledge of microbiota uncovers its crucial role in human health [1]. It plays an important role in immune system, metabolism functions and even carcinognesis of certain cancers, hence microbiota can be used to predict various disease with emerging sequencing technologies [1,2,3,4,5,6]. However, there are three major challenges to realize the predictions in practice [7]. First, the low number of samples, together with the large number of features, leads to the curse of dimensionality. Second, there is a research gap in using strain-level profiles to classify samples into patient and healthy control groups across different diseases. Third, a rigorous validation framework is essential. Prior research has shown that tuning hyperparameters on the test set without a separate validation set may lead to an overestimation of model performance [8,9,10].

The DeepMicro paper proposes the deployment of autoencoders to learn low-dimensional representations from microbiota data and then predict disease by another classification model based on the learned representations, with both trainings having thorough validation schemes [7]. The authors hypothesize that these innovations can contribute to the followings:

1. The appropriate autoencoders can effectively solve the curse of dimensionality.

2. They also reduce latency compared with alternative models without representation learing, in the mean time maintaining favorable training metrics.

In this project, our primary objective revolves around the implementation and evaluation of various autoencoder architectures. We aim to validate the disease prediction capabilities as outlined in the paper. Furthermore, we discuss the ablations inherent in the innovation of autoencoder. The rationale for choosing this paper stems from our pursuit of knowledge in deep representation learning.


# Scope of Reproducibility:

<!-- List hypotheses from the paper you will test and the corresponding experiments you will run.


1.   Hypothesis 1: xxxxxxx
2.   Hypothesis 2: xxxxxxx

You can insert images in this notebook text, [see this link](https://stackoverflow.com/questions/50670920/how-to-insert-an-inline-image-in-google-colaboratory-from-google-drive) and example below:

![sample_image.png](https://drive.google.com/uc?export=view&id=1g2efvsRJDxTxKz-OY3loMhihrEUdBxbc)


You can also use code to display images, see the code below.

The images must be saved in Google Drive first.

-->

Our team achieved successful replication of the model by leveraging its open source codebase, resulting in metrics that are comparable to those showcased in the DeepMicro paper. In our endeavor, we thoroughly examined and addressed the claims below put forth in the original paper.

1. **Dimensionality reduction engineering with traditional statistical techniques including Principal Component Analysis (PCA) and Gaussian Random Projection (GRP)**

- Principal Component Analysis (PCA) aims to capture the most significant patterns and variations in a dataset by identifying orthogonal axes, known as principal components, that maximize the data variance.
- Gaussian Random Projection (GRP) seeks to preserve pairwise distances between data points by projecting them onto a lower-dimensional space using random projections drawn from a Gaussian distribution.

2. **Innovated representation learning employing with four different autoencoders including Shallow Autoencoder (SAE), Deep Autoencoder (DAE), Variational Autoencoder (VAE), and Convolutional Autoencoder (CAE)**

- Shallow Autoencoder (SAE): This is the simplest form of an autoencoder, consisting of a fully
connected encoder layer and a decoder layer. The latent representation is obtained from the encoder
layer, which is a lower-dimensional space compared to the original input.
- Deep Autoencoder (DAE): In addition to the encoder and decoder layers, DAE introduces hidden
layers between the input and latent layers and between the latent and output layers. Rectified Linear
Unit (ReLU) activation functions are used in the hidden layers.
- Variational Autoencoder (VAE): VAE learns probabilistic representations by approximating the
true posterior distribution of latent embeddings. It assumes that the posterior distribution follows
a Gaussian distribution. VAE uses an encoder network to encode the means and variances of the
Gaussian distribution and samples the latent representation from this distribution. The decoder
network then reconstructs the input based on the sampled latent representation.
- Convolutional Autoencoder (CAE): Instead of fully connected layers, CAE incorporates convo-
lutional layers, where each unit is connected to local regions of the previous layer. Convolutional layers use filters (kernels) to perform convolution operations. CAE employs convolutional transpose layers (deconvolutional layers) to make the decoder symmetric to the encoder. No pooling layers
are used in CAE.

3. **Classification learning including including Support Vector Machine (SVM), Random Forest (RF), and Multi-Layer Perceptron (MLP)**
- Support Vector Machine (SVM) is a supervised learning algorithm that aims to find an optimal hyperplane to classify data by maximizing the margin between different classes.
- Random Forest (RF) is an ensemble learning method that constructs a multitude of decision trees and combines their predictions to make classifications.
- Multi-Layer Perceptron (MLP) is a type of neural network that consists of multiple layers of interconnected nodes, enabling it to learn non-linear relationships and perform classification tasks.


# Methodology

This methodology is the core of your project. It consists of run-able codes with necessary annotations to show the expeiment you executed for testing the hypotheses.

The methodology at least contains two subsections **data** and **model** in your experiment.

## Environment Set Up and Packages Import

In [None]:
# Python version
!python --version

Python 3.10.12


In [None]:
# set up tensorflow env
# !pip uninstall tensorflow
# !pip install tensorflow==2.12.0
# !pip install keras==2.12.0
!pip install scikeras

Collecting scikeras
  Downloading scikeras-0.13.0-py3-none-any.whl (26 kB)
Collecting keras>=3.2.0 (from scikeras)
  Downloading keras-3.2.1-py3-none-any.whl (1.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m19.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting scikit-learn>=1.4.2 (from scikeras)
  Downloading scikit_learn-1.4.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.1/12.1 MB[0m [31m28.7 MB/s[0m eta [36m0:00:00[0m
Collecting namex (from keras>=3.2.0->scikeras)
  Downloading namex-0.0.7-py3-none-any.whl (5.8 kB)
Collecting optree (from keras>=3.2.0->scikeras)
  Downloading optree-0.11.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (311 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m311.2/311.2 kB[0m [31m26.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: namex, optree, scikit-learn, keras, scike

In [None]:
# import packages needed in this project
import numpy as np
import cv2
import pandas as pd
import os
import time
import json
import datetime
import math
import matplotlib
matplotlib.use('agg')
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.random_projection import GaussianRandomProjection
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

# importing keras
import keras
import keras.backend as K
from scikeras.wrappers import KerasClassifier
from keras.callbacks import EarlyStopping, ModelCheckpoint, LambdaCallback
from keras.models import Model, load_model, Sequential
from keras.layers import Dense, Dropout, Input, Lambda, Conv2D, Conv2DTranspose, MaxPool2D, UpSampling2D, Flatten, Reshape, Cropping2D
from keras import backend as K
from keras.losses import MeanSquaredError as mse, binary_crossentropy, MeanSquaredError, BinaryCrossentropy

from google.colab.patches import cv2_imshow

import tensorflow as tf

##  Data
Data includes raw data (MIMIC III tables), descriptive statistics (our homework questions), and data processing (feature engineering).
  * Source of the data: where the data is collected from; if data is synthetic or self-generated, explain how. If possible, please provide a link to the raw datasets.
  * Statistics: include basic descriptive statistics of the dataset like size, cross validation split, label distribution, etc.
  * Data process: how do you munipulate the data, e.g., change the class labels, split the dataset to train/valid/test, refining the dataset.
  * Illustration: printing results, plotting figures for illustration.
  * You can upload your raw dataset to Google Drive and mount this Colab to the same directory. If your raw dataset is too large, you can upload the processed dataset and have a code to load the processed dataset.

### Data Path

We downloaded two datasets, abundance and marker from the [DeepMicro codebase](https://github.com/minoh0201/DeepMicro/tree/master/data). They are stored in the following path on Google Drive:

In [None]:
# data dir
raw_data_dir = '/content/drive/My Drive/Colab Notebooks/data/'

### Data Description

Our reproductivity utilizes the same datasets as the original paper, which include six disease (Table 1). They are inflammatory bowel disease (IBD), type 2 diabetes in European women (EW-T2D), type 2 diabetes in Chinese (C-T2D), obesity (Obesity), liver cirrhosis (Cirrhosis), and colorectal cancer (Colorectal).
![](assets/Data_Table1.png)

In each dataset, marker profile and abundance profile of microbiome are used to train our models (Table 2).
![](assets/Data_Table2.png)

All the data are stored in txt format and the data path structure is as following.
```
.
├── ...
├── data                              # Data folder
│   ├── marker                        # Marker profile data
│       ├── marker_IBD.txt            # Inflammatory bowel disease (IBD)
│       ├── marker_WT2D.txt           # Type 2 diabetes in European women (EW-T2D)
│       ├── marker_T2D.txt            # Type 2 diabetes in Chinese (C-T2D)
│       ├── marker_Obesity.txt        # Obesity (Obesity)
│       ├── marker_Cirrhosis.txt      # Liver cirrhosis (Cirrhosis)
│       ├── marker_Colorectal.txt     # Colorectal cancer (Colorectal)
│   ├── abundance                     # Abundance profile data
│       ├── abundance_IBD.txt         # Inflammatory bowel disease (IBD)
│       ├── abundance_WT2D.txt        # Type 2 diabetes in European women (EW-T2D)
│       ├── abundance_T2D.txt         # Type 2 diabetes in Chinese (C-T2D)
│       ├── abundance_Obesity.txt     # Obesity (Obesity)
│       ├── abundance_Cirrhosis.txt   # Liver cirrhosis (Cirrhosis)
│       ├── abundance_Colorectal.txt  # Colorectal cancer (Colorectal)
└── ...
```
In each txt file, it has different number of features and data points as shown in Table 1 and Table 2 above. In the sample demonstration below. The colorectal cancer of microbiome abundance profile has 503 features (exluding those dummy labels) and 121 entries.



In [None]:
# Extract a sample data of colorectal cancer of microbiome abundance profile
sample = pd.read_csv(os.path.join(raw_data_dir, "abundance", "abundance_Colorectal.txt"), sep='\t', index_col=0, header=None)
sample = sample.T
sample.head(10)

Unnamed: 0,dataset_name,sampleID,subjectID,bodysite,disease,age,gender,country,sequencing_technology,pubmedid,...,k__Eukaryota|p__Ascomycota|c__Saccharomycetes|o__Saccharomycetales|f__Saccharomycetaceae|g__Eremothecium|s__Eremothecium_unclassified,k__Bacteria|p__Firmicutes|c__Bacilli|o__Lactobacillales|f__Lactobacillaceae|g__Lactobacillus|s__Lactobacillus_antri,k__Bacteria|p__Firmicutes|c__Bacilli|o__Bacillales|f__Bacillaceae|g__Lysinibacillus|s__Lysinibacillus_fusiformis,k__Archaea|p__Euryarchaeota|c__Methanobacteria|o__Methanobacteriales|f__Methanobacteriaceae|g__Methanobacterium|s__Methanobacterium_unclassified,k__Bacteria|p__Firmicutes|c__Bacilli|o__Bacillales|f__Bacillaceae|g__Lysinibacillus|s__Lysinibacillus_boronitolerans,k__Bacteria|p__Firmicutes|c__Bacilli|o__Lactobacillales|f__Enterococcaceae|g__Bavariicoccus|s__Bavariicoccus_seileri,k__Bacteria|p__Firmicutes|c__Bacilli|o__Lactobacillales|f__Enterococcaceae|g__Enterococcus|s__Enterococcus_gilvus,k__Bacteria|p__Firmicutes|c__Bacilli|o__Lactobacillales|f__Lactobacillaceae|g__Lactobacillus|s__Lactobacillus_otakiensis,k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Peptococcaceae|g__Desulfotomaculum|s__Desulfotomaculum_ruminis,k__Bacteria|p__Firmicutes|c__Negativicutes|o__Selenomonadales|f__Veillonellaceae|g__Megasphaera|s__Megasphaera_sp_BV3C16_1
1,Zeller_fecal_colorectal_cancer,CCIS00146684ST-4-0,fr-726,stool,n,72,female,france,Illumina,25432777,...,0,0,0.0,0,0.0,0,0.0,0,0,0
2,Zeller_fecal_colorectal_cancer,CCIS00281083ST-3-0,fr-060,stool,n,53,male,france,Illumina,25432777,...,0,0,0.0,0,0.0,0,0.0,0,0,0
3,Zeller_fecal_colorectal_cancer,CCIS02124300ST-4-0,fr-568,stool,n,35,male,france,Illumina,25432777,...,0,0,0.0,0,0.0,0,0.0,0,0,0
4,Zeller_fecal_colorectal_cancer,CCIS02379307ST-4-0,fr-828,stool,cancer,67,male,france,Illumina,25432777,...,0,0,0.0,0,0.0,0,0.0,0,0,0
5,Zeller_fecal_colorectal_cancer,CCIS03473770ST-4-0,fr-192,stool,n,29,male,france,Illumina,25432777,...,0,0,0.0,0,0.0,0,0.03756,0,0,0
6,Zeller_fecal_colorectal_cancer,CCIS06260551ST-3-0,fr-200,stool,cancer,58,male,france,Illumina,25432777,...,0,0,0.31121,0,0.03562,0,0.0,0,0,0
7,Zeller_fecal_colorectal_cancer,CCIS07539127ST-4-0,fr-460,stool,n,77,female,france,Illumina,25432777,...,0,0,0.0,0,0.0,0,0.0,0,0,0
8,Zeller_fecal_colorectal_cancer,CCIS07648107ST-4-0,fr-053,stool,n,62,female,france,Illumina,25432777,...,0,0,0.0,0,0.0,0,0.0,0,0,0
9,Zeller_fecal_colorectal_cancer,CCIS08668806ST-3-0,fr-214,stool,small_adenoma,63,male,france,Illumina,25432777,...,0,0,0.0,0,0.0,0,0.0,0,0,0
10,Zeller_fecal_colorectal_cancer,CCIS09568613ST-4-0,fr-400,stool,n,67,male,france,Illumina,25432777,...,0,0,0.0,0,0.0,0,0.0,0,0,0


In [None]:
# Count the features excluding patients information
len([ _ for _ in sample.columns if '|' in _ ])

503

In [None]:
# Calculate the dimensions of the sameple dataset
sample.shape

(121, 714)

In [None]:
# Have a generate descrption of the sample dataset
sample.describe().T

Unnamed: 0_level_0,count,unique,top,freq
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
dataset_name,121,1,Zeller_fecal_colorectal_cancer,121
sampleID,121,121,CCIS00146684ST-4-0,1
subjectID,121,121,fr-726,1
bodysite,121,1,stool,121
disease,121,3,cancer,48
...,...,...,...,...
k__Bacteria|p__Firmicutes|c__Bacilli|o__Lactobacillales|f__Enterococcaceae|g__Bavariicoccus|s__Bavariicoccus_seileri,121,2,0,120
k__Bacteria|p__Firmicutes|c__Bacilli|o__Lactobacillales|f__Enterococcaceae|g__Enterococcus|s__Enterococcus_gilvus,121,5,0,117
k__Bacteria|p__Firmicutes|c__Bacilli|o__Lactobacillales|f__Lactobacillaceae|g__Lactobacillus|s__Lactobacillus_otakiensis,121,2,0,120
k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Peptococcaceae|g__Desulfotomaculum|s__Desulfotomaculum_ruminis,121,2,0,120


### Data Process

According to the original paper, the data provided in the codebase have been preprocessed and cleaned properly. We only need to extract the labels columns with feature index indentifier to get `X`. Also, we get `Y` from the `disease` column. This step is implemented in the `loadData` function of class `DeepMiccrobiome` in the Model session below.




### Data Use

Given the running time of dimensionality reduction is long, we choose to only use the first 10 features of each dataset in order to control the total latency in this draft project.

##   Model
The model includes the model definitation which usually is a class, model training, and other necessary parts.
  * Model architecture: layer number/size/type, activation function, etc
  * Training objectives: loss function, optimizer, weight of each loss term, etc
  * Others: whether the model is pretrained, Monte Carlo simulation for uncertainty analysis, etc
  * The code of model should have classes of the model, functions of model training, model validation, etc.
  * If your model training is done outside of this notebook, please upload the trained model here and develop a function to load and test it.

### Model Architecture

An autoencoder represents a type of neural network designed for the purpose of reconstructing its input data, denoted as $x$. In its fundamental structure, it comprises an encoder function, denoted as $f_\phi(⋅)$, and a decoder function, denoted as $f_\theta′(⋅)$, with $\phi$ and $\theta$ serving as the parameters associated with the encoder and decoder functions, respectively.
The training objective of an autoencoder is to minimize the disparity between the original input $x$ and its reconstructed counterpart $x′$. This discrepancy, typically quantified using a reconstruction loss metric such as squared error, can be mathematically expressed as $L(x, x')=||x-x'||^2=||x-f_\theta'(f_\phi(x))||^2$.

In this project, we focus on the utilization of a trained autoencoder to obtain a lower-dimensional latent representation $z = f_\phi(x)$ of the input. There are four autoencoders that we incorporate as below.

1. Shallow Autoencoder (SAE): a fully connected encoder connecting the input layer to the latent layer, and a decoder producing the reconstructed input $x′$ by combining the outputs of the latent layer using weighted sums, with both the latent and output layers utilizing a linear activation function.

2. Deep Autoencoder (DAE): enhanced SAE model by inserting hidden layers with Rectified Linear Unit (ReLu) activation function and Glorot uniform initializer between the input and latent layers, maintaining an equal number of hidden layers (either one or two layers) in both the encoder and decoder sections.

3. Variational autoencoder (VAE): it learns probabilistic representations $z$ by approximating the true posterior distribution with $q_\phi(z|x)$ assuming a Gaussian distribution. The encoder encodes the means and variances of the Gaussian distribution, allowing sampling of latent representation $z$. This sampled representation is then fed into the decoder network to generate the reconstructed input $x′ \sim g_\theta(x|z)$.

4. convolutional autoencoder (CAE): equipped with convolutional layers where each unit is connected locally to the previous layer. The layers consist of multiple filters with weights for convolution operations. We employed ReLu activation, Glorot uniform initializer, and avoided pooling layers to prevent excessive generalization. The $n$-dimensional input vector was reshaped into a squared image of size $d \times d \times 1$, where $d = ⌊ \sqrt{n} ⌋ + 1$.

After getting the low-dimentional representations of profiles, we build classification models.

1. Support Vector Machine (SVM): a grid search using SVM is conducted to explore the hyper-parameter space. We considered both radial basis function (RBF) and linear kernels, adjusting penalty parameter C and kernel coefficient gamma for RBF.

2. Random Forest (RF): We examine two criteria, Gini impurity and information gain, for selecting features to split a node in a decision tree. The maximum number of features considered for the best split at each node was determined using the square root of the sample size and the logarithm to base 2 of the sample size.

3. Multi-Layer Perceptron (MLP): ReLu activations were used for the hidden layers, while the output layer employed sigmoid activation with a single unit. The number of units in the hidden layers was set to half of the preceding layer, excluding the first hidden layer.

The overall workflow is demonstrated in the Figure 1 below.

![]("assets/Model_Figure1.png")

### Training Objectives
- Split each dataset into a training set, validation set, and test set (64% training, 16% validation, and 20% test).
- Exclude the test set from model training.
- Implement early-stopping strategy: train models on the training set, compute reconstruction loss for the validation set after each epoch, stop training if no improvement in validation loss is observed for 20 epochs.
- Select the model with the lowest validation loss as the best model.
Utilize mean squared error as the reconstruction loss metric.
- Apply adaptive moment estimation (Adam) optimizer with default parameters (learning rate: 0.001, epsilon: 1e-07) as specified in the original paper.
- Utilize the encoder part of the best model to generate low-dimensional representations of microbiome data for subsequent disease prediction tasks.


### Evaluations

- Conduct 5-fold cross-validation on the reduced training set. This involves dividing the training set into five subsets, using four subsets for training and one for validation in each fold. This is to vary hyper-parameters and explore different combinations to find the best performing configuration.
- Evaluate the performance of the models using the area under the receiver operating characteristics curve (AUC). This metric assesses the model's ability to distinguish between different classes and is commonly used for classification tasks.
- Train a final classification model using the entire training set and the best hyper-parameter combination identified during cross-validation. This model aims to achieve optimal performance based on the selected configuration. Then test the final classification model on the separate test set, which was not used during training. This evaluation provides an unbiased assessment of the model's performance on unseen data.
- Repeat the entire procedure five times, each time using a different random partition seed to create new training, validation, and test sets. This helps account for potential variations in performance due to the specific data splits. Average the resulting AUC scores obtained from the five repetitions. This average serves as a summary metric to compare the performance of different models or approaches. It provides a more robust assessment by considering multiple iterations of the evaluation process.

### Hyperparameters Config

The original implementation use command line arguments to facilitate the execution of experiments with varying configurations. In order to enhance usability within a Colab notebook environment, we modified this approach by introducing a configuration object, which offers a more convenient and intuitive means of manipulating settings. The structures of configs is as below.

```
.
├── ...
├── data                                      # Data folder
├── experiment_configs                        # Config folder
│       ├── vae_config.json                   # Config uses Variational Autoencoder (VAE)
│       ├── cae_config.json                   # Config uses Convolutional Autoencoder (CAE):
│       ├── ae_config.json                    # Config uses Shallow Autoencoder (SAE) or Deep Autoencoder (DAE)
│       ├── default_experiment_config.json    # Baseline configs
│       ├── test_experiment_config_1.json     # Test config of no autoencoder
└── ...
```

In [None]:
# experiment config dir
experiments_config_dir = '/content/drive/My Drive/Colab Notebooks/experiment_configs'

# set up config class for loading json experiment configs
class DeepMicro_Config(object):
  def __init__(self, config_name=None, config_dict=None):
    if config_dict:
      self.load_from_dict(config_dict)
    elif config_name:
      self.load_from_file(experiments_config_dir + "/" + config_name + ".json")

  def load_from_dict(self, dictionary):
    for key, value in dictionary.items():
      if isinstance(value, dict):
        value = DeepMicro_Config(config_dict=value)
      self.__dict__[key] = value

  def load_from_file(self, config_path):
    with open(config_path, 'r') as f:
      config_dict = json.load(f)
    self.load_from_dict(config_dict)

  def __getattr__(self, attr):
    return self.__dict__.get(attr, None)

# dict for data_type in config
dtypeDict = {"float16": np.float16, "float32": np.float32, "float64": np.float64}

# set labels for diseases and controls
label_dict = {
  # Controls
  'n': 0,
  # Chirrhosis
  'cirrhosis': 1,
  # Colorectal Cancer
  'cancer': 1, 'small_adenoma': 0,
  # IBD
  'ibd_ulcerative_colitis': 1, 'ibd_crohn_disease': 1,
  # T2D and WT2D
  't2d': 1,
  # Obesity
  'leaness': 0, 'obesity': 1,
}

# hyper-parameter grids for classifiers
# TODO: set n_estimator range(100, 1001, 200), here is only for draft
# TODO: set min_samples_leaf range(1,6), here is only for draft
rf_hyper_parameters = [{'n_estimators': [s for s in range(500, 1000, 500)],
                        'max_features': ['sqrt', 'log2'],
                        'min_samples_leaf': [2], # [1, 2, 3, 4, 5],
                        'criterion': ['gini', 'entropy']
                        }, ]

#svm_hyper_parameters_pasolli = [{'C': [2 ** s for s in range(-5, 16, 2)], 'kernel': ['linear']},
#                        {'C': [2 ** s for s in range(-5, 16, 2)], 'gamma': [2 ** s for s in range(3, -15, -2)],
#                         'kernel': ['rbf']}]

# TODO: set C range(-5, 6, 2), here is only for draft
# TODO: set gamma range(3, -15, -2), here is only for draft
svm_hyper_parameters = [
    {
        'C': [2 ** s for s in [-2]],
        'kernel': ['linear']
        },
    {
        'C': [2 ** s for s in [-2]],
        'gamma': [2 ** s for s in range(1, -4, -2)],
        'kernel': ['rbf']
        }
    ]
# TODO: set numHiddenLayers [1, 2, 3], here is only for draft
# TODO: set epoch larger in final project, 30 is only for draft
# TODO: set numUnits larger in final project, 10 is only for draft
mlp_hyper_parameters = [{'numHiddenLayers': [2],
                         'epochs': [30], # [30, 50, 100, 200, 300],
                         'numUnits': [10], # [10, 30, 50, 100],
                         'dropout_rate': [0.1, 0.3],
                         },]


### Autoencoders Definitions



In [None]:
# Autoencoder
def autoencoder(dims, act='relu', init='glorot_uniform', latent_act = False, output_act = False):
    """
        Fully connected auto-encoder model, symmetric.
        Arguments:
            dims: list of number of units in each layer of encoder. dims[0] is input dim, dims[-1] is units in hidden layer.
                The decoder is symmetric with encoder. So number of layers of the auto-encoder is 2*len(dims)-1
            act: activation, not applied to Input, Hidden and Output layers
        return:
            (ae_model, encoder_model), Model of autoencoder and model of encoder
        """

    # whether put activation function in latent layer
    if latent_act:
        l_act = act
    else:
        l_act = None

    if output_act:
        o_act = 'sigmoid'
    else:
        o_act = None

    # The number of internal layers: layers between input and latent layer
    n_internal_layers = len(dims) - 2

    # input
    x = Input(shape=(dims[0],), name='input')
    h = x

    # internal layers in encoder
    for i in range(n_internal_layers):
        h = Dense(dims[i + 1], activation=act, kernel_initializer=init, name='encoder_%d' % i)(h)

    # bottle neck layer, features are extracted from here
    h = Dense(dims[-1], activation=l_act, kernel_initializer=init, name='encoder_%d_bottle-neck' % (n_internal_layers))(h)

    y = h

    # internal layers in decoder
    for i in range(n_internal_layers, 0, -1):
        y = Dense(dims[i], activation=act, kernel_initializer=init, name='decoder_%d' % i)(y)

    # output
    y = Dense(dims[0], activation=o_act, kernel_initializer=init, name='decoder_0')(y)

    return Model(inputs=x, outputs=y, name='AE'), Model(inputs=x, outputs=h, name='encoder')

In [None]:
# Convolutional autoencoder
def conv_autoencoder(dims, act='relu', init='glorot_uniform', latent_act = False, output_act = False, rf_rate = 0.1, st_rate = 0.25):
    # whether put activation function in latent layer
    if latent_act:
        l_act = act
    else:
        l_act = None

    if output_act:
        o_act = 'sigmoid'
    else:
        o_act = None

    # receptive field and stride size
    rf_size = init_rf_size = int(dims[0][0] * rf_rate)
    if rf_size <= 0:
        print(f"Warning: Computed rf_size is {rf_size}, which is not valid. Setting to default value 3.")
        rf_size = 3  # Default fallback value
    stride_size = init_stride_size = int(rf_size * st_rate) if int(rf_size * st_rate) > 0 else 1
    print("receptive field (kernel) size: %d" % rf_size)
    print("stride size: %d" % stride_size)

    # The number of internal layers: layers between input and latent layer
    n_internal_layers = len(dims) - 1

    if n_internal_layers < 1:
        print("The number of internal layers for CAE should be greater than or equal to 1")
        exit()

    # input
    x = Input(shape=dims[0], name='input')
    h = x

    rf_size_list = []
    stride_size_list = []
    # internal layers in encoder
    for i in range(n_internal_layers):
        print("rf_size: %d, st_size: %d" % (rf_size, stride_size))
        h = Conv2D(
            dims[max(i + 1, len(dims)-1)], (rf_size,rf_size),
            strides=(stride_size, stride_size), activation=act,
            padding='same', kernel_initializer=init,
            name='encoder_conv_%d' % i)(h)
        #h = MaxPool2D((2,2), padding='same')(h)
        rf_size = int(K.int_shape(h)[1] * rf_rate)
        stride_size = int(rf_size /2.) if int(rf_size /2.) > 0 else 1
        rf_size_list.append(rf_size)
        stride_size_list.append(stride_size)

    reshapeDim = K.int_shape(h)[1:]

    # bottle neck layer, features are extracted from h
    h = Flatten()(h)

    y = h

    y = Reshape(reshapeDim)(y)

    print(rf_size_list)
    print(stride_size_list)

    # internal layers in decoder
    for i in range(n_internal_layers - 1, 0, -1):
        y = Conv2DTranspose(dims[i], (rf_size_list[i-1],rf_size_list[i-1]), strides=(stride_size_list[i-1], stride_size_list[i-1]), activation=act, padding='same', kernel_initializer=init, name='decoder_conv_%d' % i)(y)
        #y = UpSampling2D((2,2))(y)

    y = Conv2DTranspose(1, (init_rf_size, init_rf_size), strides=(init_stride_size, init_stride_size), activation=o_act, kernel_initializer=init, padding='same', name='decoder_1')(y)

    # output cropping
    if K.int_shape(x)[1] != K.int_shape(y)[1]:
        cropping_size = K.int_shape(y)[1] - K.int_shape(x)[1]
        y = Cropping2D(cropping=((cropping_size, 0), (cropping_size, 0)), data_format=None)(y)

    #print("dims[0]: %s" % str(dims[0]))

    # output
    # y = Conv2D(1, (rf_size, rf_size), activation=o_act, kernel_initializer=init, padding='same', name='decoder_1')(y)
    #
    # outputDim = reshapeDim * (2 ** n_internal_layers)
    # if outputDim != dims[0][0]:
    #     cropping_size = outputDim - dims[0][0]
    #     #print(outputDim, dims[0][0], cropping_size)
    #     y = Cropping2D(cropping=((cropping_size, 0), (cropping_size, 0)), data_format=None)(y)


    return Model(inputs=x, outputs=y, name='CAE'), Model(inputs=x, outputs=h, name='encoder')


In [None]:
class VAE(Model):
    def __init__(self, dims, activation='relu', initializer='glorot_uniform', output_activation=False, recon_loss='mse', beta=1):
        super(VAE, self).__init__()
        self.beta = beta
        self.output_activation = 'sigmoid' if output_activation else None
        self.recon_loss_fn = MeanSquaredError() if recon_loss == 'mse' else BinaryCrossentropy()

        # Encoder
        self.encoder = self.build_encoder(dims, activation, initializer)
        # Decoder
        self.decoder = self.build_decoder(dims, activation, initializer, self.output_activation)

    def build_encoder(self, dims, activation, initializer):
        inputs = Input(shape=(dims[0],))
        x = inputs
        for size in dims[1:-1]:
            x = Dense(size, activation=activation, kernel_initializer=initializer)(x)
        z_mean = Dense(dims[-1], name='z_mean')(x)
        z_log_var = Dense(dims[-1], name='z_log_var')(x)
        z = Lambda(self.sampling, output_shape=(dims[-1],), name='z')([z_mean, z_log_var])
        return Model(inputs, [z_mean, z_log_var, z], name='encoder')

    def build_decoder(self, dims, activation, initializer, output_activation):
        latent_inputs = Input(shape=(dims[-1],))
        x = latent_inputs
        for size in reversed(dims[1:-1]):
            x = Dense(size, activation=activation, kernel_initializer=initializer)(x)
        outputs = Dense(dims[0], activation=output_activation)(x)
        return Model(latent_inputs, outputs, name='decoder')

    def sampling(self, args):
        z_mean, z_log_var = args
        batch = tf.shape(z_mean)[0]
        dim = z_mean.shape[1]
        epsilon = tf.random.normal(shape=(batch, dim))
        return z_mean + tf.exp(0.5 * z_log_var) * epsilon

    def call(self, inputs):
        z_mean, z_log_var, z = self.encoder(inputs)
        reconstructed = self.decoder(z)
        # Add KL divergence regularization loss.
        kl_loss = 1 + z_log_var - tf.square(z_mean) - tf.exp(z_log_var)
        kl_loss = tf.reduce_mean(tf.reduce_sum(kl_loss, axis=1)) * -0.5 * self.beta
        self.add_loss(kl_loss)
        # Add reconstruction loss
        reconstruction_loss = self.recon_loss_fn(inputs, reconstructed) * tf.cast(tf.shape(inputs)[1], tf.float32)
        self.add_loss(reconstruction_loss)
        return reconstructed

    def train_step(self, data):
        with tf.GradientTape() as tape:
            pred = self(data, training=True)
            loss = sum(self.losses)
        grads = tape.gradient(loss, self.trainable_variables)
        self.optimizer.apply_gradients(zip(grads, self.trainable_variables))
        return {"loss": loss}

### Multi-Layer Perceptron

In [None]:
# create MLP model
def mlp_model(input_dim, numHiddenLayers=3, numUnits=64, dropout_rate=0.5):

    model = Sequential()

    #Check number of hidden layers
    if numHiddenLayers >= 1:
        # First Hidden layer
        model.add(Dense(numUnits, input_dim=input_dim, activation='relu'))
        model.add(Dropout(dropout_rate))

        # Second to the last hidden layers
        for i in range(numHiddenLayers - 1):
            numUnits = numUnits // 2
            model.add(Dense(numUnits, activation='relu'))
            model.add(Dropout(dropout_rate))

        # output layer
        model.add(Dense(1, activation='sigmoid'))

    else:
        # output layer
        model.add(Dense(1, input_dim=input_dim, activation='sigmoid'))

    model.compile(loss='binary_crossentropy', optimizer='adam', )#metrics=['accuracy'])

    return model

### Train Models

In [None]:
class DeepMicrobiome(object):
  # TODO:  THIS WHOLE CLASS
  # This is where the bulk of the logic for the project goes
  def __init__(self, data, seed, data_dir):
    self.t_start = time.time()
    self.filename = str(data)
    self.data = self.filename.split('.')[0]
    self.seed = seed
    self.data_dir = data_dir
    self.prefix = ''
    self.representation_only = False

  def loadData(self, feature_string, label_string, label_dict, dtype=None):
    # read file
    filename = self.data_dir + self.filename
    if os.path.isfile(filename):
      raw = pd.read_csv(filename, sep='\t', index_col=0, header=None)
    else:
      print("FileNotFoundError: File {} does not exist".format(filename))
      exit()

    #DEBUG
    print("loaded data from " + str(filename))

    # TODO: draft project only considers first 10 features

    # select rows having feature index identifier string
    X = raw.loc[raw.index.str.contains(feature_string, regex=False)].T
    X = X.iloc[:, :10]

    # get class labels
    Y = raw.loc[label_string] #'disease'
    Y = Y.replace(label_dict)

    # train and test split
    self.X_train, self.X_test, self.y_train, self.y_test = train_test_split(X.values.astype(dtype), Y.values.astype('int'), test_size=0.2, random_state=self.seed, stratify=Y.values)
    self.printDataShapes()

  def loadCustomData(self, dtype=None):
    # read file
    filename = self.data_dir + "data/" + self.filename
    if os.path.isfile(filename):
      raw = pd.read_csv(filename, sep=',', index_col=False, header=None)
    else:
      print("FileNotFoundError: File {} does not exist".format(filename))
      exit()

    # load data
    self.X_train = raw.values.astype(dtype)

    # put nothing or zeros for y_train, y_test, and X_test
    self.y_train = np.zeros(shape=(self.X_train.shape[0])).astype(dtype)
    self.X_test = np.zeros(shape=(1,self.X_train.shape[1])).astype(dtype)
    self.y_test = np.zeros(shape=(1,)).astype(dtype)
    self.printDataShapes(train_only=True)

  def loadCustomDataWithLabels(self, label_data, dtype=None):
    # read file
    filename = self.data_dir + "data/" + self.filename
    label_filename = self.data_dir + "data/" + label_data
    if os.path.isfile(filename) and os.path.isfile(label_filename):
      raw = pd.read_csv(filename, sep=',', index_col=False, header=None)
      label = pd.read_csv(label_filename, sep=',', index_col=False, header=None)
    else:
      if not os.path.isfile(filename):
        print("FileNotFoundError: File {} does not exist".format(filename))
      if not os.path.isfile(label_filename):
        print("FileNotFoundError: File {} does not exist".format(label_filename))
      exit()

    # label data validity check
    if not label.values.shape[1] > 1:
      label_flatten = label.values.reshape((label.values.shape[0]))
    else:
      print('FileSpecificationError: The label file contains more than 1 column.')
      exit()

    # train and test split
    self.X_train, self.X_test, self.y_train, self.y_test = train_test_split(raw.values.astype(dtype),
                                                                            label_flatten.astype('int'), test_size=0.2,
                                                                            random_state=self.seed,
                                                                            stratify=label_flatten)
    self.printDataShapes()

  # Principal Component Analysis
  def pca(self, ratio=0.99):
    # manipulating an experiment identifier in the output file
    self.prefix = self.prefix + 'PCA_'

    # PCA
    pca = PCA()
    pca.fit(self.X_train)
    n_comp = 0
    ratio_sum = 0.0

    for comp in pca.explained_variance_ratio_:
      ratio_sum += comp
      n_comp += 1
      if ratio_sum >= ratio:  # Selecting components explaining 99% of variance
        break

    pca = PCA(n_components=n_comp)
    pca.fit(self.X_train)

    X_train = pca.transform(self.X_train)
    X_test = pca.transform(self.X_test)

    # applying the eigenvectors to the whole training and the test set.
    self.X_train = X_train
    self.X_test = X_test
    self.printDataShapes()

  # Gausian Random Projection
  def rp(self):
    # manipulating an experiment identifier in the output file
    self.prefix = self.prefix + 'RandP_'
    # GRP
    rf = GaussianRandomProjection(eps=0.5)
    rf.fit(self.X_train)

    # applying GRP to the whole training and the test set.
    self.X_train = rf.transform(self.X_train)
    self.X_test = rf.transform(self.X_test)
    self.printDataShapes()

  # Shallow Autoencoder & Deep Autoencoder
  # TODO: set epochs=2000 in final project, 20 is only for draft
  def ae(self, dims=[50], epochs=2000, batch_size=100, verbose=2, loss='mean_squared_error',
           latent_act=False, output_act=False, act='relu', patience=20, val_rate=0.2, no_trn=False):
        # Adjusting experiment identifier
        prefix = 'AE' if len(dims) == 1 else 'DAE'
        suffix = f"{loss[:1]}{'t' if latent_act else ''}{'T' if output_act else ''}_{act[:1]}"
        model_id = f"{prefix}_{suffix}_{str(dims).replace(', ', '-')}"

        # File name for the model checkpoint
        modelName = f"{model_id}_{self.data}.keras"

        # Clean up model checkpoint before use
        if os.path.exists(modelName):
            os.remove(modelName)

        # Callbacks for early stopping and model checkpoint
        callbacks = [
            EarlyStopping(monitor='val_loss', patience=patience, mode='min', verbose=1),
            ModelCheckpoint(modelName, monitor='val_loss', mode='min', verbose=1, save_best_only=True)
        ]

        # Prepare data splits
        X_inner_train, X_inner_test, y_inner_train, y_inner_test = train_test_split(
            self.X_train, self.y_train, test_size=val_rate, random_state=self.seed, stratify=self.y_train
        )

        # Autoencoder model architecture
        input_layer = Input(shape=(X_inner_train.shape[1],))
        x = input_layer
        for dim in dims:
            x = Dense(dim, activation=act)(x)
        if latent_act:
            x = Dense(dims[-1], activation='tanh')(x)  # Example latent activation
        reconstructed = Dense(X_inner_train.shape[1], activation='sigmoid' if output_act else 'linear')(x)

        autoencoder = Model(inputs=input_layer, outputs=reconstructed)
        autoencoder.compile(optimizer='adam', loss=loss)
        autoencoder.summary()

        if no_trn:
            return

        # Model training
        self.history = autoencoder.fit(
            X_inner_train, X_inner_train,
            epochs=epochs,
            batch_size=batch_size,
            callbacks=callbacks,
            verbose=verbose,
            validation_data=(X_inner_test, X_inner_test)
        )

        # Load the best model
        autoencoder.load_weights(modelName)

        # Extract the encoder model
        encoder_layer_index = int(len(autoencoder.layers) / 2)  # Assuming the encoder is the first half
        encoder = Model(inputs=autoencoder.input, outputs=autoencoder.layers[encoder_layer_index].output)

        # Apply the learned encoder to the training and test sets
        self.X_train = encoder.predict(self.X_train)
        self.X_test = encoder.predict(self.X_test)

        self.saveLossProgress()

  # Variational Autoencoder
  # TODO: set epochs=2000 in final project, 20 is only for draft
  def vae(self, dims = [10], epochs=20, batch_size=100, verbose=2, loss='mse', output_act=False, act='relu', patience=25, beta=1.0, warmup=True, warmup_rate=0.01, val_rate=0.2, no_trn=False):

        # manipulating an experiment identifier in the output file
        if patience != 25:
            self.prefix += 'p' + str(patience) + '_'
        if warmup:
            self.prefix += 'w' + str(warmup_rate) + '_'
        self.prefix += 'VAE'
        if loss == 'binary_crossentropy':
            self.prefix += 'b'
        if output_act:
            self.prefix += 'T'
        if beta != 1:
            self.prefix += 'B' + str(beta)
        self.prefix += str(dims).replace(", ", "-") + '_'
        if act == 'sigmoid':
            self.prefix += 'sig_'

        # filename for temporary model checkpoint
        modelName = self.prefix + self.data + '.weights.h5'

        # clean up model checkpoint before use
        if os.path.isfile(modelName):
            os.remove(modelName)

        # callbacks for each epoch
        callbacks = [EarlyStopping(monitor='val_loss', patience=patience, mode='min', verbose=1),
                     ModelCheckpoint(modelName, monitor='val_loss', mode='min', verbose=1, save_best_only=True,save_weights_only=True)]

        # warm-up callback
        warm_up_cb = LambdaCallback(on_epoch_end=lambda epoch, logs: [warm_up(epoch)])  # , print(epoch), print(K.get_value(beta))])

        # warm-up implementation
        def warm_up(epoch):
            val = epoch * warmup_rate
            if val <= 1.0:
                K.set_value(beta, val)
        # add warm-up callback if requested
        if warmup:
            beta = K.variable(value=0.0)
            callbacks.append(warm_up_cb)

        # spliting the training set into the inner-train and the inner-test set (validation set)
        X_inner_train, X_inner_test, y_inner_train, y_inner_test = train_test_split(self.X_train, self.y_train,
                                                                                    test_size=val_rate,
                                                                                    random_state=self.seed,
                                                                                    stratify=self.y_train)

        # insert input shape into dimension list
        dims.insert(0, X_inner_train.shape[1])

        # create vae model
        # self.vae, self.encoder, self.decoder = variational_AE(dims, act=act, recon_loss=loss, output_act=output_act, beta=beta)
        self.vae = VAE(dims, activation=act, recon_loss=loss, output_activation=output_act, beta=beta)
        self.vae.compile(optimizer='adam')
        self.encoder = self.vae.encoder
        self.decoder = self.vae.decoder
        self.vae.summary()
        self.encoder.summary()
        self.decoder.summary()

        if no_trn:
            return

        # fit
        self.history = self.vae.fit(X_inner_train, epochs=epochs, batch_size=batch_size, callbacks=callbacks, verbose=verbose, validation_data=(X_inner_test, None))

        # save loss progress
        self.saveLossProgress()

        # load best model
        self.vae.load_weights(modelName)
        # self.encoder = self.vae.layers[1]
        self.encoder = self.vae.encoder

        # applying the learned encoder into the whole training and the test set.
        _, _, self.X_train = self.encoder.predict(self.X_train)
        _, _, self.X_test = self.encoder.predict(self.X_test)


  # Convolutional Autoencoder
  # TODO: set epochs=2000 in final project, 20 is only for draft
  def cae(self, dims=[32], epochs=20, batch_size=100, verbose=2, loss='mse', output_act=False, act='relu', patience=25, val_rate=0.2, rf_rate=0.1, st_rate=0.25, no_trn=False):
      # Manipulating an experiment identifier in the output file
      self.prefix += 'CAE'
      if loss == 'binary_crossentropy':
          self.prefix += 'b'
      if output_act:
          self.prefix += 'T'
      self.prefix += str(dims).replace(", ", "-") + '_'
      if act == 'sigmoid':
          self.prefix += 'sig_'

      # Filename for temporary model checkpoint
      modelName = self.data_dir + self.prefix + self.data + '.weights.h5'

      # Clean up model checkpoint before use
      if os.path.isfile(modelName):
          os.remove(modelName)

      # Prepare the dataset for convolutional operations
      onesideDim = int(math.sqrt(self.X_train.shape[1])) + 1
      enlargedDim = onesideDim ** 2
      self.X_train = np.pad(self.X_train, ((0, 0), (0, enlargedDim - self.X_train.shape[1])), 'constant')
      self.X_test = np.pad(self.X_test, ((0, 0), (0, enlargedDim - self.X_test.shape[1])), 'constant')

      self.X_train = self.X_train.reshape((-1, onesideDim, onesideDim, 1))
      self.X_test = self.X_test.reshape((-1, onesideDim, onesideDim, 1))

      self.printDataShapes()

      # Callbacks for early stopping and model checkpoint
      callbacks = [
          EarlyStopping(monitor='val_loss', patience=patience, mode='min', verbose=1),
          ModelCheckpoint(modelName, monitor='val_loss', mode='min', verbose=1, save_best_only=True, save_weights_only=True)
      ]

      # Model architecture using functional API
      input_shape = (onesideDim, onesideDim, 1)
      inputs = Input(shape=input_shape)
      x = inputs
      for dim in dims:
          x = Conv2D(dim, (3, 3), activation=act, padding='same')(x)
          x = Conv2D(dim, (3, 3), activation=act, padding='same', strides=(2, 2))(x)  # downsampling
      encoded = Flatten()(x)
      x = Dense(encoded.shape[1], activation=act)(encoded)  # bottleneck layer
      x = Reshape((onesideDim // 2 ** len(dims), onesideDim // 2 ** len(dims), dims[-1]))(x)
      for dim in reversed(dims):
          x = Conv2DTranspose(dim, (3, 3), activation=act, padding='same', strides=(2, 2))(x)  # upsampling
      decoded = Conv2DTranspose(1, (3, 3), activation='sigmoid' if output_act else 'linear', padding='same')(x)

      autoencoder = Model(inputs, decoded)
      autoencoder.compile(optimizer='adam', loss=loss)
      autoencoder.summary()

      if no_trn:
          return

      # Split the training set into the inner-train and the inner-test set (validation set)
      X_inner_train, X_inner_test, _, _ = train_test_split(
          self.X_train, self.y_train, test_size=val_rate, random_state=self.seed, stratify=self.y_train
      )

      # Fit model
      self.history = autoencoder.fit(
          X_inner_train, X_inner_train,
          epochs=epochs,
          batch_size=batch_size,
          callbacks=callbacks,
          verbose=verbose,
          validation_data=(X_inner_test, X_inner_test)
      )

      # Save loss progress
      self.saveLossProgress()

      # Load best model weights
      autoencoder.load_weights(modelName)

      # Reconstruct the encoder model from the autoencoder
      encoder = Model(inputs=autoencoder.input, outputs=autoencoder.get_layer(index=int(len(autoencoder.layers) / 2)).output)

      # Apply the learned encoder to the whole training and test set
      self.X_train = encoder.predict(self.X_train)
      self.X_test = encoder.predict(self.X_test)
      self.printDataShapes()


  # Classification
  def classification(self, hyper_parameters, method='svm', cv=5, scoring='roc_auc', n_jobs=1, cache_size=10000):
    clf_start_time = time.time()

    print("# Tuning hyper-parameters")
    print(self.X_train.shape, self.y_train.shape)

    # Support Vector Machine
    if method == 'svm':
      clf = GridSearchCV(SVC(probability=True, cache_size=cache_size), hyper_parameters, cv=StratifiedKFold(cv, shuffle=True), scoring=scoring, n_jobs=n_jobs, verbose=100, )
      clf.fit(self.X_train, self.y_train)

    # Random Forest
    if method == 'rf':
      clf = GridSearchCV(RandomForestClassifier(n_jobs=-1, random_state=0), hyper_parameters, cv=StratifiedKFold(cv, shuffle=True), scoring=scoring, n_jobs=n_jobs, verbose=100)
      clf.fit(self.X_train, self.y_train)

    # Multi-layer Perceptron
    if method == 'mlp':
      # TODO:  Implement the DNN model and use it here
      print("mlp classifier todo")
      model = KerasClassifier(build_fn=mlp_model, input_dim=self.X_train.shape[1], verbose=0, dropout_rate=0.1, epochs=30, numHiddenLayers=2, numUnits=10)
      clf = GridSearchCV(estimator=model, param_grid=hyper_parameters, cv=StratifiedKFold(cv, shuffle=True), scoring=scoring, n_jobs=n_jobs, verbose=100)
      clf.fit(self.X_train, self.y_train, batch_size=32)

    print("Best parameters set found on development set:")
    print()
    print(clf.best_params_)

    # Evaluate performance of the best model on test set
    y_true, y_pred = self.y_test, clf.predict(self.X_test)
    y_prob = clf.predict_proba(self.X_test)

    # Performance Metrics: AUC, ACC, Recall, Precision, F1_score
    metrics = [ round(roc_auc_score(y_true, y_prob[:, 1]), 4),
                round(accuracy_score(y_true, y_pred), 4),
                round(recall_score(y_true, y_pred), 4),
                round(precision_score(y_true, y_pred), 4),
                round(f1_score(y_true, y_pred), 4), ]

    # time stamp
    metrics.append(str(datetime.datetime.now()))

    # running time
    metrics.append(round( (time.time() - self.t_start), 2))

    # classification time
    metrics.append(round( (time.time() - clf_start_time), 2))

    # best hyper-parameter append
    metrics.append(str(clf.best_params_))

    # Write performance metrics as a file
    # NOTE:  I HAVE DISABLED THIS FOR NOW TO JUST GET SIMPLE RUNNING WORKING, MAYBE LATER

    res = pd.DataFrame([metrics], index=[self.prefix + method])
    with open(self.data_dir + "results/" + self.data + "_result.txt", 'a') as f:
     res.to_csv(f, header=None)

    print('Accuracy metrics')
    print('AUC, ACC, Recall, Precision, F1_score, time-end, runtime(sec), classfication time(sec), best hyper-parameter')
    print(metrics)

  # Print debug info on data shape
  def printDataShapes(self, train_only=False):
      print("X_train.shape: ", self.X_train.shape)
      if not train_only:
        print("y_train.shape: ", self.y_train.shape)
        print("X_test.shape: ", self.X_test.shape)
        print("y_test.shape: ", self.y_test.shape)

  # ploting loss progress over epochs
  def saveLossProgress(self):
    #print(self.history.history.keys())
    #print(type(self.history.history['loss']))
    #print(min(self.history.history['loss']))

    loss_collector, loss_max_atTheEnd = self.saveLossProgress_ylim()

    # create saving path
    if not os.path.exists(os.path.join(self.data_dir, 'results')):
      os.mkdir(os.path.join(self.data_dir, 'results'))

    # save loss progress - train and val loss only
    figureName = self.prefix + self.data + '_' + str(self.seed)
    plt.ylim(min(loss_collector)*0.9, loss_max_atTheEnd * 2.0)
    plt.plot(self.history.history['loss'])
    plt.plot(self.history.history['val_loss'])
    plt.title('model loss')
    plt.ylabel('loss')
    plt.xlabel('epoch')
    plt.legend(['train loss', 'val loss'],
                loc='upper right')
    plt.savefig(self.data_dir + "results/" + figureName + '.png')
    plt.close()

    if 'recon_loss' in self.history.history:
        figureName = self.prefix + self.data + '_' + str(self.seed) + '_detailed'
        plt.ylim(min(loss_collector) * 0.9, loss_max_atTheEnd * 2.0)
        plt.plot(self.history.history['loss'])
        plt.plot(self.history.history['val_loss'])
        plt.plot(self.history.history['recon_loss'])
        plt.plot(self.history.history['val_recon_loss'])
        plt.plot(self.history.history['kl_loss'])
        plt.plot(self.history.history['val_kl_loss'])
        plt.title('model loss')
        plt.ylabel('loss')
        plt.xlabel('epoch')
        plt.legend(['train loss', 'val loss', 'recon_loss', 'val recon_loss', 'kl_loss', 'val kl_loss'], loc='upper right')
        plt.savefig(self.data_dir + "results/" + figureName + '.png')
        plt.close()


  # supporting loss plot
  def saveLossProgress_ylim(self):
    loss_collector = []
    loss_max_atTheEnd = 0.0
    for hist in self.history.history:
      current = self.history.history[hist]
      loss_collector += current
      if current[-1] >= loss_max_atTheEnd:
        loss_max_atTheEnd = current[-1]
    return loss_collector, loss_max_atTheEnd


### Run Experiments

In [None]:
# main function for running an experiment
def run_exp_from_config(config):
  try:
    if config.exp_design.repeat > 1:
      for i in range(config.exp_design.repeat):
        run_exp(i, config)
    else:
      run_exp(config.exp_design.seed, config)
  except OSError as error:
    print(error)

def run_exp(seed, config):


  # create an object and load data
  ## no argument founded
  if config.load_data.data == None and config.load_data.custom_data == None:
    print("[Error] Please specify an input file. (use -h option for help)")
    exit()
  ## provided data
  elif config.load_data.data != None:
    dm = DeepMicrobiome(data=config.load_data.data + '.txt', seed=seed, data_dir=config.load_data.data_dir)

    ## specify feature string
    feature_string = ''
    data_string = str(config.load_data.data)
    if data_string.split('_')[0] == 'abundance':
      feature_string = "k__"
    if data_string.split('_')[0] == 'marker':
      feature_string = "gi|"

    ## load data into the object
    dm.loadData(feature_string=feature_string, label_string='disease', label_dict=label_dict,
                dtype=dtypeDict[config.load_data.dataType])

  ## user data
  elif config.load_data.custom_data != None:
    # PROBABLY NOT NECESSARY, I'VE COMMENTED IT OUT FOR NOW
    """
    ### without labels - only conducting representation learning
    if args.custom_data_labels == None:
        dm = DeepMicrobiome(data=args.custom_data, seed=seed, data_dir=args.data_dir)
        dm.loadCustomData(dtype=dtypeDict[args.dataType])

    ### with labels - conducting representation learning + classification
    else:
        dm = DeepMicrobiome(data=args.custom_data, seed=seed, data_dir=args.data_dir)
        dm.loadCustomDataWithLabels(label_data=args.custom_data_labels, dtype=dtypeDict[args.dataType])
    """
    print("custom data currently unsupported.  TODO possibly if needed!")
  else:
    exit()

  numRLrequired = config.rl.pca + config.rl.ae + config.rl.rp + config.rl.vae + config.rl.cae

  if numRLrequired > 1:
    raise ValueError('No multiple dimensionality Reduction')

  # time check after data has been loaded
  dm.t_start = time.time()

  # Representation learning (Dimensionality reduction)
  if config.rl.pca:
    dm.pca()
  if config.rl.ae:
    dm.ae(dims=[int(i) for i in config.common.dims.split(',')], act=config.common.act, epochs=config.common.max_epochs, loss=config.common.aeloss,
          latent_act=config.AE.ae_lact, output_act=config.common.ae_oact, patience=config.common.patience, no_trn=config.others.no_trn)
  if config.rl.vae:
    dm.vae(dims=[int(i) for i in config.common.dims.split(',')], act=config.common.act, epochs=config.common.max_epochs, loss=config.common.aeloss, output_act=config.common.ae_oact,
           patience= 25 if config.common.patience==20 else config.common.patience, beta=config.VAE.vae_beta, warmup=config.VAE.vae_warmup, warmup_rate=config.VAE.vae_warmup_rate, no_trn=config.others.no_trn)
  if config.rl.cae:
    dm.cae(dims=[int(i) for i in config.common.dims.split(',')], act=config.common.act, epochs=config.common.max_epochs, loss=config.common.aeloss, output_act=config.common.ae_oact,
           patience=config.common.patience, rf_rate = config.CAE.rf_rate, st_rate = config.CAE.st_rate, no_trn=config.others.no_trn)
  if config.rl.rp:
    dm.rp()

  # create saving path
    if not os.path.exists(os.path.join(dm.data_dir, 'results')):
      os.mkdir(os.path.join(dm.data_dir, 'results'))

  # write the learned representation of the training set as a file
  if config.rl.save_rep:
    if numRLrequired == 1:
      rep_file = dm.data_dir + "results/" + dm.prefix + dm.data + "_rep.csv"
      pd.DataFrame(dm.X_train).to_csv(rep_file, header=None, index=None)
      print("The learned representation of the training set has been saved in '{}'".format(rep_file))
    else:
      print("Warning: Command option '--save_rep' is not applied as no representation learning or dimensionality reduction has been conducted.")

  # Classification
  if config.others.no_clf or (config.load_data.data == None and config.load_data.custom_data_labels == None):
    print("Classification task has been skipped.")
  else:
    # turn off GPU
    #
    # NOTE FROM MATTHEW:  Can we port this over to pytorch?
    #
    #os.environ['CUDA_VISIBLE_DEVICES'] = '-1'
    #importlib.reload(keras)

    # training classification models
    if config.classification.method == "svm":
      dm.classification(hyper_parameters=svm_hyper_parameters, method='svm', cv=config.classification.numFolds,
                        n_jobs=config.classification.numJobs, scoring=config.classification.scoring, cache_size=config.classification.svm_cache)
    elif config.classification.method == "rf":
      dm.classification(hyper_parameters=rf_hyper_parameters, method='rf', cv=config.classification.numFolds,
                        n_jobs=config.classification.numJobs, scoring=config.classification.scoring)
    elif config.classification.method == "mlp":
      dm.classification(hyper_parameters=mlp_hyper_parameters, method='mlp', cv=config.classification.numFolds,
                        n_jobs=config.classification.numJobs, scoring=config.classification.scoring)
    elif config.classification.method == "svm_rf":
      dm.classification(hyper_parameters=svm_hyper_parameters, method='svm', cv=config.classification.numFolds,
                        n_jobs=config.classification.numJobs, scoring=config.classification.scoring, cache_size=config.classification.svm_cache)
      dm.classification(hyper_parameters=rf_hyper_parameters, method='rf', cv=config.classification.numFolds,
                        n_jobs=config.classification.numJobs, scoring=config.classification.scoring)
    else:
      dm.classification(hyper_parameters=svm_hyper_parameters, method='svm', cv=config.classification.numFolds,
                        n_jobs=config.classification.numJobs, scoring=config.classification.scoring, cache_size=config.classification.svm_cache)
      dm.classification(hyper_parameters=rf_hyper_parameters, method='rf', cv=config.classification.numFolds,
                        n_jobs=config.classification.numJobs, scoring=config.classification.scoring)
      dm.classification(hyper_parameters=mlp_hyper_parameters, method='mlp', cv=config.classification.numFolds,
                        n_jobs=config.classification.numJobs, scoring=config.classification.scoring)


# Results
In this section, you should finish training your model training or loading your trained model. That is a great experiment! You should share the results with others with necessary metrics and figures.

Please test and report results for all experiments that you run with:

*   specific numbers (accuracy, AUC, RMSE, etc)
*   figures (loss shrinkage, outputs from GAN, annotation or label of sample pictures, etc)



---


**Current Progress:**

Successfully run model with ae, vae, pca, cae with auc results.

**To-Do:**

Show figures for each (By Apr 21)

In [None]:
print("config ae test")
config1 = DeepMicro_Config("ae_config")
run_exp_from_config(config1)


config ae test
loaded data from /content/drive/My Drive/Colab Notebooks/data/abundance/abundance_Cirrhosis.txt
X_train.shape:  (185, 10)
y_train.shape:  (185,)
X_test.shape:  (47, 10)
y_test.shape:  (47,)


Epoch 1/20

Epoch 1: val_loss improved from inf to 0.01602, saving model to AE_m_r_[50]_abundance_Cirrhosis.keras
2/2 - 1s - 680ms/step - loss: 0.0637 - val_loss: 0.0160
Epoch 2/20

Epoch 2: val_loss improved from 0.01602 to 0.01543, saving model to AE_m_r_[50]_abundance_Cirrhosis.keras
2/2 - 1s - 263ms/step - loss: 0.0612 - val_loss: 0.0154
Epoch 3/20

Epoch 3: val_loss improved from 0.01543 to 0.01488, saving model to AE_m_r_[50]_abundance_Cirrhosis.keras
2/2 - 0s - 28ms/step - loss: 0.0591 - val_loss: 0.0149
Epoch 4/20

Epoch 4: val_loss improved from 0.01488 to 0.01437, saving model to AE_m_r_[50]_abundance_Cirrhosis.keras
2/2 - 0s - 29ms/step - loss: 0.0573 - val_loss: 0.0144
Epoch 5/20

Epoch 5: val_loss improved from 0.01437 to 0.01392, saving model to AE_m_r_[50]_abundance_Cirrhosis.keras
2/2 - 0s - 29ms/step - loss: 0.0548 - val_loss: 0.0139
Epoch 6/20

Epoch 6: val_loss improved from 0.01392 to 0.01350, saving model to AE_m_r_[50]_abundance_Cirrhosis.keras
2/2 - 0s - 29ms/ste

  X, y = self._initialize(X, y)
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[CV 1/5; 1/2] END dropout_rate=0.1, epochs=30, numHiddenLayers=2, numUnits=10;, score=0.749 total time=   8.1s
[CV 2/5; 1/2] START dropout_rate=0.1, epochs=30, numHiddenLayers=2, numUnits=10.


  X, y = self._initialize(X, y)
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[CV 2/5; 1/2] END dropout_rate=0.1, epochs=30, numHiddenLayers=2, numUnits=10;, score=0.827 total time=  13.4s
[CV 3/5; 1/2] START dropout_rate=0.1, epochs=30, numHiddenLayers=2, numUnits=10.


  X, y = self._initialize(X, y)
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[CV 3/5; 1/2] END dropout_rate=0.1, epochs=30, numHiddenLayers=2, numUnits=10;, score=0.599 total time=   8.3s
[CV 4/5; 1/2] START dropout_rate=0.1, epochs=30, numHiddenLayers=2, numUnits=10.


  X, y = self._initialize(X, y)
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[CV 4/5; 1/2] END dropout_rate=0.1, epochs=30, numHiddenLayers=2, numUnits=10;, score=0.392 total time=   8.4s
[CV 5/5; 1/2] START dropout_rate=0.1, epochs=30, numHiddenLayers=2, numUnits=10.


  X, y = self._initialize(X, y)
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[CV 5/5; 1/2] END dropout_rate=0.1, epochs=30, numHiddenLayers=2, numUnits=10;, score=0.702 total time=   7.5s
[CV 1/5; 2/2] START dropout_rate=0.3, epochs=30, numHiddenLayers=2, numUnits=10.


  X, y = self._initialize(X, y)
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[CV 1/5; 2/2] END dropout_rate=0.3, epochs=30, numHiddenLayers=2, numUnits=10;, score=0.579 total time=   8.9s
[CV 2/5; 2/2] START dropout_rate=0.3, epochs=30, numHiddenLayers=2, numUnits=10.


  X, y = self._initialize(X, y)
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[CV 2/5; 2/2] END dropout_rate=0.3, epochs=30, numHiddenLayers=2, numUnits=10;, score=0.857 total time=   8.4s
[CV 3/5; 2/2] START dropout_rate=0.3, epochs=30, numHiddenLayers=2, numUnits=10.


  X, y = self._initialize(X, y)
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[CV 3/5; 2/2] END dropout_rate=0.3, epochs=30, numHiddenLayers=2, numUnits=10;, score=0.696 total time=   9.3s
[CV 4/5; 2/2] START dropout_rate=0.3, epochs=30, numHiddenLayers=2, numUnits=10.


  X, y = self._initialize(X, y)
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[CV 4/5; 2/2] END dropout_rate=0.3, epochs=30, numHiddenLayers=2, numUnits=10;, score=0.623 total time=   8.5s
[CV 5/5; 2/2] START dropout_rate=0.3, epochs=30, numHiddenLayers=2, numUnits=10.


  X, y = self._initialize(X, y)
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[CV 5/5; 2/2] END dropout_rate=0.3, epochs=30, numHiddenLayers=2, numUnits=10;, score=0.681 total time=  12.2s


  X, y = self._initialize(X, y)
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


Best parameters set found on development set:

{'dropout_rate': 0.3, 'epochs': 30, 'numHiddenLayers': 2, 'numUnits': 10}
Accuracy metrics
AUC, ACC, Recall, Precision, F1_score, time-end, runtime(sec), classfication time(sec), best hyper-parameter
[0.587, 0.5106, 0.9583, 0.5111, 0.6667, '2024-04-14 16:09:03.687317', 128.08, 103.26, "{'dropout_rate': 0.3, 'epochs': 30, 'numHiddenLayers': 2, 'numUnits': 10}"]


In [None]:
print("config cae test")
config2 = DeepMicro_Config("cae_config")
run_exp_from_config(config2)

config cae test
loaded data from /content/drive/My Drive/Colab Notebooks/data/abundance/abundance_Cirrhosis.txt
X_train.shape:  (185, 10)
y_train.shape:  (185,)
X_test.shape:  (47, 10)
y_test.shape:  (47,)
X_train.shape:  (185, 4, 4, 1)
y_train.shape:  (185,)
X_test.shape:  (47, 4, 4, 1)
y_test.shape:  (47,)


Epoch 1/20

Epoch 1: val_loss improved from inf to 0.01256, saving model to /content/drive/My Drive/Colab Notebooks/data/abundance/CAE[50]_abundance_Cirrhosis.weights.h5
2/2 - 7s - 4s/step - loss: 0.0576 - val_loss: 0.0126
Epoch 2/20

Epoch 2: val_loss improved from 0.01256 to 0.01207, saving model to /content/drive/My Drive/Colab Notebooks/data/abundance/CAE[50]_abundance_Cirrhosis.weights.h5
2/2 - 5s - 2s/step - loss: 0.0558 - val_loss: 0.0121
Epoch 3/20

Epoch 3: val_loss improved from 0.01207 to 0.01145, saving model to /content/drive/My Drive/Colab Notebooks/data/abundance/CAE[50]_abundance_Cirrhosis.weights.h5
2/2 - 0s - 70ms/step - loss: 0.0523 - val_loss: 0.0114
Epoch 4/20

Epoch 4: val_loss improved from 0.01145 to 0.01073, saving model to /content/drive/My Drive/Colab Notebooks/data/abundance/CAE[50]_abundance_Cirrhosis.weights.h5
2/2 - 0s - 48ms/step - loss: 0.0482 - val_loss: 0.0107
Epoch 5/20

Epoch 5: val_loss improved from 0.01073 to 0.00986, saving model to /content/dri

  X, y = self._initialize(X, y)
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[CV 1/5; 1/2] END dropout_rate=0.1, epochs=30, numHiddenLayers=2, numUnits=10;, score=0.618 total time=   9.0s
[CV 2/5; 1/2] START dropout_rate=0.1, epochs=30, numHiddenLayers=2, numUnits=10.


  X, y = self._initialize(X, y)
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[CV 2/5; 1/2] END dropout_rate=0.1, epochs=30, numHiddenLayers=2, numUnits=10;, score=0.678 total time=   7.7s
[CV 3/5; 1/2] START dropout_rate=0.1, epochs=30, numHiddenLayers=2, numUnits=10.


  X, y = self._initialize(X, y)
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[CV 3/5; 1/2] END dropout_rate=0.1, epochs=30, numHiddenLayers=2, numUnits=10;, score=0.617 total time=   8.3s
[CV 4/5; 1/2] START dropout_rate=0.1, epochs=30, numHiddenLayers=2, numUnits=10.


  X, y = self._initialize(X, y)
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[CV 4/5; 1/2] END dropout_rate=0.1, epochs=30, numHiddenLayers=2, numUnits=10;, score=0.825 total time=   9.2s
[CV 5/5; 1/2] START dropout_rate=0.1, epochs=30, numHiddenLayers=2, numUnits=10.


  X, y = self._initialize(X, y)
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[CV 5/5; 1/2] END dropout_rate=0.1, epochs=30, numHiddenLayers=2, numUnits=10;, score=0.675 total time=  12.5s
[CV 1/5; 2/2] START dropout_rate=0.3, epochs=30, numHiddenLayers=2, numUnits=10.


  X, y = self._initialize(X, y)
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[CV 1/5; 2/2] END dropout_rate=0.3, epochs=30, numHiddenLayers=2, numUnits=10;, score=0.621 total time=   8.6s
[CV 2/5; 2/2] START dropout_rate=0.3, epochs=30, numHiddenLayers=2, numUnits=10.


  X, y = self._initialize(X, y)
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[CV 2/5; 2/2] END dropout_rate=0.3, epochs=30, numHiddenLayers=2, numUnits=10;, score=0.731 total time=   8.9s
[CV 3/5; 2/2] START dropout_rate=0.3, epochs=30, numHiddenLayers=2, numUnits=10.


  X, y = self._initialize(X, y)
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[CV 3/5; 2/2] END dropout_rate=0.3, epochs=30, numHiddenLayers=2, numUnits=10;, score=0.775 total time=   8.7s
[CV 4/5; 2/2] START dropout_rate=0.3, epochs=30, numHiddenLayers=2, numUnits=10.


  X, y = self._initialize(X, y)
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[CV 4/5; 2/2] END dropout_rate=0.3, epochs=30, numHiddenLayers=2, numUnits=10;, score=0.667 total time=   8.4s
[CV 5/5; 2/2] START dropout_rate=0.3, epochs=30, numHiddenLayers=2, numUnits=10.


  X, y = self._initialize(X, y)
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[CV 5/5; 2/2] END dropout_rate=0.3, epochs=30, numHiddenLayers=2, numUnits=10;, score=0.635 total time=  13.9s


  X, y = self._initialize(X, y)
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


Best parameters set found on development set:

{'dropout_rate': 0.3, 'epochs': 30, 'numHiddenLayers': 2, 'numUnits': 10}
Accuracy metrics
AUC, ACC, Recall, Precision, F1_score, time-end, runtime(sec), classfication time(sec), best hyper-parameter
[0.6757, 0.5106, 0.9167, 0.5116, 0.6567, '2024-04-14 16:20:21.959959', 141.24, 103.53, "{'dropout_rate': 0.3, 'epochs': 30, 'numHiddenLayers': 2, 'numUnits': 10}"]


In [None]:
print("config vae test")
config3 = DeepMicro_Config("vae_config")
run_exp_from_config(config3)

config vae test
loaded data from /content/drive/My Drive/Colab Notebooks/data/abundance/abundance_Cirrhosis.txt
X_train.shape:  (185, 10)
y_train.shape:  (185,)
X_test.shape:  (47, 10)
y_test.shape:  (47,)


Epoch 1/20

Epoch 1: val_loss improved from inf to 17.59176, saving model to VAE[50]_abundance_Cirrhosis.weights.h5
2/2 - 1s - 721ms/step - loss: 0.0000e+00 - val_loss: 17.5918
Epoch 2/20

Epoch 2: val_loss improved from 17.59176 to 15.96755, saving model to VAE[50]_abundance_Cirrhosis.weights.h5
2/2 - 0s - 38ms/step - loss: 0.0000e+00 - val_loss: 15.9676
Epoch 3/20

Epoch 3: val_loss improved from 15.96755 to 15.73457, saving model to VAE[50]_abundance_Cirrhosis.weights.h5
2/2 - 0s - 40ms/step - loss: 0.0000e+00 - val_loss: 15.7346
Epoch 4/20

Epoch 4: val_loss improved from 15.73457 to 14.25566, saving model to VAE[50]_abundance_Cirrhosis.weights.h5
2/2 - 0s - 37ms/step - loss: 0.0000e+00 - val_loss: 14.2557
Epoch 5/20

Epoch 5: val_loss improved from 14.25566 to 13.95582, saving model to VAE[50]_abundance_Cirrhosis.weights.h5
2/2 - 0s - 69ms/step - loss: 0.0000e+00 - val_loss: 13.9558
Epoch 6/20

Epoch 6: val_loss did not improve from 13.95582
2/2 - 0s - 26ms/step - loss: 0.0000e+00

  X, y = self._initialize(X, y)
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[CV 1/5; 1/2] END dropout_rate=0.1, epochs=30, numHiddenLayers=2, numUnits=10;, score=0.395 total time=   4.0s
[CV 2/5; 1/2] START dropout_rate=0.1, epochs=30, numHiddenLayers=2, numUnits=10.


  X, y = self._initialize(X, y)
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)
  X, y = self._initialize(X, y)
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[CV 2/5; 1/2] END dropout_rate=0.1, epochs=30, numHiddenLayers=2, numUnits=10;, score=0.687 total time=   6.0s
[CV 3/5; 1/2] START dropout_rate=0.1, epochs=30, numHiddenLayers=2, numUnits=10.
[CV 3/5; 1/2] END dropout_rate=0.1, epochs=30, numHiddenLayers=2, numUnits=10;, score=0.500 total time=   4.9s
[CV 4/5; 1/2] START dropout_rate=0.1, epochs=30, numHiddenLayers=2, numUnits=10.


  X, y = self._initialize(X, y)
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[CV 4/5; 1/2] END dropout_rate=0.1, epochs=30, numHiddenLayers=2, numUnits=10;, score=0.478 total time=   2.8s
[CV 5/5; 1/2] START dropout_rate=0.1, epochs=30, numHiddenLayers=2, numUnits=10.


  X, y = self._initialize(X, y)
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[CV 5/5; 1/2] END dropout_rate=0.1, epochs=30, numHiddenLayers=2, numUnits=10;, score=0.601 total time=   2.8s
[CV 1/5; 2/2] START dropout_rate=0.3, epochs=30, numHiddenLayers=2, numUnits=10.


  X, y = self._initialize(X, y)
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[CV 1/5; 2/2] END dropout_rate=0.3, epochs=30, numHiddenLayers=2, numUnits=10;, score=0.430 total time=   3.1s
[CV 2/5; 2/2] START dropout_rate=0.3, epochs=30, numHiddenLayers=2, numUnits=10.


  X, y = self._initialize(X, y)
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[CV 2/5; 2/2] END dropout_rate=0.3, epochs=30, numHiddenLayers=2, numUnits=10;, score=0.272 total time=   3.0s
[CV 3/5; 2/2] START dropout_rate=0.3, epochs=30, numHiddenLayers=2, numUnits=10.


  X, y = self._initialize(X, y)
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[CV 3/5; 2/2] END dropout_rate=0.3, epochs=30, numHiddenLayers=2, numUnits=10;, score=0.523 total time=   2.7s
[CV 4/5; 2/2] START dropout_rate=0.3, epochs=30, numHiddenLayers=2, numUnits=10.


  X, y = self._initialize(X, y)
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[CV 4/5; 2/2] END dropout_rate=0.3, epochs=30, numHiddenLayers=2, numUnits=10;, score=0.500 total time=   2.7s
[CV 5/5; 2/2] START dropout_rate=0.3, epochs=30, numHiddenLayers=2, numUnits=10.


  X, y = self._initialize(X, y)
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[CV 5/5; 2/2] END dropout_rate=0.3, epochs=30, numHiddenLayers=2, numUnits=10;, score=0.506 total time=   2.9s


  X, y = self._initialize(X, y)
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


Best parameters set found on development set:

{'dropout_rate': 0.1, 'epochs': 30, 'numHiddenLayers': 2, 'numUnits': 10}
Accuracy metrics
AUC, ACC, Recall, Precision, F1_score, time-end, runtime(sec), classfication time(sec), best hyper-parameter
[0.5362, 0.5319, 0.375, 0.5625, 0.45, '2024-04-14 17:31:44.096194', 96.32, 38.47, "{'dropout_rate': 0.1, 'epochs': 30, 'numHiddenLayers': 2, 'numUnits': 10}"]


In [None]:
# test a full (simple) workflow with PCA and SVM
print("\nconfig 1 test")
config1 = DeepMicro_Config("test_experiment_config_1")
run_exp_from_config(config1)

# Note: You don't have to create a whole new config file when you're just changing one or two parameters.
# you can run a slightly modified experiment like this (e.g. here we're using the same exeriment on a different dataset):
# Test config of no autoencoder
config1_using_marker_colorectal = config1
config1_using_marker_colorectal.load_data.data_dir = "/content/drive/My Drive/Colab Notebooks/data/marker/"
config1_using_marker_colorectal.load_data.data = "marker_Colorectal"
run_exp_from_config(config1_using_marker_colorectal)

## Model comparison

**To-Do:**

Train model in full epoches and compare results. (By Apr 28)
- we can compare the results of the different models in terms of accuracy, AUC, etc. And,
- the time taken to train each model
- the best hyperparameters found for each model

# Discussion

In this section,you should discuss your work and make future plan. The discussion should address the following questions:
  * Make assessment that the paper is reproducible or not.
  * Explain why it is not reproducible if your results are kind negative.
  * Describe “What was easy” and “What was difficult” during the reproduction.
  * Make suggestions to the author or other reproducers on how to improve the reproducibility.
  * What will you do in next phase.

---

**To-Do:**

Add discussion. (By Apr 28)

- discuss the reproducibility of the paper and the ease of reproducing the results.
- discuss what was easy and what was difficult during the reproduction.
- make suggestions to the author or other reproducers on how to improve the reproducibility.
- discuss the results of the model comparison and the implications of the results.
- discuss the limitations of the study and the potential future work.



# References

1. Cho, I., & Blaser, M. J. (2012). The human microbiome: at the interface of health and disease. Nature Reviews Genetics, 13(4), 260-270.
2. Huttenhower, C. et al. Structure, function and diversity of the healthy human microbiome. nature 486, 207 (2012).
3. McQuade, J. L., Daniel, C. R., Helmink, B. A. & Wargo, J. A. Modulating the microbiome to improve therapeutic response in cancer. The Lancet Oncology 20, e77–e91 (2019).
4. Eloe-Fadrosh, E. A. & Rasko, D. A. The human microbiome: from symbiosis to pathogenesis. Annual review of medicine 64, 145–163 (2013).
5. Scholz, M. et al. Strain-level microbial epidemiology and population genomics from shotgun metagenomics. Nature methods 13, 435 (2016).
6. Truong, D. T. et al. MetaPhlAn2 for enhanced metagenomic taxonomic profiling. Nature methods 12, 902 (2015).
7. Oh, M., & Zhang, L. (2020). DeepMicro: deep representation learning for disease prediction based on microbiome data. Scientific reports, 10(1), 6026.
8. Pasolli, E., Truong, D. T., Malik, F., Waldron, L. & Segata, N. Machine learning meta-analysis of large metagenomic datasets: tools and biological insights. PLoS computational biology 12, e1004977 (2016).
9. Cawley, G. C. & Talbot, N. L. On over-fitting in model selection and subsequent selection bias in performance evaluation. Journal of Machine Learning Research 11, 2079–2107 (2010).
10. Varma, S. & Simon, R. Bias in error estimation when using cross-validation for model selection. BMC bioinformatics 7, 91 (2006).
