# Usage of the genericVAE
## Requirements

This notebook uses python >= 3.7 with tensorflow version 2.12.0, numpy version 1.23.5, pandas version 2.0.0 and h5py 3.8.0.

## Data

The GTEx V8 study serves as a brilliant dataset for demonstration.
At first we want to download a set that unifies all gene read counts, for different types of cells, in the study.
This data can be downloaded from the following URL: `https://storage.googleapis.com/gtex_analysis_v8/rna_seq_data/GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_reads.gct.gz` .
After the download, move it to a folder `genericVAE/data` where you'll place all other data, too.
To read the data a helping function is provided:

In [1]:
from helpers import reading
data = reading.read_gct_from_GTEx("data/GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_reads.gct.gz")
data.info

<bound method DataFrame.info of                    Description  GTEX-1117F-0226-SM-5GZZ7   
Name                                                       
ENSG00000223972.5      DDX11L1                         0  \
ENSG00000227232.5       WASH7P                       187   
ENSG00000278267.1    MIR6859-1                         0   
ENSG00000243485.5  MIR1302-2HG                         1   
ENSG00000237613.2      FAM138A                         0   
...                        ...                       ...   
ENSG00000198695.2       MT-ND6                     32048   
ENSG00000210194.1        MT-TE                        18   
ENSG00000198727.2       MT-CYB                    127194   
ENSG00000210195.2        MT-TT                         1   
ENSG00000210196.2        MT-TP                         0   

                   GTEX-1117F-0426-SM-5EGHI  GTEX-1117F-0526-SM-5EGHJ   
Name                                                                    
ENSG00000223972.5                        

We see, that there is rows specifying genes (by entrez-id and the HUGO-symbol) and columns specifying samples.
As the dataset is quite a chunk, for further analysis we want to safe this data in format, that's faster accessible, for us that's h5:

In [3]:
data.to_hdf("data/GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_reads.h5", key="data")

Next we want to filter the dataset for genes, that are relevant for our current scope.
For the beginning, lets focus on genes, that are part of the STRING database.
For this purpose there is a list provided, named `h_S_string.txt`.

In [4]:
import pandas as pd
in_string = pd.read_csv("h_S_string.txt", sep="\t")["preferred_name"]
# filter the frame
fdat = data.loc[data["Description"].isin(in_string)]

Now we have a subset of 17383 samples reduced to 18647 gene reads per sample.
Prior to training there needs to be done some pre-processing.
Experimentation has shown, that log-transformation and scaling is enough.
For the sake of performance we will switch to `numpy` and later reconstruct the data-frame structure.


In [5]:
import numpy as np
import helpers.data_preprocessing as prep

# for the log-transformation
logdat = prep.log_norm(np.array(fdat)[:, 2:])
# for the scaling, we scale sample-wise, here we add 1 to avoid zero-division:
sdat = prep.scale_by_sample(logdat+1)

## Build a model

Now we can build a model to be trained with the data we just prepared.
At first we specify the parameters the model will be based on:

In [None]:
import class_definitions.generic_VAE as gvae

# as we train sample-wise, we have to transpose the dataframe
input_shape = sdat.T.shape[1]

vae = gvae.Builder(input_shape=input_shape,
                   encoder_shape=[186],
                   decoder_shape=[186],
                   latent_dims=18,
                   dropout_rate=.001)

vae = gvae.VAE(vae)

# compile it:
vae.compile()

# and train it:
vae.fit(sdat.T, epochs=100, workers=64, use_multiprocessing=True)

# to extract the reconstruction-errors we simply access the vae-obj
recons = vae.fwise_recon_error_tracker().result().numpy()

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100