# Model

In this script we are trying to build the model to validate our idea. 
We will take the data from the bulk analysis as training data, and generate a matrix with fake proportion of cells just to train the model. 
For the final script model the bulk data can then be easily replaced with single-cell data, and at the bottom of the script  we just need to add the bulk data to predict the proportion of cells.

In Google Colab set the runtine as: ==R with TPU==; running it without TPU resulted in errors (possibly solved with "reticulate" package, code commented).

File(s) needed: 
    - 'bulk_tissue_data.csv' from notebook `01scripts\01_compBiology_bulk.ipynb`

## load libraries and functions

In [None]:
# libraries ####
#devtools::install_github("rstudio/keras")
#install.packages("tidyverse")
# install.packages("gradDescent")
library(keras)
library(tidyverse)
#library(gradDescent)

## custom functions ####
# create a function to generate mock data about the proportion of each cell type. For this the sum of the is 1
# with this function with can specify the mock number of cell type by the `n_rows` and the number of samples as `n_cols`

create_matrix <- function(n_rows, n_cols) {
  # Generate a vector of random numbers between 0 and 1
  data <- runif(n_rows * n_cols)

  # Create a matrix from the vector
  matrix <- matrix(data, nrow = n_rows, ncol = n_cols)

# Normalize the matrix such that the sum of each column is 1
  for (i in 1:n_cols) {
    matrix[, i] <- matrix[, i] / sum(matrix[, i])
  }

  return(matrix)
}

# as alternative scaling function --- NOT USED
min_max_scaling <- function(x) {
  return ((x - min(x)) / (max(x) - min(x)))
}

## load data

In the `csv` data the first column (features) has the genes and other features such as `age` and `condition`. However, we did not changed the column name `ensembl_gene_id`.
To scale the data (standardization)  so that we can use data from bulk and single-cell RNA-seq, we need to split this data between what are genes and what is not, because we just want to scale the genes. The other variables are already numeric or need to be converted to  binary.

In [None]:
# load data ####
tissue_bulk_data <- read.csv("bulk_tissue_data.csv")
data <- tissue_bulk_data #[c(1:5,c(17276:17279)),1:5]
tail(data)
## preprocess data  ####
split_gene_features <- data[c(1:17275),-1]
tail(split_gene_features)


From this table (above) we can see that we have exported more features that what we need to be able to use the two data sets. For example,"condition_specified" is not present in single-cell-RNAseq data. Also, the sex M/F, so we need to convert to 0/1. Therefore, this is information that we can ignore in this script.

In [None]:
### convert to binary
binary_sex <- ifelse(data[17279,-1 ] == "F", 1, 0)
binary_condition <- ifelse(data[17276,-1 ] == "control", 0, 1)
binary_sex

Now with the split feature genes from other phenoData we can scale the genes using the function `scale()`. We decided to use this standard function, but it is something that we can research and possibbly find better option for this  kind of genomic data. 

In [None]:
scale_genes <- t(scale(t(data.matrix(split_gene_features))))
head(scale_genes)
# merge scale genes with other feature (sex and condition)
x_original <- rbind(scale_genes,binary_sex,binary_condition)

Now let's just print the original data to check that the `scale()` worked.

In [None]:
tail(x_original)

It worked. Now we have our `x` data ready to build our model, but since we don't have the proportion of the cell type for the bulk data we will generate mock that with the function define above. 
For sake of simplicity we will only consider 4 cell types, and, naturally, the number of samples must the same (aka the number of patient samples).

In [None]:
# Create a mock matrix for cell type porportion with random numbers
number_of_cell_types = 4
number_of_samples = dim(x_original)[2] # number of patients
y_original <- create_matrix(number_of_cell_types,number_of_samples)

# Normalize the columns to sum to 1
y_original <- y_original / colSums(y_original)

In [None]:
y_original

# Build model

Ok, now we have our data ready to build the model. For that we will used keras package, and as starting point we will consider that the number of features to be equal to number of genes; this might need to be optimized. 
But, before we proceed, let's just check that the matrices size are correct.

In [None]:
# build the model ####
## define feature and targets 
x <- x_original
y <- y_original
# check matrix dims
tissue_composition <- dim(y)[1]
number_of_features <- dim(x)[1]
dim(x)
dim(y)
tissue_composition

We can now see the the matrix is has the samples in rows, so we need to transpose the matrix to build the model; features should be the columns and rows the observation (patient samles).
At this stage we also need to split that data into "training" and "testing"; we will consider arbitrary proportion of  $0.8$ for training and $0.2$ for testing.

In [None]:
# Assuming you have a dataset 'data' with features 'x' and labels 'y'
x <- t(x)
y <- t(y)
# Set a random seed for reproducibility
set.seed(123)

# Define the proportion of data to use for testing (e.g., 20%)
test_split_ratio <- 0.2

# Generate random indices for splitting the data
num_samples <- nrow(x)
num_test_samples <- round(num_samples * test_split_ratio)
test_indices <- sample(1:num_samples, num_test_samples)

# Split the data into training and test sets
x_train <- x[-test_indices, ]  # Training features
y_train <- y[-test_indices, ]  # Training labels
x_test <- x[test_indices, ]    # Test features
y_test <- y[test_indices, ]    # Test labels

In [None]:
# Initialize model
# reticulate::use_condaenv("base", conda = "auto") # run on cpu
model <- keras_model_sequential()


In [None]:
# Add layers
feature_factor = 1
model %>%
  layer_dense(units = feature_factor*number_of_features, activation = 'sigmoid', input_shape = dim(x)[2]) %>% # units=265 relu
  layer_dropout(rate = 0.4) %>%
  layer_dense(units = 128, activation = 'sigmoid') %>% #units = 128 relu
  layer_dropout(rate = 0.3) %>%
  layer_dense(units = tissue_composition, activation = 'softmax')


In [None]:
# Compile model
model %>% compile(
  loss = 'categorical_crossentropy', # binary_crossentropy #
  optimizer = optimizer_rmsprop(),#optimizer_rmsprop(),
  metrics = c('accuracy')
)

In [None]:
# Train model
history <- model %>% fit(
  x_train,
  y_train,
  epochs = 100, #30
  batch_size = 32, # 32
  #validation_split = 0.2
)

In [None]:
plot(history)

In [None]:
# Evaluate model
model %>% evaluate(x_test, y_test)

# Make predictions
predictions <- model %>% predict(x_test) %>% k_argmax()

The code seems to be running without errors. We can see that the accuracy is not great, but this is expected since we used random data to train the model.  
Now we just want to check that the results for the predictions have a sum closer to 1. If the sum is not 1 it might indicate that the last function is not properly applied. 

In [None]:
predictions = model %>% predict(x_train)
head(predictions)
head(model %>% predict(x_train)) %>% apply(1, sum)

Good, the sum is closer to one. Now let's just check how different are the values predict form the training values (yet knowing that the accuracy is not great).

In [None]:
head(y_train)
head(y_train %>% apply(1, sum))

Ok, great that is it! The code for the model is running.

# Prepare bulk data to be used by model trained with single-cell-RNAseq data

The steps here are identical to what was performed before, but have some adjustments like cheking that the genes between datasets are the same and in the same order in the table. 
We decided to do this step because in the first layer the number of features is the same as the number of feature, therefore we thought that their order could have implications. Thus, to avoid adding unnecessary variables we made sure that the features are equally ordered (aka have the same index).

In [None]:
# process and prepare bulk for pipeline
data_bulk <- read.csv("bulk_tissue_data.csv")
genes_in_singleCell <- tissue_bulk_data$ensembl_gene_id[c(11,5,9,8)] #random rows, replace with genes from csv from single-cell
genes_filter_bulk <- data_bulk %>% filter(ensembl_gene_id %in% genes_in_singleCell)
matched_gene_order_bulk <- genes_filter_bulk %>% arrange(match(ensembl_gene_id,genes_in_singleCell))
# remove ensembl
split_gene_features <- matched_gene_order_bulk[,-1]
### convert to binary
binary_sex <- ifelse(data_bulk[17276,] == "F", 1, 0)[,-1]
binary_condition <- ifelse( data_bulk[17279,] == "control", 0, 1)[,-1]
# scale
scale_genes <- t(scale(t(data.matrix(split_gene_features))))
head(scale_genes)
head(binary_sex)
head(binary_condition)
bulk_genes_filtered <- rbind(scale_genes,binary_sex,binary_condition)
bulk_genes_filtered