# Model

In this script we are buildin the model using part of single-cell data and test with bulk data. For sake of time, and due to computation limitation with selected only 22 genes to build the model and 3 cell types. The selection of these genes, or feature selection, results from finding the genes that have higher differential expression from the analysis of single-cell-RNA data. As for the cell types, we considered the top three most aboundant cell type in single-cell-RNAseq data, which are: astrocytes, microglia and oligodendrocytes; in this order from highest to lowest proportion.

For file with walkthrough and comments on the model see the notebook `03_comBiology_mock_model.ipynb`.

In Google Colab set the runtine as: ==R with TPU==; running it without TPU resulted in errors.

File(s) needed: 
    - 'bulk_tissue_data.csv' from notebook `01scripts\01_compBiology_bulk.ipynb`;
    - 'sc-RNAseq_genes_input_22.csv' & `sc_RNAseq_output_patients.csv` from notebook  `01scripts\02_comBiology_singleCell.ipynb`

## load libraries and functions

In [None]:
# libraries ####
devtools::install_github("rstudio/keras")
install.packages("tidyverse")
install.packages("gradDescent")
library(keras)
library(tidyverse)
#library(gradDescent)

## custom functions ####
# create a function to generate mock data about the proportion of each cell type. For this the sum of the is 1
# with this function with can specify the mock number of cell type by the `n_rows` and the number of samples as `n_cols`

create_matrix <- function(n_rows, n_cols) {
  # Generate a vector of random numbers between 0 and 1
  data <- runif(n_rows * n_cols)

  # Create a matrix from the vector
  matrix <- matrix(data, nrow = n_rows, ncol = n_cols)

# Normalize the matrix such that the sum of each column is 1
  for (i in 1:n_cols) {
    matrix[, i] <- matrix[, i] / sum(matrix[, i])
  }

  return(matrix)
}

# as alternative scaling function --- NOT USED
min_max_scaling <- function(x) {
  return ((x - min(x)) / (max(x) - min(x)))
}

# load data

In [None]:
# load data ####
tissue_pseudobulk_data <- read.csv("sc-RNAseq_genes_input_22.csv")
data <- tissue_pseudobulk_data #[c(1:5,c(17276:17279)),1:5]
tail(data)
## preprocess data --- scale ####
split_gene_features <- data[c(1:22),-1]
tail(split_gene_features)


In [None]:
### convert to binary
dim(data)
binary_sex <- (data[23,-1])
binary_condition <- (data[24,-1])
binary_sex

In [None]:
scale_genes <- t(scale(t(data.matrix(split_gene_features))))
#scale_genes <- sapply(t(data.matrix(split_gene_features)),min_max_scaling)
# (colnames(split_gene_features) == colnames(binary_sex) ) == colnames(binary_condition)
head(scale_genes)
x_original <- rbind(scale_genes,binary_sex,binary_condition)

In [None]:
tail(x_original)

In [None]:
# Create a matrix with random numbers
number_of_cell_types = 3
number_of_samples = dim(x_original)[2] # number of patients
y_original <- read.csv("sc_RNAseq_output_patients.csv")
y_original <- y_original[c(1:3), -1]
# Normalize the columns to sum to 1
#y_original <- y_original / colSums(y_original)

In [None]:
y_original

In [None]:
# build the model ####
x <- x_original
y <- y_original
tissue_composition <- dim(y)[1]
number_of_features <- dim(x)[1]
dim(x)
dim(y)
tissue_composition

In [None]:
# Assuming you have a dataset 'data' with features 'x' and labels 'y'
x <- t(x)
y <- t(y)
# Set a random seed for reproducibility
set.seed(123)

# Define the proportion of data to use for testing (e.g., 20%)
test_split_ratio <- 0.2

# Generate random indices for splitting the data
num_samples <- nrow(x)
num_test_samples <- round(num_samples * test_split_ratio)
test_indices <- sample(1:num_samples, num_test_samples)

# Split the data into training and test sets
x_train <- x[-test_indices, ]  # Training features
y_train <- y[-test_indices, ]  # Training labels
x_test <- x[test_indices, ]    # Test features
y_test <- y[test_indices, ]    # Test labels

# Now you have x_train, y_train for training, and x_test, y_test for testing

In [None]:
test_indices

In [None]:

# Initialize model
model <- keras_model_sequential()


In [None]:
# Add layers
feature_factor = 1
feacture_factor2 = 0.5
model %>%
  layer_dense(units = feature_factor*number_of_features, activation = 'sigmoid', input_shape = dim(x)[2]) %>% # units=265 relu
  layer_dropout(rate = 0.4) %>%
  layer_dense(units = number_of_features, activation = 'sigmoid') %>% #units = 128 relu
  layer_dropout(rate = 0.3) %>%
  layer_dense(units = tissue_composition, activation = 'softmax')

In [None]:
# Compile model
model %>% compile(
  loss = 'categorical_crossentropy', # binary_crossentropy #
  optimizer = optimizer_rmsprop(),#optimizer_rmsprop(),
  metrics = c('accuracy')
)

In [None]:
# Train model
history <- model %>% fit(
  x_train,
  y_train,
  epochs = 30,
  batch_size = 4,
)

In [None]:
plot(history)

In [None]:
# Evaluate model
model %>% evaluate(x_test,y_test)

In [None]:
# Make predictions
#out put layer (softmax vector)
predictions = model %>% predict(x_test)
head(predictions)
head(model %>% predict(x_test)) %>% apply(1, sum)

In [None]:
head(y_test)
head(y_test %>% apply(1, sum))

In [None]:
head(x_test[,1:2])

# Predict bulk data

In [None]:
# process and prepare bulk for pipeline
data_bulk <- read.csv("bulk_tissue_data.csv")
genes_in_singleCell <- tissue_pseudobulk_data$ENSEMBL[1:22]
genes_filter_bulk <- data_bulk %>% filter(ensembl_gene_id %in% genes_in_singleCell)
matched_gene_order_bulk <- genes_filter_bulk %>% arrange(match(ensembl_gene_id,genes_in_singleCell))
# remove ensembl
split_gene_features <- matched_gene_order_bulk[,-1]
### convert to binary
binary_sex <- ifelse(data_bulk[17276,] == "F", 1, 0)[,-1]
binary_condition <- ifelse( data_bulk[17279,] == "control", 0, 1)[,-1]

# scale
scale_genes <- t(scale(t(data.matrix(split_gene_features))))
head(scale_genes)
head(binary_sex)
head(binary_condition)
bulk_genes_filtered <- rbind(scale_genes,binary_sex,binary_condition)
bulk_genes_filtered
x_test_bulk <- bulk_genes_filtered

In [None]:
# check length, should be 22
length(genes_in_singleCell)

In [None]:
# check number of rows, should be 22 genes + 2 phenoData = 24
dim(t(x_test_bulk))

In [None]:
# Make predictions
#out put layer (softmax vector)
predictions_bulk = model %>% predict(t(x_test_bulk))
head(predictions_bulk)
head(model %>% predict(x_test)) %>% apply(1, sum)

# conclusions

In less than two weeks we have built the base architecture of a machine learning model to predict tissue composition based on gene expression profiles. 
Naturally, our model is overfitted due to the low number of samples and features (limitation of the scRNA-seq data). Therefore it requires further training and validation, especially to increase number of samples and  features (besides the selected 22 genes). However, we think that considering all the genes might introduced noise, thus the idea would be to improve feature selection from previous evidence. For example, considering genes that are described to be associated with Huntington's disease as well as markers of specific cell types in the brain. 
Another aspect that we did not include was a cell type that represents all the other non-specified cell types. This would allow the sum to be is exactly 1. As is, even if the profile does not match a brain cell the neuronal net will try to place it in one of the three cell types. Thus, we think an additional non-specified cell type would improve the results since it will allow the algorithm to purge expression profiles that don't activate nodes for brain cells.  