# Summary of the proposed idea

## Example data
The dataset I work on is formed of three parts :

1. a matrix of normalized gene counts, each row is a human tissue sample and each column is the estimated expression of a specific gene.

2. a target array : the target we want to classify e.g. (HER receptor positive vs HER receptor negative ) or (tumor stage early vs late) or (subtype of tumor e.g. ductal or papillary or metastatic), etc. It can either be binary or multiclass. It can also be numeric e.g. age before death (survival)

3. an array of the gene names (the column names of matrix mentioned in 1)

In [35]:
import numpy as np
import pandas as pd
a = np.load('data/BRCA_HER2_status_data.npy').transpose()
labels = np.load('data/BRCA_HER2_status_class.npy')
gene_names = np.load('data/BRCA_HER2_status_genes.npy')

In [36]:
# 1. example matrix
pd.DataFrame(a).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,60473,60474,60475,60476,60477,60478,60479,60480,60481,60482
0,0.079321,0.0,1.636884,0.0,2.070119,3.909544,0.0,0.007719,7.948489,0.073201,...,0.0,0.089308,0.0,0.0,0.129877,3.370021,0.208246,0.0,0.306455,0.0
1,0.032339,0.004531,1.782667,0.0,1.926196,3.540072,0.0,0.029193,6.887261,0.322512,...,0.0,0.007371,0.0,0.285674,0.0,3.988904,0.18073,0.0,0.709098,0.0
2,0.031695,0.0,1.513331,0.0,1.643069,3.525534,0.0,0.112598,7.534212,0.057845,...,0.0,0.15762,0.0,0.147026,0.151939,3.831975,0.231874,0.0,0.412883,0.0
3,0.0,0.0,2.067434,0.0,1.942403,3.560276,0.0,0.135735,7.743208,0.271353,...,0.0,0.062719,0.021173,0.124921,0.129128,4.067203,0.261164,0.0,0.519763,0.0
4,0.0,0.01917,2.051959,0.0,2.215758,3.365744,0.0,0.257722,6.970411,0.062656,...,0.0,0.046387,0.0,0.158857,0.056815,4.425612,0.167314,0.0,0.571092,0.0


In [39]:
# 2. example target classes : already preprocessed in a separte file to 0 and 1 , check the log file for the names refered to by 0s and 1s

labels[1:10], len(labels)

(array([0, 0, 0, 0, 0, 0, 0, 0, 1], dtype=int32), 767)

In [41]:
# gene names
gene_names[1:10], len(gene_names)

(array(['ENSG00000270112.3', 'ENSG00000167578.15', 'ENSG00000273842.1',
        'ENSG00000078237.5', 'ENSG00000146083.10', 'ENSG00000225275.4',
        'ENSG00000158486.12', 'ENSG00000198242.12', 'ENSG00000259883.1'],
       dtype=object), 60483)

## Previous research


Most previous studies, deal with each dataset separately. This poses many problems since the dataset is usually small, this leads to overfitting and losing much information in dimension reduction. Below are some examples :

In this study (Wenric, Stephane, and Ruhollah Shemirani. "Using supervised learning methods for gene selection in RNA-Seq case-control studies." Frontiers in genetics 9 (2018).)
The filter the genes using three methods : (differential expression and RF and extreme pseudosamples ) and use the output selected genes in the prediction model. 

<img src="images/prev1.png">

In this paper (Torres, Rodrigo, et al. "A machine-learning classifier trained with microRNA ratios to distinguish melanomas from nevi." bioRxiv (2018): 507400.): 

They use a specified version of random forest for feature selection called Boruta and then apply apply a RF model on the selected features

<img src="images/bruta.png", width = 800>

In this paper (Jabeen, Almas, Nadeem Ahmad, and Khalid Raza. "Machine learning-based state-of-the-art methods for the classification of RNA-seq data." Classification in BioApps. Springer, Cham, 2018. 133-172.):

They summarize the machine learning applications on gene expression data and describes in the following image the traditional workflow for modelling gene expression data

<img src="images/traditional.png">

# Problem specification

The problem with the above/ traditional modelling is that is uses each dataset separately from previous knowledge ignoring a priori knowledge and dealing with each dataset as a new problem. There are many problems with this approach such as :

1. It's very hard and inaccurate to train ML models on very small datasets (the usual case in gene expression data due to the high cost of data collection)

2. With every filtering step, we compromise to reduce dimentionality but at the cost of losing information. For example, univariate filters, run a correlation test between each feature and the target and filter features with low correlation, but what if two features are both weak predictors but if combined they become a strong predictor and what if one feature is specific to one class but not sensitive (e.g. a gene that is rarely expression, but when it exists it means 100% a tumor, unfortunately, it will be filtered if we filter low expressed genes before modelling).


# Proposed solution


One common a priori information is that genes (features used for classification) are the same for all datasets, Even if datasets from different experiments or generated by different machines are incomparable, we still can learn common patterns from several datasets, that improves modelling of new datasets.

We propose a transfer learning approach to build small NNs with subsets of genes that can be trained on several datasets and learns more information from each dataset they get trained on. The benifits of this approach are as following :

1. It will make use of a priori information which hopefully may improve prediction accuracy and generalization

2. Each small NN, might be specific for a biological pathway which is very important in science and biology. For example, we might discover a NN with 10 genes that work together very well. If these 10 genes are already known to work together, that is fine, but if this information is new, this might lead to discovering a new biological machanism behind cancer or the disease we were trying to predict. 