# Multi-Omic analysis

Given multiple datasets, each representing a different part of the chain between genomics and proteomics we 
seek to find biological pathways that drive or inhibit certain phenotypes.

The goal is to integrate these datasets such that the whole is more than the sum of the parts.

Ritchie et al. (Methods of integrating data to uncover genotype-phenotype interactions) describe the following ways to integrate multi-omic data

* Pathway or knowledge-based integration
* Concatenation-based: combine all datasets
* Model-based: create models per datasets, then combine models
* Transformation-based

We can also think of 
* Reduced normalised concatenation
* Model-based inter-omic transformation

Another subdivision is given by early, intermediate and late integration of omics with respect
to the identification of clusters/classes.

Per omic we collect important features, by
* comparing the non-parametric distributions over the different classifications
* simply counting the occurrences and setting a cut-off point
* using the importances of the classification models as filters
* checking the summed weights of linear and non-linear dimensionality reducers

We then have the choice to collect these features
*  greedily: all remaining omic features
*  non-greedily: only overlapping features (by gene)

To find inter **and** intra-omic connections we can resort to a similarity measure. 



## Tools 

A zoo of techniques exist:

* sparse CCA (sCCA)
* sparse ICA (sICA)
* sparse PLS (sPLS)
* sparse PCA (sPCA)
* multiple co-inertia analysis (MCIA)
* **Similarity Network Fusion (SNF)**
* **Multi-Omic Factor Analysis (MOFA)**
* joint-NMF (jNMF)
* iNMF
* Affinity Network Fusion (ANF)
* **Weighted Gene-Co expression Network Analysis (WGCNA)**
* joint graphical LASSO 
* Inter-battery factor analysis (IBFA)
* Cross-modal factor analysis (CFA)
* Joint and individual variation explained (JIVE)
* Redundancy analysis (RDA)
* Canonical Correspondence analysis
* Bayesian concensus clustering (BCC)
* joint Affinity Propagation
* iCluster
* PARADIGM
* NEMO
* T-SVD
* Multiple Dataset Integration (MDI)
* Pattern Fusion Analysis (PFA)
* Multiple Factor Analysis (MFA)


before we can appreciate these techniques and delf into them we first need to get some basic intuition of 
what multi-omics data means in practice.

# Basic intuition



## Similarity Network Fusion

Unsupervised. Meant for **disease sub-typing**. Mixing of patient clusters from different datasets.

![image](../_images/SimilarityNetworkFusion.png)

## Multi-omics Factor Analysis

Unsupervised. Meant for **pathway analysis** and multi-omics dimension reduction.

![image](../_images/mofa_overview.png)

## Weighted Gene-Co expression Network Analysis

Single-omics high-dimensional dimension reduction, through pairwise similarities. Extraction of eigengenes.
--> similar to Affinity Propagation and same parts of Markov Clustering but with fuzzy cluster assignment.

See WGCNA more as a suite of tools in R that aim to give a comprehensive overview of the correlations from the perspective of different ontologies.

![image](../_images/WGCNA.png)

In [23]:
import pandas as pd
from tqdm import tqdm
import os
from scipy import spatial
from scipy import sparse
import numpy as np

In [3]:
os.chdir('/media/bramiozo/DATA-FAST/genetic_expression')

In [36]:
def _iterative_self_correlator(X, **kwargs):   
    ran = X.shape[1-axis]
    for c1 in range(ran):
        for c2 in range(c1+1,ran): 
            similarity = 1-spatial.distance.cdist(X[:,c1:c1+1].T, 
                                                X[:,c2:c2+1].T,
                                                metric)[0][0]
            if abs(similarity)>min_corr:
                yield distance, c1, c2

def intra_correlator(X, min_corr=0.5, metric='cosine'):
    get_corrs = _iterative_self_correlator(X, min_corr=min_corr, metric=metric)
    dv, c1v, c2v = [], [], []
    for _c in get_corrs:
        dv.append(_c[0])
        c1v.append(_c[1])
        c2v.append(_c[2])
    return spatial.coo_matrx(dv, (c1v, c2v))


# https://deepgraph.readthedocs.io/en/latest/tutorials/pairwise_correlations.html
# https://cupy.dev/

# Casus: lung cancer type differentiation

* Squamous cell carcinoma: 500 patients, age, gender
* Adenoma carcinoma: 500 patients, age, gender

The Cancer Genome Atlas. 

## DNA mutation

In [15]:
mutation = pd.read_feather('hackathon_2/Lung_mutation.feather')
mutation.set_index('Sample_ID', inplace=True)

## DNA Copy Number Variation

In [7]:
CNV = pd.read_feather('hackathon_2/Lung_CNV.feather')
CNV.set_index('index', inplace=True)
print(CNV.shape)

(1017, 24791)


## RNA expression

In [4]:
RNAex=pd.read_feather('hackathon_2/Lung_RNAex.feather')
RNAex.set_index('index', inplace=True)
print(RNAex.shape)

(1135, 60465)


In [None]:
scorr, ir, jr = iterative_self_correlator(RNAex.values, axis=0)

 32%|███▏      | 19119/60465 [4:28:21<8:24:30,  1.37it/s] 

## microRNA

In [9]:
mRNA=pd.read_feather('hackathon_2/Lung_mRNA.feather')
mRNA.set_index('index', inplace=True)
print(mRNA.shape)

(875, 2153)


## Methylation

In [16]:
methylation=pd.read_feather('hackathon_2/Lung_methylation.feather')
methylation.set_index('index', inplace=True)
print(methylation.shape)

(907, 311381)


## Proteomics


In [11]:
Proteome = pd.read_feather('hackathon_2/Lung_proteome.feather')
Proteome.set_index('index', inplace=True)
print(Proteome.shape)

(693, 276)


## Metadata

In [12]:
Meta = pd.read_feather('hackathon_2/Lung_meta.feather')
Meta.set_index('SampleID', inplace=True)
print(Meta.shape)

(1019, 46)


# Multi-omics analysis

## reduction

## concatenation

## SNF 
[SNF](https://github.com/rmarkello/snfpy)

## MOFA

[MOFA2](https://github.com/bioFAM/MOFA2)

## Discussion