### Dimensional Reduction 
#### Part II
Determining a metric by which to compare dimensionality reduction techniques

In [31]:
%matplotlib inline
import tSNE_utils # load utility functions from local file

import numpy as np
np.seterr(divide='ignore', invalid='ignore') # allow divide by zero for normalization
import pandas as pd
import scipy as sc

# scikit packages
import skbio
from sklearn.preprocessing import normalize
from sklearn.preprocessing import maxabs_scale
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

# import tsne packages
from sklearn.manifold import TSNE
import sys; sys.path.append('/Users/Cody/git/FIt-SNE')
from fast_tsne import fast_tsne

# plotting packages
import matplotlib.pyplot as plt
import seaborn as sns; sns.set(style = 'whitegrid')

Let's load in data from three replicates (each containing 8 sets of 375 cells), and normalize them all a couple of different ways, perform PCA, and look at correlations.

In [43]:
%%time

# load the hdf5 files into dictionary
reps = {
'r00':tSNE_utils.read_hdf5('inputs/GSE102698ClosenessRep_0.hdf5'),
'r01':tSNE_utils.read_hdf5('inputs/GSE102698ClosenessRep_1.hdf5'),
'r02':tSNE_utils.read_hdf5('inputs/GSE102698ClosenessRep_2.hdf5')
}

CPU times: user 7.48 s, sys: 1.27 s, total: 8.75 s
Wall time: 9.22 s


I want to test two normalization methods:
* Fractional read counts calculated as each gene count divided by the sum of all genes for each cell 
* `sklearn.preprocessing.normalize` designed for sparse dataframes. We'll just use default settings (`norm = 'l2'`) 
  
And then transform the normalized matrices with two functions:
* arcsinh(norm*1000)
* log2(norm+1)

In [46]:
def arcsinh_norm(norm, scale = 1000):
    '''
    Perform an arcsinh-transformation on a np.ndarray containing normalized data of shape=(n_cells,n_genes).
    Useful for feeding into PCA or tSNE.
        scale = factor to multiply values by before arcsinh-transform. 
        scales values away from [0,1] in order to make arcsinh more effective.
    '''
    return np.arcsinh(norm * scale)
    
def log2_norm(norm, frac = True):
    '''
    Perform a log2-transformation on a np.ndarray containing normalized data of shape=(n_cells,n_genes).
    Useful for feeding into PCA or tSNE.
    '''
    return np.log2(norm + 1)
    

Using the 0.75 "closeness" replicate as an example, it looks like the log2 normalization of fractional read coverage has a closer __Wasserstein or "Earth-Movers" Distance__ than the arcsinh function.

In [57]:
r00_075 = np.nan_to_num(r00['Close_0.75'] / np.sum(r00['Close_0.75'], axis=0)) # get fractional counts

r00_075_sinh = arcsinh_norm(r00_075) # arcsinh transform
print('arcsinh EMD: {}'.format(sc.stats.wasserstein_distance(r00_075.flatten(), r00_075_sinh.flatten()))) # calculate EMD from normalized to normalized-transformed

r00_075_log = log2_norm(r00_075) # log2 transform
print('log2 EMD: {}\n'.format(sc.stats.wasserstein_distance(r00_075.flatten(), r00_075_log.flatten()))) # calculate EMD from normalized to normalized-transformed

arcsinh EMD: 0.23876311381274876
log2 EMD: 0.0006935577783836777



But I'm not sure if that's the best metric...  
Let's look at the __correlation of distance matrices__ between the raw data and the two normalized/transformed matrices.  
This will tell us how much each sample (or cell) in the matrix was moved relative to its neighbors following the transformation.  
(a higher R value is better here)

In [59]:
sinh_R,sinh_p,sinh_n = tSNE_utils.corr_distances(sc.spatial.distance_matrix(r00['Close_0.75'],r00['Close_0.75']),sc.spatial.distance_matrix(r00_075_sinh,r00_075_sinh),plot_out=False)
log2_R,log2_p,log2_n =tSNE_utils.corr_distances(sc.spatial.distance_matrix(r00['Close_0.75'],r00['Close_0.75']),sc.spatial.distance_matrix(r00_075_log,r00_075_log), plot_out=False)

print('arcsinh Mantel R: {}\nlog2 Mantel R: {}\n'.format(sinh_R, log2_R))

arcsinh Mantel R: 0.33728508594344886
log2 Mantel R: 0.34900162855676653



In this case, the R value of the Mantel test comparing the cell-cell distances from the raw data to those in the transformed data is slightly higher for the log2-transformed counts as well. Seems compelling, but I want to test further.  
  
Let's do the same exercise with the `sklearn.preprocess.normalize()` function to see if it's better/different than manually-calculated fractional counts:

In [61]:
r00_075 = normalize(r00['Close_0.75'], axis=0, norm='l2') # normalize to square root of sum of squares for each cell

r00_075_sinh = arcsinh_norm(r00_075) # arcsinh transform
print('arcsinh EMD: {}'.format(sc.stats.wasserstein_distance(r00_075.flatten(), r00_075_sinh.flatten()))) # calculate EMD from normalized to normalized-transformed

r00_075_log = log2_norm(r00_075) # log2 transform
print('log2 EMD: {}\n'.format(sc.stats.wasserstein_distance(r00_075.flatten(), r00_075_log.flatten()))) # calculate EMD from normalized to normalized-transformed

arcsinh EMD: 0.39737249207984604
log2 EMD: 0.002872566345361386



These transformed values are a bit further via EMD from the normalized matrix. No sweat, let's look at the correlation of the distance matrices to the raw data:

In [62]:
sinh_R,sinh_p,sinh_n = tSNE_utils.corr_distances(sc.spatial.distance_matrix(r00['Close_0.75'],r00['Close_0.75']),sc.spatial.distance_matrix(r00_075_sinh,r00_075_sinh),plot_out=False)
log2_R,log2_p,log2_n =tSNE_utils.corr_distances(sc.spatial.distance_matrix(r00['Close_0.75'],r00['Close_0.75']),sc.spatial.distance_matrix(r00_075_log,r00_075_log), plot_out=False)

print('arcsinh Mantel R: {}\nlog2 Mantel R: {}\n'.format(sinh_R, log2_R))

arcsinh Mantel R: 0.34263435505647516
log2 Mantel R: 0.3390754271795316



Uh-oh, the better fit via Mantel test flipped.  
This tells me that these normalization methods may be identical in terms of utility, or maybe it's just this one dataset that's ambiguous...

In [None]:
# initiate df for dumping correlation data into
corr_out = pd.DataFrame()

for rep in reps.keys():
    for key in reps[rep].keys():
        
        frac = np.nan_to_num(reps[rep][key] / np.sum(reps[rep][key], axis=0)) # get fractional counts

In [38]:
d1 = sc.stats.wasserstein_distance(test1.flatten(), test2.flatten())
d2 = sc.stats.wasserstein_distance(test1.flatten(), test3.flatten())

print('Wasserstein Distances:\n\n\nNormalize "l2": {}\nMaxabs Scaled: {}'.format(d1,d2))

Wasserstein Distances:


Normalize "l2": 0.0016846641844954862
Maxabs Scaled: 0.026490231334317764
