Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Produce robust clr matrix as output. #59

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

davised
Copy link

@davised davised commented May 12, 2020

Some users would benefit from being able to access the rclr transformed values, including myself.

I wrote this quick addition to return the rclr matrix and print it as tsv.

See issue #51 as an example.

I've been unhappy with the clr transforms implemented in R, but so far I'm happy with this one.

Thanks for putting the time to implement the matrix completion and rclr functions.

@cameronmartino cameronmartino self-requested a review May 30, 2020 03:44
@madeleineernst
Copy link

Thanks for this addition, very useful. Using the rclr output matrix I compared imputed vs non-imputed data, for a feature with no missing values. For such features, I would expect no change after imputation. However, all values seem to be different after imputation, with no correlation to the original data. Is this an expected behavior, and if yes why? Or could feature IDs eventually get swapped along the way? Many thanks in advance for any clarifications!
image

@mortonjt
Copy link
Collaborator

mortonjt commented Feb 22, 2021 via email

@madeleineernst
Copy link

Hi Jamie,

Many thanks for your quick reply. Unfortunately, I won't be able to post a subset of the actual data here, due to data privacy issues. However, I have created a random subset of features and blank samples that illustrates my doubt.
For example, if I plot feature '1' of the attached data table, before and after imputation, I would expect perfect correlation, or in other words, I don't expect any change in the data structure as feature '1' did not contain any 0 values. I have also tried to plot the rclr transformed original table (without imputation), against imputed values, but the same pattern persists. It seems that imputation changes data structure also for features where no imputation is needed?

Best,
Madeleine
Blanks.txt

table = load_table('Blanks.biom')
ordination, distance, robust_clr = auto_rpca(table)
table = table.to_dataframe().T

x = robust_clr['1']
y = table['1']

print(np.corrcoef(x, y))

plt.scatter(x, y)
plt.title('Imputed vs non-imputed data')
plt.xlabel('Imputed')
plt.ylabel('Non-imputed')
plt.plot(np.unique(x), np.poly1d(np.polyfit(x, y, 1))(np.unique(x)), color='red')
plt.show()

@cameronmartino
Copy link
Collaborator

Hi @madeleineernst,

There are a few things going on here, in addition to @mortonjt's valid points, but this is a great question. 

  1. There is a transformation of the data before RPCA is run. To transform only the non-zero values we use an altered clr transform. We go into depth on how that transformation is done in the paper (here) in the section Preprocessing with rclr. So we need to compare only the observed (non-zero) values between the unimputed rclr and output from RPCA. 
  2. The RPCA function was not really meant to be used for imputation. So we will need to alter this PR a bit as a standalone function. The lack of correlation is likely being distorted in the re-centering step after running the MatrixCompletion function (see here). 

If you wanted to compare the observed (non-zero) values 1:1 as an imputation you would need to do the following:

import numpy as np
import pandas as pd
from biom import load_table
from deicode.preprocessing import rclr
from deicode.matrix_completion import MatrixCompletion
import matplotlib.pyplot as plt


# import table
table = load_table('Blanks.biom')

# Robust-clr (rclr) preprocessing and OptSpace (i.e. RPCA)
# As you increase the rank (n_components) the correlation will get tighter (and vice versa).
# ^ See the low-rank assumption for why. 
opt = MatrixCompletion(n_components=10).fit(rclr(table.to_dataframe().T)) 
# reconstruct imputed data
robust_clr_imputed = opt.sample_weights @ opt.s @ opt.feature_weights.T

# do pre-processing step within RPCA
table = table.to_dataframe().T
robust_clr_unimputed = rclr(table)

# use this to mask any "imputed" values
zero_mask = (table == 0.0)

# re-mask imputed values for comparison
robust_clr_imputed_masked = robust_clr_imputed.copy()
robust_clr_imputed_masked[zero_mask] = np.nan
 
# flatten and keep only observed
x = robust_clr_unimputed.ravel()
y = robust_clr_imputed_masked.ravel()
x = x[np.isfinite(x)]
y = y[np.isfinite(y)]

# print the correlation between values
print(np.corrcoef(x, y))

# plot the observed values
plt.scatter(x, y)
plt.title('Observed values from imputed vs. non-imputed data')
plt.xlabel('Observed Imputed')
plt.ylabel('Observed Non-imputed')
plt.plot(np.unique(x),
         np.poly1d(np.polyfit(x, y, 1))(np.unique(x)),
         color='red')
plt.show()

giving

[[1.         0.96985866]
 [0.96985866 1.        ]]

corr

@madeleineernst
Copy link

Hi @cameronmartino

That was very useful, many thanks for the thorough clarifications and code!

My original data matrix consists of app. 2-3000 features and 600 samples. It turns out that if I increase the n_components parameter to 100, I can reach a correlation between imputed and non-imputed non-zero values of up to 0.88, while at the same time being able to impute more than 65% of the sparse data table. If I use n_components of 10, the correlation is only 0.7.

Would you recommend optimizing the n_components parameter until a satisfactory correlation is achieved while at the same time imputing a reasonable amount of data, even if n_components get as high as 2-300 or more? I understand that the rank must be lower than 600, if my matrix has the dimensions 600x3000 (A = m × n matrix and m ≥ n, then rank (A) ≤ n). In the documentation you suggest n_components to be between 1 and 10.

@cameronmartino
Copy link
Collaborator

Glad that helped @madeleineernst. Just like all dimensionality reduction methods (e.g. PCoA, PCA, ..etc) the OptSpace method (used here) depends on a low-rank assumption. That means we assume there are some latent categories. For example, a simple example in the microbiome might be sample type or in the case of movies, a latent categorization might be genre. You are correct that the max rank (n_components) can be is the smallest dimension of the input matrix. The low-rank assumption is essential for making the problem tractable at scale - increasing the rank will dramatically increase the runtime. That is why we suggest keeping it low (below 10) but your estimates may get better as you increase the rank. All of this is why we really focused on this tool as a dimensionality reduction rather than an imputation method.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants