Produce robust clr matrix as output. #59

davised · 2020-05-12T18:34:30Z

Some users would benefit from being able to access the rclr transformed values, including myself.

I wrote this quick addition to return the rclr matrix and print it as tsv.

See issue #51 as an example.

I've been unhappy with the clr transforms implemented in R, but so far I'm happy with this one.

Thanks for putting the time to implement the matrix completion and rclr functions.

madeleineernst · 2021-02-22T16:13:37Z

Thanks for this addition, very useful. Using the rclr output matrix I compared imputed vs non-imputed data, for a feature with no missing values. For such features, I would expect no change after imputation. However, all values seem to be different after imputation, with no correlation to the original data. Is this an expected behavior, and if yes why? Or could feature IDs eventually get swapped along the way? Many thanks in advance for any clarifications!

mortonjt · 2021-02-22T16:46:16Z

Hi Madeline, could you include a subset of the data to give us a better idea what we are looking at? If it helps, RCLR is making a missing-at-random assumption; so it is possible that there are microbes/metabolites at high abundance, but aren't detected due to some random process. Depending on how you are simulating / stratifying your data, you could be seeing scenarios where this assumption is clearly violated. Also probably should mention that RCLR performs a log-ratio transformation, which is why there are some negative values here.

…

On Mon, Feb 22, 2021 at 9:13 AM madeleineernst ***@***.***> wrote: Thanks for this addition, very useful. Using the rclr output matrix I compared imputed vs non-imputed data, for a feature with no missing values. For such features, I would expect no change after imputation. However, all values seem to be different after imputation, with no correlation to the original data. Is this an expected behavior, and if yes why? Or could feature IDs eventually get swapped along the way? Many thanks in advance for any clarifications! [image: image] <https://user-images.githubusercontent.com/2773739/108735717-2a645d80-7531-11eb-8aa3-b66b4965d51e.png> — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#59 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA75VXJBLJ4FXYZI5F5J75DTAJ7EFANCNFSM4M7CHJIQ> .

madeleineernst · 2021-02-23T15:03:23Z

Hi Jamie,

Many thanks for your quick reply. Unfortunately, I won't be able to post a subset of the actual data here, due to data privacy issues. However, I have created a random subset of features and blank samples that illustrates my doubt.
For example, if I plot feature '1' of the attached data table, before and after imputation, I would expect perfect correlation, or in other words, I don't expect any change in the data structure as feature '1' did not contain any 0 values. I have also tried to plot the rclr transformed original table (without imputation), against imputed values, but the same pattern persists. It seems that imputation changes data structure also for features where no imputation is needed?

Best,
Madeleine
Blanks.txt

table = load_table('Blanks.biom')
ordination, distance, robust_clr = auto_rpca(table)
table = table.to_dataframe().T

x = robust_clr['1']
y = table['1']

print(np.corrcoef(x, y))

plt.scatter(x, y)
plt.title('Imputed vs non-imputed data')
plt.xlabel('Imputed')
plt.ylabel('Non-imputed')
plt.plot(np.unique(x), np.poly1d(np.polyfit(x, y, 1))(np.unique(x)), color='red')
plt.show()

cameronmartino · 2021-02-24T02:26:41Z

Hi @madeleineernst,

There are a few things going on here, in addition to @mortonjt's valid points, but this is a great question.

There is a transformation of the data before RPCA is run. To transform only the non-zero values we use an altered clr transform. We go into depth on how that transformation is done in the paper (here) in the section Preprocessing with rclr. So we need to compare only the observed (non-zero) values between the unimputed rclr and output from RPCA.
The RPCA function was not really meant to be used for imputation. So we will need to alter this PR a bit as a standalone function. The lack of correlation is likely being distorted in the re-centering step after running the MatrixCompletion function (see here).

If you wanted to compare the observed (non-zero) values 1:1 as an imputation you would need to do the following:

import numpy as np
import pandas as pd
from biom import load_table
from deicode.preprocessing import rclr
from deicode.matrix_completion import MatrixCompletion
import matplotlib.pyplot as plt


# import table
table = load_table('Blanks.biom')

# Robust-clr (rclr) preprocessing and OptSpace (i.e. RPCA)
# As you increase the rank (n_components) the correlation will get tighter (and vice versa).
# ^ See the low-rank assumption for why. 
opt = MatrixCompletion(n_components=10).fit(rclr(table.to_dataframe().T)) 
# reconstruct imputed data
robust_clr_imputed = opt.sample_weights @ opt.s @ opt.feature_weights.T

# do pre-processing step within RPCA
table = table.to_dataframe().T
robust_clr_unimputed = rclr(table)

# use this to mask any "imputed" values
zero_mask = (table == 0.0)

# re-mask imputed values for comparison
robust_clr_imputed_masked = robust_clr_imputed.copy()
robust_clr_imputed_masked[zero_mask] = np.nan
 
# flatten and keep only observed
x = robust_clr_unimputed.ravel()
y = robust_clr_imputed_masked.ravel()
x = x[np.isfinite(x)]
y = y[np.isfinite(y)]

# print the correlation between values
print(np.corrcoef(x, y))

# plot the observed values
plt.scatter(x, y)
plt.title('Observed values from imputed vs. non-imputed data')
plt.xlabel('Observed Imputed')
plt.ylabel('Observed Non-imputed')
plt.plot(np.unique(x),
         np.poly1d(np.polyfit(x, y, 1))(np.unique(x)),
         color='red')
plt.show()

giving

[[1.         0.96985866]
 [0.96985866 1.        ]]

madeleineernst · 2021-02-24T15:06:34Z

Hi @cameronmartino,

That was very useful, many thanks for the thorough clarifications and code!

My original data matrix consists of app. 2-3000 features and 600 samples. It turns out that if I increase the n_components parameter to 100, I can reach a correlation between imputed and non-imputed non-zero values of up to 0.88, while at the same time being able to impute more than 65% of the sparse data table. If I use n_components of 10, the correlation is only 0.7.

Would you recommend optimizing the n_components parameter until a satisfactory correlation is achieved while at the same time imputing a reasonable amount of data, even if n_components get as high as 2-300 or more? I understand that the rank must be lower than 600, if my matrix has the dimensions 600x3000 (A = m × n matrix and m ≥ n, then rank (A) ≤ n). In the documentation you suggest n_components to be between 1 and 10.

cameronmartino · 2021-03-01T16:11:41Z

Glad that helped @madeleineernst. Just like all dimensionality reduction methods (e.g. PCoA, PCA, ..etc) the OptSpace method (used here) depends on a low-rank assumption. That means we assume there are some latent categories. For example, a simple example in the microbiome might be sample type or in the case of movies, a latent categorization might be genre. You are correct that the max rank (n_components) can be is the smallest dimension of the input matrix. The low-rank assumption is essential for making the problem tractable at scale - increasing the rank will dramatically increase the runtime. That is why we suggest keeping it low (below 10) but your estimates may get better as you increase the rank. All of this is why we really focused on this tool as a dimensionality reduction rather than an imputation method.

Produce robust clr matrix as output.

a3142a9

cameronmartino self-requested a review May 30, 2020 03:44

LauGuaUuu mentioned this pull request Mar 11, 2024

Adding matrix completion in rclr/rAitchison vegandevs/vegan#570

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Produce robust clr matrix as output. #59

Produce robust clr matrix as output. #59

davised commented May 12, 2020

madeleineernst commented Feb 22, 2021

mortonjt commented Feb 22, 2021 via email •

edited

Loading

madeleineernst commented Feb 23, 2021

cameronmartino commented Feb 24, 2021

madeleineernst commented Feb 24, 2021

cameronmartino commented Mar 1, 2021

Produce robust clr matrix as output. #59

Are you sure you want to change the base?

Produce robust clr matrix as output. #59

Conversation

davised commented May 12, 2020

madeleineernst commented Feb 22, 2021

mortonjt commented Feb 22, 2021 via email • edited Loading

madeleineernst commented Feb 23, 2021

cameronmartino commented Feb 24, 2021

madeleineernst commented Feb 24, 2021

cameronmartino commented Mar 1, 2021

mortonjt commented Feb 22, 2021 via email •

edited

Loading