-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Produce robust clr matrix as output. #59
base: master
Are you sure you want to change the base?
Conversation
Hi Madeline, could you include a subset of the data to give us a better
idea what we are looking at?
If it helps, RCLR is making a missing-at-random assumption; so it is
possible that there are microbes/metabolites at high abundance, but aren't
detected due to some random process. Depending on how you are simulating /
stratifying your data, you could be seeing scenarios where this assumption
is clearly violated.
Also probably should mention that RCLR performs a log-ratio transformation, which is why there are some negative values here.
…On Mon, Feb 22, 2021 at 9:13 AM madeleineernst ***@***.***> wrote:
Thanks for this addition, very useful. Using the rclr output matrix I
compared imputed vs non-imputed data, for a feature with no missing values.
For such features, I would expect no change after imputation. However, all
values seem to be different after imputation, with no correlation to the
original data. Is this an expected behavior, and if yes why? Or could
feature IDs eventually get swapped along the way? Many thanks in advance
for any clarifications!
[image: image]
<https://user-images.githubusercontent.com/2773739/108735717-2a645d80-7531-11eb-8aa3-b66b4965d51e.png>
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#59 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA75VXJBLJ4FXYZI5F5J75DTAJ7EFANCNFSM4M7CHJIQ>
.
|
Hi Jamie, Many thanks for your quick reply. Unfortunately, I won't be able to post a subset of the actual data here, due to data privacy issues. However, I have created a random subset of features and blank samples that illustrates my doubt. Best, table = load_table('Blanks.biom') x = robust_clr['1'] print(np.corrcoef(x, y)) plt.scatter(x, y) |
Hi @madeleineernst, There are a few things going on here, in addition to @mortonjt's valid points, but this is a great question.
If you wanted to compare the observed (non-zero) values 1:1 as an imputation you would need to do the following: import numpy as np
import pandas as pd
from biom import load_table
from deicode.preprocessing import rclr
from deicode.matrix_completion import MatrixCompletion
import matplotlib.pyplot as plt
# import table
table = load_table('Blanks.biom')
# Robust-clr (rclr) preprocessing and OptSpace (i.e. RPCA)
# As you increase the rank (n_components) the correlation will get tighter (and vice versa).
# ^ See the low-rank assumption for why.
opt = MatrixCompletion(n_components=10).fit(rclr(table.to_dataframe().T))
# reconstruct imputed data
robust_clr_imputed = opt.sample_weights @ opt.s @ opt.feature_weights.T
# do pre-processing step within RPCA
table = table.to_dataframe().T
robust_clr_unimputed = rclr(table)
# use this to mask any "imputed" values
zero_mask = (table == 0.0)
# re-mask imputed values for comparison
robust_clr_imputed_masked = robust_clr_imputed.copy()
robust_clr_imputed_masked[zero_mask] = np.nan
# flatten and keep only observed
x = robust_clr_unimputed.ravel()
y = robust_clr_imputed_masked.ravel()
x = x[np.isfinite(x)]
y = y[np.isfinite(y)]
# print the correlation between values
print(np.corrcoef(x, y))
# plot the observed values
plt.scatter(x, y)
plt.title('Observed values from imputed vs. non-imputed data')
plt.xlabel('Observed Imputed')
plt.ylabel('Observed Non-imputed')
plt.plot(np.unique(x),
np.poly1d(np.polyfit(x, y, 1))(np.unique(x)),
color='red')
plt.show() giving
|
Hi @cameronmartino, That was very useful, many thanks for the thorough clarifications and code! My original data matrix consists of app. 2-3000 features and 600 samples. It turns out that if I increase the n_components parameter to 100, I can reach a correlation between imputed and non-imputed non-zero values of up to 0.88, while at the same time being able to impute more than 65% of the sparse data table. If I use n_components of 10, the correlation is only 0.7. Would you recommend optimizing the n_components parameter until a satisfactory correlation is achieved while at the same time imputing a reasonable amount of data, even if n_components get as high as 2-300 or more? I understand that the rank must be lower than 600, if my matrix has the dimensions 600x3000 (A = m × n matrix and m ≥ n, then rank (A) ≤ n). In the documentation you suggest n_components to be between 1 and 10. |
Glad that helped @madeleineernst. Just like all dimensionality reduction methods (e.g. PCoA, PCA, ..etc) the OptSpace method (used here) depends on a low-rank assumption. That means we assume there are some latent categories. For example, a simple example in the microbiome might be sample type or in the case of movies, a latent categorization might be genre. You are correct that the max rank (n_components) can be is the smallest dimension of the input matrix. The low-rank assumption is essential for making the problem tractable at scale - increasing the rank will dramatically increase the runtime. That is why we suggest keeping it low (below 10) but your estimates may get better as you increase the rank. All of this is why we really focused on this tool as a dimensionality reduction rather than an imputation method. |
Some users would benefit from being able to access the rclr transformed values, including myself.
I wrote this quick addition to return the rclr matrix and print it as tsv.
See issue #51 as an example.
I've been unhappy with the clr transforms implemented in R, but so far I'm happy with this one.
Thanks for putting the time to implement the matrix completion and rclr functions.