Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

write_matrix_dir does not write rownames and colnames of a matrix transformed from dgCMatrix #29

Closed
realzehuali opened this issue May 16, 2023 · 10 comments

Comments

@realzehuali
Copy link

g = readRDS("raw.RDS") # previous Assay3 in dgCMatrix
g[["RNA"]] = as(g[["RNA"]], Class = "Assay5")
g[["RNA"]]$counts = as(g[["RNA"]]$counts, "IterableMatrix", strict = F)
write_matrix_dir(mat = g[["RNA"]]$counts, dir = "BPCell/counts", overwrite = T)
g = CreateSeuratObject(open_matrix_dir("BPCell/counts"))
g[["RNA"]]

the output of g[["RNA"]] is a matrix with rownames of Feature1 Feature2 Feature3 ... and colnames of Cells_1 Cell_2 ...

@realzehuali
Copy link
Author

Also, is there any way to change the rownames and colnames of an IterableMatrix?

@realzehuali
Copy link
Author

Here is my experience. As a result of all these, when I savedRDS of a SeuratV5 object, the RNA assay in IterableMatrix was also saved to the RDS file, which took so long, especially for SeuratV5 object with multiple layers. Every layer was saved by saving the whole IterableMatrix. An object taking up 30GB RAM ended with taking up more than 100GB disk space. I had to shut down and delete the RDS file. I think all of these could be saved by successful writing the original matrix with rownames and colnames. Thanks for any help!

@realzehuali
Copy link
Author

g = readRDS("raw.RDS")
g
An object of class Seurat
41029 features across 646765 samples within 1 assay
Active assay: RNA (41029 features, 0 variable features)
2 layers present: counts, data
g[["RNA"]]$counts
41029 x 646765 sparse Matrix of class "dgCMatrix"
... omit the matrix ....
g[["RNA"]] = as(g[["RNA"]], Class = "Assay5")
Warning: Assay RNA changing from Assay to Assay5
g[["RNA"]]
Assay (v5) data with 41029 features for 646765 cells
First 10 features:
Xkr4, Rp1, Sox17, Gm37323, Mrpl15, Lypla1, Gm37988, Tcea1, Atp6v1h, Rb1cc1
Layers:
counts, data
g[["RNA"]]$data = NULL
g[["RNA"]]$counts = as(g[["RNA"]]$counts, "IterableMatrix", strict = F)
g[["RNA"]]
Assay (v5) data with 41029 features for 646765 cells
First 10 features:
Xkr4, Rp1, Sox17, Gm37323, Mrpl15, Lypla1, Gm37988, Tcea1, Atp6v1h, Rb1cc1
Layers:
counts
g[["RNA"]]$counts
41029 x 646765 IterableMatrix object with class MatrixSubset

Row names: Xkr4, Rp1 ... 4933409K07Rik
Col names: AAACCCAAGACCATGG-1_1_1_1, AAACCCAAGCTCATAC-1_1_1_1 ... TTTGTTGTCTGTCCCA-1_15_5

Data type: double
Storage order: column major

Queued Operations:

  1. Load dgCMatrix from memory
  2. Select rows: 1, 2 ... 41029 and cols: 1, 2 ... 646765

write_matrix_dir(mat = g[["RNA"]]$counts, dir = "testBPC", overwrite = T)
Warning: Matrix compression performs poorly with non-integers.
• Consider calling convert_matrix_type if a compressed integer matrix is intended.
This message is displayed once every 8 hours.
41029 x 646765 IterableMatrix object with class MatrixDir

Row names: unknown names
Col names: unknown names

Data type: double
Storage order: column major

Queued Operations:

  1. Load compressed matrix from directory E:\scRNAseq\testBPC

g = CreateSeuratObject(open_matrix_dir("testBPC"))
Counts matrix provided is not sparse. Creating V5 assay in Seurat Object.
g[["RNA"]]
Assay (v5) data with 41029 features for 646765 cells
First 10 features:
Feature1, Feature2, Feature3, Feature4, Feature5, Feature6, Feature7, Feature8, Feature9, Feature10
Layers:
counts
g[["RNA"]]$counts
41029 x 646765 IterableMatrix object with class MatrixSubset

Row names: Feature1, Feature2 ... Feature41029
Col names: Cell_1, Cell_2 ... Cell_646765

Data type: double
Storage order: column major

Queued Operations:

  1. Load compressed matrix from directory E:\scRNAseq\testBPC
  2. Select rows: 1, 2 ... 41029 and cols: 1, 2 ... 646765

@bnprks
Copy link
Owner

bnprks commented May 16, 2023

BPCells uses rownames() and colnames() as normal in R, and in some quick testing I'm not able to reproduce your issue. For instance, all this works for me:

Code example reading and changing row/col names
library(BPCells)
x <- matrix(1:12, nrow=3)
rownames(x) <- paste0("row", seq_len(nrow(x)))
colnames(x) <- paste0("col", seq_len(ncol(x)))
x
#      col1 col2 col3 col4
# row1    1    4    7   10
# row2    2    5    8   11
# row3    3    6    9   12
x_sparse <- as(x, "dgCMatrix")
x_sparse
# 3 x 4 sparse Matrix of class "dgCMatrix"
#      col1 col2 col3 col4
# row1    1    4    7   10
# row2    2    5    8   11
# row3    3    6    9   12
x_bpcells <- as(x_sparse, "IterableMatrix")
x_bpcells
# 3 x 4 IterableMatrix object with class Iterable_dgCMatrix_wrapper

# Row names: row1, row2, row3
# Col names: col1, col2 ... col4

# Data type: double
# Storage order: column major

# Queued Operations:
# 1. Load dgCMatrix from memory
rownames(x_bpcells) <- paste0("newrow", seq_len(nrow(x)))
colnames(x_bpcells) <- paste0("newcol", seq_len(ncol(x)))
dir_path <- tempfile()
x_bpcells_dir <- write_matrix_dir(x_bpcells, dir_path)
# Warning: Matrix compression performs poorly with non-integers.
# • Consider calling convert_matrix_type if a compressed integer matrix is intended.
# This message is displayed once every 8 hours.
x_bpcells_dir2 <- open_matrix_dir(dir_path)
x_bpcells_dir2
# 3 x 4 IterableMatrix object with class MatrixDir

# Row names: newrow1, newrow2, newrow3
# Col names: newcol1, newcol2 ... newcol4

# Data type: double
# Storage order: column major

# Queued Operations:
# 1. Load compressed matrix from directory /tmp/RtmpNGXgHy/file23f2df836ad

I hope that helps answer your question. The examples you provide are complicated by the fact that I don't have access to the same dataset, and you're also using Seurat objects as an intermediate. If you're still having issues, I'd encourage you to simplify your problem into a reproducible example that I can take a closer look at.

EDIT: one additional note is that changing the row/col names of a BPCells disk-backed object does not alter the data on disk. If you want to save the new row/col names on disk, you'll need to write the matrix again, or change the row/col names prior to importing as a BPCells object

@realzehuali
Copy link
Author

I also got the same right result using a similar example you used (example in #23 actually). I think the key to the solution might be related tothis message: "3 x 4 IterableMatrix object with class Iterable_dgCMatrix_wrapper" (what I also got in my test) after transformed from dgCMatrix. However, in my bug code, I got "41029 x 646765 IterableMatrix object with class MatrixSubset" after transformed from dgCMatrix.

@realzehuali
Copy link
Author

Thank you very much for your example about how to change the row and col names! Actually, the raw.RDS was generated by SeuratV4 in the past. I think to reproduce this bug, you can use SeuratV4 on a small raw matrix.

bnprks added a commit that referenced this issue May 17, 2023
- Previously, changing the dimnames on a transformed matrix would not
  affect the dimnames when writing that matrix to disk.
- Following a similar strategy to cell/chr renaming in fragments, where
  we add a new layer into the delayed operations
- Also added an unrelated fix to properly export `merge_cells()`
@bnprks
Copy link
Owner

bnprks commented May 17, 2023

Thanks for the bug report! You were right that the issue had to do with the MatrixSubset class (actually any transformation shared the problem). I've fixed this now so updated dimnames should be properly saved when you write a matrix.

Please comment/reopen if this didn't actually solve your problem

@bnprks bnprks closed this as completed May 17, 2023
@realzehuali
Copy link
Author

Thank you very very much for this quick response.

@Dario-Rocha
Copy link

Hello again,
You've always been so attentive and I come again to you asking for help.
I am commenting on this closed issue because I am encountering the same problem, I am assuming that again the issue stems from subsetting the matrix with '[,]' when the object is a concatenation of other BPCells matrices.
For extra info, the a_soup_joined matrix is a combination of a list of 92 matrices in temp_all_mats (which are all BPCells matrices), which I need to combine, then filter out some rows and store it in a new on-disk matrix.

a_soup_joined <- do.call(cbind, temp_all_mats)
a_soup_joined <- a_soup_joined[temp_keepindex,]
#save bpcells matrix----
write_matrix_dir(mat = a_soup_joined, 
                 dir = '.../complete_v01_bpcells_matrix', 
                 overwrite = TRUE,
                 compress = TRUE)

temp_check <- open_matrix_dir('.../complete_v01_bpcells_matrix')

a_soup_joined
31251 x 1310011 IterableMatrix object with class MatrixSubset

Row names: MIR1302.2HG, AL627309.1 ... AC007325.2
Col names: s334_cd25lo_t1:AAACCTGAGACACTAA-1, s334_cd25lo_t1:AAACCTGAGACTTGAA-1 ... s392_cd25hi_t2:TTTGTCATCTGCAAGT-1

Data type: uint32_t
Storage order: column major

Queued Operations:

  1. Concatenate cols of 92 matrix objects with classes: RenameDims, RenameDims ... RenameDims (threads=0)
  2. Select rows: 1, 4 ... 36601 and cols: all

temp_check
31251 x 1310011 IterableMatrix object with class MatrixDir

Row names: unknown names
Col names: unknown names

Data type: uint32_t
Storage order: column major

Queued Operations:

  1. Load compressed matrix from directory /Users/dariorocha/mydrive/postdoc/alcina/complete_v01/complete_v01_bpcells_matrix

And here is a reproductible example:

library(BPCells)
setwd()

temp_matrix1 <- matrix(rnorm(10000), ncol = 100)
temp_matrix1 <- as(temp_matrix1, "dgCMatrix")
temp_matrix1 <- as(temp_matrix1, "IterableMatrix")

colnames(temp_matrix1) <- paste0('col1', seq(1:100))
rownames(temp_matrix1) <- paste0('row1', seq(1:100))

temp_matrix2 <- matrix(rnorm(10000), ncol = 100)
temp_matrix2 <- as(temp_matrix2, "dgCMatrix")
temp_matrix2 <- as(temp_matrix2, "IterableMatrix")

colnames(temp_matrix2) <- paste0('col2', seq(1:100))
rownames(temp_matrix2) <- paste0('row1', seq(1:100))

temp_matrix_join <- cbind(temp_matrix1, temp_matrix2)

temp_matrix_join <- temp_matrix_join[1:50,]

write_matrix_dir(mat = temp_matrix_join, 
                 dir = '/Users/dariorocha/Desktop/temp/temp_matrix', 
                 overwrite = TRUE)

temp_loaded <- open_matrix_dir('/Users/dariorocha/Desktop/temp/temp_matrix')
temp_loaded

In the meantime, while this issue gets resolved, and with all due respect, is there any workaround you'd suggest? Please don't get me wrong, your package has been doing wonders for my job, I'd just really like to be able to move forward with my proyect

bnprks added a commit that referenced this issue Sep 29, 2023
@bnprks
Copy link
Owner

bnprks commented Sep 29, 2023

Hi @Dario-Rocha, thanks for the very clear report with the reproducible example -- it made it quick to reproduce the bug on my end. I've fixed this issue in the current main branch, so you should be set for now.

In the future, similar bugs having to do with dimnames failing to write to disk should be fixed by calling dimnames(mat) <- dimnames(mat) just prior to writing to disk. (I know it's a weird workaround -- the dimnames handling code is a bit more complicated and due for a more complete re-think at some point soon). This case you found was a slight variant that I had not anticipated problems with in my original fix.

I appreciate you taking the time to bring up this issue -- it helps make BPCells better for everyone

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants