Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: String representation/RleMatrix #38

Open
jonocarroll opened this issue Feb 28, 2020 · 6 comments
Open

Proposal: String representation/RleMatrix #38

jonocarroll opened this issue Feb 28, 2020 · 6 comments

Comments

@jonocarroll
Copy link

I'd like to propose a couple of potential optimisations, particularly useful for cases where many entries are the same (e.g. AD: NA NA for GT: "./."). I'm working with a file which compresses nicely to .rds (~40MB) but is very large when in memory (17GB). This size becomes difficult to work with on a local machine or interactively.

Looping over the individual assays in a CompressedVcf I see that the largest ones are those which are matrices of lists (e.g. AD, AF, MBQ, ...). In theory (I think) they contain the same scale of information as GT, but their representation makes them much larger.

I appreciate that this list format makes working with the data cleaner than storing the entries as delimited strings (as I believe they natively are in the VCF) which likely requires parsing the string and splitting into a list-like structure anyway, but this is a tradeoff between size and usability, and applies to the entire dataset even if a user is only interested in a subset.

Would there be interest in a representation of CollapsedVcf list-matrices as either delimited strings (in which case they could be stored as RleMatrix for an additional compression boost; or RleListMatrix (which doesn't exist, but here's an open issue: Bioconductor/DelayedArray#62)? I'm not well-versed enough in the Rle side to know if Rle provides a benefit when stored in a list, but it's an option.

Below is a comparison of object sizes for a toy example 'assay' constructed as a matrix of list elements and the size savings are potentially very large (in this case 1/62 the size).

## create a matrix of list elements
x <- matrix(
  replicate(
    2e4, 
    c(
      sample(c(NA, 1, 2), 1, prob = c(100, 1, 1)), 
      sample(c(NA, 1, 2), 1, prob = c(100, 1, 1))
    ), 
    simplify = FALSE), 
  ncol = 10
)
head(x)
#>      [,1]      [,2]      [,3]      [,4]      [,5]      [,6]      [,7]     
#> [1,] Numeric,2 Numeric,2 Numeric,2 Numeric,2 Numeric,2 Numeric,2 Numeric,2
#> [2,] Numeric,2 Numeric,2 Numeric,2 Numeric,2 Numeric,2 Numeric,2 Numeric,2
#> [3,] Numeric,2 Numeric,2 Numeric,2 Numeric,2 Numeric,2 Numeric,2 Numeric,2
#> [4,] Numeric,2 Numeric,2 Numeric,2 Numeric,2 Numeric,2 Numeric,2 Numeric,2
#> [5,] Numeric,2 Numeric,2 Numeric,2 Numeric,2 Numeric,2 Numeric,2 Numeric,2
#> [6,] Numeric,2 Numeric,2 Numeric,2 Numeric,2 Numeric,2 Numeric,2 Numeric,2
#>      [,8]      [,9]      [,10]    
#> [1,] Numeric,2 Numeric,2 Numeric,2
#> [2,] Numeric,2 Numeric,2 Numeric,2
#> [3,] Numeric,2 Numeric,2 Numeric,2
#> [4,] Numeric,2 Numeric,2 Numeric,2
#> [5,] Numeric,2 Numeric,2 Numeric,2
#> [6,] Numeric,2 Numeric,2 Numeric,2

## collapse to singleton strings (credit: @lawremi)
v <- unlist(x)
v[is.na(v)] <- "."
x_str <- matrix(
  S4Vectors::unstrsplit(
    c(relist(v, x)), ","), 
  nrow(x), ncol(x))
head(x_str)
#>      [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9]  [,10]
#> [1,] ".,." ".,." ".,." ".,." ".,." ".,." ".,." ".,." ".,." ".,."
#> [2,] ".,." ".,." ".,." ".,." ".,." ".,." ".,." ".,." ".,." ".,."
#> [3,] ".,." ".,." ".,." ".,." ".,." ".,." ".,." ".,." ".,." ".,."
#> [4,] ".,." ".,." ".,." ".,." ".,." ".,." ".,." ".,." ".,." ".,."
#> [5,] ".,." ".,." ".,." ".,." ".,." ".,." ".,." ".,." ".,." ".,."
#> [6,] ".,." ".,." ".,." ".,." ".,." ".,." ".,." ".,." ".,." ".,."

suppressMessages(library(DelayedArray))

## Rle representation of matrix of list doesn't work
as(x, "RleArray")
#> Error in .Call2("Rle_constructor", values, lengths, PACKAGE = "S4Vectors"): Rle of type 'list' is not supported

## Rle representation of string matrix
x_rle <- as(x_str, "RleArray")
x_rle
#> <2000 x 10> matrix of class RleMatrix and type "character":
#>         [,1]  [,2]  [,3]  ... [,9]  [,10]
#> [1,]    ".,." ".,." ".,." .   ".,." ".,."
#> [2,]    ".,." ".,." ".,." .   ".,." ".,."
#> [3,]    ".,." ".,." ".,." .   ".,." ".,."
#> [4,]    ".,." ".,." ".,." .   ".,." ".,."
#> [5,]    ".,." ".,." ".,." .   ".,." ".,."
#> ...     .     .     .     .   .     .    
#> [1996,] ".,." ".,." ".,1" .   ".,." ".,."
#> [1997,] ".,." ".,." ".,." .   "2,." ".,."
#> [1998,] ".,." ".,." ".,." .   ".,." ".,."
#> [1999,] "1,." ".,." "1,." .   ".,." ".,."
#> [2000,] "2,." ".,." ".,." .   ".,." ".,."

## compare sizes
pryr::object_size(x)
#> Registered S3 method overwritten by 'pryr':
#>   method      from
#>   print.bytes Rcpp
#> 1.44 MB
pryr::object_size(x_str)
#> 161 kB
pryr::object_size(x_rle)
#> 23 kB

Created on 2020-02-28 by the reprex package (v0.3.0)

I'm not across this package enough to have any insights into implementation issues or how this might affect other aspects of the inner workings (or user-side workings) but this approach reduces a 1.71GB CollapsedVcf into a 20MB object of the same class without destroying any information. Credit to @lawremi for guidance towards optimising the conversion and on structural advice.

@lawremi
Copy link
Contributor

lawremi commented Feb 28, 2020

Ideally we would use a light weight representation (like the proposed RleListMatrix) that is more computable than a delimited string.

@jonocarroll
Copy link
Author

I'm not sure whether or not the Rle benefit will work across lists, but if it can then that may solve all of the problems.

@lawremi
Copy link
Contributor

lawremi commented Feb 29, 2020

I think it would in this case, because of how the Rle compression helps the character case.

@jonocarroll
Copy link
Author

Does it work in that context because a matrix is just a vector with a dim so you can actually compress long runs of sequential elements? I'm not sure I see how that manifests for the 'vector of lists' case but perhaps something clever can be done in the unlisting/relisting sense to take advantage of that.

@lawremi
Copy link
Contributor

lawremi commented Mar 2, 2020

It will be efficient as long as the RleList is a CompressedList, meaning that internally there is just a flat vector that is lazily partitioned into list elements.

@jonocarroll
Copy link
Author

In that case it may even be slightly more performant than the string version as there would be even fewer unique values over which a run may compress. Nice!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants