Skip to content

Return sparse representation for genotype #310

@quattro

Description

@quattro

Hi @brentp ,

thanks again for developing such a fantastic tool. We use it for nearly every project in my group!

I'm curious if it would be possible to add a new property/method to VariantInfo that returns a sparse representation of genotypes. Ideally, something like var.sparse_genotypes that returns a (values, indices) for non-zero genotypes and sample indices where those occur.

This is already achievable with numpy filtering of var.gt_types, it is somewhat slow, and I'm curious if doing this in Cython space is faster.

The overall goal is to be able to build a sparse genotype matrix across all variants, which would look something like,

vcf = VCF(...)
data = []
indices = []
for vdx, var in enumerate(vcf):
   _data, _idxs = var.sparse_genotypes(include_missing=False)
  # construct local index
  _idx = np.column_stack((_idxs, np.ones_like(_idxs) * vdx))
  data.append(_data)
  indices.append(_idx)

data = np.concatenate(data)
indices = np.concatenate(indices)

n = len(vcf.samples)
p = vdx # last variant
sp_geno_mat = coo_matrix(data, indices, shape=(n, p))

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions