Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compute gene scores without GC content annotation #12

Open
marie3003 opened this issue May 19, 2022 · 5 comments
Open

Compute gene scores without GC content annotation #12

marie3003 opened this issue May 19, 2022 · 5 comments
Assignees

Comments

@marie3003
Copy link

Hey,
thanks a lot for your package. It is very useful:).

I'm currently working with a multiome dataset for which I want to calculate gene scores for the ATAC data to compare them to the gene expression data. I'm using the data set from the NeurIPS Competition Open Problems in Single Cell Analysis (https://openproblems.bio/neurips_docs/data/dataset/). However, in my dataset there is no GC annotation. That's why the method genescore.prepare_multiome_anndata() crashes and I'm not able to use the follow up methods to compute the gene scores.
Do you provide a method to calculate this or do you have another easy way to add this annotation to the data? Or is there another way to calculate the gene scores without the GC annotation.

Thanks a lot for your help.
Best regards,
Marie Becker

@ManuSetty
Copy link
Member

At this point, we do not yet provide a function of computing GC content but we will look to add this feature.

GC content represents the fraction of Gs or Cs in the peak sequence. This can be computed using Bioconda or Bioconductor packages. We have been using ArchR for ATAC preprocessing which generates this information automatically. GC content is unfortunately a requirement for computing gene scores since it is necessary for background peak identification for expression-accessibility correlations

@marie3003
Copy link
Author

Thanks a lot for the suggestions. This feature would be very useful.

I was able to calculate the gene score using genomepy and biopython. I extracted the peak sequences with genomepy and calculated the gc content with the GC function from Bio.SeqUtils. It was pretty fast and easy actually:).

@ManuSetty
Copy link
Member

ManuSetty commented May 20, 2022 via email

@marie3003
Copy link
Author

Hi Manu,

sure, no problem. Thank you for adding this! Here is the code I used to download the right genome and to add the GC annotation.

import numpy as np
import anndata as ad

import genomepy
from Bio.SeqUtils import GC

adata_atac = ad.read_h5ad('../write/filtered_data_atac.h5ad')

# download genome from NCBI
genomepy.install_genome(name='GRCh38', provider='NCBI', genomes_dir = ''../data') # took about 9 min
genome = genomepy.Genome(name = 'GRCh38', genomes_dir = '../data')

GC_content_list = []

for i, region in enumerate(adata_atac.var_names):
    chromosome, start, end = region.split('-')
    chromosome = chromosome[3:]

    # get sequence
    sequence = str(genome.get_seq(name = chromosome, start = int(start), end = int(end)))

    # calculate GC content
    GC_content = GC(str(sequence))
    GC_content_list.append(GC_content)

# GC content ranges from 0% - 100%, should be 0 to 1
adata_atac.var['GC'] = GC_content_list
adata_atac.var.GC = adata_atac.var.GC/100

To find the right genome name and provider, I used

for provided_genome in genomepy.search('GRCh38', provider=None):
    print(provided_genome)

Best regards,
Marie

@ManuSetty
Copy link
Member

Thank you! @sitarapersad Can you please add this to the utils.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants