Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ref : Macaca Fascicularis 5.0, missing style #10

Closed
MuShuw opened this issue Jun 12, 2020 · 8 comments
Closed

ref : Macaca Fascicularis 5.0, missing style #10

MuShuw opened this issue Jun 12, 2020 · 8 comments

Comments

@MuShuw
Copy link

MuShuw commented Jun 12, 2020

Hello.

We are working with VCFs of Macaca Fascicularis.
We had an issue with your function extractSeqLevelsByGroup, called by MutationalPatterns::read_vcfs_as_granges .
The chromosomes style notation of Macaca Fascicularis is not available.
Could BSgenome.Mfascicularis.NCBI.5.0 be added to the supported organism ?
Macaca Fascicularis is an alternative to Macaca Mulatta on which research is increasing.

Regards,
Elyas.

@mtmorgan
Copy link
Contributor

Can you please provide a reproducible example of the problem?

@hpages
Copy link
Contributor

hpages commented Jun 12, 2020

I don't see that MutationalPatterns::read_vcfs_as_granges calls "our function" extractSeqLevelsByGroup. I don't know what extractSeqLevelsByGroup is or what it does. AFAICT there is no such function in the GenomeInfoDb package and I don't see one in the MutationalPatterns package either (looking at the master branch only).

So yes please follow Martin's advice as it's hard to help you without having a little bit more details about what's going on.

FWIW I'll take a guess that the problem you're running into is caused by the exotic chromosome names in the Macaca_fascicularis_5.0 assembly:

> library(BSgenome.Mfascicularis.NCBI.5.0)
> seqlevels(BSgenome.Mfascicularis.NCBI.5.0)[1:30]
 [1] "MFA1"         "MFA2"         "MFA3"         "MFA4"         "MFA5"        
 [6] "MFA6"         "MFA7"         "MFA8"         "MFA9"         "MFA10"       
[11] "MFA11"        "MFA12"        "MFA13"        "MFA14"        "MFA15"       
[16] "MFA16"        "MFA17"        "MFA18"        "MFA19"        "MFA20"       
[21] "MFAX"         "MT"           "Scaffold1372" "Scaffold1442" "Scaffold1519"
[26] "Scaffold3048" "Scaffold3501" "Scaffold27"   "Scaffold47"   "Scaffold56"  

These names would certainly confuse GenomeInfoDb::seqlevelsStyle() and maybe MutationalPatterns::read_vcfs_as_granges() makes overly optimistic assumptions about the capabilities of GenomeInfoDb::seqlevelsStyle(). I don't know, I'm not familiar with the MutationalPatterns package.

To rename the chromosomes and scaffolds to the more standard UCSC naming scheme, you can do:

chrominfo <- getChromInfoFromNCBI("Macaca_fascicularis_5.0")

head(chrominfo)
#   SequenceName       SequenceRole AssignedMolecule GenBankAccn Relationship
# 1         MFA1 assembled-molecule             <NA>  CM001919.1            =
# 2         MFA2 assembled-molecule             <NA>  CM001920.1            =
# 3         MFA3 assembled-molecule             <NA>  CM001921.1            =
# 4         MFA4 assembled-molecule             <NA>  CM001922.1            =
# 5         MFA5 assembled-molecule             <NA>  CM001923.1            =
# 6         MFA6 assembled-molecule             <NA>  CM001924.1            =
#    RefSeqAccn     AssemblyUnit SequenceLength UCSCStyleName circular
# 1 NC_022272.1 Primary Assembly      227556264          chr1       NA
# 2 NC_022273.1 Primary Assembly      192460366          chr2       NA
# 3 NC_022274.1 Primary Assembly      192294377          chr3       NA
# 4 NC_022275.1 Primary Assembly      170955103          chr4       NA
# 5 NC_022276.1 Primary Assembly      189454096          chr5       NA
# 6 NC_022277.1 Primary Assembly      181584905          chr6       NA

seqlevels(BSgenome.Mfascicularis.NCBI.5.0) <- setNames(chrominfo$UCSCStyleName,
                                                       chrominfo$SequenceName)

seqlevels(BSgenome.Mfascicularis.NCBI.5.0)[1:30]
#  [1] "chr1"                     "chr2"                    
#  [3] "chr3"                     "chr4"                    
#  [5] "chr5"                     "chr6"                    
#  [7] "chr7"                     "chr8"                    
#  [9] "chr9"                     "chr10"                   
# [11] "chr11"                    "chr12"                   
# [13] "chr13"                    "chr14"                   
# [15] "chr15"                    "chr16"                   
# [17] "chr17"                    "chr18"                   
# [19] "chr19"                    "chr20"                   
# [21] "chrX"                     "chrM"                    
# [23] "chr1_AQIA01037047_random" "chr1_AQIA01037052_random"
# [25] "chr1_AQIA01037053_random" "chr1_AQIA01037091_random"
# [27] "chr1_AQIA01037098_random" "chr1_KE145507_random"    
# [29] "chr1_KE145508_random"     "chr1_KE145509_random"    

Then try MutationalPatterns::read_vcfs_as_granges() again and see if that helps.

@MuShuw
Copy link
Author

MuShuw commented Jun 18, 2020

Hello,
I looked to verify if I didn't make an error, but extractSeqlevelsByGroup is a function of GenomeInfoDb.
It can be found here : https://github.com/Bioconductor/GenomeInfoDb/blob/master/R/seqlevelsStyle.R#L222

I was allowed to share a bit of a vcf file that can be found here : https://github.com/MuShuw/sharable/blob/master/test.vcf

We use R 3.6.5 on the project.

The next lines can reproduce the error :

ref_genome <- "BSgenome.Mfascicularis.NCBI.5.0"
library(MutationalPatterns)
library(GenomeInfoDb)
library(ref_genome, character.only = TRUE)
path_dat <- '.' # wherever are the vcfs
vcf_files <- list.files(path_dat, pattern = ".vcf", full.names = TRUE)
sample_names <- c("test")
vcfs <- read_vcfs_as_granges(vcf_files, sample_names, ref_genome)

At that point appears the next error :

Error in extractSeqlevelsByGroup(species = ref_organism, style = ref_style, :
The style specified by 'NCBIEnsembl' does not have a compatible entry for the species Macaca fascicularis

The traceback give this :

6: stop("The style specified by '", style, "' does not have a compatible entry for the species ", 
       species)
5: extractSeqlevelsByGroup(species = ref_organism, style = ref_style, 
       group = "auto")
4: FUN(X[[i]], ...)
3: lapply(X = X, FUN = FUN, ...)
2: mclapply(seq_along(vcf_files), function(index) {
       file <- vcf_files[index]
       vcf <- rowRanges(readVcf(file, genome_name))
       seqlevelsStyle(vcf) <- ref_style[1]
       groups <- c()
       if (group != "none") {
           if (group == "auto+sex") {
               groups <- c(extractSeqlevelsByGroup(species = ref_organism, 
                   style = ref_style, group = "auto"), extractSeqlevelsByGroup(species = ref_organism, 
                   style = ref_style, group = "sex"))
               groups_names <- names(groups)
               if (!is.null(groups_names)) {
                   unique_names <- unique(groups_names)
                   groups <- llply(unique_names, function(x) groups[groups_names == 
                     x])
                   groups <- llply(groups, unlist, recursive = FALSE)
                   groups <- unique(as.vector(groups[[1]]))
               }
           }
           else {
               groups <- extractSeqlevelsByGroup(species = ref_organism, 
                   style = ref_style, group = group)
               groups <- unique(as.vector(t(groups)))
           }
           groups <- intersect(groups, seqlevels(vcf))
           vcf <- keepSeqlevels(vcf, groups, pruning.mode = "tidy")
       }
       if (check_alleles) {
           rem <- which(all(!(!is.na(match(vcf$ALT, DNA_BASES)) & 
               !is.na(match(vcf$REF, DNA_BASES)) & (lengths(vcf$ALT) == 
               1))))
           if (length(rem) > 0) {
               vcf = vcf[-rem]
               warnings <- c(warnings, paste(length(rem), "position(s) with indels and/or multiple", 
                   "alternative alleles are excluded from", paste(sample_names[[index]], 
                     ".", sep = "")))
           }
       }
       return(list(vcf, warnings))
   }, mc.cores = num_cores)
1: read_vcfs_as_granges(vcf_files, "test", ref_genome)

Concerning the proposed solution using getChromInfoFromNCBI, I have not been able to use that function despite loading the library.

Thanks again for your help.

Elyas.

@hpages
Copy link
Contributor

hpages commented Jun 18, 2020

Yes I see now that GenomeInfoDb has an extractSeqlevelsByGroup function. I originally missed it because you reported it with the incorrect spelling (extractSeqLevelsByGroup) (case matters in R).

If you're using R 3.6.5 that means you're using an old and unsupported version of Bioconductor. The current version (BioC 3.11) is only supported on R 4.0. Bioconductor evolves quickly and getChromInfoFromNCBI is one of the many things that were introduced in BioC 3.11.

FWIW I'm currently working on making the switch between NCBI and UCSC style easier for BSgenome.Mfascicularis.NCBI.5.0 and other BSgenome packages. The goal is to be able to just do:

genome <- BSgenome.Mfascicularis.NCBI.5.0
seqlevelsStyle(genome) <- "UCSC"

This will only be available in BioC >= 3.12 though (BioC 3.12 is the current development version).

I strongly recommend that you update your installation to R 4.0 + BioC 3.11.

Best,
H.

@hpages
Copy link
Contributor

hpages commented Jul 2, 2020

@MuShuw Have you made progress on this?

Heads up: with BSgenome 1.57.3 you can now switch the style of all the sequence names in BSgenome.Mfascicularis.NCBI.5.0 with just:

library(BSgenome.Mfascicularis.NCBI.5.0)
genome <- BSgenome.Mfascicularis.NCBI.5.0

seqinfo(genome)
# Seqinfo object with 7601 sequences (1 circular) from Macaca_fascicularis_5.0 genome:
#   seqnames     seqlengths isCircular                  genome
#   MFA1          227556264      FALSE Macaca_fascicularis_5.0
#   MFA2          192460366      FALSE Macaca_fascicularis_5.0
#   MFA3          192294377      FALSE Macaca_fascicularis_5.0
#   MFA4          170955103      FALSE Macaca_fascicularis_5.0
#   MFA5          189454096      FALSE Macaca_fascicularis_5.0
#   ...                 ...        ...                     ...
#   Scaffold7531       5581      FALSE Macaca_fascicularis_5.0
#   Scaffold7558      21004      FALSE Macaca_fascicularis_5.0
#   Scaffold7563       5184      FALSE Macaca_fascicularis_5.0
#   Scaffold7587       8082      FALSE Macaca_fascicularis_5.0
#   Scaffold7621       7181      FALSE Macaca_fascicularis_5.0

seqlevelsStyle(genome)
# [1] "NCBI"

seqlevelsStyle(genome) <- "UCSC"  # switch style

Then:

seqinfo(genome)
# Seqinfo object with 7601 sequences (1 circular) from macFas5 genome:
#   seqnames       seqlengths isCircular  genome
#   chr1            227556264      FALSE macFas5
#   chr2            192460366      FALSE macFas5
#   chr3            192294377      FALSE macFas5
#   chr4            170955103      FALSE macFas5
#   chr5            189454096      FALSE macFas5
#   ...                   ...        ...     ...
#   chrUn_KE147191       5581      FALSE macFas5
#   chrUn_KE147192      21004      FALSE macFas5
#   chrUn_KE147193       5184      FALSE macFas5
#   chrUn_KE147194       8082      FALSE macFas5
#   chrUn_KE147195       7181      FALSE macFas5

Hope this helps,
H.

@MuShuw
Copy link
Author

MuShuw commented Jul 2, 2020

@hpages Hello,

We haven't had R 4.0 installed on the server yet. I will give you an update once the former solution will be tested.

BSgenome 1.57.3 is the lastest version and that last solution is the switch you were working on if understand porperly ? If so I guess I should go for this solution first once BioC 3.11 is available to me ?

Thank you for your help Hervé.

Regards,
Elyas.

@hpages
Copy link
Contributor

hpages commented Jul 10, 2020

Hi @MuShuw,

Just to clarify the recent improvements to the seqlevelsStyle() setter are in the master branch of BSgenome (in version >= 1.57.3). The master branch is what goes in BioC 3.12 which is the current development version of Bioconductor. Since the improvements also introduce some subtle changes of semantic I'm not planning to backport them to the RELEASE_3_11 branch of BSgenome. In other words, these improvements won't be available in BioC 3.11 (the latest and most current Bioconductor release).

Best,
H.

@hpages
Copy link
Contributor

hpages commented Aug 4, 2020

Hi @MuShuw , is this working for you now? Thanks! H.

@hpages hpages closed this as completed Dec 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants