Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: VCF requirements not clearly stated & wrong seqnames derived #70

Open
KuechlerO opened this issue May 5, 2024 · 1 comment
Open

Comments

@KuechlerO
Copy link

Firstly thx for your amazing work.
I really like the concept behind your tool :)

Now concerning the bug:
I worked with a fresh Colab notebook

  • MMSplice version:
    Installation with pip: mmsplice-2.4.0

  • Python version:
    Python 3.10.13

  • Operating System:
    Colab notebook

Description

I wanted to try out mmsplice, so I've downloaded your example data and gave it a go.
But even though I used your test data, I constantly received the following error:
ValueError: Fasta chrom names do not match with vcf chrom names.

After hours of wrapping my head around and hacking with the package code I found out:

  1. Not only a VCF-file is required as input, but also its index version (i.e. vcf.gz.tbi).
    That was never stated (check your ReadMe)...
    Please fix that.

  2. Also, when hacking around, I realized the parsing of the seqnames seems buggy.

  • It only parses seqnames when an indexed VCF file is provided
  • But it also always includes {'1'} as seqname for the VCF, no matter what is provided?!
# In your vcf_dataloader.py
def _check_chrom_annotation(self):
        # I've added these two lines
        fasta_chroms = set(self.fasta.fasta.keys())
        vcf_chroms = set(self.vcf.seqnames)

        print("fasta: ", fasta_chroms, flush=True)
        print("vcf_chroms", vcf_chroms, flush=True)
        if not fasta_chroms.intersection(vcf_chroms):
            raise ValueError(
                'Fasta chrom names do not match with vcf chrom names')

--> Output:
fasta:  {'17'}
vcf_chroms {'1', '17'}
...

The VCF seqnames should only include 17, since I've just provided your example.

What I Did

Take a look:
https://colab.research.google.com/drive/1hx6PAYT_lKuEYtnHCq0PN2lyvNBNDud1?usp=sharing

@KuechlerO
Copy link
Author

Ahh, the 1 might come from the annotated contig in the vcf file:
##contig=<ID=1,length=16>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant