Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNPs in HDF5 files are not recognized #62

Open
asaferali opened this issue Mar 29, 2017 · 7 comments
Open

SNPs in HDF5 files are not recognized #62

asaferali opened this issue Mar 29, 2017 · 7 comments

Comments

@asaferali
Copy link

I was able to get WASP to run successfully when my genotypes are stored in text files. However I would like to use HDF5 files for maximum accuracy. I made HDF5 files from VCFs using snp2h5

When I try to run find_intersecting_snps.py I get the following for every chromosome:
starting chromosome 1
reading SNPs from file '/udd/resaf/WASP/WASP/data/HRC_VCF_chr22/snp_tab.h5'
WARNING: chromosome 1 is not in snp_tab.h5 file, assuming no SNPs for this chromosome
processing reads
starting chromosome 10
reading SNPs from file '/udd/resaf/WASP/WASP/data/HRC_VCF_chr22/snp_tab.h5'
WARNING: chromosome 10 is not in snp_tab.h5 file, assuming no SNPs for this chromosome

The HDF5 files seem to have been properly made so I am not sure why the data are not recognized.

@ejh243
Copy link

ejh243 commented Mar 14, 2018

I have a similar issue. Looking back it appears that when I ran the snp2h5 program it worked fine for chromosomes 10-22 and 1. Then with chromosome 2 I get the error message included at the end of the message and hence chromosomes 2-9 are not in the snp_tab.h5 file and therefore the intersecting_snps.py script doesn't find any snps on these chromosomes.

I don't think there is an issue with the vcf files as I previously got this command to work for a different version of chromInfo.txt. I needed to rerun this step and change the chromosome labels from chr1 to 1; chr2 to 2 etc. I just did this in the chromInfo.txt file. I don't think that can be the problem as why would it work for some chromosomes and not others? Any suggestions?

ERROR MESSAGE:

reading from file chr2.dose.vcf.gz
counting lines in file
total lines: 3392260
reading VCF header
VCF header lines: 14
number of samples: 90
initializing HDF5 matrix with dimension: (3392246, 180)
HDF5-DIAG: Error detected in HDF5 (1.8.12) thread 0:
#000: ../../src/H5Ddeprec.c line 169 in H5Dcreate1(): unable to create dataset
major: Dataset
minor: Unable to initialize object
#1: ../../src/H5Dint.c line 439 in H5D__create_named(): unable to create and link to dataset
major: Dataset
minor: Unable to initialize object
#2: ../../src/H5L.c line 1638 in H5L_link_object(): unable to create new link to object
major: Links
minor: Unable to initialize object
#3: ../../src/H5L.c line 1882 in H5L_create_real(): can't insert link
major: Symbol table
minor: Unable to insert object
#4: ../../src/H5Gtraverse.c line 861 in H5G_traverse(): internal path traversal failed
major: Symbol table
minor: Object not found
#5: ../../src/H5Gtraverse.c line 641 in H5G_traverse_real(): traversal operator failed
major: Symbol table
minor: Callback failed
#6: ../../src/H5L.c line 1674 in H5L_link_cb(): name already exists
major: Symbol table
minor: Object already exists
ERROR: snp2h5.c:470 failed to create dataset

END

@gmcvicker
Copy link
Collaborator

I am not sure what the problem is, but I can try to reproduce it if you are able to provide a way to download the VCF files that you used. If you are not able to provide a complete VCF, then perhaps you could create a short VCF with a few SNPs from each chromosome?

Thanks!

Graham

@ejh243
Copy link

ejh243 commented Mar 19, 2018

Ah I have just noticed something after chromosome 22, it reads in from the chromosome 2 vcf file BUT extracts the chromInfo for chromosome 1 see below:

chromosome: 22, length: 51304566bp
reading from file chr22.dose.vcf.gz
counting lines in file
total lines: 524558
reading VCF header
VCF header lines: 14
number of samples: 90
initializing HDF5 matrix with dimension: (524544, 180)
parsing file and writing to HDF5 files
............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
chromosome: 1, length: 249250621bp
reading from file chr2.dose.vcf.gz

Why is it treating chromosome 2 as chromosome 1? Even though it has already processed chr1?

@gmcvicker
Copy link
Collaborator

I am still not sure what the issue is. The matching of filenames to chromosomes is handled by the function chrom_guess_from_file in the file: https://github.com/bmvdgeijn/WASP/blob/master/snp2h5/chrom.c

Can you send me the list of your vcf filenames as well as the chromInfo.txt file that you are using?

Thanks,

Graham

@ejh243
Copy link

ejh243 commented Mar 20, 2018 via email

@gmcvicker
Copy link
Collaborator

I think that i have fixed this problem now. snp2h5 tries to automatically guess which VCF input files are for which chromosome, however this turns out to be error-prone. I have re-written this part of the code so that the chromosome is read from the first data line in the VCF file, which should be more reliable. The commit for this is here 66e9d25

I have merged this fix into the master branch and will hopefully make a new release soon.

@ejh243
Copy link

ejh243 commented Jun 5, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants