-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
`[somalier] found 0 sites' error when extracting from array .vcf.gz #31
Comments
somalier actually looks for depth, not at the genotype. I'll make a fix for this special case and get it into next release. |
Thanks a lot! |
this is fixed and will be out in next release. meanwhile here is a binary of the dev version of somalier. if you have a chance to try it out, that would be great |
Thanks, got further this time but crashed after creating 12 .somalier files (6.5k samples in the full VCF):
|
hmm. thanks for reporting. can you re-run in a different location in a local (not networked) disk to verify that the problem persists and that you still have only 12 samples output? |
Hm, ran with the same config but outputting to a different directory (same disk, same everything). The extraction worked, will check the relatedness now. Fingers crossed! |
ok. hopefully just a network error. that part of the code is doing sane things. I'm interested to hear about run-time (and output) for the 6.5K samples. the largest set I have run is 2K samples. |
Extraction is around 1.5 min for 5098 sites. Will measure the relatedness calculation, but it was pretty quick (5-10 min?). Is there a flag to limit the pairs file output to minimal relatedness value? |
OK. Great. I just ran on 4528 samples and it finished in 59 seconds so your estimate sounds right. there isn't a flag to limit by relatedness, because that's simple with:
I would like to limit the output to html to related pairs (and subset unrelated since nearly all pairs in large cohorts are unrelated), but that's for a later release. thanks again for following up. I'll get out the new release this week. |
Thanks a look for implementing the genotype extraction! A quick update on the relatedness calculation: n=8584; the number variants is not very consistent – 6.5k array samples with ~5000 variants, 2k WES samples with ~15k variants.
The pairs file has 368,38,237 lines so that seems correct. head/tail of the file looks ok. Not sure if there are any files affected by the out of memory 😕 |
I think the only way it could be running out of memory is in the step for converting to JSON which is after the pairs file gets written. The fact that the message is printed supports that. my set of 4,258 samples uses ~7.5GB of memory and that jumps right at the end. |
this is out here: https://github.com/brentp/somalier/releases/tag/v0.2.5 thanks for reporting and let me know any other issues with large cohorts. |
Hi, I am using somalier to calculate relatedness for a cohort of 13206 samples. I am running into an The output are The command and standard output are:
I am running the command using
Why the stdout is reporting out of memory but the script run to the end? Do you have any suggestion? |
hi, unfortunately, dumping to json requires the most memory. The easiest way for you to get this running is to use a machine with more memory. |
I could allocate more memory with Slurm. Could you say roughly how much will be needed? Yes, the command is in a wrapper. |
looks like you're using 102GB so try 256GB to be safe or 128GB if you need a lower allocation. |
Hello again, I increased the memory to 256GB and it ran without problems. However, I didn't get the inferred relationships. Would it be another file of a column in the somalier.pairs.tsv file? Also, what is the role of the somalier.sample.tsv file? Thanks! |
In order to get inferred relationships, you have to use an extra flag (see |
Sorry I haven't explained it properly, but I am using the sample.tsv is empty in my case. |
sample.tsv should not be empty if somalier finished running. can you verify you are looking at the correct file and that it was copied fully by whatever wrapper you are using? |
Actually, I was confused by the |
this will be easier to debug if you run somalier directly (on an interactive node or whatever you need). |
I tried on interctive node but it runs into The script to extract is:
the sites are the one I downloaded from the repository you linked to: Running on an interactive node, the stdout is:
|
ok. so i think the problem is something with the somalier files. How were they extracted? Somalier thinks they are all related or identical so it can't subset the html output. |
I use Is in a for loop and the stdout is:
|
ok. that's the worst way to run somalier. better to run it on a jointly called VCF. But if you do run extract that way, then you need to run |
I tried to run it from the aggregated file but I get a problem which I believe is caused by the VEP annotations in the INFO field. It doesn't find any site.
|
This is not because of VEP, it's because the caller did not add AD field for each sample. what caller did you use? The AD field is pretty standard. |
That bit wasn't done by me, but I can try to find out if you think it can helps. I tried with the
|
can you show what format fields are available in your VCF? so that doesn't look like a memory error. are you sure you have enough disk space to write? |
in the single VCFs the format has I don't think is a disk space problem:
|
Hi,
Following up on the question I've posted on twitter. Briefly, my goal is to check the relatedness between around ten thousand WES samples (no variant calls yet, so somalier's "sketches" would be very helpful) and existing array genotyping datasets (between 5k and 25k samples). I'm using the provided b38 site file and hs38DH.fa. For the WES samples, the sketch extraction works smoothly and the relatedness ckeck produces sensible results.
However, I'm getting puzzling errors when extracting the sites from VCFs. With 0.2.3 it reports
[somalier] found 0 sites
(and produces zero variant .somalier files). With 0.2.4 it'sSIGSEGV: Illegal storage access. (Attempt to read from nil?)
.Following your suggestion to check the REFs and ALTs, I've made toy VCF with three variants matching the sites file.
In the sites:
In the VCF:
This looks like it should work, but produces the same 0 sites/illegal storage errors. I've also tried setting
--sites
and the extraction file to the three variant VCF itself: both versions reportfound 0 sites
. Any suggestions what else to check?Appreciate your help!
The text was updated successfully, but these errors were encountered: