Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

`[somalier] found 0 sites' error when extracting from array .vcf.gz #31

Closed
asazonov opened this issue Oct 23, 2019 · 31 comments
Closed

Comments

@asazonov
Copy link

Hi,

Following up on the question I've posted on twitter. Briefly, my goal is to check the relatedness between around ten thousand WES samples (no variant calls yet, so somalier's "sketches" would be very helpful) and existing array genotyping datasets (between 5k and 25k samples). I'm using the provided b38 site file and hs38DH.fa. For the WES samples, the sketch extraction works smoothly and the relatedness ckeck produces sensible results.

However, I'm getting puzzling errors when extracting the sites from VCFs. With 0.2.3 it reports [somalier] found 0 sites (and produces zero variant .somalier files). With 0.2.4 it's SIGSEGV: Illegal storage access. (Attempt to read from nil?).

Following your suggestion to check the REFs and ALTs, I've made toy VCF with three variants matching the sites file.

In the sites:

chr1    33572614        .       T       C       .       PASS    AC=127807;AF=0.509354
chr1    33772682        .       T       C       .       PASS    AC=74131;AF=0.295319
chr1    36307804        .       T       G       .       PASS    AC=78047;AF=0.498041

In the VCF:

chr1	33572614	AX-11402824	T	C	.	.	PR	GT	1/1	0/1 ...
chr1	33772682	AX-11522782	T	C	.	.	PR	GT	0/1	0/1 ...
chr1	36307804	AX-11462332	T	G	.	.	PR	GT	0/1	1/1 ...

This looks like it should work, but produces the same 0 sites/illegal storage errors. I've also tried setting --sites and the extraction file to the three variant VCF itself: both versions report found 0 sites. Any suggestions what else to check?

Appreciate your help!

@brentp
Copy link
Owner

brentp commented Oct 23, 2019

somalier actually looks for depth, not at the genotype. I'll make a fix for this special case and get it into next release.

@asazonov
Copy link
Author

Thanks a lot!

@brentp brentp closed this as completed in 06329c2 Oct 23, 2019
@brentp
Copy link
Owner

brentp commented Oct 23, 2019

this is fixed and will be out in next release. meanwhile here is a binary of the dev version of somalier. if you have a chance to try it out, that would be great
somalier.gz

brentp added a commit that referenced this issue Oct 23, 2019
@asazonov
Copy link
Author

Thanks, got further this time but crashed after creating 12 .somalier files (6.5k samples in the full VCF):

somalier version: 0.2.5
[somalier] FORMAT field 'AD' not found for depth information. using genotype only
[somalier] found 5098 sites
io.nim(127)              raiseEIO
Error: unhandled exception: errno: 5 `I/O error` [IOError]

@brentp
Copy link
Owner

brentp commented Oct 23, 2019

hmm. thanks for reporting. can you re-run in a different location in a local (not networked) disk to verify that the problem persists and that you still have only 12 samples output?
even a debug build won't be much help here.

@asazonov
Copy link
Author

Hm, ran with the same config but outputting to a different directory (same disk, same everything). The extraction worked, will check the relatedness now. Fingers crossed!

@brentp
Copy link
Owner

brentp commented Oct 23, 2019

ok. hopefully just a network error. that part of the code is doing sane things. I'm interested to hear about run-time (and output) for the 6.5K samples. the largest set I have run is 2K samples.
that's probably too many points for the html to be usable, but the text-output has all the info.

@asazonov
Copy link
Author

Extraction is around 1.5 min for 5098 sites. Will measure the relatedness calculation, but it was pretty quick (5-10 min?). Is there a flag to limit the pairs file output to minimal relatedness value?

@brentp
Copy link
Owner

brentp commented Oct 23, 2019

OK. Great. I just ran on 4528 samples and it finished in 59 seconds so your estimate sounds right.

there isn't a flag to limit by relatedness, because that's simple with:

awk '$3 > 0.1' $prefix.pairs.tsv

I would like to limit the output to html to related pairs (and subset unrelated since nearly all pairs in large cohorts are unrelated), but that's for a later release.

thanks again for following up. I'll get out the new release this week.

@asazonov
Copy link
Author

Thanks a look for implementing the genotype extraction!

A quick update on the relatedness calculation: n=8584; the number variants is not very consistent – 6.5k array samples with ~5000 variants, 2k WES samples with ~15k variants.

[somalier] time to calculate all vs all relatedness for all 36838236 combinations: 252.55
out of memory
real	4m50.938s
user	3m41.378s
sys	0m40.665s

The pairs file has 368,38,237 lines so that seems correct. head/tail of the file looks ok. Not sure if there are any files affected by the out of memory 😕

@brentp
Copy link
Owner

brentp commented Oct 23, 2019

I think the only way it could be running out of memory is in the step for converting to JSON which is after the pairs file gets written. The fact that the message is printed supports that.

my set of 4,258 samples uses ~7.5GB of memory and that jumps right at the end.
This memory issue will also be fixed if I don't write the JSON every unrelated pair to the html for large cohorts. I'll open a new issue for that.

@brentp
Copy link
Owner

brentp commented Oct 24, 2019

this is out here: https://github.com/brentp/somalier/releases/tag/v0.2.5

thanks for reporting and let me know any other issues with large cohorts.

@stefanucci-luca
Copy link

Hi,

I am using somalier to calculate relatedness for a cohort of 13206 samples. I am running into an OUT OF MEMORY problem as well and I would much appreciate your help.

The output are somalier.pairs.tsv and somalier.samples.tsv. The somalier.pairs.tsv has 87192616 which is the number of pairs I was expecting plus the header. The somalier.samples.tsv file is empty.

The command and standard output are:

somalier relate ${out_dir}/*.somalier

somalier version: 0.2.12
[somalier] starting read of 13206 samples
[somalier] time to read files and get per-sample stats for 13206 samples: 11.07
[somalier] time to get expected relatedness from pedigree graph: 0.02
[somalier] html and text output will have unrelated sample-pairs subset to 0.11% of points
[somalier] time to calculate all vs all relatedness for all 87192615 combinations: 692.50
io.nim(138)              raiseEIO
Error: unhandled exception: cannot write string to file [IOError] 

I am running the command using sbatch and the script should I/O on the same node. the memory info are:

State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 32
CPU Utilized: 00:15:24
CPU Efficiency: 1.91% of 13:24:16 core-walltime
Job Wall-clock time: 00:25:08
Memory Utilized: 102.55 GB
Memory Efficiency: 27.28% of 375.94 GB

Why the stdout is reporting out of memory but the script run to the end? Do you have any suggestion?
I tried also to infer the relationship using the -i flag, but it doesn't output anything.

@brentp
Copy link
Owner

brentp commented Dec 8, 2020

hi, unfortunately, dumping to json requires the most memory. The easiest way for you to get this running is to use a machine with more memory.
The concerning part is why you appear to have exit code 0 even though it obviously had an error. Was this run within some other wrapper?
If your text output has 87192615 lines (+1 header) then you can trust that data, but the html will not be valid.

@stefanucci-luca
Copy link

I could allocate more memory with Slurm. Could you say roughly how much will be needed?

Yes, the command is in a wrapper.

@brentp
Copy link
Owner

brentp commented Dec 8, 2020

looks like you're using 102GB so try 256GB to be safe or 128GB if you need a lower allocation.

@stefanucci-luca
Copy link

Hello again,

I increased the memory to 256GB and it ran without problems. However, I didn't get the inferred relationships. Would it be another file of a column in the somalier.pairs.tsv file?

Also, what is the role of the somalier.sample.tsv file?

Thanks!

@brentp
Copy link
Owner

brentp commented Dec 8, 2020

In order to get inferred relationships, you have to use an extra flag (see somalier relate --help).
But it's often best to see the html output or find pairs in the .pairs file with a low ibs0 and high ibs2.
The sample.tsv file has sample info, like the sex het rates and the other per-sample QC metrics.

@stefanucci-luca
Copy link

Sorry I haven't explained it properly, but I am using the -i flag already.

sample.tsv is empty in my case.

@brentp
Copy link
Owner

brentp commented Dec 8, 2020

sample.tsv should not be empty if somalier finished running. can you verify you are looking at the correct file and that it was copied fully by whatever wrapper you are using?

@stefanucci-luca
Copy link

Actually, I was confused by the COMPLETED output. the script still runs into out_of_memory even if I allocate 256GB of memory. I tried with a small subset of the samples and it runs without problems. Any suggestion?

@brentp
Copy link
Owner

brentp commented Dec 10, 2020

this will be easier to debug if you run somalier directly (on an interactive node or whatever you need).
I don't think you should need even close to that much memory.
What are these somalier files extracted from? (crams/gvcfs/joinly-called-vcf)?

@stefanucci-luca
Copy link

stefanucci-luca commented Dec 10, 2020

I tried on interctive node but it runs into out_of_memory. the script outputs pairs.tsv and the HTML. It also creates the samples.tsv but is empty. somalier files are extracted from single VCFs.

The script to extract is:

for i in *.vcf.gz; do somalier extract -d $out_dir --sites $sites_dir/$sites_vcf -f $ref $i; done

the sites are the one I downloaded from the repository you linked to: sites.GRCh37.vcf.gz

Running on an interactive node, the stdout is:

somalier relate -i somalier_flaghip/*.somalier
somalier version: 0.2.12
[somalier] starting read of 13206 samples
[somalier] time to read files and get per-sample stats for 13206 samples: 15.23
[somalier] time to get expected relatedness from pedigree graph: 0.02
[somalier] html and text output will have unrelated sample-pairs subset to 0.11% of points
[somalier] apparent identical twins or sample duplicate found with LP2000115-DNA_C03 and LP2000747-DNA_H10 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000115-DNA_F03 and LP2000856-DNA_H03 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000115-DNA_H06 and LP2000954-DNA_G02 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000115-DNA_H09 and LP2000266-DNA_D03 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000247-DNA_F12 and LP2000272-DNA_E12 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000247-DNA_G07 and LP2000249-DNA_E11 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000248-DNA_A04 and LP2000879-DNA_G09 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000248-DNA_B01 and LP2000952-DNA_A01 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000248-DNA_C11 and LP2000269-DNA_H02 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000248-DNA_D01 and LP2000882-DNA_A02 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000249-DNA_A01 and LP2000865-DNA_G09 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000249-DNA_B07 and LP2000254-DNA_D03 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000253-DNA_C05 and LP2000713-DNA_D11 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000254-DNA_E06 and LP2000266-DNA_A01 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000254-DNA_E11 and LP2000711-DNA_H04 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000257-DNA_F09 and LP2000257-DNA_H09 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000258-DNA_D02 and LP2000978-DNA_A09 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000258-DNA_H12 and LP2000260-DNA_G02 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000259-DNA_C06 and LP2000717-DNA_E09 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000261-DNA_A05 and LP2000981-DNA_F04 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000261-DNA_A12 and LP2000952-DNA_H04 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000261-DNA_E01 and LP2000730-DNA_G07 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000261-DNA_G03 and LP2000965-DNA_F08 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000261-DNA_H04 and LP2000266-DNA_C01 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000266-DNA_A06 and LP2000267-DNA_H12 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000266-DNA_A11 and LP2000858-DNA_B12 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000266-DNA_F01 and LP2000749-DNA_E11 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000267-DNA_B05 and LP2000859-DNA_G02 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000267-DNA_D09 and LP2000870-DNA_H08 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000267-DNA_G06 and LP2000803-DNA_C08 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000267-DNA_H05 and LP2000748-DNA_A10 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000268-DNA_B01 and LP2000730-DNA_H11 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000268-DNA_B10 and LP2000965-DNA_F07 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000268-DNA_D12 and LP2000271-DNA_B10 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000269-DNA_C06 and LP2000270-DNA_F11 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000269-DNA_F07 and LP2000730-DNA_E05 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000269-DNA_H09 and LP2000711-DNA_D10 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000271-DNA_A05 and LP2000712-DNA_E09 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000274-DNA_A06 and LP2000987-DNA_B08 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000274-DNA_C08 and LP2000881-DNA_E02 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000274-DNA_F04 and LP2000741-DNA_H06 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000274-DNA_F08 and LP2000858-DNA_D09 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000275-DNA_H04 and LP2000867-DNA_F03 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000711-DNA_D05 and LP2000713-DNA_E10 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000711-DNA_D12 and LP2000713-DNA_H08 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000711-DNA_H10 and LP2000713-DNA_D10 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000713-DNA_A09 and LP2000952-DNA_A05 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000713-DNA_F07 and LP2000991-DNA_E07 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000716-DNA_D03 and LP2000873-DNA_H03 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000716-DNA_D08 and LP2000746-DNA_E05 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000716-DNA_H03 and LP2000969-DNA_C02 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000717-DNA_A06 and LP2000743-DNA_F07 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000717-DNA_H04 and LP2000879-DNA_F01 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000719-DNA_D04 and LP2000965-DNA_B07 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000720-DNA_G06 and LP2000864-DNA_B09 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000720-DNA_H10 and LP2000795-DNA_E12 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000732-DNA_B06 and LP2000738-DNA_G08 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000732-DNA_E04 and LP2000732-DNA_H08 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000732-DNA_F03 and LP2000740-DNA_F02 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000732-DNA_G02 and LP2000854-DNA_E10 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000735-DNA_E11 and LP2000863-DNA_E07 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000738-DNA_H05 and LP2000881-DNA_C10 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000739-DNA_E08 and LP2000965-DNA_A07 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000740-DNA_C10 and LP2000967-DNA_G07 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000740-DNA_H10 and LP2000800-DNA_H12 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000742-DNA_B04 and LP2000870-DNA_G06 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000742-DNA_F11 and LP2000878-DNA_C12 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000742-DNA_H03 and LP2000863-DNA_H06 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000743-DNA_B11 and LP2000952-DNA_B05 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000743-DNA_E06 and LP2000882-DNA_H11 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000744-DNA_G07 and LP2000870-DNA_E06 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000745-DNA_B04 and LP2000794-DNA_A06 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000749-DNA_A06 and LP2000864-DNA_B10 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000749-DNA_E02 and LP2000864-DNA_B02 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000749-DNA_H07 and LP2000859-DNA_E06 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000789-DNA_D09 and LP2000790-DNA_H10 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000790-DNA_A09 and LP2000791-DNA_G01 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000794-DNA_G04 and LP2000965-DNA_C07 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000795-DNA_B06 and LP2000795-DNA_C06 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000795-DNA_C09 and LP2000864-DNA_A11 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000795-DNA_D06 and LP2000799-DNA_F10 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000795-DNA_D09 and LP2000864-DNA_A10 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000795-DNA_D10 and LP2000876-DNA_E11 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000795-DNA_F08 and LP2000870-DNA_F09 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000795-DNA_G08 and LP2000864-DNA_A07 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000798-DNA_C11 and LP2000879-DNA_A09 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000800-DNA_A10 and LP2000807-DNA_F12 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000802-DNA_A08 and LP2000802-DNA_F01 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000803-DNA_A10 and LP2000858-DNA_F07 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000803-DNA_G03 and LP2000978-DNA_G02 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000804-DNA_E06 and LP2000977-DNA_E02 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000804-DNA_G07 and LP2000982-DNA_B01 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000805-DNA_B10 and LP2000870-DNA_A09 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000807-DNA_A04 and LP2000870-DNA_A06 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000807-DNA_D04 and LP2000953-DNA_C04 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000807-DNA_F05 and LP2000874-DNA_H07 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000807-DNA_G12 and LP2000860-DNA_B10 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000854-DNA_D03 and LP2000854-DNA_E03 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000858-DNA_G03 and LP2000870-DNA_F02 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000860-DNA_F03 and LP2000861-DNA_H12 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000862-DNA_B12 and LP2000875-DNA_C01 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000863-DNA_A05 and LP2000869-DNA_A10 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000863-DNA_B05 and LP2000870-DNA_F06 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000863-DNA_E08 and LP2000864-DNA_B07 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000863-DNA_F06 and LP2000881-DNA_E12 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000863-DNA_G05 and LP2000869-DNA_G10 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000863-DNA_H05 and LP2000874-DNA_F07 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000863-DNA_H05 and LP2000879-DNA_B03 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000865-DNA_C03 and LP2000882-DNA_B12 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000870-DNA_F01 and LP2000870-DNA_G12 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000871-DNA_H07 and LP2000968-DNA_E09 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000872-DNA_D09 and LP2000954-DNA_A09 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000872-DNA_F09 and LP2000877-DNA_E08 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000872-DNA_G10 and LP2000876-DNA_F11 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000874-DNA_D02 and LP2000952-DNA_A11 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000874-DNA_F05 and LP2000955-DNA_C10 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000874-DNA_F07 and LP2000879-DNA_B03 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000875-DNA_B07 and LP2000952-DNA_G06 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000879-DNA_A08 and LP2000955-DNA_D08 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000879-DNA_F03 and LP2000953-DNA_H02 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000879-DNA_G03 and LP2000982-DNA_D09 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000880-DNA_C07 and LP2000978-DNA_F09 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000881-DNA_C09 and LP2000954-DNA_E11 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000881-DNA_D10 and LP2000954-DNA_D12 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000881-DNA_E10 and LP2000982-DNA_G01 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000882-DNA_A04 and LP2000982-DNA_E09 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000882-DNA_G03 and LP2000982-DNA_A09 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000882-DNA_H03 and LP2000982-DNA_E02 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000951-DNA_A07 and LP2000982-DNA_C08 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000951-DNA_A08 and LP2000970-DNA_F01 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000951-DNA_C07 and LP2000972-DNA_D12 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000951-DNA_H07 and LP2000952-DNA_F07 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000952-DNA_E06 and LP2000979-DNA_D03 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000953-DNA_B03 and LP2000954-DNA_E08 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000953-DNA_C03 and LP2000982-DNA_B07 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000953-DNA_D03 and LP2000954-DNA_H11 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000953-DNA_D04 and LP2000982-DNA_E01 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000954-DNA_G08 and LP2000982-DNA_C09 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000955-DNA_D10 and LP2000982-DNA_B03 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000966-DNA_H11 and LP2000982-DNA_F08 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000967-DNA_B07 and LP2000988-DNA_F05 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000968-DNA_C09 and LP2000968-DNA_G06 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000969-DNA_G09 and LP2000970-DNA_D12 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000972-DNA_E11 and LP2000982-DNA_G07 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000977-DNA_H05 and LP2000977-DNA_H06 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000979-DNA_B02 and LP2000986-DNA_G09 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000982-DNA_A04 and LP2000982-DNA_F02 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000982-DNA_D02 and LP2000986-DNA_F09 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000983-DNA_E12 and LP2000983-DNA_G12 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000984-DNA_A11 and LP2000984-DNA_G10 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000987-DNA_A02 and LP2000987-DNA_G02 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000987-DNA_A12 and LP2000987-DNA_H11 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000987-DNA_B12 and LP2000987-DNA_C12 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000989-DNA_C08 and LP2000991-DNA_F11 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000989-DNA_E03 and LP2000989-DNA_H01 NOT assuming siblings
[somalier] time to calculate all vs all relatedness for all 87192615 combinations: 864.14
io.nim(138)              raiseEIO
Error: unhandled exception: cannot write string to file [IOError]

@brentp
Copy link
Owner

brentp commented Dec 10, 2020

ok. so i think the problem is something with the somalier files. How were they extracted? Somalier thinks they are all related or identical so it can't subset the html output.

@stefanucci-luca
Copy link

stefanucci-luca commented Dec 10, 2020

I use somalier extract command:
somalier extract -d $out_dir --sites $sites_dir/$sites_vcf -f $ref $i

Is in a for loop and the stdout is:

somalier version: 0.2.12
[somalier] found 10754 sites
somalier version: 0.2.12
[somalier] found 10764 sites
somalier version: 0.2.12
[somalier] found 10650 sites
somalier version: 0.2.12
[somalier] found 10726 sites
somalier version: 0.2.12
[somalier] found 10680 sites
somalier version: 0.2.12
...

@brentp
Copy link
Owner

brentp commented Dec 10, 2020

ok. that's the worst way to run somalier. better to run it on a jointly called VCF. But if you do run extract that way, then you need to run somalier relate with the -u flag.

@stefanucci-luca
Copy link

I tried to run it from the aggregated file but I get a problem which I believe is caused by the VEP annotations in the INFO field. It doesn't find any site.

somalier version: 0.2.12
[somalier] FORMAT field 'AD' not found for depth information. using genotype only
[somalier] found 0 sites

@brentp
Copy link
Owner

brentp commented Dec 10, 2020

This is not because of VEP, it's because the caller did not add AD field for each sample. what caller did you use? The AD field is pretty standard.

@stefanucci-luca
Copy link

That bit wasn't done by me, but I can try to find out if you think it can helps.

I tried with the -u flag but still 'out of memory`

....
[somalier] apparent identical twins or sample duplicate found with LP2000972-DNA_E11 and LP2000982-DNA_G07 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000977-DNA_H05 and LP2000977-DNA_H06 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000979-DNA_B02 and LP2000986-DNA_G09 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000982-DNA_A04 and LP2000982-DNA_F02 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000982-DNA_D02 and LP2000986-DNA_F09 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000983-DNA_E12 and LP2000983-DNA_G12 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000984-DNA_A11 and LP2000984-DNA_G10 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000987-DNA_A02 and LP2000987-DNA_G02 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000987-DNA_A12 and LP2000987-DNA_H11 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000987-DNA_B12 and LP2000987-DNA_C12 NOT assuming siblings
[somalier] apparent identical twins or sample duplicate found with LP2000989-DNA_C08 and LP2000991-DNA_F11 NOT assuming siblings
[somalier] time to calculate all vs all relatedness for all 87192615 combinations: 809.16
io.nim(138)              raiseEIO
Error: unhandled exception: cannot write string to file [IOError]
[1]+  Exit 1                  somalier relate -u -i somalier_flaghip/*.somalier

@brentp
Copy link
Owner

brentp commented Dec 10, 2020

can you show what format fields are available in your VCF?

so that doesn't look like a memory error. are you sure you have enough disk space to write?

@stefanucci-luca
Copy link

in the single VCFs the format has GT:GQ:GQX:DP:DPF:AD. In the pVCF the format is empy

I don't think is a disk space problem:

free -h
              total        used        free      shared  buff/cache   available
Mem:           1.5T        235G        1.2T        217M        9.4G        1.2T
Swap:           15G          0B         15G

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants