very huge SNP report file size #41

Open
Jirapong opened this Issue Sep 18, 2012 · 5 comments

3 participants

@Jirapong

I'm trying to run FreeBayes with Human genomic sample. Here is my command.

freebayes --no-indels --no-mnps --no-complex  -v snp0.vcf -p 2 -f Homo_sapiens.nucleotide.fa -b sample1.bam

BAM file as SAM format

HWI-ST313_0162:7:46:20191:40694#CTTGTA  177 Y   2786195 0   100M    6   73049336    0   CATAGTATTCCATGGTGTATATGTGCCACATTTTCTTCATCCAGTCTATCATTGNTGGACATTTGGGTTGGTTCCAAGTCTTTGCTATTGTGAATAGTGC    ggffedccb_fcadefdcfdfceaedS`ddehdafgfggdgcggddddfd]]]]BbT`]bgaggbggggggeggggegggggggggffgggfgggggggg    XT:A:R  NM:i:1  SM:i:0  AM:i:0  X0:i:38 XM:i:1  XO:i:0  XG:i:0  MD:Z:54T45
HWI-ST313_0162:7:63:14158:88065#CTTGTA  113 Y   2925844 0   100M    =   2925844 0   AAAGGAGGCATCTCAAAGGAAATGGAATTTAGTTGAGCTGAAGGATAAGAATTAGATTGCACTGTATTAAAAGTTGGTGAAGGGCTTCCCAGGCAAAGAA    gdgggcedegfacfcggggggcgggefecffggggfdgeggfgggegggggggffgeeggcgggeggfggggggffgffegfgfggbggggggggggfgf    XT:A:R  NM:i:71 SM:i:0  AM:i:0  X0:i:2  X1:i:0  XM:i:1  XO:i:0  XG:i:0  MD:Z:0T1T1T0T1C2C0A0G0A1C0T0T0T0T0G0C0A0A0A0T0C1G2T1A0T2A0A0C0T0T0A0A0C0A0C1T0T0G1A0G0T0T0A1T0T0G0T0A1A0C0G2T0T0T2C0T0C1T0T0C0A1A0T1G0A0G1T0A0C2G3G0    XA:Z:X,-88657279,100M,1;
HWI-ST313_0162:7:63:14158:88065#CTTGTA  177 Y   2925844 0   100M    =   2925844 0   TATGTTGCCACAGAACTTTTGCAAATCTGTGTTATAGAACTTAACACATTGTAGTTATTTGTAGACGTATTTGTCTCTTTCAGATTGAGCTACCAGAGAG    eeeeea]acdafffdf\dfdefeefWddfdcgeedfgggggdgggb\ccagec_dgaggeggggegffeggeggbggggggggggggfgggggegggggg    XT:A:R  NM:i:1  SM:i:0  AM:i:0  X0:i:2  X1:i:0  XM:i:1  XO:i:0  XG:i:0  MD:Z:30A69  XA:Z:X,-88657279,100M,1;
HWI-ST313_0162:7:63:12456:81142#CTTGTA  65  Y   2925864 23  100M    =   2925864 0   TGTCACATATTAACTAAGAAACTACTATCTCTTAATCATACTGATGAGTGCTTCATTCAAGAAATATATAAATGACTAGAATGTCACCCGTTCTCTCTGG    a_ba_acccceadd_abcbafcfeffce__`d_ddfffcf__b_`dddQdZccVbadddaPSPUV_acbc\`_aacdbd\eece_dddcadedaefeffB    XT:A:U  NM:i:76 SM:i:23 AM:i:23 X0:i:1  X1:i:1  XM:i:1  XO:i:0  XG:i:0  MD:Z:0G0C0A0A1T0C1G1A0T0T0A2G0A1C0T0T0A2A0C0A0T1G1A0G0T1A0T1T0G1A0G0A0C0G0T0A0T0T1G0T0C1C0T0T0T0C1G2T0G1G0C0T1C0C1G0A0G0A0G1A0C0G0G0G0T0G0A0C0A1T0C0T0A0G1C0A0  XA:Z:X,+88657299,100M,2;
HWI-ST313_0162:7:63:12456:81142#CTTGTA  129 Y   2925864 23  100M    =   2925864 0   GCAAATCTGTGTTATAGAACTTAACACATTGTAGTTATTTGTAGACGTATTTGTCTCTTTCAGATTGAGCTACCAGAGAGAACGGGTGACATTCTAGTCA    fdddaafdfadddadfefeffdfcfedggeeb[f^dadad^c`a^^a]`acccb^dedeegfgedd_d_dffZfV[_`_^QSY]]aJ_BBBBBBBBBBBB    XT:A:U  NM:i:1  SM:i:23 AM:i:23 X0:i:1  X1:i:1  XM:i:1  XO:i:0  XG:i:0  MD:Z:10A89  XA:Z:X,+88657299,100M,2;
HWI-ST313_0162:7:8:15102:17763#CTTGTA   177 Y   2926340 0   100M    X   88657770    0   GACTGGATTAGAGAATATTAATATTCTAGAAAATAACAAGCTTATGACAGGAATACTATATCAGAGTCAAGAGAAAACAAAAGTATAGGTAAAGACTGAA    abe^eef\cffgagfcggggggfeggffgffgfdgegggdeggdgfffff]bfebggfge^ggfggggggggggggegegggeggffgggfgggcfffhg    XT:A:R  NM:i:0  SM:i:0  AM:i:0  X0:i:2  X1:i:0  XM:i:0  XO:i:0  XG:i:0  MD:Z:100    XA:Z:X,-88657770,100M,0;
HWI-ST313_0162:7:28:12949:47769#CTTGTA  113 Y   2926340 0   100M    =   2926340 0   CCGTAATTACTTTCCTCCACACTTGACTGACTGATCTCATTTCACCATTCTTGTAGCCTCATAAATTTCACCTAGGCCTTACATAGGAATATTTATTTGA    TbTabdad^d__cddZfdbfeefdfZdcc`^cccbUTgedddeacgee`gdggaeegeggggeggecffbfeegggggfgdggfggggggggfggecggg    XT:A:R  NM:i:72 SM:i:0  AM:i:0  X0:i:2  X1:i:0  XM:i:0  XO:i:0  XG:i:0  MD:Z:0G0A0C1G0G0A1T0A0G0A0G0A0A1A0T0T0A1T0A1T0C0T0A2A0A0A0T0A0A0C0A1G0C1T1T0G1C0A0G0G0A0A2C0T0A1A0T0C1G1G1C0A1G0A0G1A0A0A1A0A1A0G2T0A0G0G2A0A0G1C1G0A1  XA:Z:X,-88657770,100M,0;
HWI-ST313_0162:7:28:12949:47769#CTTGTA  177 Y   2926340 0   100M    =   2926340 0   GACTGGATTAGAGAATATTAATATTCTAGAAAATAACAAGCTTATGACAGGAATACTATATCAGAGTCAAGAGAAAACAAAAGTATAGGTAAAGACTGAA    ccd\dccc_bgfffgeeefedNhedggcdggdfgfgegggfg_dgffeeeeeeebdffgedgcggggffgggggggdfgggfggg_fffcffffefdfcf    XT:A:R  NM:i:0  SM:i:0  AM:i:0  X0:i:2  X1:i:0  XM:i:0  XO:i:0  XG:i:0  MD:Z:100    XA:Z:X,-88657770,100M,0;
HWI-ST313_0162:7:65:13614:56621#CTTGTA  113 Y   2928080 0   100M    =   2928080 0   GTAATTTATGTCTGATGTACATTGGCAGTCATCTAATTTTCTTTTTTGTGCTTTTTGTTTATCTGCCTGAATGAGAACCACCTTTTATGATCAAGTGAAT    _d^dda\eabdddaddZddadbddNfffdceeaeZYd`ddf_^Qfeeceed_fffdccedc`c_ccfcffeeeeeddcebffffffffffeeeeececca    XT:A:R  NM:i:74 SM:i:0  AM:i:0  X0:i:2  X1:i:0  XM:i:1  XO:i:0  XG:i:0  MD:Z:1C1T0A0G0A1C0T1A0A0T0T2G0T0A0G2T0C0T0T0C1G0G0G0T0C0C0T0A0G0G0A0A1A0A0A0G0A0A3A0C1G0A0T0A2C0A0T2A0T0G0C0T0C0A0C0T0A0C0T0A0G1T0T2A0A1G0C1G0A0C0T0A1T1C1  XA:Z:X,-88659510,100M,1;
HWI-ST313_0162:7:65:13614:56621#CTTGTA  177 Y   2928080 0   100M    =   2928080 0   GCATAGAACTTAATGTGGTAGTTTCTTCTGGGTCCTAGGAATAAAGAATGCACTGATATTCATTGATGCTCACTACTAGATTTTAAAGCAGACTATTACT    Bdddc^aYcb`^Z^JddadadedUfeWeceeefefffbcfcfcfggggggfeddfgfggdggaggdddaa^ccc_eeeeedfcageefffe_f_hdgggc    XT:A:R  NM:i:1  SM:i:0  AM:i:0  X0:i:2  X1:i:0  XM:i:1  XO:i:0  XG:i:0  MD:Z:14T85  XA:Z:X,-88659510,100M,1;

When result generated, it got all position as SNP.

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  unknown
Y   3044167 .   G   T   14.6236 .   AB=0.25;ABP=5.18177;AC=1;AF=0.5;AN=2;AO=1;CIGAR=1X;DP=4;DPRA=0;EPP=5.18177;EPPR=3.0103;HWE=-0;LEN=1;MEANALT=2;MQM=37;MQMR=37;NS=1;NUMALT=1;ODDS=0;PAIRED=0;PAIREDR=0;RO=2;RPP=5.18177;RPPR=3.0103;RUN=1;SAP=5.18177;SRP=7.35324;TYPE=snp;XAI=0;XAM=0.81;XAS=0.81;XRI=0;XRM=0.4;XRS=0.4;BVAR GT:DP:RO:QR:AO:QA:GL0/1:4:2:90:1:33:-6.27,-3.72597,-11.48
Y   3044168 .   G   T   13.2124 .   AB=0.25;ABP=5.18177;AC=1;AF=0.5;AN=2;AO=1;CIGAR=1X;DP=4;DPRA=0;EPP=5.18177;EPPR=3.73412;HWE=-0;LEN=1;MEANALT=1;MQM=37;MQMR=37;NS=1;NUMALT=1;ODDS=2.99336;PAIRED=0;PAIREDR=0;RO=3;RPP=5.18177;RPPR=3.73412;RUN=1;SAP=5.18177;SRP=9.52472;TYPE=snp;XAI=0;XAM=0.79;XAS=0.79;XRI=0;XRM=0.276667;XRS=0.276667;BVAR   GT:DP:RO:QR:AO:QA:GL    0/1:4:3:133:1:33:-3.3,-0.60206,-12.4133
Y   3044169 .   T   G   13.2124 .   AB=0.25;ABP=5.18177;AC=1;AF=0.5;AN=2;AO=1;CIGAR=1X;DP=4;DPRA=0;EPP=5.18177;EPPR=3.73412;HWE=-0;LEN=1;MEANALT=1;MQM=37;MQMR=37;NS=1;NUMALT=1;ODDS=2.99336;PAIRED=0;PAIREDR=0;RO=3;RPP=5.18177;RPPR=3.73412;RUN=1;SAP=5.18177;SRP=9.52472;TYPE=snp;XAI=0;XAM=0.79;XAS=0.79;XRI=0;XRM=0.276667;XRS=0.276667;BVAR   GT:DP:RO:QR:AO:QA:GL    0/1:4:3:135:1:33:-3.3,-0.60206,-12.6
Y   3044170 .   A   T   13.2124 .   AB=0.25;ABP=5.18177;AC=1;AF=0.5;AN=2;AO=1;CIGAR=1X;DP=4;DPRA=0;EPP=5.18177;EPPR=3.73412;HWE=-0;LEN=1;MEANALT=1;MQM=37;MQMR=37;NS=1;NUMALT=1;ODDS=2.99336;PAIRED=0;PAIREDR=0;RO=3;RPP=5.18177;RPPR=3.73412;RUN=1;SAP=5.18177;SRP=9.52472;TYPE=snp;XAI=0;XAM=0.81;XAS=0.81;XRI=0;XRM=0.27;XRS=0.27;BVAR   GT:DP:RO:QR:AO:QA:GL    0/1:4:3:162:1:33:-3.3,-0.60206,-15.12
Y   3044171 .   T   G   14.6236 .   AB=0.25;ABP=5.18177;AC=1;AF=0.5;AN=2;AO=1;CIGAR=1X;DP=4;DPRA=0;EPP=5.18177;EPPR=7.35324;HWE=-0;LEN=1;MEANALT=2;MQM=37;MQMR=37;NS=1;NUMALT=1;ODDS=0;PAIRED=0;PAIREDR=0;RO=2;RPP=5.18177;RPPR=7.35324;RUN=1;SAP=5.18177;SRP=7.35324;TYPE=snp;XAI=0;XAM=0.79;XAS=0.79;XRI=0;XRM=0.005;XRS=0.005;BVAR   GT:DP:RO:QR:AO:QA:GL    0/1:4:2:130:1:33:-6.27,-3.72597,-15.2133
Y   3044172 .   C   G   14.6236 .   AB=0.25;ABP=5.18177;AC=1;AF=0.5;AN=2;AO=1;CIGAR=1X;DP=4;DPRA=0;EPP=5.18177;EPPR=7.35324;HWE=-0;LEN=1;MEANALT=2;MQM=37;MQMR=37;NS=1;NUMALT=1;ODDS=0;PAIRED=0;PAIREDR=0;RO=2;RPP=5.18177;RPPR=7.35324;RUN=1;SAP=5.18177;SRP=7.35324;TYPE=snp;XAI=0;XAM=0.81;XAS=0.81;XRI=0;XRM=0.005;XRS=0.005;BVAR   GT:DP:RO:QR:AO:QA:GL    0/1:4:2:125:1:33:-6.27,-3.72597,-14.7467

and final file result like > 100 GB (my BAM file is 8 GB). Any advice?

@ekg
Owner
ekg commented Feb 1, 2013

Would you please try this with the most recent version of freebayes? It's possible that there were problems with the default filters in this version.

Beyond that, my first question is: are you sure this is the right reference?

And I also wonder what the effect of --no-indels --no-mnps --no-complex would be. I haven't tested this functionality in a long time because I've focused on detecting all variant classes in a completely integrated way. If you want just SNPs, you can always filter after the fact. If ~/freebayes is your freebayes source tree:

~/freebayes/vcflib/vcffilter -f "TYPE = snp" results.vcf

Does this help?

@gpharhay

Hi:

I tried:

vcffilter -f "TYPE = snp" results.vcf

to generate a freebayes results.vcf file with just SNP. Most of the non-SNP were removed, but not all. It appears that vcffilter isn't functioning properly

@ekg
Owner
@gpharhay
@ekg
Owner
ekg commented Oct 2, 2013

I need another me to write documentation....

Here is how vcffilter works. If you supply "-f" it keeps only alleles which match the following boolean expression of numbers, strings, and INFO field values. The same pattern applies for "-g" but only for genotypes.

In this case, I suggested you get all SNPs. In freebayes, a pure SNP will be marked as having CIGAR=1X. Thus, vcffilter -f "CIGAR = 1X" will keep all such alleles. The "language" used by vcffilter is pretty bare-bones. You have to have spaces between all symbols, including parentheses. The not operator ! applies to the entire expression following it, so to invert the above you'd do vcffilter -f "! ( CIGAR = 1X )".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment