Output from truvari bench command #147

priyambial123 · 2023-03-06T14:54:20Z

Hello

I used pbsv tool (https://github.com/PacificBiosciences/pbsv) to do structural variant calling for the HG002 dataset using GRCh38 reference genome build. I downloaded the benchmark HG002 vcf dataset from PacBio link to see if the variants called using the pbsv tool were also found in the downloaded benchmark HG002 vcf dataset. For this purpose, I used the truvari --bench command.

After running, I got this output:

Zipped 101309 variants Counter({'base': 51064, 'comp': 50245})
28244 chunks of 101309 variants Counter({'__filtered': 47625, 'comp': 31471, 'base': 22213})
2023-03-06 15:04:28,877 [INFO] Stats: {
    "TP-base": 20092,
    "TP-call": 20092,
    "FP": 1255,
    "FN": 2121,
    "precision": 0.9412095376399494,
    "recall": 0.9045153738801602,
    "f1": 0.9224977043158861,
    "base cnt": 22213,
    "call cnt": 21347,
    "TP-call_TP-gt": 19000,
    "TP-call_FP-gt": 1092,
    "TP-base_TP-gt": 19000,
    "TP-base_FP-gt": 1092,
    "gt_concordance": 0.9456500099542107,
    "gt_matrix": {
        "(0, 1)": {
            "(0, 1)": 11814,
            "(1, 1)": 664
        },
        "(1, 1)": {
            "(1, 1)": 7186,
            "(0, 1)": 428
        }
    }
}

What is the filtered number of variants?. In the above output, it is given as 47625.
What is call_cnt?
Can you explain the gt_matrix?

Thank you

The text was updated successfully, but these errors were encountered:

ACEnglish · 2023-03-06T15:24:42Z

Details on the summmary.json can be found here. The '__filtered' counts are calls inside the vcf which weren't compared due to parameters like --passonly and --sizemin.

priyambial123 · 2023-03-07T10:13:43Z

From the log file, I could see that I have used the default parameters.

2023-03-06 15:36:21,200 [INFO] Running /benchmark/pbsv/HG002.m84005_220919_232112_s2.GRCh38.pbsv.vcf.gz -c /pbsv/pbsv.vcf.gz -o /pbsv/output_ref --reference /reference/human_GRCh38_no_alt_analysis_set.fasta
2023-03-06 15:36:21,200 [INFO] Params:
{
    "base": "/scicore/home/cichon/GROUP/temp_pacbio/samples/benchmark/pbsv/HG002.m84005_220919_232112_s2.GRCh38.pbsv.vcf.gz",
    "comp": "//pbsv/benchmark.GRCh38.pbsv.vcf.gz",
    "output": "benchmark/pbsv/output_ref",
    "reference": "reference/human_GRCh38_no_alt_analysis_set.fasta",
    "giabreport": false,
    "debug": false,
    "prog": false,
    "refdist": 500,
    "unroll": false,
    "pctsim": 0.7,
    "minhaplen": 50,
    "pctsize": 0.7,
    "pctovl": 0.0,
    "typeignore": false,
    "dup_to_ins": false,
    "use_lev": false,
    "chunksize": 1000,
    "gtcomp": false,
    "bSample": null,
    "cSample": null,
    "sizemin": 50,
    "sizefilt": 30,
    "sizemax": 50000,
    "passonly": false,
    "no_ref": false,
    "includebed": null,
    "extend": 0,
    "multimatch": false
}
2023-03-06 15:36:21,201 [INFO] Truvari version: 3.5.0
2023-03-06 15:36:42,825 [INFO] Zipped 101309 variants Counter({'base': 51064, 'comp': 50245})
2023-03-06 15:36:42,825 [INFO] 28244 chunks of 101309 variants Counter({'__filtered': 47625, 'comp': 31471, 'base': 22213})
2023-03-06 15:36:42,827 [INFO] Stats: {
    "TP-base": 20064,
    "TP-call": 20064,
    "FP": 1283,
    "FN": 2149,
    "precision": 0.9398978779219562,
    "recall": 0.9032548507630667,
    "f1": 0.9212121212121211,
    "base cnt": 22213,
    "call cnt": 21347,
    "TP-call_TP-gt": 18974,
    "TP-call_FP-gt": 1090,
    "TP-base_TP-gt": 18974,
    "TP-base_FP-gt": 1090,
    "gt_concordance": 0.9456738437001595,
    "gt_matrix": {
        "(0, 1)": {
            "(0, 1)": 11807,
            "(1, 1)": 663
        },
        "(1, 1)": {
            "(1, 1)": 7167,
            "(0, 1)": 427
        }

So, using bcftools, I filtered the variants below 30 bp from comparison vcf dataset and there were 18677 variants

bcftools view -i 'INFO/SVLEN<30 & INFO/SVLEN>-30' "s/benchmark/pbsv/pbsv_vcf.gz" > "/samples/benchmark/pbsv/filter_pbsv.vcf"

Similarly, I filtered the variants below 50 bp and ended with vcf dataset of 28838 variants in the benchmark dataset.
bcftools view -i 'INFO/SVLEN<50 & INFO/SVLEN>-50' "s/benchmark/pbsv/HG002.m84005_220919_232112_s2.GRCh38.pbsv.vcf.gz" > "/samples/benchmark/pbsv/PBSV.vcf"

When I exclude these variants from 51064-28838=22226 in base data and 50245-18677=31568 in comparison data. But, I see from the filter applied in the log file, these numbers are different. It is 22213 number of variants in the base vcf and 31471 in comparison vcf
Why am I getting different numbers? Please help

Thank you

ACEnglish · 2023-03-07T15:25:11Z

Without access to the files, I cannot say for certain what might be happening. Unfortunately, all I can do is explain what the code is doing:

The 'chunker' is the first method to manipulate the data-flow. See here.
It is responsible for counting how many base, comp, and __filtered variants with the call_counts = Counter() variable which is logged at the end of the method.

On line 299, you can see the chunk will add monomorphic reference sites to the __filtered variants which wouldn't have been caught by the bcftools query. Monomorphic reference sites are loci without an ALT allele generally found in gVCFs. They give rise to None in pysam.VariantRecord.alts which causes problem. That in conjunction with the fact that they're not technically variants means that truvari filters them. All the other __filtered calls are added when the conditional if not matcher.filter_call evaluates to false. This method will calculate the entry size and compare it to the parameters sizemax (which I don't see accounted for in your bcftools commands), sizemin, sizefilt. Default parameters are kept here and variant size calculation is documented inside the method here.

priyambial123 · 2023-03-27T09:08:24Z

Why there is a reduction in call count in the comp file as shown in above output. For example, in the above output, comp is 31471 and changes to call-count as 21347. But the base-count remains the same.

Thank you

ACEnglish · 2023-03-27T15:05:09Z

This is due to the difference between --sizemin and --sizefilt. See https://github.com/ACEnglish/truvari/wiki/bench#--sizemin-and---sizefilt

priyambial123 · 2023-03-27T16:14:00Z

I understand that these two parameters are applied and we have reduced number of variants to analyse, here it is 31471 variants compared to base variants, 22313. In the genotype concordance step, again --sizemin and --sizefilt is applied?. If so, why. The number of calls would be the same when comparing the number of variants to the truth set as well as when estimating the genotype concordance between truth set and comp set. Why there is reduction in the number of variants at these two steps?.

ACEnglish · 2023-03-27T16:27:03Z

The total number of genotypes compared is 20,064. Genotype concordance is its own measurement different from precison/recall/etc. Think of it as two thing being measured - how many variants are matching, how many variants have the same genotype. In the code we first count the base/comp calls and second, for the calls which have a match, we compare their genotypes. FN/FP variants don't have a counterpart genotype to which they can be compared. Their only option is to count every FN/FP as having an incorrect genotype, which would increase the genotype concordance denominator in the formula from (matching-genotypes / total-matches) to (matching-genotypes / (total-matches + FN_count + FP_count) and thus lower the genotype concordance. Since we want genotype concordance to be an estimate of how correct the caller's genotyping is, we don't want the assumption in the second calculation which is that every FN/FP variant has an incorrect genotype.

For details on the output summary's calculations, see https://github.com/ACEnglish/truvari/blob/develop/truvari/bench.py#LL228C18-L228C18 and https://github.com/ACEnglish/truvari/blob/develop/truvari/utils.py#L380

priyambial123 · 2023-03-27T17:21:01Z

Thank you for the vivid explanation and it is clear from the script. I just want to clarify one last thing. Initially there were 31471 variant calls and among these 20064 were true positives and 1283 were false positives. False positives are calls present in base (truth) data and not in comp data. Is it right?

ACEnglish · 2023-03-27T17:27:34Z

This is incorrect. Base or 'gold standard' or truth counts are TP-base, False Negative. Comparison or test counts are TP-comp, False Positive.

So a 'known truth' call missing from 'test' calls is falsely negative. A 'test' call errantly present is falsely positive.

See https://en.wikipedia.org/wiki/Sensitivity_and_specificity for even more details.

priyambial123 · 2023-03-27T17:40:37Z

Thank you, it makes sense to me. Initially there were 31471 test calls and among which 20064 are TP and 1283 are FP calls. So, what about the remaining calls, 31471-21347=10,034 calls.

ACEnglish · 2023-03-27T17:45:20Z

See #147 (comment) 😄

priyambial123 · 2023-03-28T06:54:26Z

Okay. I understand after reading that calls get filtered based on size min and PASS parameters. Comp calls were 31471 and call-cnt is 21347. 31471 calls were from applying the filters to the initial 50245 calls (size-min and size-filt) . 21347 call cnt was true positives and false positives (not matched with base and calls matched with base). Remaining calls (10034) were not called as they did meet the parameters such as pctseq, pctsize. Hope I understood it right this time

Thank you

ACEnglish · 2023-03-28T13:28:44Z

That sounds correct.

The other, simpler answer, is to ignore the numbers printed separately on {"base".."comp".."_filtered"}. It is a log line more for debugging than for analysis. Everything that is useful for analysis is written to summary.json.

To summarize the summary.json:

The "base cnt" represents variants over --sizemin (and other filters).
The "comp cnt" is all variants over --sizemin (and other filters) as well some variants over --sizefilt
The rest of the numbers (TP/FN/FP) are subsets of "base cnt" / "comp cnt"
Genotype counts are only on TP-base variants and their highest matching TP-comp variant

ACEnglish closed this as completed Mar 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Output from truvari bench command #147

Output from truvari bench command #147

priyambial123 commented Mar 6, 2023

ACEnglish commented Mar 6, 2023

priyambial123 commented Mar 7, 2023

ACEnglish commented Mar 7, 2023 •

edited

Loading

priyambial123 commented Mar 27, 2023

ACEnglish commented Mar 27, 2023

priyambial123 commented Mar 27, 2023

ACEnglish commented Mar 27, 2023 •

edited

Loading

priyambial123 commented Mar 27, 2023

ACEnglish commented Mar 27, 2023

priyambial123 commented Mar 27, 2023

ACEnglish commented Mar 27, 2023

priyambial123 commented Mar 28, 2023 •

edited

Loading

ACEnglish commented Mar 28, 2023

Output from truvari bench command #147

Output from truvari bench command #147

Comments

priyambial123 commented Mar 6, 2023

ACEnglish commented Mar 6, 2023

priyambial123 commented Mar 7, 2023

ACEnglish commented Mar 7, 2023 • edited Loading

priyambial123 commented Mar 27, 2023

ACEnglish commented Mar 27, 2023

priyambial123 commented Mar 27, 2023

ACEnglish commented Mar 27, 2023 • edited Loading

priyambial123 commented Mar 27, 2023

ACEnglish commented Mar 27, 2023

priyambial123 commented Mar 27, 2023

ACEnglish commented Mar 27, 2023

priyambial123 commented Mar 28, 2023 • edited Loading

ACEnglish commented Mar 28, 2023

ACEnglish commented Mar 7, 2023 •

edited

Loading

ACEnglish commented Mar 27, 2023 •

edited

Loading

priyambial123 commented Mar 28, 2023 •

edited

Loading