No TP-base and TP-call match in the benchmark results #132

zhengdafangyuan · 2022-08-22T07:47:20Z

Hi,
Thanks to the efforts of authors in providing this tool Truvari. We had problems comparing two vcf files using Truvari V3.4.0.

Running : truvari bench -b GM12878.chr21_14_19M.true.v1.vcf.gz -c lra.chr21.14-19_wr.v1.vcf.gz -o ./out_lra -r 1000 --giabreport -p 0.00
The truvari result shows that TP-base and TP-call are both 0, and got an error KeyError: 'sizecat':

2022-08-22 12:04:33,435 [INFO] Running /home/GPU3090B/miniconda3/bin/truvari bench -b GM12878.chr21_14_19M.true.v1.vcf.gz -c lra.chr21.14-19_wr.v1.vcf.gz -o ./out_lra -r 1000 --giabreport -p 0.00
2022-08-22 12:04:33,435 [INFO] Params:
{
    "base": "GM12878.chr21_14_19M.true.v1.vcf.gz",
    "comp": "lra.chr21.14-19_wr.v1.vcf.gz",
    "output": "./out_lra",
    "reference": null,
    "giabreport": true,
    "debug": false,
    "prog": false,
    "refdist": 1000,
    "pctsim": 0.0,
    "minhaplen": 50,
    "pctsize": 0.7,
    "pctovl": 0.0,
    "typeignore": false,
    "use_lev": false,
    "chunksize": 1000,
    "gtcomp": false,
    "bSample": null,
    "cSample": null,
    "sizemin": 50,
    "sizefilt": 30,
    "sizemax": 50000,
    "passonly": false,
    "no_ref": false,
    "includebed": null,
    "extend": 0,
    "multimatch": false
}
2022-08-22 12:04:33,435 [INFO] Truvari version: 3.4.0
2022-08-22 12:04:33,444 [INFO] Zipped 72 variants. Counter({'base': 44, 'comp': 28})
2022-08-22 12:04:33,444 [INFO] 45 chunks of 72 variants. Counter({'base': 44, 'comp': 28})
2022-08-22 12:04:33,444 [INFO] Stats: {
    "TP-base": 0,
    "TP-call": 0,
    "FP": 19,
    "FN": 44,
    "precision": 0.0,
    "recall": 0.0,
    "f1": 0,
    "base cnt": 44,
    "call cnt": 19,
    "TP-call_TP-gt": 0,
    "TP-call_FP-gt": 0,
    "TP-base_TP-gt": 0,
    "TP-base_FP-gt": 0,
    "gt_concordance": 0,
    "gt_matrix": {}
}
2022-08-22 12:04:33,475 [INFO] Running /home/GPU3090B/miniconda3/bin/truvari bench -b GM12878.chr21_14_19M.true.v1.vcf.gz -c lra.chr21.14-19_wr.v1.vcf.gz -o ./out_lra -r 1000 --giabreport -p 0.00
2022-08-22 12:04:33,505 [INFO] Optimizing DataFrame memory
2022-08-22 12:04:33,516 [INFO] Optimized 0.02MB to 0.02MB
2022-08-22 12:04:33,523 [INFO] Finished vcf2df
Traceback (most recent call last):
  File "/home/GPU3090B/miniconda3/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 3621, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 136, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 163, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 5198, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'sizecat'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/GPU3090B/miniconda3/bin/truvari", line 8, in <module>
    sys.exit(main())
  File "/home/GPU3090B/miniconda3/lib/python3.9/site-packages/truvari/__main__.py", line 85, in main
    TOOLS[args.cmd](args.options)
  File "/home/GPU3090B/miniconda3/lib/python3.9/site-packages/truvari/bench.py", line 815, in bench_main
    make_giabreport(args, outputs["stats_box"])
  File "/home/GPU3090B/miniconda3/lib/python3.9/site-packages/truvari/giab_report.py", line 40, in make_giabreport
    data["sizecat"] = data["sizecat"].astype(sz_type)
  File "/home/GPU3090B/miniconda3/lib/python3.9/site-packages/pandas/core/frame.py", line 3505, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/home/GPU3090B/miniconda3/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 3623, in get_loc
    raise KeyError(key) from err
KeyError: 'sizecat'

But we did find match insertion and deletion results from IGV visualization of the two vcf, for example, the chr21-16359642-DEL-139 in base VCF / chr21-16359640-cuteSV.DEL.6 in comparison VCF , and the chr21-18251499-INS-448 in base VCF / chr21-18251502-cuteSV.INS.8 in comparison VCF；

In addition, we used usable_vcf tool to verify these two VCF files, and got VCF validation failure:

2022-08-22 15:13:22,250 [INFO] Running /home/GPU3090B/Software/usable_vcf/usable_vcf.py -v GM12878.chr21_14_19M.true.v1.vcf.gz
2022-08-22 15:13:22,328 [ERROR] ===== vcf-validator failed =====
2022-08-22 15:13:22,329 [DEBUG] b''
2022-08-22 15:13:22,341 [INFO] ===== 4/5 passed (1 failures) for GM12878.chr21_14_19M.true.v1.vcf.gz =====

2022-08-22 15:14:23,453 [INFO] Running /home/GPU3090B/Software/usable_vcf/usable_vcf.py -v lra.chr21.14-19_wr.v1.vcf.gz
2022-08-22 15:14:23,532 [ERROR] ===== vcf-validator failed =====
2022-08-22 15:14:23,532 [DEBUG] b''
2022-08-22 15:14:23,544 [INFO] ===== 4/5 passed (1 failures) for lra.chr21.14-19_wr.v1.vcf.gz =====

I have attached the base VCF and compression VCF to this issue. If it is the VCF file format problem? Or any other problems, could you give us some help? thank you!
GM12878.chr21_14_19M.true.v1.vcf.gz
lra.chr21.14-19_wr.v1.vcf.gz

The text was updated successfully, but these errors were encountered:

ACEnglish · 2022-08-22T09:08:02Z

The GM128*.vcf has a header definition of ID=SVTYPE,Number=A. This is causing Truvari to see its svtype as a tuple e.g. ("INS",) instead of string "INS". By changing the header line to:

##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Variant type">

I was able to see:

{
    "TP-base": 18,
    "TP-call": 18,
    "FP": 1,
    "FN": 25,
    "precision": 0.9473684210526315,
    "recall": 0.4186046511627907,
    "f1": 0.5806451612903226,
    "base cnt": 43,
    "call cnt": 19,
    "TP-call_TP-gt": 0,
    "TP-call_FP-gt": 18,
    "TP-base_TP-gt": 0,
    "TP-base_FP-gt": 18,
    "gt_concordance": 0.0,
    "gt_matrix": {
        "(1, 1)": {
            "(None, None)": 6
        },
        "(1, 0)": {
            "(None, None)": 3
        },
        "(0, 1)": {
            "(None, None)": 9
        }
    }
}

Truvari uses pysam to parse VCF files and while there are safe guards for Number=. (See line) I'll need to edit that line for Number=A and others which tell pysam to return tuples instead of lists via:

if isinstance(ret_type, (list, tuple)):

ACEnglish · 2022-08-22T09:12:09Z

Sorry, one more thing... The --giabreport can only be used on the GIAB vcfs. See wiki

pysam returns tuple from Number=A

zhengdafangyuan · 2022-08-23T02:14:23Z

Thanks, I followed the step you mentioned and it worked.

ACEnglish closed this as completed Aug 22, 2022

ACEnglish added a commit that referenced this issue Aug 22, 2022

Fix for #132

10517f8

pysam returns tuple from Number=A

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No TP-base and TP-call match in the benchmark results #132

No TP-base and TP-call match in the benchmark results #132

zhengdafangyuan commented Aug 22, 2022

ACEnglish commented Aug 22, 2022 •

edited

Loading

ACEnglish commented Aug 22, 2022

zhengdafangyuan commented Aug 23, 2022

No TP-base and TP-call match in the benchmark results #132

No TP-base and TP-call match in the benchmark results #132

Comments

zhengdafangyuan commented Aug 22, 2022

ACEnglish commented Aug 22, 2022 • edited Loading

ACEnglish commented Aug 22, 2022

zhengdafangyuan commented Aug 23, 2022

ACEnglish commented Aug 22, 2022 •

edited

Loading