Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No TP-base and TP-call match in the benchmark results #132

Closed
zhengdafangyuan opened this issue Aug 22, 2022 · 3 comments
Closed

No TP-base and TP-call match in the benchmark results #132

zhengdafangyuan opened this issue Aug 22, 2022 · 3 comments

Comments

@zhengdafangyuan
Copy link

Hi,
Thanks to the efforts of authors in providing this tool Truvari. We had problems comparing two vcf files using Truvari V3.4.0.

Running : truvari bench -b GM12878.chr21_14_19M.true.v1.vcf.gz -c lra.chr21.14-19_wr.v1.vcf.gz -o ./out_lra -r 1000 --giabreport -p 0.00
The truvari result shows that TP-base and TP-call are both 0, and got an error KeyError: 'sizecat':

2022-08-22 12:04:33,435 [INFO] Running /home/GPU3090B/miniconda3/bin/truvari bench -b GM12878.chr21_14_19M.true.v1.vcf.gz -c lra.chr21.14-19_wr.v1.vcf.gz -o ./out_lra -r 1000 --giabreport -p 0.00
2022-08-22 12:04:33,435 [INFO] Params:
{
    "base": "GM12878.chr21_14_19M.true.v1.vcf.gz",
    "comp": "lra.chr21.14-19_wr.v1.vcf.gz",
    "output": "./out_lra",
    "reference": null,
    "giabreport": true,
    "debug": false,
    "prog": false,
    "refdist": 1000,
    "pctsim": 0.0,
    "minhaplen": 50,
    "pctsize": 0.7,
    "pctovl": 0.0,
    "typeignore": false,
    "use_lev": false,
    "chunksize": 1000,
    "gtcomp": false,
    "bSample": null,
    "cSample": null,
    "sizemin": 50,
    "sizefilt": 30,
    "sizemax": 50000,
    "passonly": false,
    "no_ref": false,
    "includebed": null,
    "extend": 0,
    "multimatch": false
}
2022-08-22 12:04:33,435 [INFO] Truvari version: 3.4.0
2022-08-22 12:04:33,444 [INFO] Zipped 72 variants. Counter({'base': 44, 'comp': 28})
2022-08-22 12:04:33,444 [INFO] 45 chunks of 72 variants. Counter({'base': 44, 'comp': 28})
2022-08-22 12:04:33,444 [INFO] Stats: {
    "TP-base": 0,
    "TP-call": 0,
    "FP": 19,
    "FN": 44,
    "precision": 0.0,
    "recall": 0.0,
    "f1": 0,
    "base cnt": 44,
    "call cnt": 19,
    "TP-call_TP-gt": 0,
    "TP-call_FP-gt": 0,
    "TP-base_TP-gt": 0,
    "TP-base_FP-gt": 0,
    "gt_concordance": 0,
    "gt_matrix": {}
}
2022-08-22 12:04:33,475 [INFO] Running /home/GPU3090B/miniconda3/bin/truvari bench -b GM12878.chr21_14_19M.true.v1.vcf.gz -c lra.chr21.14-19_wr.v1.vcf.gz -o ./out_lra -r 1000 --giabreport -p 0.00
2022-08-22 12:04:33,505 [INFO] Optimizing DataFrame memory
2022-08-22 12:04:33,516 [INFO] Optimized 0.02MB to 0.02MB
2022-08-22 12:04:33,523 [INFO] Finished vcf2df
Traceback (most recent call last):
  File "/home/GPU3090B/miniconda3/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 3621, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 136, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 163, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 5198, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'sizecat'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/GPU3090B/miniconda3/bin/truvari", line 8, in <module>
    sys.exit(main())
  File "/home/GPU3090B/miniconda3/lib/python3.9/site-packages/truvari/__main__.py", line 85, in main
    TOOLS[args.cmd](args.options)
  File "/home/GPU3090B/miniconda3/lib/python3.9/site-packages/truvari/bench.py", line 815, in bench_main
    make_giabreport(args, outputs["stats_box"])
  File "/home/GPU3090B/miniconda3/lib/python3.9/site-packages/truvari/giab_report.py", line 40, in make_giabreport
    data["sizecat"] = data["sizecat"].astype(sz_type)
  File "/home/GPU3090B/miniconda3/lib/python3.9/site-packages/pandas/core/frame.py", line 3505, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/home/GPU3090B/miniconda3/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 3623, in get_loc
    raise KeyError(key) from err
KeyError: 'sizecat'

But we did find match insertion and deletion results from IGV visualization of the two vcf, for example, the chr21-16359642-DEL-139 in base VCF / chr21-16359640-cuteSV.DEL.6 in comparison VCF , and the chr21-18251499-INS-448 in base VCF / chr21-18251502-cuteSV.INS.8 in comparison VCF

In addition, we used usable_vcf tool to verify these two VCF files, and got VCF validation failure:

2022-08-22 15:13:22,250 [INFO] Running /home/GPU3090B/Software/usable_vcf/usable_vcf.py -v GM12878.chr21_14_19M.true.v1.vcf.gz
2022-08-22 15:13:22,328 [ERROR] ===== vcf-validator failed =====
2022-08-22 15:13:22,329 [DEBUG] b''
2022-08-22 15:13:22,341 [INFO] ===== 4/5 passed (1 failures) for GM12878.chr21_14_19M.true.v1.vcf.gz =====

2022-08-22 15:14:23,453 [INFO] Running /home/GPU3090B/Software/usable_vcf/usable_vcf.py -v lra.chr21.14-19_wr.v1.vcf.gz
2022-08-22 15:14:23,532 [ERROR] ===== vcf-validator failed =====
2022-08-22 15:14:23,532 [DEBUG] b''
2022-08-22 15:14:23,544 [INFO] ===== 4/5 passed (1 failures) for lra.chr21.14-19_wr.v1.vcf.gz =====

I have attached the base VCF and compression VCF to this issue. If it is the VCF file format problem? Or any other problems, could you give us some help? thank you!
GM12878.chr21_14_19M.true.v1.vcf.gz
lra.chr21.14-19_wr.v1.vcf.gz

@ACEnglish
Copy link
Owner

ACEnglish commented Aug 22, 2022

The GM128*.vcf has a header definition of ID=SVTYPE,Number=A. This is causing Truvari to see its svtype as a tuple e.g. ("INS",) instead of string "INS". By changing the header line to:

##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Variant type">

I was able to see:

{
    "TP-base": 18,
    "TP-call": 18,
    "FP": 1,
    "FN": 25,
    "precision": 0.9473684210526315,
    "recall": 0.4186046511627907,
    "f1": 0.5806451612903226,
    "base cnt": 43,
    "call cnt": 19,
    "TP-call_TP-gt": 0,
    "TP-call_FP-gt": 18,
    "TP-base_TP-gt": 0,
    "TP-base_FP-gt": 18,
    "gt_concordance": 0.0,
    "gt_matrix": {
        "(1, 1)": {
            "(None, None)": 6
        },
        "(1, 0)": {
            "(None, None)": 3
        },
        "(0, 1)": {
            "(None, None)": 9
        }
    }
}

Truvari uses pysam to parse VCF files and while there are safe guards for Number=. (See line) I'll need to edit that line for Number=A and others which tell pysam to return tuples instead of lists via:

if isinstance(ret_type, (list, tuple)):

@ACEnglish
Copy link
Owner

Sorry, one more thing... The --giabreport can only be used on the GIAB vcfs. See wiki

ACEnglish added a commit that referenced this issue Aug 22, 2022
pysam returns tuple from Number=A
@zhengdafangyuan
Copy link
Author

Thanks, I followed the step you mentioned and it worked.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants