Skip to content

Possible Critical Flaws in Evaluation(VBVR-Bench) #234

@Dynamite2003

Description

@Dynamite2003

After conducting empirical tests and a code audit, I found that the current evaluation metrics(VBVR-Bench) fail to objectively reflect the reasoning capabilities of video models. In many cases, videos that completely fail the logic of the task are still awarded high scores, rendering the benchmark results unreliable.

For example, in the task grid_highest_cost, the evaluator should compute total cost of the path in generated videos, but in "vbvrevalkit/eval/vbvr_bench/evaluators/grid_highest_cost.py", I found that the total cost is roughly computed by the coverage of the path rather than real cost. And amazingly, the function is not even called within the whole repo!

Image

Based on the evidence above, I strongly suspect the evaluation code is written by AI without strictly reviewed.
If any of my statements is wrong, please contact me to let me know.
Looking forward to your explanation and feedback on my questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions