Skip to content

Benchmarking wrap-up #1960

@Jym77

Description

@Jym77

Summary and wrap-up of the "benchmarking" exploration. We'll present that at a CG meeting, maybe also at a TF meeting if there is interest.

All documents copied or generated are grouped in the ACT rules benchmarking Google drive.

  • Initial goal: we want to see how ACT rules (or tools using it) compare to manual testing.

    • How many actual occurrences of problems can we find? (similar to Deque's coverage report)
    • How many SC do we fully cover? Partially?
    • What are the biggest gaps that need to be filled in ACT rules?
  • Problem 1: generating data
    Manual testing is expensive and therefore, manual testing data is often kept secret by organisations. Additionally, releasing report pointing out that such customer is having accessibility issues is probably not what the end customers want… Therefore, it is difficult to generate manual testing data in sufficient amount.

    We've looked at the AIR challenge (which has public data), and considered asking manual testers to willingly provide a bit of work toward generating crowd-sourced reports. But these seems unpracticable.

    In the end, we've opted to look at the EU WAD data. Monitoring bodies do provide report of their in-depth tests. Sometimes they are available. This data is often at the page or site granularity (i.e. not pointing out specific occurrences that fail), which is not ideal but still workable.

  • Problem 2: saving data
    Webpage changes, and so does their accessibility. We can especially hope accessibility will improve after a report, with errors being corrected. This means that automated testing for comparison should happen close to the day the manual test was done; but then we cannot test again 6 months later to see if the tool improved.

    We've decided to store copies of the pages in the Internet Archive Wayback machine. This saves a frozen copy of the page that can be tested later, or even re-audited by manual testers. The archive should be taken at the same time as the initial testing is done, but that is a lightweight operation.

  • Problem 3: comparison
    Even at the best granularity, manual testing data is hard to compare with ACT rules. For example, in a <p>lorem <span>ipsum</span> dolor</p> snipet with poor contrast, manual testers will likely flag the full paragraph while ACT rule flags each of the three text nodes separately. This is OK for true positives (= true error) since we can count them as the manual tester do (i.e. consider there is one contrast problem in this paragraph). This is less OK for false positives (i.e. incorrectly flagged by ACT rules or tools). Notably, if a tool flags these 3 text nodes while manual tester do, is it 1 false positive or 3?

    No real solution so far…

  • Outcome:
    We've found (at least) one website for which the Portuguese monitoring body provides results at the page level (in the JSON file generated by the WCAG-EM tool). I've run Alfa on these pages (or copies in the Wayback machine). The results are gathered in the spreadsheet of Alfa vs Portuguese monitoring.


Possible actions:

  • Other tool vendors and manual testers are welcome to run their tools against the CHUC pages (URL of the Wayback machine in the spreadsheet) to add their results to the comparison.
  • We will try to reach out (through the WAI tools project, likely), to the EU monitoring bodies ard ask if they can provide more data at the page level; and if they can save pages they test in the wayback machine.
  • We can look through the initial results and see if that provides some sort of insight. Notably, the SCs that have errors that the tools don't find are probably good place to find new rules to write.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions