Provide AQA Test metrics per release #5121

jiekang · 2024-03-04T18:33:37Z

This issue tracks the efforts to provide more metrics on the AQAvit test runs on a per release basis.

The project as a whole currently tracks some useful release metrics via scorecards by Shelley:

https://github.com/adoptium/adoptium/wiki/Adoptium-Release-Scorecards
https://github.com/smlambert/scorecard

The release scorecard data is useful to understand how we well we are doing at meeting release targets and how that is trending across releases.

It would be nice to similarly provide data for test runs to help track the "health" of our test suite execution across releases (health of the tests & their execution, which can relate to the underlying infrastructure). As a connected note, this is also a piece in the larger end goal of highlighting opportunities to reduce the burden on triage engineers (e.g. by highlighting machine specific failures across releases in a different manner)

I imagine this to involve enhancing the existing Release Summary Report (RSR) which already contains most of the data (whether in the report itself or in links), and presenting it in a manner that connects the state across releases.

This proposal is open to all feedback. To start the discussion, I propose tracking AQAvit test execution data per release that is formatted to easily understand platform state across releases. This would contain:

Per release & per platform:
- For test targets and tests: number executed, passed, failed, disabled, & skipped (mostly in RSR already)
- Overall score for initial test run: % passed, number of automated reruns (mostly in RSR already)
- For failed tests targets, the number and list of test failures, and the failing machine (likely linked) (mostly in RSR already)
- Number of manual reruns performed (may be difficult to gather and involve adjusting triage process)

smlambert · 2024-03-04T19:20:52Z

Thanks @jiekang ! Adding some comments and linking some relevant issues related to this:

help track the "health" of our test suite

Noting an important distinction that some of the metrics under discussion will track the health of the underlying infrastructure and availability of machine resources rather than specifically the "health" of the test suites being run. We will aim to differentiate such information, in order to know where best to 'course-correct' and where to apply improvement efforts.

Related to this is the enhancement issue for RSR:
adoptium/aqa-test-tools#649

smlambert · 2024-03-04T19:46:32Z

Additional metrics worth tracking (not necessarily to be considered under this issue, just jotting some down so as not to lose them):

top-level test target execution times (including queue times waiting for machines), helps inform on whether extra resources needed (related: Assess test target execution time & define test schedule #2037)
number of online available test machines for particular platform (measured twice, once during dry run and once on the trigger of the release pipeline)
- for mac where Orka machines are spun up, it would be good to measure the max burst number of dynamic agents spun up and running during release period (is that something visible in Orka dashboard, is there a limit set to how many we can spin up at once?)
if a related issue is identified to be the cause of the failure, report on the age of the issue
We already gather number of test targets excluded, but can also gather number of ProblemListed test cases as we enter a release period.
Number of commits between last aqa-tests release and current aqa-tests release (measuring contributions and activity in aqa-tests repository as a measure of overall AQAvit project health, could also measure the other 6 AQAvit repos, but aqa-tests is the central one, so a good measure of activity). Eclipse Foundation also tracks number of different companies that are giving contributions to a project, while we do not want to duplicate their statistics, we should consider reporting on those Organization contribution statistics in our program plan for each of the sub-projects, Temurin, AQAvit, etc (see https://projects.eclipse.org/projects/adoptium.aqavit/who).
EPIC: Improve the contents and organization of release summary report aqa-test-tools#649 (comment)

jiekang · 2024-03-04T20:30:33Z

Thanks @jiekang ! Adding some comments and linking some relevant issues related to this:

help track the "health" of our test suite

Noting an important distinction that some of the metrics under discussion will track the health of the underlying infrastructure and availability of machine resources rather than specifically the "health" of the test suites being run. We will aim to differentiate such information, in order to know where best to 'course-correct' and where to apply improvement efforts.

Yes, excellent point. There is some cross-boundary overlap as test execution success is sometimes closely linked to stable & consistent infrastructure configuration. I've updated the original comment to note this distinction as I think we should understand the health of the overall system.

Related to this is the enhancement issue for RSR: adoptium/aqa-test-tools#649

jiekang · 2024-03-04T20:34:06Z

Additional metrics worth tracking (not necessarily to be considered under this issue, just jotting some down so as not to lose them):

top-level test target execution times (including queue times waiting for machines), helps inform on whether extra resources needed (related: Assess test target execution time & define test schedule #2037)

number of online available test machines for particular platform (measured twice, once during dry run and once on the trigger of the release pipeline)

for mac where Orka machines are spun up, it would be good to measure the max burst number of dynamic agents spun up and running during release period (is that something visible in Orka dashboard, is there a limit set to how many we can spin up at once?)

if a related issue is identified to be the cause of the failure, report on the age of the issue

We already gather number of test targets excluded, but can also gather number of ProblemListed test cases as we enter a release period.

Number of commits between last aqa-tests release and current aqa-tests release (measuring contributions and activity in aqa-tests repository as a measure of overall AQAvit project health, could also measure the other 6 AQAvit repos, but aqa-tests is the central one, so a good measure of activity). Eclipse Foundation also tracks number of different companies that are giving contributions to a project, while we do not want to duplicate their statistics, we should consider reporting on those Organization contribution statistics in our program plan for each of the sub-projects, Temurin, AQAvit, etc (see https://projects.eclipse.org/projects/adoptium.aqavit/who).

EPIC: Improve the contents and organization of release summary report aqa-test-tools#649 (comment)

Thanks for noting all these; I can see value in all of them!

smlambert · 2024-03-05T01:15:08Z

Related to test execution stats gathering: https://github.com/smlambert/aqastats

Related to differentiating between infra issue and TBD issue (one that needs more triage to figure out if product|test|infra issue):
There is a feature in test pipeline code that supports creating an errorList, and temporarily marking a machine offline if certain issues are reported to the console, this is one way to start to differentiate between what is an obvious infra issue and what is still an underdetermined issue that requires more triage to categorize it. We have not enabled it at ci.adoptium.net yet, but it would be an interesting experiment. A new route / API could potentially be added to TRSS to pull this data if present.

jiekang · 2024-03-05T15:42:25Z

Additional notes:

Tracking test effectiveness: Tracking when an actual bug in OpenJDK was found by a test
We must define the purpose of every metric and how it will be used to improve things

Related issue:
#4278

smlambert · 2024-03-06T17:40:59Z

As discussed in PMC call today, I will create a new repo to encompass moving over the scorecards scripts (from smlambert/scorecards, and an adapted version of scripts from smlambert/aqastats) and new metrics we will design and intend to add for all Adoptium sub-projects as shown in the Adoptium project hierarchy below:

Eclipse Adoptium®

jiekang · 2024-03-27T01:09:30Z

A first draft on data to be collected per release:

Release: 
[...]
    Date: <date>
    Execution Time: <total time>
    
    Version:
    [...]
        SCM Ref: <scm_ref>
        Test Targets: <total, executed, passed, failed, disabled, skipped>
        Tests: <total, executed, passed, failed, disabled, skipped>
        Manual reruns: <total>
        Execution Time: <total time>
            
            Platform:
            [...]
            OS & Arch: <os> <arch>
                Test Targets: <total, executed, passed, failed, disabled, skipped>
                Tests: <total, executed, passed, failed, disabled, skipped>
                Manual reruns: <total>
                Machines Available (Dry Run): <count>
                Machines Available (Release): <count>
                Execution Time: <total time>
                
                Test Target:
                [...]
                    Name: <name>
                    Execution Time: <total time>

jiekang · 2024-03-27T01:12:26Z

Clarifying: I think there is another set of data that has been discussed for gathering that doesn't fit into the same bucket, but is definitely still under consideration. E.g. Test effectiveness, Related issue reporting (age, etc.), repository activity, contribution statistics, etc.

jiekang · 2024-03-27T01:14:07Z

Also immediately after posting I think Platform and Version hierarchy should be swapped for Machines Available data to make sense.

smlambert · 2024-03-27T01:16:23Z

:)

Appreciate your initial care and thoughts on this feature @jiekang ! Thank you!

jiekang · 2024-04-11T21:10:46Z

So with the hierarchy flipped it is:

Release: 
[...]
    Date: <date>
    Execution Time: <total time>
      
    Platform:
    [...]
    OS & Arch: <os> <arch>
        Test Targets: <total, executed, passed, failed, disabled, skipped>
        Tests: <total, executed, passed, failed, disabled, skipped>
        Manual reruns: <total>
        Machines Available (Dry Run): <count>
        Machines Available (Release): <count>
        Execution Time: <total time>
        
        Version:
        [...]
            SCM Ref: <scm_ref>
            Test Targets: <total, executed, passed, failed, disabled, skipped>
            Tests: <total, executed, passed, failed, disabled, skipped>
            Manual reruns: <total>
            Execution Time: <total time>

            Test Target:
            [...]
                Name: <name>
                Execution Time: <total time>

jiekang · 2024-07-04T20:44:55Z

Just noting the code is in development here:

https://github.com/jiekang/scorecard/tree/trss-statistics

It's now fully functional with a diff command to compare between two releases.

Remaining items:

Fix issue with counting test targets. The input data is a list of test targets which include top-level target jobs, their children, and rerun jobs, all of which might not be correctly accounted for when summing results and duration. At the moment it blindly sums every piece of data available.
Add test total data. There are only test target totals.
Add machine availability data. At the moment, this can be manually input.

smlambert mentioned this issue Mar 4, 2024

EPIC: Improve the contents and organization of release summary report adoptium/aqa-test-tools#649

Open

5 tasks

smlambert added the enhancement label Mar 5, 2024

smlambert mentioned this issue Mar 5, 2024

AQAvit Community Call (March 5, 2024 edition) #5090

Closed

smlambert assigned jiekang Mar 6, 2024

smlambert self-assigned this Mar 6, 2024

smlambert mentioned this issue Mar 6, 2024

Automate AQAvit tagging and release branch creation adoptium/github-release-scripts#151

Open

sophia-guo mentioned this issue Mar 8, 2024

Enhance Test result summary with testcases results adoptium/TKG#510

Closed

smlambert mentioned this issue May 15, 2024

AQAvit Meeting - May 15, 2024 #5193

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide AQA Test metrics per release #5121

Provide AQA Test metrics per release #5121

jiekang commented Mar 4, 2024 •

edited

Loading

smlambert commented Mar 4, 2024

smlambert commented Mar 4, 2024 •

edited

Loading

jiekang commented Mar 4, 2024

jiekang commented Mar 4, 2024

smlambert commented Mar 5, 2024 •

edited

Loading

jiekang commented Mar 5, 2024

smlambert commented Mar 6, 2024

jiekang commented Mar 27, 2024 •

edited

Loading

jiekang commented Mar 27, 2024

jiekang commented Mar 27, 2024

smlambert commented Mar 27, 2024

jiekang commented Apr 11, 2024

jiekang commented Jul 4, 2024 •

edited

Loading

Provide AQA Test metrics per release #5121

Provide AQA Test metrics per release #5121

Comments

jiekang commented Mar 4, 2024 • edited Loading

smlambert commented Mar 4, 2024

smlambert commented Mar 4, 2024 • edited Loading

jiekang commented Mar 4, 2024

jiekang commented Mar 4, 2024

smlambert commented Mar 5, 2024 • edited Loading

jiekang commented Mar 5, 2024

smlambert commented Mar 6, 2024

jiekang commented Mar 27, 2024 • edited Loading

jiekang commented Mar 27, 2024

jiekang commented Mar 27, 2024

smlambert commented Mar 27, 2024

jiekang commented Apr 11, 2024

jiekang commented Jul 4, 2024 • edited Loading

jiekang commented Mar 4, 2024 •

edited

Loading

smlambert commented Mar 4, 2024 •

edited

Loading

smlambert commented Mar 5, 2024 •

edited

Loading

jiekang commented Mar 27, 2024 •

edited

Loading

jiekang commented Jul 4, 2024 •

edited

Loading