Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide AQA Test metrics per release #5121

Open
jiekang opened this issue Mar 4, 2024 · 13 comments
Open

Provide AQA Test metrics per release #5121

jiekang opened this issue Mar 4, 2024 · 13 comments
Assignees

Comments

@jiekang
Copy link

jiekang commented Mar 4, 2024

This issue tracks the efforts to provide more metrics on the AQAvit test runs on a per release basis.

The project as a whole currently tracks some useful release metrics via scorecards by Shelley:

https://github.com/adoptium/adoptium/wiki/Adoptium-Release-Scorecards
https://github.com/smlambert/scorecard

The release scorecard data is useful to understand how we well we are doing at meeting release targets and how that is trending across releases.

It would be nice to similarly provide data for test runs to help track the "health" of our test suite execution across releases (health of the tests & their execution, which can relate to the underlying infrastructure). As a connected note, this is also a piece in the larger end goal of highlighting opportunities to reduce the burden on triage engineers (e.g. by highlighting machine specific failures across releases in a different manner)

I imagine this to involve enhancing the existing Release Summary Report (RSR) which already contains most of the data (whether in the report itself or in links), and presenting it in a manner that connects the state across releases.

This proposal is open to all feedback. To start the discussion, I propose tracking AQAvit test execution data per release that is formatted to easily understand platform state across releases. This would contain:

  • Per release & per platform:
    • For test targets and tests: number executed, passed, failed, disabled, & skipped (mostly in RSR already)
    • Overall score for initial test run: % passed, number of automated reruns (mostly in RSR already)
    • For failed tests targets, the number and list of test failures, and the failing machine (likely linked) (mostly in RSR already)
    • Number of manual reruns performed (may be difficult to gather and involve adjusting triage process)
@smlambert
Copy link
Contributor

Thanks @jiekang ! Adding some comments and linking some relevant issues related to this:

help track the "health" of our test suite

Noting an important distinction that some of the metrics under discussion will track the health of the underlying infrastructure and availability of machine resources rather than specifically the "health" of the test suites being run. We will aim to differentiate such information, in order to know where best to 'course-correct' and where to apply improvement efforts.

Related to this is the enhancement issue for RSR:
adoptium/aqa-test-tools#649

@smlambert
Copy link
Contributor

smlambert commented Mar 4, 2024

Additional metrics worth tracking (not necessarily to be considered under this issue, just jotting some down so as not to lose them):

  • top-level test target execution times (including queue times waiting for machines), helps inform on whether extra resources needed (related: Assess test target execution time & define test schedule #2037)

  • number of online available test machines for particular platform (measured twice, once during dry run and once on the trigger of the release pipeline)

    • for mac where Orka machines are spun up, it would be good to measure the max burst number of dynamic agents spun up and running during release period (is that something visible in Orka dashboard, is there a limit set to how many we can spin up at once?)
  • if a related issue is identified to be the cause of the failure, report on the age of the issue

  • We already gather number of test targets excluded, but can also gather number of ProblemListed test cases as we enter a release period.

  • Number of commits between last aqa-tests release and current aqa-tests release (measuring contributions and activity in aqa-tests repository as a measure of overall AQAvit project health, could also measure the other 6 AQAvit repos, but aqa-tests is the central one, so a good measure of activity). Eclipse Foundation also tracks number of different companies that are giving contributions to a project, while we do not want to duplicate their statistics, we should consider reporting on those Organization contribution statistics in our program plan for each of the sub-projects, Temurin, AQAvit, etc (see https://projects.eclipse.org/projects/adoptium.aqavit/who).

  • EPIC: Improve the contents and organization of release summary report aqa-test-tools#649 (comment)

@jiekang
Copy link
Author

jiekang commented Mar 4, 2024

Thanks @jiekang ! Adding some comments and linking some relevant issues related to this:

help track the "health" of our test suite

Noting an important distinction that some of the metrics under discussion will track the health of the underlying infrastructure and availability of machine resources rather than specifically the "health" of the test suites being run. We will aim to differentiate such information, in order to know where best to 'course-correct' and where to apply improvement efforts.

Yes, excellent point. There is some cross-boundary overlap as test execution success is sometimes closely linked to stable & consistent infrastructure configuration. I've updated the original comment to note this distinction as I think we should understand the health of the overall system.

Related to this is the enhancement issue for RSR: adoptium/aqa-test-tools#649

@jiekang
Copy link
Author

jiekang commented Mar 4, 2024

Additional metrics worth tracking (not necessarily to be considered under this issue, just jotting some down so as not to lose them):

  • top-level test target execution times (including queue times waiting for machines), helps inform on whether extra resources needed (related: Assess test target execution time & define test schedule #2037)

  • number of online available test machines for particular platform (measured twice, once during dry run and once on the trigger of the release pipeline)

    • for mac where Orka machines are spun up, it would be good to measure the max burst number of dynamic agents spun up and running during release period (is that something visible in Orka dashboard, is there a limit set to how many we can spin up at once?)
  • if a related issue is identified to be the cause of the failure, report on the age of the issue

  • We already gather number of test targets excluded, but can also gather number of ProblemListed test cases as we enter a release period.

  • Number of commits between last aqa-tests release and current aqa-tests release (measuring contributions and activity in aqa-tests repository as a measure of overall AQAvit project health, could also measure the other 6 AQAvit repos, but aqa-tests is the central one, so a good measure of activity). Eclipse Foundation also tracks number of different companies that are giving contributions to a project, while we do not want to duplicate their statistics, we should consider reporting on those Organization contribution statistics in our program plan for each of the sub-projects, Temurin, AQAvit, etc (see https://projects.eclipse.org/projects/adoptium.aqavit/who).

  • EPIC: Improve the contents and organization of release summary report aqa-test-tools#649 (comment)

Thanks for noting all these; I can see value in all of them!

@smlambert
Copy link
Contributor

smlambert commented Mar 5, 2024

Related to test execution stats gathering: https://github.com/smlambert/aqastats

Related to differentiating between infra issue and TBD issue (one that needs more triage to figure out if product|test|infra issue):
There is a feature in test pipeline code that supports creating an errorList, and temporarily marking a machine offline if certain issues are reported to the console, this is one way to start to differentiate between what is an obvious infra issue and what is still an underdetermined issue that requires more triage to categorize it. We have not enabled it at ci.adoptium.net yet, but it would be an interesting experiment. A new route / API could potentially be added to TRSS to pull this data if present.

@jiekang
Copy link
Author

jiekang commented Mar 5, 2024

Additional notes:

  • Tracking test effectiveness: Tracking when an actual bug in OpenJDK was found by a test
  • We must define the purpose of every metric and how it will be used to improve things

Related issue:
#4278

@smlambert
Copy link
Contributor

As discussed in PMC call today, I will create a new repo to encompass moving over the scorecards scripts (from smlambert/scorecards, and an adapted version of scripts from smlambert/aqastats) and new metrics we will design and intend to add for all Adoptium sub-projects as shown in the Adoptium project hierarchy below:

@jiekang
Copy link
Author

jiekang commented Mar 27, 2024

A first draft on data to be collected per release:

Release: 
[...]
    Date: <date>
    Execution Time: <total time>
    
    Version:
    [...]
        SCM Ref: <scm_ref>
        Test Targets: <total, executed, passed, failed, disabled, skipped>
        Tests: <total, executed, passed, failed, disabled, skipped>
        Manual reruns: <total>
        Execution Time: <total time>
            
            Platform:
            [...]
            OS & Arch: <os> <arch>
                Test Targets: <total, executed, passed, failed, disabled, skipped>
                Tests: <total, executed, passed, failed, disabled, skipped>
                Manual reruns: <total>
                Machines Available (Dry Run): <count>
                Machines Available (Release): <count>
                Execution Time: <total time>
                
                Test Target:
                [...]
                    Name: <name>
                    Execution Time: <total time>

@jiekang
Copy link
Author

jiekang commented Mar 27, 2024

Clarifying: I think there is another set of data that has been discussed for gathering that doesn't fit into the same bucket, but is definitely still under consideration. E.g. Test effectiveness, Related issue reporting (age, etc.), repository activity, contribution statistics, etc.

@jiekang
Copy link
Author

jiekang commented Mar 27, 2024

Also immediately after posting I think Platform and Version hierarchy should be swapped for Machines Available data to make sense.

@smlambert
Copy link
Contributor

:)

Appreciate your initial care and thoughts on this feature @jiekang ! Thank you!

@jiekang
Copy link
Author

jiekang commented Apr 11, 2024

So with the hierarchy flipped it is:

Release: 
[...]
    Date: <date>
    Execution Time: <total time>
      
    Platform:
    [...]
    OS & Arch: <os> <arch>
        Test Targets: <total, executed, passed, failed, disabled, skipped>
        Tests: <total, executed, passed, failed, disabled, skipped>
        Manual reruns: <total>
        Machines Available (Dry Run): <count>
        Machines Available (Release): <count>
        Execution Time: <total time>
        
        Version:
        [...]
            SCM Ref: <scm_ref>
            Test Targets: <total, executed, passed, failed, disabled, skipped>
            Tests: <total, executed, passed, failed, disabled, skipped>
            Manual reruns: <total>
            Execution Time: <total time>

            Test Target:
            [...]
                Name: <name>
                Execution Time: <total time>

@jiekang
Copy link
Author

jiekang commented Jul 4, 2024

Just noting the code is in development here:

https://github.com/jiekang/scorecard/tree/trss-statistics

It's now fully functional with a diff command to compare between two releases.

Remaining items:

  • Fix issue with counting test targets. The input data is a list of test targets which include top-level target jobs, their children, and rerun jobs, all of which might not be correctly accounted for when summing results and duration. At the moment it blindly sums every piece of data available.
  • Add test total data. There are only test target totals.
  • Add machine availability data. At the moment, this can be manually input.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In Progress
Development

No branches or pull requests

2 participants