# Persistent Failures Analysis

Speed and quality of builds are some of the key performance indicators for the continuous integration process. That is, reduction in the number of failing builds, or the time taken to fix them, should generally indicate an improvement in the development process. In this notebook, we will analyze the data collected from [TestGrid](https://testgrid.k8s.io/redhat) to calculate metrics such as percent of failures that persist for long times, how long do such failures last for (i.e. how long does it take to fix them), etc. Our goal here is to provide engineers and managers with insights such as
- Which tests (e.g. network or storage) have the most "long lasting" failures
- Which platforms (e.g. AWS or bare metal) have the most "long lasting" failures
- How long does it take to get a failing test passing again
- How long does it take to get a failing build to build again

In this notebook, we will follow the same convention as in [number_of_flakes.ipynb](number_of_flakes.ipynb), i.e., create a long dataframe and let the end user decide what level they want to create aggregate views at.

In [7]:
import gzip
import json
from enum import Enum

import pandas as pd

In [2]:
with gzip.open("../../../../data/raw/testgrid_810.json.gz", "rb") as read_file:
    data = json.load(read_file)

In [3]:
class TestStatus(Enum):
    """Enum to encode what test status each value in testgrid corresponds to

    Basically python equivalent of the enum here:
    https://github.com/GoogleCloudPlatform/testgrid/blob/a18fe953cf98174c215c43e0258b0515e37c283b/pb/test_status/test_status.proto#L3
    """

    NO_RESULT = 0
    PASS = 1
    PASS_WITH_ERRORS = 2
    PASS_WITH_SKIPS = 3
    RUNNING = 4
    CATEGORIZED_ABORT = 5
    UNKNOWN = 6
    CANCEL = 7
    BLOCKED = 8
    TIMED_OUT = 9
    CATEGORIZED_FAIL = 10
    BUILD_FAIL = 11
    FAIL = 12
    FLAKY = 13
    TOOL_FAIL = 14
    BUILD_PASSED = 15

In [4]:
# calculate consecutive failure stats
consec_fail_stats_tuples = []

for tab in data.keys():
    print(tab)

    for grid in data[tab].keys():
        current_grid = data[tab][grid]

        ## Extract relevant info for each test
        for current_test in current_grid["grid"]:

            # number of failing cells
            n_failing_cells = 0

            # total number of occurences of failures (consecutive or one-time)
            n_fail_instances = 0

            # number of occurences of consecutive (not "one-time") failures
            n_consecutive_fail_instances = 0

            # times spent fixing each occurence of failure
            times_spent = []

            # helper variables for calculating time spent fixing
            prev_failing = False
            curr_time_spent = 0
            prev_oldest_ts_idx = 0

            for s in current_test["statuses"]:
                # oldest (least recent) timestamp in current rle encoded dict
                curr_oldest_ts_idx = prev_oldest_ts_idx + s["count"]

                # if the current status is not failing and the previous (i.e.
                # the "newer") status was failing, then this marks the start
                # point of failure. since end point would have already been
                # calculated in previous loop, we just need to save time spent
                if s["value"] != TestStatus.FAIL.value:
                    if prev_failing:
                        times_spent.append(curr_time_spent)
                        curr_time_spent = 0

                elif s["value"] == TestStatus.FAIL.value:
                    n_fail_instances += 1
                    n_failing_cells += s["count"]
                    if s["count"] > 1:
                        n_consecutive_fail_instances += 1

                    # if previous (i.e. the "newer") status was not failing
                    # and now its failing, then time delta between the oldest
                    # ts from previous status and current one must have been
                    # spent fixing the failure
                    if not prev_failing:
                        curr_time_spent += (
                            current_grid["timestamps"][
                                max(0, prev_oldest_ts_idx - 1)
                            ]
                            - current_grid["timestamps"][
                                curr_oldest_ts_idx - 1
                            ]
                        )

                # update helper variables
                prev_failing = s["value"] == TestStatus.FAIL.value
                prev_oldest_ts_idx = curr_oldest_ts_idx

            # test never got to non-fail status again so time spent so far
            # wont have been added to times_spent yet
            if curr_time_spent != 0:
                times_spent.append(curr_time_spent)

            ## Calculate stats for this test

            # consecutive failure rate
            try:
                consec_fail_rate = (
                    n_consecutive_fail_instances / n_fail_instances
                )
            except ZeroDivisionError:
                consec_fail_rate = 0

            # mean length of failures
            try:
                mean_fail_len = n_failing_cells / n_fail_instances
            except ZeroDivisionError:
                mean_fail_len = 0

            # mean time to fix
            try:
                mean_time_to_fix = sum(times_spent) / len(times_spent)
            except ZeroDivisionError:
                mean_time_to_fix = 0

            # save the results to list
            consec_fail_stats_tuples.append(
                [
                    tab,
                    grid,
                    current_test["name"],
                    consec_fail_rate,
                    mean_fail_len,
                    mean_time_to_fix,
                ]
            )

len(consec_fail_stats_tuples)

"redhat-openshift-informing"
"redhat-openshift-ocp-release-3.11-informing"
"redhat-openshift-ocp-release-4.1-blocking"
"redhat-openshift-ocp-release-4.1-informing"
"redhat-openshift-ocp-release-4.2-blocking"
"redhat-openshift-ocp-release-4.2-informing"
"redhat-openshift-ocp-release-4.3-blocking"
"redhat-openshift-ocp-release-4.3-broken"
"redhat-openshift-ocp-release-4.3-informing"
"redhat-openshift-ocp-release-4.4-blocking"
"redhat-openshift-ocp-release-4.4-broken"
"redhat-openshift-ocp-release-4.4-informing"
"redhat-openshift-ocp-release-4.5-blocking"
"redhat-openshift-ocp-release-4.5-broken"
"redhat-openshift-ocp-release-4.5-informing"
"redhat-openshift-ocp-release-4.6-blocking"
"redhat-openshift-ocp-release-4.6-broken"
"redhat-openshift-ocp-release-4.6-informing"
"redhat-openshift-ocp-release-4.7-blocking"
"redhat-openshift-ocp-release-4.7-broken"
"redhat-openshift-ocp-release-4.7-informing"
"redhat-openshift-okd-release-4.3-informing"
"redhat-openshift-okd-release-4.4-informing"
"r

177291

In [5]:
# put results in a pretty dataframe
consec_fail_stats_df = pd.DataFrame(
    data=consec_fail_stats_tuples,
    columns=[
        "tab",
        "grid",
        "test",
        "consec_fail_rate",
        "mean_fail_len",
        "mean_time_to_fix",
    ],
)
consec_fail_stats_df.head()

Unnamed: 0,tab,grid,test,consec_fail_rate,mean_fail_len,mean_time_to_fix
0,"""redhat-openshift-informing""",release-openshift-okd-installer-e2e-aws-upgrade,Application behind service load balancer with ...,0.0625,1.0625,17479125.0
1,"""redhat-openshift-informing""",release-openshift-okd-installer-e2e-aws-upgrade,Cluster frontend ingress remain available,0.142857,1.142857,19371000.0
2,"""redhat-openshift-informing""",release-openshift-okd-installer-e2e-aws-upgrade,Kubernetes APIs remain available,0.0,1.0,26793750.0
3,"""redhat-openshift-informing""",release-openshift-okd-installer-e2e-aws-upgrade,Monitor cluster while tests execute,0.125,1.125,17794562.5
4,"""redhat-openshift-informing""",release-openshift-okd-installer-e2e-aws-upgrade,OpenShift APIs remain available,0.0,1.0,7584500.0


In [6]:
# the output here shows what tabs and grids have overall the longest failures
res = consec_fail_stats_df[consec_fail_stats_df["test"] == "Overall"]
res.sort_values("mean_time_to_fix", ascending=False).head()

Unnamed: 0,tab,grid,test,consec_fail_rate,mean_fail_len,mean_time_to_fix
19692,"""redhat-openshift-ocp-release-4.3-broken""",release-openshift-ocp-installer-e2e-azure-ovn-4.3,Overall,1.0,90.0,5130903000.0
35064,"""redhat-openshift-ocp-release-4.3-informing""",release-openshift-origin-installer-e2e-aws-4.3...,Overall,1.0,60.0,5100456000.0
48958,"""redhat-openshift-ocp-release-4.4-broken""",release-openshift-origin-installer-e2e-azure-c...,Overall,1.0,60.0,5100443000.0
69902,"""redhat-openshift-ocp-release-4.4-informing""",release-openshift-origin-installer-e2e-aws-upg...,Overall,1.0,60.0,5100441000.0
164990,"""redhat-openshift-okd-release-4.4-informing""",promote-release-openshift-okd-machine-os-conte...,Overall,1.0,60.0,5100395000.0
