# Correlated test failure sets per test and average size of correlation set

This notebook outputs 2 artifacts: 

1. A CSV file that provides, for a given test, all of the other tests that are highly correlated (correlation coefficient of 0.9 or above). This CSV file ommits any tests that do not have any highly correlated tests. So, if a test is not present on the list, then it has no highly correlated tests associated with it and has been removed from the record. The calculation for correlation is performed on all available data exposed by the Red Hat test grid instance at the time the notebook is run.  

2. A summary metric that can be easily tracked over time that represents the average size of correlated test sets in the above CSV. 


__Note__: This notebook follows a very similar approach to an earlier [EDA notebook](https://github.com/aicoe-aiops/ocp-ci-analysis/blob/master/notebooks/data-sources/Sippy/sippy_failure_correlation.ipynb) where we correlated failures with a different dataset. For simplicity, much of the reasoning behind the decisions made in this notebook have been omitted here, but can be found in the above linked notebook.  

In [1]:
import gzip
import json
import os
import sys
import numpy as np
import pandas as pd
import datetime
from pathlib import Path

sys.path.append("../../..")

module_path_1 = os.path.abspath(os.path.join("../../../data-sources/TestGrid"))
if module_path_1 not in sys.path:
    sys.path.append(module_path_1)

from ipynb.fs.defs.testgrid_EDA import decode_run_length  # noqa: E402

In [2]:
# Load test file
with gzip.open("../../../../data/raw/testgrid_810.json.gz", "rb") as read_file:
    testgrid_data = json.load(read_file)

In [3]:
current_grid = testgrid_data['"redhat-openshift-ocp-release-4.4-informing"'][
    "release-openshift-origin-installer-e2e-azure-shared-vpc-4.4"
]

In [4]:
tests = [current_grid["grid"][i]["name"] for i in range(len(current_grid["grid"]))]
# unroll the run-length encoding and set bool for flake or not (x==13)
decoded = [
    (np.array(decode_run_length(current_grid["grid"][i]["statuses"])) == 12).tolist()
    for i in range(len(current_grid["grid"]))
]

In [5]:
matrix = pd.DataFrame(zip(tests, decoded), columns=["test", "values"])
matrix.head()

Unnamed: 0,test,values
0,Monitor cluster while tests execute,"[True, True, True, True, True, True, True, Tru..."
1,Overall,"[False, False, True, False, False, True, False..."
2,[Conformance][Area:Networking][Feature:Router]...,"[False, False, True, False, False, False, Fals..."
3,[Conformance][Area:Networking][Feature:Router]...,"[False, False, True, False, False, False, Fals..."
4,[Conformance][Area:Networking][Feature:Router]...,"[False, False, True, False, False, False, Fals..."


In [6]:
matrix = pd.DataFrame(matrix["values"].to_list(), index=matrix["test"])
matrix.head()

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,50,51,52,53,54,55,56,57,58,59
test,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Monitor cluster while tests execute,True,True,True,True,True,True,True,True,True,False,...,True,True,True,True,True,True,True,True,True,True
Overall,False,False,True,False,False,True,False,False,False,False,...,False,False,False,False,True,False,False,False,False,False
[Conformance][Area:Networking][Feature:Router] The HAProxy router [Top Level] [Conformance][Area:Networking][Feature:Router] The HAProxy router should enable openshift-monitoring to pull metrics [Skipped:ibmcloud] [Suite:openshift/conformance/parallel/minimal],False,False,True,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
[Conformance][Area:Networking][Feature:Router] The HAProxy router [Top Level] [Conformance][Area:Networking][Feature:Router] The HAProxy router should override the route host for overridden domains with a custom value [Suite:openshift/conformance/parallel/minimal],False,False,True,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
[Conformance][Area:Networking][Feature:Router] The HAProxy router [Top Level] [Conformance][Area:Networking][Feature:Router] The HAProxy router should respond with 503 to unrecognized hosts [Suite:openshift/conformance/parallel/minimal],False,False,True,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [7]:
matrix.tail()

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,50,51,52,53,54,55,56,57,58,59
test,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
operator install service-catalog-apiserver,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
operator install service-catalog-controller-manager,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
operator.All images are built and tagged into stable,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
operator.Find all of the input images from ocp/4.4:${component} and tag them into the output image stream,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
operator.Run template e2e-azure - e2e-azure container teardown,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [8]:
# Here we iterate through each grid in our dataset and collect the the names of all tests that fail
# during a build. We will store this in the 'failure_groups' list.

failure_groups = []

for tab in testgrid_data.keys():
    for grid in testgrid_data[tab].keys():
        current_grid = testgrid_data[tab][grid]

        tests = [
            current_grid["grid"][i]["name"] for i in range(len(current_grid["grid"]))
        ]
        # unroll the run-length encoding and set bool for flake or not (x==13)
        decoded = [
            (
                np.array(decode_run_length(current_grid["grid"][i]["statuses"])) == 12
            ).tolist()
            for i in range(len(current_grid["grid"]))
        ]

        matrix = pd.DataFrame(zip(tests, decoded), columns=["test", "values"])
        matrix = pd.DataFrame(matrix["values"].to_list(), index=matrix["test"])

        for c, items in matrix.iteritems():
            if len(items[items].index) > 1:
                failure_groups.append(items[items].index)

In [9]:
failure_groups = pd.Series(failure_groups)

In [10]:
len(failure_groups)

14491

In [11]:
# Now we want a vocabulary for of all unique tests in our dataset so that we can encode our
# failure sets using a one-hot encoding scheme.
vocab = []
count = 0
for i in failure_groups:
    for j in i:
        count += 1
        if j not in vocab:
            vocab.append(j)

print(count)
len(vocab)

115746


5065

In [12]:
# confirm that there are no duplicates in the vocab
len(pd.Series(vocab).unique())

5065

In [13]:
def encode_tests(job):
    encoded = []
    for v in vocab:
        if v in job:
            encoded.extend([1])
        else:
            encoded.extend([0])
    return encoded

In [14]:
encoded = failure_groups.apply(encode_tests)

In [15]:
encoded.head()

0    [1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
1    [0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, ...
2    [1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
3    [1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
4    [0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, ...
dtype: object

In [16]:
df_encoded = pd.DataFrame(encoded.array, columns=vocab)
df_encoded.head()

Unnamed: 0,Application behind service load balancer with PDB is not disrupted,Cluster frontend ingress remain available,Kubernetes APIs remain available,Monitor cluster while tests execute,OpenShift APIs remain available,Overall,operator.Run template e2e-aws-upgrade - e2e-aws-upgrade container setup,operator.Run template e2e-aws-upgrade - e2e-aws-upgrade container test,[sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial] [Suite:openshift],Operator upgrade authentication,...,[Suite: e2e] Pods should be Running or Succeeded,[install] [Suite: operators] [OSD] Managed Velero Operator velero Access should be allowed to edit Schedules,[install] [Suite: operators] [OSD] Managed Velero Operator velero Access should be allowed to edit VolumeSnapshotLocations,[install] [Suite: operators] [OSD] Configure AlertManager Operator clusterRoles should exist,[install] [Suite: operators] [OSD] Configure AlertManager Operator configmaps should exist,[install] [Suite: operators] [OSD] Configure AlertManager Operator deployment should exist,[install] [Suite: operators] [OSD] Configure AlertManager Operator deployment should have all desired replicas ready,[Suite: operators] AlertmanagerInhibitions inhibits ClusterOperatorDegraded,[Suite: operators] [OSD] Splunk Forwarder Operator Operator Upgrade should upgrade from the replaced version,[Suite: operators] [OSD] RBAC Operator Operator Upgrade should upgrade from the replaced version
0,1,1,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,1,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,1,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,1,1,1,1,1,0,1,1,0,...,0,0,0,0,0,0,0,0,0,0


In [17]:
# percent that each test is present in the data
perc_present = df_encoded.sum() / len(df_encoded)
perc_present.sort_values(ascending=False).head(3)

Overall                                                                 0.788627
Monitor cluster while tests execute                                     0.340004
operator.Import the release payload "latest" from an external source    0.103582
dtype: float64

In [18]:
occurrence_count = df_encoded.sum()
occurrence_count.sort_values(ascending=False).head(3)

Overall                                                                 11428
Monitor cluster while tests execute                                      4927
operator.Import the release payload "latest" from an external source     1501
dtype: int64

We also want to make sure that our correlation values are not just due to unique failed test sets present in our dataset. We want to make sure our tests impact multiple jobs. For example, if we had a unique failed test set that only occurred in a single example, and shared no other failed tests among the vocabulary, then all of the tests would appear to be 100% correlated with each other, when in fact that is merely a consequence of insufficient data. In order to prevent that, we will ignore any tests that occur only in a single job. In order to do that we will use occurrence_count to create a filter vector for any test that occurs only once. Then filter them out of our working DF.

In [19]:
filter_unique = list(occurrence_count[occurrence_count.values <= 1].index)

In [20]:
df_encoded = df_encoded.drop(filter_unique, axis=1)

In [21]:
df_encoded.shape

(14491, 4267)

In [22]:
# this takes time
corr_matrix = df_encoded.corr()

In [23]:
# For each feature, find the other features that are correlated by more than 0.9
top_correlation = {}

for c in corr_matrix.columns:
    top_correlation[c] = []
    series = corr_matrix.loc[c]

    for i, s in enumerate(series):
        if s > 0.90 and series.index[i] != c:
            top_correlation[c].append((series.index[i], s))

len(top_correlation)

4267

In [24]:
pd.set_option("display.max_colwidth", 150)
# top_correlation has a number of empty rows as not all tests have high correlations with others,
# lets grab only the sets that have at least 1 highly correlated test
corr_sets = []
for i in top_correlation.items():
    if len(i[1]) >= 1:
        corr_sets.append(i)
print(f"{len(corr_sets)} sets of correlated tests \n")
print(f"Feature of interest: {corr_sets[1][0]}")
pd.DataFrame(corr_sets[1][1], columns=["test_name", "correlation coefficient"])

614 sets of correlated tests 

Feature of interest: [sig-api-machinery] OpenShift APIs remain available


Unnamed: 0,test_name,correlation coefficient
0,[sig-api-machinery] OAuth APIs remain available,0.980253


In [25]:
test_name = "[sig-api-machinery] OpenShift APIs remain available"
num = occurrence_count.loc[test_name]
print(f"{num} : the number of times this test failed in our data set")

102 : the number of times this test failed in our data set


In [26]:
lst = []
focus = corr_sets[1][1]
for j in focus:
    lst.append((j[0], occurrence_count.loc[j[0]]))

pd.DataFrame(lst, columns=["test_name", "num_occurrences"])

Unnamed: 0,test_name,num_occurrences
0,[sig-api-machinery] OAuth APIs remain available,102


In [27]:
# Save CSV
Path("../../../../data/processed/metrics/failure_correlation/").mkdir(
    parents=True, exist_ok=True
)
save = pd.DataFrame(corr_sets, columns=["test_name", "correlated_tests"])
save.to_csv(
    "../../../../data/processed/metrics/failure_correlation/failure_correlation_sets.csv"
)
save.head()

Unnamed: 0,test_name,correlated_tests
0,[sig-api-machinery] OAuth APIs remain available,"[([sig-api-machinery] OpenShift APIs remain available, 0.9802531617970734)]"
1,[sig-api-machinery] OpenShift APIs remain available,"[([sig-api-machinery] OAuth APIs remain available, 0.9802531617970734)]"
2,Operator upgrade kube-controller-manager,"[(Operator upgrade etcd, 1.0)]"
3,Operator upgrade etcd,"[(Operator upgrade kube-controller-manager, 1.0)]"
4,operator install kube-scheduler,"[(operator install kube-controller-manager, 0.9261749970073855)]"


In [28]:
save.shape

(614, 2)

In [29]:
timestamp = datetime.datetime.now()
average_corr = save["correlated_tests"].apply(len).mean()
metric_to_save = pd.DataFrame(
    [[timestamp, average_corr]],
    columns=["timestamp", "average_number_of_correlated_failures"],
)

In [30]:
# Save average size of correlation set
metric_to_save.to_csv(
    r"../../../../data/processed/metrics/failure_correlation/{}-{}-{}-avg-corr.csv".format(
        timestamp.year, timestamp.month, timestamp.day
    )
)

In [31]:
metric_to_save

Unnamed: 0,timestamp,average_number_of_correlated_failures
0,2021-03-09 16:23:24.005437,9.397394
