# Prow Logs and GCS Data

### What do we have access to as data scientists when digging into the build artifacts?

In this notebook we will demonstrate how to discover and interact with the data (logs) made availble on [GCS/origin-ci-test](https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/) as well as provide some simple EDA to help folks get started analyzing this data.

This notebook is divided into 2 sections:

1. Compare the different log files present throughout the archives and quantify how complete and comparable our log dataset is from build to build.
1. Download a sample dataset of the events and build logs to perform some lite EDA.

_Note: We will be collecting data from the "origin-ci-test" Bucket on Google Cloud Storage. But, after some out-of-notebook exploration it has become aparent that this is a massive amount of data that contains more than just the OpenShift CI logs we are intrested in here and programatically investigating that Bucket is not advised. Therefore, we recommend using the [web ui](https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/) to inspect what jobs are exposed and identify what is of interest to your analysis before collecting data via the google cloud stporage api. Here we will rely on web-scraping the UI to explore what's available to us based on what jobs are displayed on [TestGrid](https://testgrid.k8s.io/redhat-assisted-installer)._     

## Compare availability of log files across a build

In [1]:
import json
import requests
import pandas as pd
from bs4 import BeautifulSoup
from google.cloud import storage
from pathlib import Path

### Example to access a single set of Prow artifacts

Let's make sure we understand how this works, and focus on a single job first.

In [2]:
tab = '"redhat-openshift-ocp-release-4.6-informing"'
job = "periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-gcp-upgrade"

In [3]:
response = requests.get(
    f"https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/{job}"
)
soup = BeautifulSoup(response.text, "html.parser")
list_of_builds = [x.get_text()[1:-1] for x in soup.find_all("a")][1:-1]

In [4]:
response = requests.get(
    f"https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/{job}/{list_of_builds[1]}"
)
response.url

'https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-gcp-upgrade/1364869749170769920'

In [5]:
soup = BeautifulSoup(response.text, "html.parser")

In [6]:
[x.get_text() for x in soup.find_all("a")]

[' ..',
 ' artifacts/',
 ' build-log.txt',
 ' finished.json',
 ' podinfo.json',
 ' prowjob.json',
 ' started.json']

Great, we can now programmatically access the archives. Now, lets walk through all of the build archives for a single job and create a list of what they have on the first level of their directories.  

In [7]:
build_data = {}

for build in list_of_builds:
    response = requests.get(
        f"https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/{job}/{build}"
    )
    soup = BeautifulSoup(response.text, "html.parser")
    artifacts = [x.get_text() for x in soup.find_all("a")]
    build_data[build] = artifacts

In [8]:
builds_info = pd.Series({k: len(v) for (k, v) in build_data.items()})

In [9]:
builds_info.value_counts()

7    238
6      3
5      1
dtype: int64

In [10]:
pd.Series(build_data).apply(" ".join).value_counts()

 ..  artifacts/  build-log.txt  finished.json  podinfo.json  prowjob.json  started.json    238
 ..  artifacts/  build-log.txt  finished.json  prowjob.json  started.json                    2
 ..  build-log.txt  finished.json  podinfo.json  prowjob.json  started.json                  1
 ..  build-log.txt  finished.json  prowjob.json  started.json                                1
dtype: int64

In [11]:
builds_info.value_counts() / len(builds_info)

7    0.983471
6    0.012397
5    0.004132
dtype: float64

~98% percent of our records for this job appear to be complete and include the 'artifacts/' subdirectory, lets dig in and see what they contain. 

In [12]:
build_data = {}

for build in list_of_builds:
    response = requests.get(
        f"https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/{job}/{build}/artifacts"
    )
    soup = BeautifulSoup(response.text, "html.parser")
    artifacts = [x.get_text() for x in soup.find_all("a")]
    build_data[build] = artifacts

In [13]:
artifact_info = pd.Series({k: len(v) for (k, v) in build_data.items()})
artifact_info.value_counts()

6    95
8    58
7    46
5    36
1     2
3     2
4     2
2     1
dtype: int64

In [14]:
artifact_info.value_counts() / len(artifact_info)

6    0.392562
8    0.239669
7    0.190083
5    0.148760
1    0.008264
3    0.008264
4    0.008264
2    0.004132
dtype: float64

The above shows us that the there are about 40% of the artifacts dirs that have 6 items and 20% that have 7 or 8 (but it does not account for different combinations) 

In [15]:
pd.Series(build_data).apply(" ".join).value_counts(normalize=True)

 ..  build-resources/  e2e-gcp-upgrade/  release/  junit_operator.xml  metadata.json                                                  0.380165
 ..  build-resources/  e2e-gcp-upgrade/  release/  ci-operator-step-graph.json  ci-operator.log  junit_operator.xml  metadata.json    0.239669
 ..  build-resources/  e2e-gcp-upgrade/  release/  ci-operator.log  junit_operator.xml  metadata.json                                 0.161157
 ..  build-resources/  e2e-gcp-upgrade/  junit_operator.xml  metadata.json                                                            0.111570
 ..  build-resources/  release/  junit_operator.xml  metadata.json                                                                    0.037190
 ..  build-resources/  release/  ci-operator-step-graph.json  ci-operator.log  junit_operator.xml  metadata.json                      0.028926
 ..  build-resources/  release/  ci-operator.log  junit_operator.xml  metadata.json                                                   0.008264

We can see from the results above that once we get down into the artifacts there is a far less uniformity to the data available to us for analysis. And this is all within a single job! Moving forward we will assume that this issue gets worse when comparing available artifacts across jobs and can dedicate a later notebook to proving out that assumption.  

This heterogeneity of objects available for each build will make it somewhat difficult to use these sets of documents as a whole to compare different CI behaviour. At this point, it makes sense to consider looking only at the same document (log) across job where available. 


## Collect Data

### Build logs

In the next section we are going to walkthrough accessing the `build-logs.txt` and the `events.json` as they appear to be nearly universally available. We will both download a small testing dataset as well show how to work directly with the data in memory.

Now that we know what logs we want to collect its simpler to use the google cloud storage api to access or data. 

In [16]:
def connect_storage(bucket_name):
    storage_client = storage.Client.create_anonymous_client()
    bucket = storage_client.bucket(bucket_name)
    return {"bucket": bucket, "storage_client": storage_client}


def download_public_file(client, source_blob_name):
    """Downloads a public blob from the bucket."""
    blob = client["bucket"].blob(source_blob_name)
    if blob.exists(client["storage_client"]):
        text = blob.download_as_text()
    else:
        text = ""
    return text

In [17]:
bucket_connection = connect_storage("origin-ci-test")

In [18]:
# Read data into memory
build_log_data = {}
for build in list_of_builds:
    file = download_public_file(bucket_connection, f"logs/{job}/{build}/build-log.txt")
    build_log_data[build] = file

In [19]:
build_log_data[list(build_log_data.keys())[0]]

'2021/02/25 03:27:14 ci-operator version v20210224-231f07b\n2021/02/25 03:27:14 Loading configuration from https://config.ci.openshift.org for openshift/release@master [ci-4.6-upgrade-from-stable-4.5]\nerror: failed to load configuration: got unexpected http 404 status code from configresolver: failed to get config: could not find any config for branch master on repo openshift/release\ntime="2021-02-25T03:27:14Z" level=info msg="Reporting job state \'failed\' with reason \'loading_args:loading_config:config_resolver\'"\n'

In [20]:
def get_counts(x):
    """
    Gets counts for chars, words, lines for a log.
    """
    if x:
        chars = len(x)
        words = len(x.split())
        lines = x.count("\n") + 1
        return chars, words, lines
    else:
        return 0, 0, 0

In [21]:
## Create a dataframe with char, words, and lines
## count for the logs
data = []
for key, value in build_log_data.items():
    chars, words, lines = get_counts(value)
    data.append([key, chars, words, lines])

df = pd.DataFrame(data=data, columns=["build_log_id", "chars", "words", "lines"])
df

Unnamed: 0,build_log_id,chars,words,lines
0,1364778930069835776,517,51,5
1,1364869749170769920,10805,919,111
2,1364960659313266688,10807,919,111
3,1365051265100288000,15758,1316,184
4,1365142142841786368,10804,919,111
...,...,...,...,...
237,1386232963561164800,16674,1166,120
238,1386323699170283520,16954,1189,120
239,1386414426231410688,16673,1167,120
240,1386505145390469120,16674,1167,120


#### See the stats for chars, words, lines

In [22]:
df["chars"].describe()

count    2.420000e+02
mean     1.805198e+05
std      8.681102e+05
min      9.200000e+01
25%      1.084125e+04
50%      1.122550e+04
75%      1.675175e+04
max      6.662932e+06
Name: chars, dtype: float64

In [23]:
df["words"].describe()

count       242.000000
mean      12469.330579
std       60873.841513
min          11.000000
25%         919.000000
50%         926.000000
75%        1201.750000
max      508998.000000
Name: words, dtype: float64

In [24]:
df["lines"].describe()

count      242.000000
mean       722.760331
std       3001.253363
min          2.000000
25%        111.000000
50%        113.000000
75%        120.000000
max      21291.000000
Name: lines, dtype: float64

From the initial analysis above, we see that we have log files with 2 lines to ~21,000 lines with a mean of ~720 lines. This suggests high variability. The next thing we could look at would be the similarity betwen the log files, performing word analysis, templating, and clustering. We will address those questions in an upcoming notebook. 

### Events

In [25]:
build_events_data = {}
for build in list_of_builds:
    file = download_public_file(
        bucket_connection, f"logs/{job}/{build}/artifacts/build-resources/events.json"
    )
    if file:
        build_events_data[build] = json.loads(file)
    else:
        build_events_data[build] = ""

In [26]:
## Percentage of builds that have the events.json file
count = 0
for key, value in build_events_data.items():
    if value:
        count += 1
count * 100 / len(build_events_data)

97.93388429752066

In [27]:
# Analyzing the messages of a single build
messages = [
    (i["metadata"]["uid"], i["message"])
    for i in build_events_data["1364869749170769920"]["items"]
]
messages_df = pd.DataFrame(messages, columns=["UID", "message"])
messages_df

Unnamed: 0,UID,message
0,504b881c-9e97-46a0-b206-765c9973e1d3,Running job periodic-ci-openshift-release-mast...
1,3a92467e-993e-43a1-8eee-24ba7a22508f,Running job periodic-ci-openshift-release-mast...
2,ed40875c-b182-4164-b730-a3754ed94124,Running job periodic-ci-openshift-release-mast...
3,b8c5b026-936e-4f28-a0de-62051ff378d8,Running job periodic-ci-openshift-release-mast...
4,b1b8265a-d87d-430d-b38e-c10c7f3fb91c,No matching pods found
...,...,...
477,db97f7b5-5fb4-4e68-aa82-cd9c120b9c8d,"Container image ""gcr.io/k8s-prow/sidecar:v2021..."
478,947a5c3f-f318-42d8-88c5-133f9783e94d,Created container sidecar
479,e94e0988-c2ea-4045-bb84-251410018d94,Started container sidecar
480,00bef865-7965-4b65-b39a-9993347c5942,"Back-off pulling image ""image-registry.openshi..."


In [28]:
messages_df["message"].describe()

count                                                   482
unique                                                  156
top       Container image "gcr.io/k8s-prow/entrypoint:v2...
freq                                                     29
Name: message, dtype: object

In [29]:
messages_df["message"].value_counts().reset_index()

Unnamed: 0,index,message
0,"Container image ""gcr.io/k8s-prow/entrypoint:v2...",29
1,Started container place-entrypoint,29
2,"Container image ""gcr.io/k8s-prow/sidecar:v2021...",29
3,Started container sidecar,29
4,Created container place-entrypoint,29
...,...,...
151,"Successfully pulled image ""image-registry.open...",1
152,"found no controller ref for pod ""e2e-gcp-upgra...",1
153,Successfully assigned ci-op-ft9klqc6/e2e-azure...,1
154,"Successfully pulled image ""registry.ci.openshi...",1


In the build data, we saw that about ~97% builds have the events.json file. We further analyzed all the events that happened for a particular build and found the frequencies of the messages. We can repeat the process for all the other builds and find most common messages and perform further analysis.

# Save sample data

In [30]:
path = "../../../data/raw/gcs/build-logs/"
filename = "sample-build-logs.parquet"
dataset_base_path = Path(path)
dataset_base_path.mkdir(parents=True, exist_ok=True)
build_logs = pd.DataFrame.from_dict(build_log_data, orient="index", columns=["log"])
build_logs.to_parquet(f"{path}/{filename}")

In [31]:
path = "../../../data/raw/gcs/events/"
filename = "sample-events.json"
dataset_base_path = Path(path)
dataset_base_path.mkdir(parents=True, exist_ok=True)

with open(f"{path}/{filename}", "w") as file:
    json.dump(build_events_data, file)

## Conclusion

In this notebook, we demonstrated how to programmatically access the gcs openshift origins ci archives, pull specific logs types into our notebook for analysis and save them for later use. 
