# Jupyterhub application cloud data
In this notebook we collect metrics, logs, and events data for the jupyterhub application. We retrieve the data and show how it can be split by pods. This data can be used as observations and training data for machine learning models.

In [1]:
from prometheus_api_client import PrometheusConnect
from prometheus_api_client.metric_range_df import MetricRangeDataFrame
from datetime import timedelta, datetime
import pandas as pd
import seaborn as sns
import json
import requests
import hashlib
import re


sns.set(rc={"figure.figsize": (15, 10)})

## Metrics
In this section, we will look at the metrics collected for the pods in the opf-jupyterhub namespace. We explore the metrics in the [Grafana dashboard](https://grafana-route-opf-monitoring.apps.zero.massopen.cloud/d/0fddcc62fa2792f6c6db1a2d6fdd109bde1196ec/jupyterhub-sli-slo?orgId=1) that is used while monitoring the Jupyterhub application. 

In [2]:
prom_url = "http://thanos-query-frontend-opf-observatorium.apps.zero.massopen.cloud/"
# Creating the prometheus connect object with the required parameters
# To get your token, login to moc operate first dashboard
# click on your username on top right
# copy login command and display token
pc = PrometheusConnect(
    url=prom_url,
    disable_ssl=False,
    headers={"Authorization": "bearer [token]"},
)
# Fetching a list of all metrics scraped by the Prometheus host.
all_metrics = pd.DataFrame(pc.all_metrics(), columns={"metrics"})

In [3]:
## Define function to fetch data for a start time and end time
start_time = datetime(2021, 4, 28, 12)
end_time = datetime(2021, 4, 29, 12)


def fetch_metric(metric_name, start_time, end_time):
    # Request last week's data
    metric_data = pc.get_metric_range_data(
        metric_name,  # metric name and label config
        start_time=start_time,  # datetime object for metric range start time
        end_time=end_time,  # datetime object for metric range end time
        chunk_size=timedelta(
            days=1
        ),  # timedelta object for duration of metric data downloaded in one request
    )

    ## Make the dataframe
    metric_df = MetricRangeDataFrame(metric_data)
    metric_df.index = pd.to_datetime(metric_df.index, unit="s", utc=True)

    return metric_df

In [4]:
## Obfuscate PII data

re_email = re.compile(r"[\w.+-]+@[\w-]+\.[\w.-]+")
re_podname = re.compile(r"[\w.+-]+40[\w-]+2e[\w.-]+")
re_quayimage = re.compile(r"\"quay\.io\S*")


def hash_str(x):
    """
    Takes in a string and returns it's hash
    """
    hash_object = hashlib.sha224(x.encode("utf-8"))
    return hash_object.hexdigest()


def hash_email(x):
    """
    Takes in a string, finds emails from
    it and replaces it with their hash
    """
    insts = re_email.findall(x)
    for ins in insts:
        x = re.sub(pattern=ins, string=x, repl=hash_str(ins))
    return x


def hash_podname(x):
    """
    Takes in a string, finds pod names
    from it and replaces it with their hash
    """
    insts = re_podname.findall(x)
    for ins in insts:
        x = re.sub(pattern=ins, string=x, repl=hash_str(ins))
    return x


def hash_quay_image(x):
    """
    Takes in a string, finds quay image
    links in it and replaces it with their hash
    """
    insts = re_quayimage.findall(x)
    for ins in insts:
        x = re.sub(pattern=ins, string=x, repl=hash_str(ins))
    return x

#### Memory

In [5]:
## Collect pod memories
pod_memory = fetch_metric(
    "container_memory_working_set_bytes{\
                            namespace='opf-jupyterhub',\
                            pod=~'jupyterhub.*'}",
    start_time,
    end_time,
)

In [7]:
pod_memory = pod_memory[["container", "instance", "name", "pod", "value"]]
pod_memory["pod"] = pod_memory["pod"].apply(hash_str)
pod_memory["name"] = pod_memory["name"].astype(str).apply(hash_str)

In [38]:
pod_memory.describe()

Unnamed: 0,container,instance,name,pod,value
count,37627,37627,37627,37627,37627
unique,6,4,40,15,5786
top,prometheus-proxy,192.12.185.116:10250,1201b48f642f405715d25e70cd1016472c601373615343...,c1282c7f61f54d63d572ac744f476a72ddfd8e7e92c646...,0
freq,12169,24293,12169,4508,1132


In [8]:
## Look at all the pods that have values for these metrics
pod_memory["pod"].value_counts()

c1282c7f61f54d63d572ac744f476a72ddfd8e7e92c6468253123977    4508
41f66791129354174d3953cdd59299aabebf4b796840949bc1d51c7f    3381
6a264c580f88d15ba26424daaf49f841d0dc157f2435f7ad6d52d01a    3255
3ad81758644b0153ee5432f5720d0c3857eb9f8e407a694843ca0eef    3201
6d679893fdfb604fc76e01d2445ce75d524670ad9a5c009f3c05957c    3201
9aaf637427a80f834a90fef511e5f158fa1a0d8a46fe8cf04a21f07b    3201
8db5cc0ce1ed5fd2b6b4d0c074d67a07200ad473068367f7a2f17254    3201
e88fa6086c09af60cf3c2ac41dd9160329e59c448c226cc324b76a5e    2591
a6970dc782504830a8d828b55a48170e3ec54ac0f77b040e0920da74    2580
2e32cda2c4b8f19448deb92072c9813eaefa2fa667b8aecdb09d790f    2377
f3ded8962b719f430627773f7e3bc2f3e66dc89ce15ee5adfe6b5a58    1815
caa158c2d6f86d379c95848558839fe9c8bbd249fa301a33f5ecb49c    1815
042b1a4a2eb31b71f766fab1ee96b951f6e601ed1cae9166b86bbef7    1815
f94464d2a96457510fd566835a3f16d42e66456b75e64eb3f9c44b19     458
6a124f896a468cc684b8d2bb2b16aba2db98e4061fe50fa443a28f1a     228
Name: pod, dtype: int64

In [9]:
## Save the csv
pod_memory.to_parquet("../data/processed/jupyterhub/metrics/pod_memory.parquet")

#### CPU

In [10]:
# Collect cpu logs
pod_cpu = fetch_metric(
    "container_cpu_usage_seconds_total{\
                            namespace='opf-jupyterhub',\
                            pod=~'jupyterhub.*'}",
    start_time,
    end_time,
)

In [11]:
pod_cpu = pod_cpu[["container", "instance", "name", "pod", "value"]]
pod_cpu["pod"] = pod_cpu["pod"].apply(hash_str)
pod_cpu["name"] = pod_cpu["name"].astype(str).apply(hash_str)

In [12]:
## Look at all the pods that have values for these metrics
pod_cpu["pod"].value_counts()

41f66791129354174d3953cdd59299aabebf4b796840949bc1d51c7f    3390
c1282c7f61f54d63d572ac744f476a72ddfd8e7e92c6468253123977    3381
6a264c580f88d15ba26424daaf49f841d0dc157f2435f7ad6d52d01a    2958
3ad81758644b0153ee5432f5720d0c3857eb9f8e407a694843ca0eef    2904
6d679893fdfb604fc76e01d2445ce75d524670ad9a5c009f3c05957c    2904
9aaf637427a80f834a90fef511e5f158fa1a0d8a46fe8cf04a21f07b    2904
8db5cc0ce1ed5fd2b6b4d0c074d67a07200ad473068367f7a2f17254    2904
e88fa6086c09af60cf3c2ac41dd9160329e59c448c226cc324b76a5e    2291
a6970dc782504830a8d828b55a48170e3ec54ac0f77b040e0920da74    2283
2e32cda2c4b8f19448deb92072c9813eaefa2fa667b8aecdb09d790f    2080
f3ded8962b719f430627773f7e3bc2f3e66dc89ce15ee5adfe6b5a58    1815
caa158c2d6f86d379c95848558839fe9c8bbd249fa301a33f5ecb49c    1815
042b1a4a2eb31b71f766fab1ee96b951f6e601ed1cae9166b86bbef7    1815
f94464d2a96457510fd566835a3f16d42e66456b75e64eb3f9c44b19     456
6a124f896a468cc684b8d2bb2b16aba2db98e4061fe50fa443a28f1a     228
Name: pod, dtype: int64

In [13]:
## Save the csv
pod_cpu.to_parquet("../data/processed/jupyterhub/metrics/pod_cpu.parquet")

#### PVC

In [14]:
# Collect pvc metrics
pod_pvc = fetch_metric(
    "kubelet_volume_stats_used_bytes{namespace='opf-jupyterhub',\
                            persistentvolumeclaim=~'jupyterhub.*',\
                            pod='prometheus-k8s-0'}",
    start_time,
    end_time,
)

In [15]:
pod_pvc.describe()

Unnamed: 0,__name__,container,endpoint,instance,job,metrics_path,namespace,node,persistentvolumeclaim,pod,prometheus,receive,service,tenant_id,value
count,13072,13072,13072,13072,13072,13072,13072,13072,13072,13072,13072,13072,13072,13072,13072
unique,1,1,1,3,1,1,1,3,14,1,1,1,1,1,39
top,kubelet_volume_stats_used_bytes,prometheus-proxy,https-metrics,192.12.185.116:10250,kubelet,/metrics,opf-jupyterhub,os-wrk-1,jupyterhub-nb-kachauha-40redhat-2ecom-pvc,prometheus-k8s-0,openshift-monitoring/k8s,true,kubelet,default-tenant,16777216
freq,13072,13072,13072,8535,13072,13072,13072,8535,1135,13072,13072,13072,13072,13072,1135


In [16]:
pod_pvc = pod_pvc[["container", "instance", "persistentvolumeclaim", "value"]]
pod_pvc["persistentvolumeclaim"] = pod_pvc["persistentvolumeclaim"].apply(hash_str)

In [17]:
## Look at all the pods that have values for these metrics
pod_pvc["persistentvolumeclaim"].value_counts()

b9adb41e417ee68a2b99a170dba5625a71491bcd29c0889ae8a78f2d    1135
82232eee735d53b2126d2aa3d73783cda871b14fc90641be4b5a478b    1135
60cb5055d7c3251ae31d811f3dd528d90b78d4a0f50b519e75648c5e    1135
63a97d20e91883473246fb83756a6aa0cdbfbbd9571cbda32f2d0e73    1135
019f8493c120a6f58e27fca3e4d1525d368d9aa46bc5dc39cd756591    1135
c6069e0be16f260b090af702553d8d2e41a29c8e944e6b1a33638a2d    1134
56a655de0da5c8eec8f4f4ffffb6b3b97db9c1ccaa509f328a19e0b0    1134
fcc294052b2bc5413195a7d1a2412c0006af5ed3c2abe7b2e111279c    1134
6dd8ec0bac58ca547d160ff8885107290fb16ac7a968ddb39af8b104    1102
1bd8d43ea2dd39d86a1aa2ae009e9bc1845cf70260ca5f13ecbbe10b     930
150820c81c88af602708fda887f81cc4c84526234b58f64da54be0b4     928
f5925df9d4ac7eea96cdd6593a36acb2cfca18cfa1a21812d3859ba9     808
d449b59b5a18587c2509aa0a7d0235bb19d83e289b94bec3c3ee7cc5     151
603b04f8eb9f6fb0336e44e05f7f41a96ba97b7858b451688d85f86e      76
Name: persistentvolumeclaim, dtype: int64

In [18]:
# Save the csv
pod_pvc.to_parquet("../data/processed/jupyterhub/metrics/pod_pvc.parquet")

#### Pod status

In [19]:
## Collect pod waiting and terminated status
pod_waiting_status = fetch_metric(
    "kube_pod_container_status_waiting_reason{namespace='opf-jupyterhub',\
                              pod=~'jupyterhub.*', prometheus_replica='prometheus-k8s-0'}",
    start_time,
    end_time,
)

pod_terminated_status = fetch_metric(
    "kube_pod_container_status_terminated_reason{namespace='opf-jupyterhub',\
                              pod=~'jupyterhub.*', prometheus_replica='prometheus-k8s-0'}",
    start_time,
    end_time,
)

In [20]:
pod_waiting_status = pod_waiting_status[["container", "pod", "reason", "value"]]
pod_waiting_status["pod"] = pod_waiting_status["pod"].apply(hash_str)

pod_terminated_status = pod_terminated_status[["container", "pod", "reason", "value"]]
pod_terminated_status["pod"] = pod_terminated_status["pod"].apply(hash_str)

In [21]:
## Look at all the pods that have values for these metrics
pod_waiting_status["pod"].value_counts()

3ad81758644b0153ee5432f5720d0c3857eb9f8e407a694843ca0eef    7945
9aaf637427a80f834a90fef511e5f158fa1a0d8a46fe8cf04a21f07b    7945
caa158c2d6f86d379c95848558839fe9c8bbd249fa301a33f5ecb49c    7945
8db5cc0ce1ed5fd2b6b4d0c074d67a07200ad473068367f7a2f17254    7945
f3ded8962b719f430627773f7e3bc2f3e66dc89ce15ee5adfe6b5a58    7945
c1282c7f61f54d63d572ac744f476a72ddfd8e7e92c6468253123977    7945
41f66791129354174d3953cdd59299aabebf4b796840949bc1d51c7f    7945
6d679893fdfb604fc76e01d2445ce75d524670ad9a5c009f3c05957c    7945
042b1a4a2eb31b71f766fab1ee96b951f6e601ed1cae9166b86bbef7    7945
6a264c580f88d15ba26424daaf49f841d0dc157f2435f7ad6d52d01a    7714
e88fa6086c09af60cf3c2ac41dd9160329e59c448c226cc324b76a5e    6517
a6970dc782504830a8d828b55a48170e3ec54ac0f77b040e0920da74    6503
2e32cda2c4b8f19448deb92072c9813eaefa2fa667b8aecdb09d790f    5663
f94464d2a96457510fd566835a3f16d42e66456b75e64eb3f9c44b19    1071
6a124f896a468cc684b8d2bb2b16aba2db98e4061fe50fa443a28f1a     532
Name: pod, dtype: int64

In [22]:
pod_terminated_status["pod"].value_counts()

3ad81758644b0153ee5432f5720d0c3857eb9f8e407a694843ca0eef    6810
9aaf637427a80f834a90fef511e5f158fa1a0d8a46fe8cf04a21f07b    6810
caa158c2d6f86d379c95848558839fe9c8bbd249fa301a33f5ecb49c    6810
8db5cc0ce1ed5fd2b6b4d0c074d67a07200ad473068367f7a2f17254    6810
f3ded8962b719f430627773f7e3bc2f3e66dc89ce15ee5adfe6b5a58    6810
c1282c7f61f54d63d572ac744f476a72ddfd8e7e92c6468253123977    6810
41f66791129354174d3953cdd59299aabebf4b796840949bc1d51c7f    6810
6d679893fdfb604fc76e01d2445ce75d524670ad9a5c009f3c05957c    6810
042b1a4a2eb31b71f766fab1ee96b951f6e601ed1cae9166b86bbef7    6810
6a264c580f88d15ba26424daaf49f841d0dc157f2435f7ad6d52d01a    6612
e88fa6086c09af60cf3c2ac41dd9160329e59c448c226cc324b76a5e    5586
a6970dc782504830a8d828b55a48170e3ec54ac0f77b040e0920da74    5574
2e32cda2c4b8f19448deb92072c9813eaefa2fa667b8aecdb09d790f    4854
f94464d2a96457510fd566835a3f16d42e66456b75e64eb3f9c44b19     918
6a124f896a468cc684b8d2bb2b16aba2db98e4061fe50fa443a28f1a     456
Name: pod, dtype: int64

In [23]:
pod_waiting_status.reason.value_counts()

CreateContainerError          14215
ImagePullBackOff              14215
InvalidImageName              14215
CrashLoopBackOff              14215
ErrImagePull                  14215
CreateContainerConfigError    14215
ContainerCreating             14215
Name: reason, dtype: int64

In [24]:
pod_terminated_status.reason.value_counts()

Error                 14215
ContainerCannotRun    14215
Evicted               14215
DeadlineExceeded      14215
Completed             14215
OOMKilled             14215
Name: reason, dtype: int64

In [25]:
## Save the csv
pod_waiting_status.to_parquet(
    "../data/processed/jupyterhub/metrics/pod_waiting_status.parquet"
)
pod_terminated_status.to_parquet(
    "../data/processed/jupyterhub/metrics/pod_terminated_status.parquet"
)

## Events
In this section we are going to look at the events collected from the pods. In order to fetch events, we need to install oc, login into the openshift namespace and execute the following command

```
#!/bin/sh

for i in {0..9}; do
    echo "#$i Attempt"
    oc get events --watch-only --output=jsonpath='{@}{"\n"}' > events-zero.massopen.cloud-$i.ndjson
done

```
In the next cells, we will read and inspect already collected data.

In [26]:
## Read events data from the raw folder
## This cell may not work if you don't have access
## to raw events data, use processed data instead
events = []
for i in range(14):
    events.append(
        pd.read_json(
            f"../data/raw/jupyterhub/events/events-zero.massopen.cloud-{i}.ndjson",
            lines=True,
        )
    )
events_df = pd.concat(events).reset_index().drop("index", axis=1)

In [27]:
## Extract example columns
events_df["pod"] = events_df["involvedObject"].apply(lambda x: x["name"])
events_df = events_df[["firstTimestamp", "lastTimestamp", "pod", "message", "reason"]]

In [28]:
events_df["message"] = events_df["message"].apply(hash_quay_image)
events_df["message"] = events_df["message"].apply(hash_podname)
events_df["pod"] = events_df["pod"].apply(hash_str)
events_df

Unnamed: 0,firstTimestamp,lastTimestamp,pod,message,reason
0,2021-04-28T13:09:15Z,2021-04-28T13:09:15Z,46370ba3277c53edd8e4777f9308dc0b46d77e145b31f1...,Successfully pulled image bb4166c2fdb2499c6ff8...,Pulled
1,2021-04-28T16:14:41Z,2021-04-28T16:14:41Z,e88fa6086c09af60cf3c2ac41dd9160329e59c448c226c...,Successfully assigned opf-jupyterhub/e88fa6086...,Scheduled
2,2021-04-28T16:14:41Z,2021-04-28T16:14:41Z,e88fa6086c09af60cf3c2ac41dd9160329e59c448c226c...,"AttachVolume.Attach succeeded for volume ""pvc-...",SuccessfulAttachVolume
3,2021-04-28T16:14:51Z,2021-04-28T16:14:51Z,e88fa6086c09af60cf3c2ac41dd9160329e59c448c226c...,Add eth0 [10.131.3.3/23],AddedInterface
4,2021-04-28T16:14:52Z,2021-04-28T16:14:52Z,e88fa6086c09af60cf3c2ac41dd9160329e59c448c226c...,Container image fef6e077b696175290bf407f68e6ae...,Pulled
...,...,...,...,...,...
173,2021-04-26T19:12:43Z,2021-04-29T12:30:14Z,46370ba3277c53edd8e4777f9308dc0b46d77e145b31f1...,Pulling image bb4166c2fdb2499c6ff82f4787d3aad3...,Pulling
174,2021-04-29T12:30:14Z,2021-04-29T12:30:14Z,46370ba3277c53edd8e4777f9308dc0b46d77e145b31f1...,Successfully pulled image bb4166c2fdb2499c6ff8...,Pulled
175,2021-04-26T19:13:08Z,2021-04-29T12:30:14Z,46370ba3277c53edd8e4777f9308dc0b46d77e145b31f1...,Created container ray-operator,Created
176,2021-04-26T19:13:08Z,2021-04-29T12:30:14Z,46370ba3277c53edd8e4777f9308dc0b46d77e145b31f1...,Started container ray-operator,Started


In [29]:
## Look at the pods from which we got the events
events_df["pod"].value_counts()

46370ba3277c53edd8e4777f9308dc0b46d77e145b31f11a3090f9e6    65
5aa202ea4fee731c09354b963edbb5ec884ee611a75941939f9f415e    25
f94464d2a96457510fd566835a3f16d42e66456b75e64eb3f9c44b19    22
2e32cda2c4b8f19448deb92072c9813eaefa2fa667b8aecdb09d790f    18
a6970dc782504830a8d828b55a48170e3ec54ac0f77b040e0920da74    15
6a264c580f88d15ba26424daaf49f841d0dc157f2435f7ad6d52d01a    14
e88fa6086c09af60cf3c2ac41dd9160329e59c448c226cc324b76a5e    12
150820c81c88af602708fda887f81cc4c84526234b58f64da54be0b4     6
6a124f896a468cc684b8d2bb2b16aba2db98e4061fe50fa443a28f1a     1
Name: pod, dtype: int64

In [30]:
events_df.to_parquet("../data/processed/jupyterhub/events/events.parquet")

## Logs

In the next section, we query logs through loki.

In [None]:
## Define the server and the query
query = '{k8s_namespace_name="opf-jupyterhub"}'
url = "http://loki-frontend-opf-observatorium.apps.zero.massopen.cloud/loki/api/v1/query_range"
headers = {
    "content-type": "application/json",
    "X-Scope-OrgID": "cluster-app-logs",
}

In [None]:
## Set start and end times
## We fetch data on an hourly basis for 24 hours
## so that it doesn't stress the server

date_range = (
    pd.date_range(start=(datetime(2021, 4, 28, 12)), periods=25, freq="H")
    .to_pydatetime()
    .tolist()
)
date_range_timestamp = [
    str(int(datetime.timestamp(i)) * 1000000000) for i in date_range
]
start_end_times = [
    [date_range_timestamp[i], date_range_timestamp[i + 1]]
    for i in range(len(date_range_timestamp) - 1)
]

In [None]:
for e, i in enumerate(start_end_times):
    params = {"query": query, "start": i[0], "end": i[1]}
    logs = requests.get(url, params=params, headers=headers).json()
    output_file = f"../data/raw/jupyterhub/logs/logs_{e}.json"
    with open(output_file, "w") as f:
        json.dump(logs, f)

In [31]:
## Read the fetched logs
## This cell may not work if you don't have access
## to raw logs data, use processed data instead
logs = []
for i in range(24):
    with open(f"../data/raw/jupyterhub/logs/logs_{i}.json") as f:
        logs_t = json.load(f)
    logs.append(logs_t["data"]["result"][0]["values"])

In [32]:
## Make the dataframe with relevant columns
logs_data = []
for log in logs:
    for log_line in log:
        dc = json.loads(log_line[1])
        pod = dc["kubernetes.pod_name"]
        namespace = dc["kubernetes.namespace_name"]
        message = dc["message"]
        logs_data.append([log_line[0], namespace, pod, message])
logs_df = pd.DataFrame(logs_data, columns=["timestamp", "namespace", "pod", "message"])

In [33]:
logs_df["message"] = logs_df["message"].apply(hash_email)
logs_df["message"] = logs_df["message"].apply(hash_podname)
logs_df["pod"] = logs_df["pod"].apply(hash_str)

In [34]:
logs_df

Unnamed: 0,timestamp,namespace,pod,message
0,1619614799936000000,opf-jupyterhub,c1282c7f61f54d63d572ac744f476a72ddfd8e7e92c646...,[I 2021-04-28 12:59:59.936 JupyterHub log:189]...
1,1619614799477000000,opf-jupyterhub,c1282c7f61f54d63d572ac744f476a72ddfd8e7e92c646...,12:59:59.477 [ConfigProxy] [32minfo[39m: 201...
2,1619614799477000000,opf-jupyterhub,c1282c7f61f54d63d572ac744f476a72ddfd8e7e92c646...,12:59:59.477 [ConfigProxy] [32minfo[39m: Rou...
3,1619614799476000000,opf-jupyterhub,c1282c7f61f54d63d572ac744f476a72ddfd8e7e92c646...,12:59:59.477 [ConfigProxy] [32minfo[39m: Add...
4,1619614799468000000,opf-jupyterhub,c1282c7f61f54d63d572ac744f476a72ddfd8e7e92c646...,/user/32bc72f9faeea8ab3897cf77b800f57c82ef1587...
...,...,...,...,...
2395,1619697592078000000,opf-jupyterhub,c1282c7f61f54d63d572ac744f476a72ddfd8e7e92c646...,"ParseResult(scheme='http', netloc='10.131.3.3:..."
2396,1619697592078000000,opf-jupyterhub,c1282c7f61f54d63d572ac744f476a72ddfd8e7e92c646...,/user/ded6e08b44c4b28e295f9d5c76afd03d8f21d1de...
2397,1619697592075000000,opf-jupyterhub,c1282c7f61f54d63d572ac744f476a72ddfd8e7e92c646...,11:59:52.075 [ConfigProxy] [32minfo[39m: 201...
2398,1619697592074000000,opf-jupyterhub,c1282c7f61f54d63d572ac744f476a72ddfd8e7e92c646...,11:59:52.074 [ConfigProxy] [32minfo[39m: Add...


In [35]:
## Look at the pods from which we got logs
logs_df["pod"].value_counts()

c1282c7f61f54d63d572ac744f476a72ddfd8e7e92c6468253123977    1406
46370ba3277c53edd8e4777f9308dc0b46d77e145b31f11a3090f9e6     808
8db5cc0ce1ed5fd2b6b4d0c074d67a07200ad473068367f7a2f17254      81
6a264c580f88d15ba26424daaf49f841d0dc157f2435f7ad6d52d01a      52
6d679893fdfb604fc76e01d2445ce75d524670ad9a5c009f3c05957c      47
9aaf637427a80f834a90fef511e5f158fa1a0d8a46fe8cf04a21f07b       6
Name: pod, dtype: int64

In [36]:
# Print example pod and log messages
print(logs_df["pod"][0], logs_df["message"][0])
print(logs_df["pod"][1], logs_df["message"][1])
print(logs_df["pod"][2], logs_df["message"][2])

c1282c7f61f54d63d572ac744f476a72ddfd8e7e92c6468253123977 [I 2021-04-28 12:59:59.936 JupyterHub log:189] 302 GET /metrics -> /hub/metrics (@::ffff:10.129.6.124) 2.76ms
c1282c7f61f54d63d572ac744f476a72ddfd8e7e92c6468253123977 12:59:59.477 [ConfigProxy] [32minfo[39m: 201 POST /api/routes/user/32bc72f9faeea8ab3897cf77b800f57c82ef1587ab54ad2e14066c7d/public 
c1282c7f61f54d63d572ac744f476a72ddfd8e7e92c6468253123977 12:59:59.477 [ConfigProxy] [32minfo[39m: Route added /user/32bc72f9faeea8ab3897cf77b800f57c82ef1587ab54ad2e14066c7d/public -> http://10.131.2.60:9090


In [37]:
logs_df.to_parquet("../data/processed/jupyterhub/logs/logs.parquet")

# Conclusion
In this notebook, we collected all the observed time series data points for the opf-jupyterhub namespace. We fetched metrics, logs, and events for 24 hours and showed how we can split it for different pods.