# Overview

Processing top block files to
-----------------------------

- prune to reduce total size
- prelimenary analysis to find patterns


Prelimenary analysis summary
----------------------------

- All login nodes have sufficient amount of physical memory that no virtual memory utilization is reported from top.

- No hardware interrupts occured at the login nodes (based on info `cpu:hi===0.0` from log).

- No stolen time from hypervisor (login nodes are not virtual machines).

- TBD

In [None]:
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from tqdm.notebook import tqdm

sns.set_theme(style="whitegrid", )
custom_params = {
    "axes.spines.right": False,
    "axes.spines.top": False,
    "figure.figsize":(15,15)
}
sns.set_theme(style="ticks", palette="pastel", rc=custom_params)

Useful information for interpreting the results from top:

- cols
```
          %MEM - simply RES divided by total physical memory
          CODE - the `pgms' portion of quadrant 3
          DATA - the entire quadrant 1 portion of VIRT plus all
                 explicit mmap file-backed pages of quadrant 3
          RES  - anything occupying physical memory which, beginning with
                 Linux-4.5, is the sum of the following three fields:
                 RSan - quadrant 1 pages, which include any
                        former quadrant 3 pages if modified
                 RSfd - quadrant 3 and quadrant 4 pages
                 RSsh - quadrant 2 pages
          RSlk - subset of RES which cannot be swapped out (any quadrant)
          SHR  - subset of RES (excludes 1, includes all 2 & 4, some 3)
          SWAP - potentially any quadrant except 4
          USED - simply the sum of RES and SWAP
          VIRT - everything in-use and/or reserved (all quadrants)
```

- TASK and CPU States

```
       As a default, percentages for these individual categories are
       displayed.  Where two labels are shown below, those for more
       recent kernel versions are shown first.
           us, user    : time running un-niced user processes
           sy, system  : time running kernel processes
           ni, nice    : time running niced user processes
           id, idle    : time spent in the kernel idle handler
           wa, IO-wait : time waiting for I/O completion
           hi : time spent servicing hardware interrupts
           si : time spent servicing software interrupts
           st : time stolen from this vm by the hypervisor
```

- Memory usage

```
       As a default, Line 1 (mem_*) reflects physical memory, classified as:
           total, free, used and buff/cache

       Line 2 (swap_*) reflects mostly virtual memory, classified as:
           total, free, used and avail (which is physical memory)

       The avail number on line 2 is an estimation of physical memory
       available for starting new applications, without swapping.
       Unlike the free field, it attempts to account for readily
       reclaimable page cache and memory slabs.  It is available on
       kernels 3.14, emulated on kernels 2.6.27+, otherwise the same as
       free.

```

## Load data

In [None]:
df_top = pd.read_feather("top_block.arrow")

df_top

## Statistic check

Use the following information to decide if a column can be dropped from analysis (per col analysis).

- how many unique values?
- what are the:
  - mean
  - median
  - std


In [None]:
for col in df_top.columns:
    unique_vals = df_top[col].unique()
    print(f"len(unique_vals[{col}]) = {len(unique_vals)}")
    if len(unique_vals) < 10:
        print(f"\t{unique_vals}")

`swap_free_KiB` and `swap_used_KiB` are either `0.0` or missing values, indicating that the virual memory (`swap`) is not utilized at the login node.
Therefore it should be safe to drop them.

In [None]:
df_top.drop(["swap_free_KiB", "swap_used_KiB"], axis=1, inplace=True)

`cpu: hi` is always zero, indicating that there is no hardware interrupts.

In [None]:
df_top.drop(["cpu_hi"], axis=1, inplace=True)

`cpu: st` is always zero, which makes sense as all five login nodes are not virtual machines, therefore there is no stolen time from the system.

In [None]:
df_top.drop(["cpu_st"], axis=1, inplace=True)

the state of the dataframe...

In [None]:
df_top.describe()

Casting numetric type into the correct type from str:

In [None]:
df_top["PID"] = df_top["PID"].astype(int)
df_top["PR"] = df_top["PR"].astype(int)
df_top["NI"] = df_top["NI"].astype(int)

In [None]:
# converter lambda for mmeory
def memory_to_num(data):
    if "g" in data:
        val = float(re.findall("\d+.\d+", data)[0]) * 1e6  # Gb -> Kib
    elif "m" in data:
        val = float(re.findall("\d+.\d+", data)[0]) * 1e3  # Mb -> Kib
    elif "t" in data:
        val = float(re.findall("\d+.\d+", data)[0]) * 1e9  # Tb -> Kib
    else:
        val = float(data)
    return val

df_top["SHR"].apply(memory_to_num)

In [None]:
df_top["SHR_KiB"] = df_top["SHR"].apply(memory_to_num)
df_top["VIRT_KiB"] = df_top["VIRT"].apply(memory_to_num)
df_top["RES_KiB"] = df_top["RES"].apply(memory_to_num)

# drop the original obj col
df_top.drop(["SHR", "VIRT", "RES"], axis=1, inplace=True)

In [None]:
df_top["%CPU"] = df_top["%CPU"].astype(float)
df_top["%MEM"] = df_top["%MEM"].astype(float)

In [None]:
df_top["cpu_time_sec"] = df_top["TIME+"].apply(lambda x: float(x.split(":")[0])*60 + float(x.split(":")[-1]))

# drop the original string colum
df_top.drop(["TIME+"], axis=1, inplace=True)

In [None]:
# converting task columns
df_top["task_total"] = df_top["task_total"].astype(int)
df_top["task_runing"] = df_top["task_runing"].astype(int)
df_top["task_sleeping"] = df_top["task_sleeping"].astype(int)
df_top["task_stopped"] = df_top["task_stopped"].astype(int)
df_top["task_zombie"] = df_top["task_zombie"].astype(int)

In [None]:
# cpu summary columns

for me in tqdm(("us", "sy", "ni", "id", "wa", "si")):
    label = f"cpu_{me}"
    df_top[label] = df_top[label].astype(float)

In [None]:
# memory summary cols

for me in tqdm(("total", "free", "used", "buff")):
    label = f"mem_{me}_KiB"
    df_top[label] = df_top[label].astype(float)

In [None]:
# remaining swap summary cols

df_top["swap_total_KiB"] = df_top["swap_total_KiB"].astype(float)
df_top["swap_avail_mem_KiB"] = df_top["swap_avail_mem_KiB"].astype(float)

convert USER to proper number for efficient grouping needs a little bit trick here:

- the display limitation results in user with higher uid numbers undistinguishable. For example, u12345 will be displayed as u1234+, which cannot be distinguished from u12346.
- the best we can do is pad the number so that we know the user ID is not necessarily tied to one person due to display issue.

For `u1234x` that displayed as `u1234+`, we will pad two zero at the end to indicate its modification.

In [None]:
# convert user
df_top[~df_top["USER"].str.contains("u")]["USER"].unique()

since we have none-user type process owner (deamon), we have to keep the USER col as object for now.

In [None]:
df_top.to_feather("top_block_pruned.arrow", compression="lz4")