# <a id='toc1_'></a>[Analysis of GPU Jobs that Requested and Used VRAM](#toc0_)
This notebook generates the analysis for jobs that requested some VRAM and run on partitions that their type is GPU and some GPU VRAM is used. It looks at these jobs, corresponding users, and PI groups.

**Table of contents**<a id='toc0_'></a>    
- [Analysis of GPU Jobs that Requested and Used VRAM](#toc1_)    
  - [Setup](#toc1_1_)    
  - [Data Digestion and Preprocessing](#toc1_2_)    
  - [Narrowing Dataset Down to the Relevant Partition](#toc1_3_)    
  - [Generate Baseline Metrics](#toc1_4_)    
  - [Job-Level Analysis](#toc1_5_)    
      - [Find most inefficient jobs with no VRAM constraints based on `requested_vram_efficiency_score`](#toc1_5_1_1_)    
    - [Generate all hoarding analysis metrics for jobs:](#toc1_5_2_)    
      - [Find most inefficient jobs hoarding node RAM based on `ram_hoarding_fraction_diff`](#toc1_5_2_1_)    
      - [Find most inefficient jobs hoarding CPU cores based on `core_hoarding_fraction_diff`](#toc1_5_2_2_)    
  - [User-Level Analysis](#toc1_6_)    
    - [Find Inefficient Users based on `requested_vram_efficiency_score`](#toc1_6_1_)    
    - [Generate all hoarding analysis metrics for users:](#toc1_6_2_)    
      - [Find most inefficient users hoarding node RAM based on `expected_value_ram_hoarding_fraction_diff`](#toc1_6_2_1_)    
      - [Find most inefficient users hoarding CPU cores based on `expected_value_core_hoarding_fraction_diff`](#toc1_6_2_2_)    
  - [PI Group Analysis](#toc1_7_)    
      - [Find Inefficient PIs based on `avg_requested_vram_efficiency_score`](#toc1_7_1_1_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

## <a id='toc1_1_'></a>[Setup](#toc0_)

In [None]:
# Import required modules
import sys
from pathlib import Path
import pandas as pd

Jupyter server should be run at the notebook directory, so the output of the following cell would be the project root:

In [None]:
project_root = Path.cwd().resolve().parent.parent
print(f"Project root: {project_root.name}")

In [None]:
# Automatically reload modules before executing code (set this up BEFORE imports)
%load_ext autoreload
%autoreload 2

# Add project root to sys.path for module imports
if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

from src.analysis import ResourceHoarding
from src.analysis import efficiency_analysis as ea
from src.visualization import (
    JobsWithMetricsVisualizer,
    UsersWithMetricsVisualizer,
    PIGroupsWithMetricsVisualizer,
)
from src.config.enum_constants import ResourceHoardingDataFrameNameEnum
from src.config.paths import (
    JOBS_VISUALIZATION_DATA_DIR,
    USERS_VISUALIZATION_DATA_DIR,
    PI_GROUPS_VISUALIZATION_DATA_DIR,
)

## <a id='toc1_2_'></a>[Data Digestion and Preprocessing](#toc0_)

In [None]:
# Load the jobs DataFrame from DuckDB
preprocessed_jobs_df = ea.load_preprocessed_jobs_dataframe_from_duckdb(
    db_path=Path(project_root) / "data/slurm_data.db",
    table_name="Jobs",
    anonymize=True,
)
display(preprocessed_jobs_df.head(10))
print(preprocessed_jobs_df.shape)

## <a id='toc1_3_'></a>[Narrowing Dataset Down to the Relevant Partition](#toc0_)


In [None]:
analyzer = ResourceHoarding(jobs_df=preprocessed_jobs_df)

In [None]:
filtered_jobs = analyzer.filter_jobs_for_analysis(
    gpu_mem_usage_filter={"min": 0, "inclusive": False}, requested_vram_filter={"min": 0, "inclusive": False}
)

## <a id='toc1_4_'></a>[Generate Baseline Metrics](#toc0_)

In [None]:
metrics_dict = analyzer.calculate_all_efficiency_metrics(filtered_jobs)

jobs_with_metrics = metrics_dict["jobs_with_efficiency_metrics"]
users_with_metrics = metrics_dict["users_with_efficiency_metrics"]
pi_accounts_with_metrics = metrics_dict["pi_accounts_with_efficiency_metrics"]

## <a id='toc1_5_'></a>[Job-Level Analysis](#toc0_)

In [None]:
# Set option to display all columns
pd.set_option("display.max_columns", None)
# Display the DataFrame
display(jobs_with_metrics.head(5))
# To revert to default settings (optional)
pd.reset_option("display.max_columns")

print(f"Jobs found: {len(jobs_with_metrics)}")

#### <a id='toc1_5_1_1_'></a>[Find most inefficient jobs with no VRAM constraints based on `requested_vram_efficiency_score`](#toc0_)

In [None]:
inefficient_jobs_vram_hours = analyzer.sort_and_filter_records_with_metrics(
    metrics_df_name_enum=ResourceHoardingDataFrameNameEnum.JOBS,
    sorting_key="requested_vram_efficiency_score",
    ascending=True,  # Sort by requested_vram_efficiency_score in ascending order
    filter_criteria={
        "alloc_vram_efficiency_score": {"max": -10, "inclusive": True},  # score threshold
    },
)
# Plot top inefficient jobs by requested VRAM efficiency score, with VRAM-hours as labels
jobs_with_metrics_visualizer = JobsWithMetricsVisualizer(inefficient_jobs_vram_hours.head(10))
jobs_with_metrics_visualizer.visualize(
    output_dir_path=JOBS_VISUALIZATION_DATA_DIR,
    column="requested_vram_efficiency_score",
    bar_label_columns=["vram_hours", "allocated_vram"],
    figsize=(10, 6),
    anonymize=True,
)

### <a id='toc1_5_2_'></a>[Generate all hoarding analysis metrics for jobs:](#toc0_)

In [None]:
memory_hoarding_jobs = analyzer.calculate_node_resource_hoarding_for_jobs(filtered_jobs)

# Set option to display all columns
pd.set_option("display.max_columns", None)
# Display the DataFrame
display(memory_hoarding_jobs.head(5))
# To revert to default settings (optional)
pd.reset_option("display.max_columns")

print(f"Jobs found: {len(memory_hoarding_jobs)}")

#### <a id='toc1_5_2_1_'></a>[Find most inefficient jobs hoarding node RAM based on `ram_hoarding_fraction_diff`](#toc0_)

In [None]:
inefficient_jobs_hoarding_ram = analyzer.sort_and_filter_records_with_metrics(
    metrics_df_name_enum=ResourceHoardingDataFrameNameEnum.JOBS_WITH_RESOURCE_HOARDING_METRICS,
    sorting_key="ram_hoarding_fraction_diff",
    ascending=False,  # Sort in descending order
    filter_criteria={"ram_hoarding_fraction_diff": {"min": 0, "inclusive": True}},
)
# Plot top inefficient jobs by RAM hoarding fraction, with RAM hoarding fraction as labels
jobs_with_metrics_visualizer = JobsWithMetricsVisualizer(inefficient_jobs_hoarding_ram.head(10))
jobs_with_metrics_visualizer.visualize(
    column="ram_hoarding_fraction_diff",
    bar_label_columns=["cpu_mem_efficiency", "alloc_vram_efficiency"],
    figsize=(10, 6),
)

#### <a id='toc1_5_2_2_'></a>[Find most inefficient jobs hoarding CPU cores based on `core_hoarding_fraction_diff`](#toc0_)

In [None]:
inefficient_jobs_hoarding_cpu_cores = analyzer.sort_and_filter_records_with_metrics(
    metrics_df_name_enum=ResourceHoardingDataFrameNameEnum.JOBS_WITH_RESOURCE_HOARDING_METRICS,
    sorting_key="core_hoarding_fraction_diff",
    ascending=False,  # Sort in descending order
    filter_criteria={"core_hoarding_fraction_diff": {"min": 0, "inclusive": True}},
)

# Plot top inefficient jobs by CPU core hoarding fraction, with CPU core hoarding fraction as labels
jobs_with_metrics_visualizer = JobsWithMetricsVisualizer(inefficient_jobs_hoarding_cpu_cores.head(10))
jobs_with_metrics_visualizer.visualize(
    column="core_hoarding_fraction_diff",
    bar_label_columns=["ram_hoarding_fraction_diff", "alloc_vram_efficiency"],
    figsize=(10, 6),
)

## <a id='toc1_6_'></a>[User-Level Analysis](#toc0_)

In [None]:
users_with_metrics

### <a id='toc1_6_1_'></a>[Find Inefficient Users based on `requested_vram_efficiency_score`](#toc0_)

In [None]:
inefficient_users_avg_req_vram_eff_score = analyzer.sort_and_filter_records_with_metrics(
    metrics_df_name_enum=ResourceHoardingDataFrameNameEnum.USERS,
    sorting_key="avg_requested_vram_efficiency_score",
    ascending=True,  # Sort by avg_requested_vram_efficiency_score in ascending order
    filter_criteria={
        "avg_requested_vram_efficiency_score": {"max": -10, "inclusive": True},  # score threshold
        "job_count": {"min": 5, "inclusive": True},  # minimum job count threshold
    },
)
# Plot top inefficient users by Avg Requested VRAM Efficiency Score, with avg_requested_vram_efficiency_score as labels
users_with_metrics_visualizer = UsersWithMetricsVisualizer(inefficient_users_avg_req_vram_eff_score.head(10))
users_with_metrics_visualizer.visualize(
    column="avg_requested_vram_efficiency_score",
    bar_label_columns=["vram_hours", "job_count"],
    figsize=(10, 6),
    anonymize=True,
)

In [None]:
users_with_metrics_visualizer = UsersWithMetricsVisualizer(inefficient_users_avg_req_vram_eff_score)
users_with_metrics_visualizer.visualize_metric_distribution(
    output_dir_path=USERS_VISUALIZATION_DATA_DIR, column="avg_requested_vram_efficiency_score", figsize=(8, 5)
)

In [None]:
inefficient_users_ev_req_vram_eff = analyzer.sort_and_filter_records_with_metrics(
    metrics_df_name_enum=ResourceHoardingDataFrameNameEnum.USERS,
    sorting_key="expected_value_requested_vram_efficiency",
    ascending=True,  # Sort by expected_value_requested_vram_efficiency in ascending order
    filter_criteria={
        "job_count": {"min": 5, "inclusive": True},  # minimum job count threshold
    },
)
users_with_metrics_ev_visualizer = UsersWithMetricsVisualizer(inefficient_users_ev_req_vram_eff)
users_with_metrics_ev_visualizer.visualize_metric_distribution(
    output_dir_path=USERS_VISUALIZATION_DATA_DIR, column="expected_value_requested_vram_efficiency", figsize=(8, 5)
)

### <a id='toc1_6_2_'></a>[Generate all hoarding analysis metrics for users:](#toc0_)

In [None]:
memory_hoarding_users = analyzer.calculate_node_resource_hoarding_for_users(filtered_jobs)
display(memory_hoarding_users)

#### <a id='toc1_6_2_1_'></a>[Find most inefficient users hoarding node RAM based on `expected_value_ram_hoarding_fraction_diff`](#toc0_)

In [None]:
inefficient_users_hoarding_ram = analyzer.sort_and_filter_records_with_metrics(
    metrics_df_name_enum=ResourceHoardingDataFrameNameEnum.USERS_WITH_RESOURCE_HOARDING_METRICS,
    sorting_key="expected_value_ram_hoarding_fraction_diff",
    ascending=False,  # Sort in descending order
    filter_criteria={"expected_value_ram_hoarding_fraction_diff": {"min": 0, "inclusive": True}},
)
# Plot top inefficient users by RAM hoarding fraction, with RAM hoarding fraction as labels
users_with_metrics_visualizer = UsersWithMetricsVisualizer(inefficient_users_hoarding_ram.head(10))
users_with_metrics_visualizer.visualize(
    column="expected_value_ram_hoarding_fraction_diff",
    bar_label_columns=[
        "expected_value_alloc_vram_efficiency",
    ],
    figsize=(10, 6),
)

#### <a id='toc1_6_2_2_'></a>[Find most inefficient users hoarding CPU cores based on `expected_value_core_hoarding_fraction_diff`](#toc0_)

In [None]:
inefficient_users_hoarding_cpu_cores = analyzer.sort_and_filter_records_with_metrics(
    metrics_df_name_enum=ResourceHoardingDataFrameNameEnum.USERS_WITH_RESOURCE_HOARDING_METRICS,
    sorting_key="expected_value_core_hoarding_fraction_diff",
    ascending=False,  # Sort in descending order
    filter_criteria={"expected_value_core_hoarding_fraction_diff": {"min": 0, "inclusive": True}},
)
# Plot top inefficient users by CPU core hoarding fraction, with CPU core hoarding fraction as labels
users_with_metrics_visualizer = UsersWithMetricsVisualizer(inefficient_users_hoarding_cpu_cores.head(10))
users_with_metrics_visualizer.visualize(
    column="expected_value_core_hoarding_fraction_diff",
    bar_label_columns=[
        "expected_value_alloc_vram_efficiency",
    ],
    figsize=(10, 6),
)

## <a id='toc1_7_'></a>[PI Group Analysis](#toc0_)

In [None]:
pi_accounts_with_metrics

#### <a id='toc1_7_1_1_'></a>[Find Inefficient PIs based on `avg_requested_vram_efficiency_score`](#toc0_)

In [None]:
inefficient_pis_avg_req_vram_eff_score = analyzer.sort_and_filter_records_with_metrics(
    metrics_df_name_enum=ResourceHoardingDataFrameNameEnum.PI_GROUPS,
    sorting_key="avg_requested_vram_efficiency_score",
    ascending=True,  # more negative first
    filter_criteria={
        "avg_requested_vram_efficiency_score": {"max": -10, "inclusive": True},
        "job_count": {"min": 5, "inclusive": True},  # Minimum number of jobs to consider a PI account
    },
)
pis_with_metrics_visualizer = PIGroupsWithMetricsVisualizer(inefficient_pis_avg_req_vram_eff_score.head(10))
pis_with_metrics_visualizer.visualize(
    output_dir_path=PI_GROUPS_VISUALIZATION_DATA_DIR,
    column="avg_requested_vram_efficiency_score",
    bar_label_columns=["pi_acc_job_hours", "pi_acc_vram_hours"],
    figsize=(10, 6),
    anonymize=True,
)