# Alibaba GPU Cluster Data Exploration

## Research Objective
Behavioral pattern recognition for autonomous ML infrastructure optimization

## Primary Dataset: cluster-trace-gpu-v2020
- **Scale**: 6,500+ GPUs across 1,800 machines
- **Duration**: 2 months (July-August 2020)
- **Users**: 1,300+ users
- **Workloads**: Training and inference jobs

## Key Research Questions
1. What behavioral patterns exist in GPU resource usage?
2. Can we identify resource hoarding patterns?
3. What temporal patterns exist in job submissions?
4. How do recurring tasks behave differently?


In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import json

# Configure plotting
plt.style.use('default')
sns.set_palette('husl')
pd.set_option('display.max_columns', None)

print('GPU Observability Research - Alibaba Data Exploration')
print('=' * 60)

GPU Observability Research - Alibaba Data Exploration


## Dataset Status Check

In [7]:
# Check dataset availability
data_path = Path('../data/raw')

datasets = {
    'GPU 2020': data_path / 'cluster-trace-gpu-v2020',
    'GPU 2023': data_path / 'cluster-trace-gpu-v2023'
}

print('Dataset Status:')
print('-' * 30)

for name, path in datasets.items():
    if path.exists():
        files = list(path.glob('*.csv'))
        status = f'✓ Found ({len(files)} CSV files)' if files else '⚠ Metadata only'
        print(f'{name}: {status}')
        
        # Show info if available
        info_file = path / 'dataset_info.json'
        if info_file.exists():
            with open(info_file) as f:
                info = json.load(f)
            print(f'  Description: {info.get("description", "N/A")}')
    else:
        print(f'{name}: ✗ Not found')

print('\nTo get full datasets:')
print('1. See: ../data/raw/ACCESS_GUIDE.md')
print('2. Complete surveys for dataset access')
print('3. Download and extract data files')

Dataset Status:
------------------------------
GPU 2020: ✗ Not found
GPU 2023: ✗ Not found

To get full datasets:
1. See: ../data/raw/ACCESS_GUIDE.md
2. Complete surveys for dataset access
3. Download and extract data files


## Data Loading (Run after downloading full datasets)

In [None]:
# Load GPU 2020 dataset (primary focus)
gpu_2020_path = data_path / 'cluster-trace-gpu-v2020'

if (gpu_2020_path / 'job_table.csv').exists():
    print('Loading GPU 2020 dataset...')
    
    # Core tables for behavioral analysis
    jobs_df = pd.read_csv(gpu_2020_path / 'job_table.csv')
    tasks_df = pd.read_csv(gpu_2020_path / 'task_table.csv')
    instances_df = pd.read_csv(gpu_2020_path / 'instance_table.csv')
    machines_df = pd.read_csv(gpu_2020_path / 'machine_attributes.csv')
    
    print(f'Jobs: {jobs_df.shape}')
    print(f'Tasks: {tasks_df.shape}')
    print(f'Instances: {instances_df.shape}')
    print(f'Machines: {machines_df.shape}')
    
    # Show basic info
    print('\nJob table columns:')
    print(jobs_df.columns.tolist())
    
else:
    print('⚠ Full dataset not available')
    print('Complete survey to download: cluster-trace-gpu-v2020')
    print('See: ../data/raw/ACCESS_GUIDE.md')