# Executor OOM Troubleshooting

**Цель:** Диагностика и исправление OutOfMemoryError у Spark executors

**Symptoms:**
- `OOMKilled` в pod status
- `java.lang.OutOfMemoryError` в логах
- Job завершается с ошибкой

## Setup

Импортируйте библиотеки и подключитесь к Spark Connect

In [None]:
import os
from pyspark.sql import SparkSession
import kubernetes
import json

# Настройка
SPARK_CONNECT_URL = os.environ.get('SPARK_CONNECT_URL', 'sc://spark-connect:15002')
NAMESPACE = os.environ.get('NAMESPACE', 'spark')

spark = SparkSession.builder \
    .remote(SPARK_CONNECT_URL) \n    .getOrCreate()

# Kubernetes API
k8s = kubernetes.config.load_kube_config()
v1 = kubernetes.client.CoreV1Api(api_client=k8s)

## Step 1: Check for OOMKilled Pods

Проверить есть ли executor pods с OOMKilled статусом

In [None]:
# Получить все executor pods
pods = v1.list_namespaced_pod(NAMESPACE, label_selector='spark-role=executor')

oom_pods = []
for pod in pods.items:
    for container_status in pod.status.container_statuses:
        if container_status.terminated and container_status.terminated_reason == 'OOMKilled':
            oom_pods.append(pod.metadata.name)

print(f"Found {len(oom_pods)} OOMKilled pods:")
for pod_name in oom_pods:
    print(f"  - {pod_name}")

if len(oom_pods) == 0:
    print("✅ No OOMKilled pods found")

## Step 2: Analyze Memory Usage

Изучить memory consumption executors

In [None]:
# Получить метрики памяти через Spark UI
from pyspark.sparkcontext import SparkContext

sc = spark.sparkContext

# Executor memory stats
executor_memory = sc._jsc.sc().getExecutorMemoryStatus().collectAsMap()

print("Executor Memory Usage:")
for executor_id, stats in executor_memory.items():
    used_mb = stats.memoryUsed() / 1024 / 1024
    total_mb = stats.totalOnHeap() / 1024 / 1024
    usage_pct = (used_mb / total_mb * 100) if total_mb > 0 else 0
    print(f"  {executor_id}: {used_mb:.1f}MB / {total_mb:.1f}MB ({usage_pct:.1f}%)")

## Step 3: Check Current Configuration

Изучить текущую memory конфигурацию

In [None]:
# Получить текущую конфигурацию
conf = spark._jconf.getAll()

memory_settings = {k: v for k, v in conf.items() if 'memory' in k.lower()}

print("Current Memory Settings:")
for key, value in sorted(memory_settings.items()):
    print(f"  {key}: {value}")

## Step 4: Recommendations

**Диагностика завершена. Вот рекомендации:**

In [None]:
print(""""
## Рекомендации:

### 1. Увеличьте executor.memory
```bash
helm upgrade spark-connect charts/spark-4.1 -n spark \\
  --set connect.executor.memory=4g \\
  --set connect.executor.memoryLimit=5g
```  

### 2. Уменьшите memoryOverhead
```bash
helm upgrade spark-connect charts/spark-4.1 -n spark \\
  --set connect.executor.memoryOverhead=512m
```  

### 3. Уменьшите shuffle partition size
```python
df = spark.repartition(10)  # меньше partitions
```  

### 4. Включите off-heap memory
```bash
helm upgrade spark-connect charts/spark-4.1 -n spark \\
  --set connect.sparkConf.spark\.memory.offHeap.enabled=true \\
  --set connect.sparkConf.spark\.memory.offHeap.size=1g
```  

### 5. Используйте adaptive execution
```python
spark.conf.set('spark.sql.adaptive.enabled', 'true')
spark.conf.set('spark.sql.adaptive.coalescePartitions.enabled', 'true')
```  

""")

In [None]:
# Cleanup
spark.stop()
print("✅ Diagnostics complete")