Start jupyter-lab

```bash
PYTHONPATH=$HOME/gits/NVIDIA/spark-rapids/integration_tests/src/main/python \
SPARK_HOME=$HOME/dist/spark-3.1.1-bin-hadoop3.2 \
TZ=UTC \
jupyter-lab --notebook-dir=$HOME/gits/NVIDIA/spark-rapids-examples
```

In [1]:
should_gen = False

In [2]:
gpu_alloc_size = '5000m' if should_gen else '128m'
cores_per_exec = 4
dfgen_path = '/tmp/dfgen'
data_gen_length = 100*1000*1000 # generate one file 
num_copies = 10 # and replicate it this many times

In [3]:
import findspark
findspark.init()
findspark.add_jars('/home/gshegalov/gits/NVIDIA/spark-rapids/dist/target/rapids-4-spark_2.12-22.10.0-SNAPSHOT-cuda11.jar') 

In [4]:
import pyspark
conf = pyspark.SparkConf(loadDefaults=False)
conf.setAll([
    ('spark.driver.memory', '8g'),
    ('spark.executor.memory', '8g'),
    # ('spark.executor.extraJavaOptions', 
    #      '-agentlib:jdwp=transport=dt_socket,server=y,address=localhost:5005'),
    ('spark.plugins', 'com.nvidia.spark.SQLPlugin'),
    ('spark.rpc.message.maxSize', 1024),
    ('spark.task.maxFailures', 1),
    # ('spark.rapids.memory.gpu.allocFraction', 0.2),
    ('spark.rapids.memory.gpu.allocSize', gpu_alloc_size),
    # ('spark.rapids.memory.gpu.minAllocFraction', 0.1),
    # ('spark.rapids.memory.gpu.maxAllocFraction', 0.5),
    # reader and target batch sizes to avoid running OOM on a single batch 
    ('spark.rapids.sql.batchSizeBytes', '16m'),
    ('spark.rapids.sql.explain', 'ALL'),
    ('spark.rapids.sql.reader.batchSizeBytes', '16m'),
    ('spark.sql.adaptive.enabled', False),   
])
spark = pyspark.sql.SparkSession.builder\
    .appName('Spill Experiments Notebook')\
    .master(f"local-cluster[1,{cores_per_exec},10000]")\
    .config(conf=conf)\
    .getOrCreate()

22/09/20 19:45:36 WARN Utils: Your hostname, gshegalov-dual-5760 resolves to a loopback address: 127.0.1.1; using 10.0.0.133 instead (on interface wlp0s20f3)
22/09/20 19:45:36 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
22/09/20 19:45:36 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/09/20 19:45:37 WARN RapidsPluginUtils: RAPIDS Accelerator 22.10.0-SNAPSHOT using cudf 22.10.0-SNAPSHOT.
22/09/20 19:45:37 WARN RapidsPluginUtils: RAPIDS Accelerator is enabled, to disable GPU support set `spark.rapids.sql.enabled` to false.
22/09/20 19:45:37 WARN RapidsPluginUtils: spark.rapids.sql.explain is set to `ALL`. Set it to 'NONE' to suppress the diagnostics logging about the query plac

In [5]:
spark

## Generate Data 

In [6]:
import spark_init_internal
setattr(spark_init_internal, '_spark', spark)
from data_gen import *

In [7]:
if should_gen: 
    dfgen = unary_op_df(
        spark=spark, 
        gen=IntegerGen(nullable=False), 
        length=data_gen_length, 
        num_slices=1)

In [8]:
# if should_gen: dfgen = spark.range(0, 1 << 28)

In [9]:
if should_gen: 
    dfgen.write.mode('overwrite').parquet(dfgen_path)
import glob
import shutil
generated_files = glob.glob(f"{dfgen_path}/*.parquet")
if len(generated_files) == 1:
    orig_path = generated_files[0]
    print(f"replicating generated file {orig_path}\n")
    for i in range(num_copies):
        shutil.copyfile(src=orig_path, dst=f"{dfgen_path}/part-00000-copy-{i}.snappy.parquet")

22/09/20 19:47:59 WARN GpuOverrides: 
*Exec <DataWritingCommandExec> will run on GPU
  *Output <InsertIntoHadoopFsRelationCommand> will run on GPU
  ! <RDDScanExec> cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.RDDScanExec
    @Expression <AttributeReference> a#0 could run on GPU

22/09/20 19:48:00 WARN TaskSetManager: Stage 0 contains a task of very large size (680682 KiB). The maximum recommended task size is 1000 KiB.
                                                                                

replicating generated file /tmp/dfgen/part-00000-cb555d1b-4438-4e7e-b264-308ca3ff90b9-c000.snappy.parquet



## Repro for OutOfCore Sort spilling 

In [10]:
if not should_gen: 
    from pyspark.sql.functions import *
    df = spark.read.parquet(dfgen_path)
    df.printSchema()
    q2 = df.orderBy(col('a').desc())
    q2.write.mode('overwrite').parquet('/tmp/q2')