# Configuring Error Messages

One common challenge with Spark and especially with PySpark is to make sense of error messages. In PySpark, an important part of the problem comes from the fact that two different technologies (JVM and Python) work in conjunction, and both produce stack traces. This notebook tries to provide some guidance for configuring simplified error messages.

In [None]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as f

if not 'spark' in locals():
    spark = SparkSession.builder \
        .master("local[*]") \
        .config("spark.driver.memory","24G") \
        .getOrCreate()

spark

# 1. Provoke an Error

First we provoke an error by creating an ill-formed program

In [None]:
import pandas as pd

replication_df = spark.createDataFrame(pd.DataFrame(list(range(1,1000)),columns=['replication_id'])).repartition(1000, 'replication_id')

In [None]:
from pyspark.sql.functions import *
from pyspark.sql.types import *

outSchema = StructType([StructField('replication_id', IntegerType(), True),
            StructField('sil_score', DoubleType(), True),
            StructField('num_clusters', IntegerType(), True),
            StructField('min_samples', IntegerType(), True),
            StructField('min_cluster_size', IntegerType(), True)])


def run_model(df_pandas: pd.DataFrame) -> pd.DataFrame:
    # Return result as a pandas data frame
    return pd.DataFrame({'replication_id': replication_id, 'sil_score': 2,
                           'num_clusters': 3, 'min_samples': 4,
                           'min_cluster_size': 5}, index=[0])


results = replication_df.groupBy("replication_id").applyInPandas(run_model, outSchema)

# 2. Configuring Error Messages

PySpark provides two important configuration properties `spark.sql.pyspark.jvmStacktrace.enabled` and `spark.sql.execution.pyspark.udf.simplifiedTraceback.enabled` which control how error messages are presented to the developer. We will try different settings and compare the output in the following sections.

## 2.1 Full Detail

First we turn on the JVM Stacktrace and disable a simplification for UDFs. Evil combination, as we will see.

In [None]:
spark.conf.set("spark.sql.pyspark.jvmStacktrace.enabled",True)
spark.conf.set("spark.sql.execution.pyspark.udf.simplifiedTraceback.enabled",False)

results.count()

## 2.2 Medium Detail

That was too much. Let's turn off all the JVM Stacktraces, but let's still keep simplification for UDFs turned off. Looks better, but still not perfect.

In [None]:
spark.conf.set("spark.sql.pyspark.jvmStacktrace.enabled",False)
spark.conf.set("spark.sql.execution.pyspark.udf.simplifiedTraceback.enabled",False)

results.count()

## 2.3 Simplified Detail

Last try: Turn off all the JVM Stacktraces, and enable  simplification for UDFs turned off..

In [None]:
spark.conf.set("spark.sql.pyspark.jvmStacktrace.enabled",False)
spark.conf.set("spark.sql.execution.pyspark.udf.simplifiedTraceback.enabled",True)

results.count()