## Set Spark Job Description

```{warning}
This section is under construction.
```



The Spark UI is a great tool for: monitoring Spark applications; troubleshooting slow jobs; and generally understanding how spark will execute your code. However, it is sometimes difficult to find the correct job number to drill down on a problem. 

There is a way to set the Spark job description when using Pyspark, making use of the `spark.sparkContext.setJobDescription()` function. This function takes a string input and will update the description column in the spark UI with this string. While this makes finding the job in the UI much easier, if we do not update the descriptions in our script or tell spark to [revert back to default descriptions](#back-to-default-description) each job will be given the same description. 

We will work through a short example to highlight its effectiveness. First we will start a Spark session.

In [None]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F

spark = (
    SparkSession.builder.appName("job-description-tip")
    .config("spark.executor.memory", "1g")
    .config("spark.executor.cores", 1)
    .config("spark.dynamicAllocation.enabled", "true")
    .config("spark.dynamicAllocation.maxExecutors", 3)
    .config("spark.sql.shuffle.partitions", 12)
    .config("spark.shuffle.service.enabled", "true")
    .config("spark.ui.showConsoleProgress", "false")
    .enableHiveSupport()
    .getOrCreate()
)

Next we'll generate a link to the Spark UI

In [None]:
import os, IPython
url = "spark-%s.%s" % (os.environ["CDSW_ENGINE_ID"], os.environ["CDSW_DOMAIN"])
IPython.display.HTML("<a href=http://%s>Spark UI</a>" % url)

## Default descriptions
By default spark will generate job descriptions based on the action that has been called.
Now we will read in some data to perform some transformations and some actions to create a few Spark jobs.

In [None]:
rescue = (spark.read.csv("/training/animal_rescue.csv", header=True) 
      .withColumnRenamed("AnimalGroupParent", "AnimalGroup") 
      .select("IncidentNumber", "FinalDescription", "AnimalGroup", "CalYear")
     )
rescue.count()

In [None]:
rescue.sort("CalYear").show(5, truncate=False)

In [None]:
rescue.select("IncidentNumber", "AnimalGroup").sort("IncidentNumber").limit(3).collect()

Now we can check the Spark UI, there's screenshot included below. As we can see, a number of jobs have been created, but which action created each job?

![Spark UI, jobs tab showing default descriptions for the jobs](path_to_files)

In the *Description* column we can see the description starts with the action that created the job. This is useful to identify the correct job, but we can do one better.

## Customised description

We can customise the job description using `spark.sparkContext.setJobDescription()`. This is useful as we can assign a more detailed name to each job within our code. Doing so will help us understand and find the exact action which has created a job in the spark UI.

In [None]:
spark.sparkContext.setJobDescription("Count fox incidents")

rescue.filter(F.col("AnimalGroup") == "Fox").count()

In [None]:
spark.sparkContext.setJobDescription("Count incidents in 2015")

rescue.filter(F.col("CalYear") == 2015).count()

Looking at the UI again you can see our descriptions have been updated to contain our customised names.

![Spark UI, jobs tab showing default descriptions and customised descriptions for the jobs](Update_path)

**WARNING** every job from now on will have the last description we set, unless we tell spark to default back to the default descriptions.

## Back to default description

We can set the description to `None` to revert back to the default descriptions.

In [None]:
spark.sparkContext.setJobDescription(None)

rescue.show(5, truncate=False)

The job description for the most recent job has been revered to its default:

![Spark UI, jobs tab showing default and customised descriptions for the jobs](Update_path)

## Summary

- Default job descriptions tell you the action used to trigger the job
- Set a job description to better track jobs in the Spark UI using `spark.sparkContext.setJobDescription()` 
- Remember to set the description to `None` to go back to using default descriptions once you've finished tracking your jobs. 
- The description will carry through to the Stages tab also, but will not appear in the SQL tab.

### Further Resources

Spark at the ONS Articles:
- [Spark Application and UI](../spark-concepts/spark-application-and-ui)

PySpark Documentation:
- [`SparkSession`](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.SparkSession.html)
- [`.count()`](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.count.html)
- [`.groupBy()`](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.groupBy.html)
- [`.show()`](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.show.html)
- [`setJobDescription`](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.SparkContext.setJobDescription.html?highlight=setjob)

Spark documentation:
- [Monitoring and Instrumentation](https://spark.apache.org/docs/latest/monitoring.html)
- [Spark Web UI](https://spark.apache.org/docs/latest/web-ui.html)


### Further Resources

Spark at the ONS Articles:
- [Partitions](../spark-concepts/partitions)
- [Shuffling](../spark-concepts/shuffling)
- [Persisting](../spark-concepts/persistence)
- [Optimising Joins](../spark-concepts/join-concepts)
- [Garbage Collection](../spark-concepts/garbage-collection)
- [Set Spark Job Description](../spark-functions/job-description) 
- [Spark Application and UI](../spark-concepts/spark-application-and-ui.md)

PySpark Documentation:
- [`SparkSession`](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.SparkSession.html)
- [`.count()`](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.count.html)
- [`.groupBy()`](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.groupBy.html)
- [`.join()`](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.join.html)
- [`spark.stop()`](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.SparkSession.stop.html)
- [`.show()`](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.show.html)
- [`.toPandas()`](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.toPandas.html)
- [`.agg()`](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.agg.html)
- [`.coalesce()`](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.coalesce.html)

sparklyr and tidyverse Documentation:
- [`spark_connect()` and `spark_disconnect()`](https://spark.rstudio.com/packages/sparklyr/latest/reference/spark-connections.html)
- [`group_by()`](https://dplyr.tidyverse.org/reference/group_by.html)
- [`left_join()`](https://spark.rstudio.com/packages/sparklyr/latest/reference/join.tbl_spark.html)
- [`sdf_nrow()`](https://spark.rstudio.com/packages/sparklyr/latest/reference/sdf_dim.html)
- [`collect()`](https://dplyr.tidyverse.org/reference/compute.html)
- [`summarise()`](https://dplyr.tidyverse.org/reference/summarise.html)
- [`sdf_coalesce()`](https://spark.rstudio.com/packages/sparklyr/latest/reference/sdf_coalesce.html)

Spark documentation:
- [Monitoring and Instrumentation](https://spark.apache.org/docs/latest/monitoring.html)
- [Spark Web UI](https://spark.apache.org/docs/latest/web-ui.html)