# Median in Spark

### Introduction

Set-up Spark session and read config

In [1]:
from pyspark.sql import SparkSession, functions as F
import yaml

spark = (SparkSession.builder.master("local[2]")
         .appName("ons-spark")
         .getOrCreate())

with open("../../../config.yaml") as f:
    config = yaml.safe_load(f)

```r
library(sparklyr)
library(dplyr)

sc <- sparklyr::spark_connect(
  master = "local[2]",
  app_name = "ons-spark",
  config = sparklyr::spark_config(),
  )

config <- yaml::yaml.load_file("ons-spark/config.yaml")
```

Read in population dataset

In [2]:
pop_df = spark.read.parquet(config["population_path"])

pop_df.printSchema()

root
 |-- postcode_district: string (nullable = true)
 |-- population: long (nullable = true)



```r
pop_df <- sparklyr::spark_read_csv(sc, path = config$population_path)
                                     
sparklyr::sdf_schema(pop_df)
```

### Computing the median for a Spark DataFrame

We can compute the medians for big data in Spark using the [Greenwald-Khanna](http://infolab.stanford.edu/~datar/courses/cs361a/papers/quantiles.pdf) algorithm that approximates a given quantile or quantiles of a distribution where the number of observations is very large. The relative error of the function can be adjusted to give a more accurate estimate for a given quantile at the cost of increased computation. 

In PySpark, the Greenwald-Khanna algorithm is implemented with [`approxQuantile`](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.approxQuantile.html), which extends [`pyspark.sql.DataFrame`](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html#). To find the exact median of the `population` column with PySpark, we apply the `approxQuantile` to our population DataFrame and specify the column name, the quantile of interest (in this case, the median or second quartile, 0.5), and the relative error, which is set to 0 to give the exact median.

In SparklyR, the Greenwald-Khanna algorithm is implemented with [`sdf_quantile`](https://spark.rstudio.com/packages/sparklyr/latest/reference/sdf_quantile.html#sdf_quantile). To find the exact median of the `population` column with SparklyR, we apply `sdf_quantile` to our popuation DataFrame, specifying the same parameters as in the PySpark example - the quantile to compute (the median, 0.5) and the relative error (0).

In [3]:
pop_df.approxQuantile("population", [0.5], 0)

[22331.0]

```r
sdf_quantile(pop_df, "population", probabilities = c(0.5), relative.error = 0)
```

In the next example, we will compute the 1st, 2nd, and 3rd quartiles of the `population` column. We will assume that computing three quantiles would be too computationally expensive given our available resources, so we will increase the relative error parameter to reduce the accuracy of our estimates in return for decreased computation cost. We will increase the relative error parameter to 0.2, or 20%.

In [4]:
pop_df.approxQuantile("population", [0.25, 0.5, 0.75], 0.2)

[9643.0, 19490.0, 28199.0]

```r
sdf_quantile(pop_df, "population", probabilities = c(0.25, 0.5, 0.75), relative.error = 0.2)
```

### Computing the median for aggregated data

Usually when performing data analysis you will want to find the median of aggregations created from your dataset, rather than just computing the median of entire columns. We will first read in the borough and postcode information from the animal rescue dataset and join these with the population dataset.

In [5]:
borough_df = spark.read.parquet(config["rescue_clean_path"])

borough_df = borough_df.select("borough", F.upper(borough_df["postcodedistrict"]).alias("postcode_district"))

pop_borough_df = borough_df.join(
    pop_df,
    on = "postcode_district",
    how = "left"
)

pop_borough_df.show(10)

+-----------------+--------------------+----------+
|postcode_district|             borough|population|
+-----------------+--------------------+----------+
|             SE19|             Croydon|     27639|
|             SE25|             Croydon|     34521|
|              SM5|              Sutton|     38291|
|              UB9|          Hillingdon|     14336|
|              RM3|            Havering|     40272|
|             RM10|Barking And Dagenham|     38157|
|              E11|      Waltham Forest|     55128|
|              E12|           Redbridge|     41869|
|              CR0|             Croydon|    153812|
|               E5|             Hackney|     47669|
+-----------------+--------------------+----------+
only showing top 10 rows



```r
borough_df <- sparklyr::spark_read_csv(sc, path = config$rescue_clean_path)

borough_df <- borough_df %>%
    select(borough, postcode_district = postcodedistrict) %>%
    mutate(postcode_district = upper(postcode_district)
    
pop_borough_df <- borough_df %>%
    left_join(pop_df, by = "postcode_district")
    
glimpse(pop_borough_df)
```

Next, we will aggregate the population data across boroughs in the combined `pop_borough_df` and find the median population in each borough.

To acheive this in PySpark, we will register the [`percentile_approx`](https://docs.databricks.com/en/sql/language-manual/functions/percentile_approx.html) Hive UDF as a PySpark expression and apply it within an aggregation. The parameters required in `percentile_approx` are similar to that of `approxQuantile`, with the difference being that you must provide the accuracy instead of relative error, where accuracy is defined as $ \frac{1}{relative\ error} $

The DAP environment is currently limited to using PySpark version 2.4.0, but there are two functions present in later versions that are useful for calculating quantiles/medians.

- In PySpark 3.1.0, [`percentile_approx`](https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.functions.percentile_approx.html) was added to [`pyspark.sql.functions`](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/functions.html#), which allows you to use the `percentile_approx` Hive UDF in PySpark directly.

``` python
pop_borough_df.groupBy("borough").agg(
        percentile_approx("population", 0.5, 10000).alias("median_population")
).show()
```
- In PySpark 3.4.0, [`median`](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.median.html#pyspark.sql.functions.median) was added to `pyspark.sql.functions`, which further simplifies the process of of computing the median within aggregations, as it does not require a parameter specifying the quantile, or an accuracy parameter. 

``` python
pop_borough_df.groupBy("borough").agg(
        median("population").alias("median_population")
).show()
```

Close spark session

In [6]:
spark.stop()

```r
sparklyr::spark_disconnect(sc)
```