Median in Spark

Set-up Spark session and read config

In [1]:
from pyspark.sql import SparkSession, functions as F
import yaml

spark = (SparkSession.builder.master("local[2]")
         .appName("ons-spark")
         .getOrCreate())

with open("../../../config.yaml") as f:
    config = yaml.safe_load(f)

```r
library(sparklyr)
library(dplyr)

sc <- sparklyr::spark_connect(
  master = "local[2]",
  app_name = "ons-spark",
  config = sparklyr::spark_config(),
  )

config <- yaml::yaml.load_file("ons-spark/config.yaml")
```

Read in animal rescue dataset

In [2]:
rescue_df = spark.read.csv(config["rescue_path_csv"], header = True, inferSchema = True)

rescue_df.printSchema()

root
 |-- IncidentNumber: string (nullable = true)
 |-- DateTimeOfCall: string (nullable = true)
 |-- CalYear: integer (nullable = true)
 |-- FinYear: string (nullable = true)
 |-- TypeOfIncident: string (nullable = true)
 |-- PumpCount: double (nullable = true)
 |-- PumpHoursTotal: double (nullable = true)
 |-- HourlyNotionalCost(£): integer (nullable = true)
 |-- IncidentNotionalCost(£): double (nullable = true)
 |-- FinalDescription: string (nullable = true)
 |-- AnimalGroupParent: string (nullable = true)
 |-- OriginofCall: string (nullable = true)
 |-- PropertyType: string (nullable = true)
 |-- PropertyCategory: string (nullable = true)
 |-- SpecialServiceTypeCategory: string (nullable = true)
 |-- SpecialServiceType: string (nullable = true)
 |-- WardCode: string (nullable = true)
 |-- Ward: string (nullable = true)
 |-- BoroughCode: string (nullable = true)
 |-- Borough: string (nullable = true)
 |-- StnGroundName: string (nullable = true)
 |-- PostcodeDistrict: string (nullabl

```r
rescue_df <- sparklyr::spark_read_csv(sc,
                                     path = config$rescue_path_csv,
                                     header = TRUE,
                                     infer_schema = TRUE)
                                     
sparklyr::sdf_schema(rescue_df)
```

Close spark session

In [3]:
spark.stop()

```r
sparklyr::spark_disconnect(sc)
```