## Using Spark functions in sparklyr

The sparklyr package allows you to use the dplyr style functions when working on the cluster with sparklyr DataFrames. The key difference to working with tibbles or base R DataFrames is that the Spark cluster will be used for processing, rather than the CDSW session. This means that you can handle much larger data.

You can also make use of [Spark functions](https://spark.apache.org/docs/latest/api/sql/index.html) directly when using sparklyr. For instance, you can use Spark functions to [change data types](../spark-overview/data-types.html#casting-changing-data-types) Examples of Spark functions are [`to_date()`](https://spark.apache.org/docs/latest/api/sql/index.html#to_date). To do this, wrap them in a relevant `dplyr` command, for instance, [`mutate()`](https://dplyr.tidyverse.org/reference/mutate.html) or [`filter()`](https://dplyr.tidyverse.org/reference/filter.html). Note that these functions are not part of an actual R package and so you can't prefix them with the package name with `::`.

There are a large number of Spark functions and the authors of this article have not verified them all; versioning and implementation differences mean that not all might be available.

Remember: you can't use these functions on a tibble or base R DataFrame as R cannot interpret them. They can only be processed on the Spark cluster.

### Selected practical examples

Set up a Spark session and read the Animal Rescue data:

```r
library(sparklyr)
library(dplyr)

sc <- sparklyr::spark_connect(
    master = "local[2]",
    app_name = "sparklyr-functions",
    config = sparklyr::spark_config())
        
config <- yaml::yaml.load_file("ons-spark/config.yaml")

rescue <- sparklyr::spark_read_parquet(sc, config$rescue_path) %>%
    sparklyr::select(date_time_of_call, animal_group, property_category)
    
pillar::glimpse(rescue)
```

#### Cast to date: `to_date()`

[`to_date()`](https://spark.apache.org/docs/latest/api/sql/index.html#to_date) changes the column type to date with the chosen format. This must be wrapped in a valid `dplyr` command, such as `mutate()`:

```r
rescue <- rescue %>% 
    sparklyr::mutate(date_of_call = to_date(date_time_of_call, "dd/MM/yyyy"))

rescue %>%
    sparklyr::select(date_time_of_call, date_of_call) %>%
    head(5) %>%
    sparklyr::collect()
```

#### Capitalise first letter of each word: `initcap()`

[`initcap()`](https://spark.apache.org/docs/latest/api/sql/index.html#initcap) capitalises the first letter of each word, and can be useful when data cleansing.

In the Animal Rescue data the values in the `animal_group` column do not always begin with a capital letter. In this example, `initcap()` can be combined with `filter()` to return all cats, regardless of case:

```r
cats <- rescue %>% sparklyr::filter(initcap(animal_group) == "Cat")
```

Show that both `"cat"` and `"Cat"` are included in this DataFrame:

```r
cats %>%
    dplyr::group_by(animal_group) %>%
    dplyr::summarise(n())
```

#### `concat_ws()`: a Spark version of `paste()`

[`concat_ws()`](https://spark.apache.org/docs/latest/api/sql/index.html#concat_ws) works in a similar way to the base R function [`paste()`](https://stat.ethz.ch/R-manual/R-devel/library/base/html/paste.html); the separator is the first argument:

```r
rescue <- rescue %>% sparklyr::mutate(animal_property = concat_ws(": ", animal_group, property_category))

rescue %>%
    sparklyr::select(animal_group, property_category, animal_property) %>%
    head(5) %>%
    sparklyr::collect()
```

## Further Resources

Spark at the ONS Articles:
- [Data Types in Spark](../spark-overview/data-types.html#casting-changing-data-types): A common use case for using Spark functions in sparklyr

sparklyr and tidyverse Documentation:
- [`mutate()`](https://dplyr.tidyverse.org/reference/mutate.html)
- [`filter()`](https://dplyr.tidyverse.org/reference/filter.html)

R Documentation:
- [`paste()`](https://stat.ethz.ch/R-manual/R-devel/library/base/html/paste.html)

Spark SQL Documentation:
- [`to_date`](https://spark.apache.org/docs/latest/api/sql/index.html#to_date)
- [`initcap`](https://spark.apache.org/docs/latest/api/sql/index.html#initcap)
- [`concat_ws`](https://spark.apache.org/docs/latest/api/sql/index.html#concat_ws)