# Part II: Spark SQL and DataFrames

Author: **Julien Peloton** [@JulienPeloton](https://github.com/astrolabsoftware/spark-tutorials/issues/new?body=@JulienPeloton)  
Last Verifed to Run: **2018-10-25**  

__Learning objectives__

- A tour of data formats.
- Loading and distributing data: Spark SQL and DataFrames.

## A tour of data formats

There are many data formats used in the context of Big Data: CSV (1978), XML (1996), JSON (2001), Thrift (2007), Protobuf (2008), Avro & SequenceFile (2009), Parquet (2013), ORC (2016), and the list goes on... Some are _naively_ structured that is using a single type to describe the data (e.g. text) without any internal organisation to access faster the data. Others are more complex and highly optimised for big data treatment (e.g. Parquet). Unfortunately those are not the data formats typically chosen by the scientific community. In astronomy you would rather store the data in FITS (1981) or HDF5 (1988) format. 
FITS and HDF5 are multi-purposes data formats: images, spectra, photon lists, data cubes, or even structured data such as multi-table databases can be efficiently stored and accessed.

The data source API in Apache Spark belongs to the [Spark SQL module](https://spark.apache.org/sql/). If you want to connect a particular data source with Apache Spark, you have mostly two ways:

- [indirect] Access and distribute your files as binary streams (Spark does it natively), and decode the data on-the-fly within executors using third-party libraries.
- [native] Write a custom connector to access, distribute and decode the data natively.

FITS or HDF5 as most of scientific data formats, were not designed for serialisation (distribution of data over machines) originally and they often use compression to reduce the size on disk. Needless to say that default Spark cannot read those natively. 

First attempts to connect those data formats (see e.g. [1] for FITS) with Spark were using the indirect method above. By reading files as binary streams, the indirect method has the advantage of having access to all FITS functionalities implemented in the underlying user library. This can be an advantage when working with the Python API for example which already contains many great scientific libraries. However this indirect method assumes each Spark mapper will receive and handle one entire file (since the filenames are parallelized and entire file data must be reconstructed from binary once the file has been opened by a Spark mapper). Therefore each single file must fit within the memory of a Spark mapper, hence the indirect method cannot distribute a dataset made of large FITS files (e.g. in [1] they have a 65 GB dataset made of 11,150 files). In addition by assuming each Spark mapper will receive and handle one entire file, the indirect method will have a poor load balancing if the dataset is made of files with not all the same size.

Fortunately Apache Spark low-level layers are sufficiently well written to allow extending the framework and write native connectors for any kind of data sources. Recently connectors for FITS and HDF5 were made available [2, 3] to the community. With such connectors, there is a guarantee of having a good load balancing regardless the structure of the dataset and the size of the input files is no more a problem (a 1 TB dataset made of thousand 1 GB files or one single 1 TB file will be viewed as almost the same by a native Spark connector). Note however that the Data Source API is in Java/Scala and if there is no library to play with your data source in those languages you must implement it (what has been done in [2]) or interface with another language.

Note that the low-level layers dealing with the data sources have been recently updated. Apache Spark 2.3 introduced the Data Source API version 2. While the version 1 is still available and usable for a long time, we expect that all Spark connectors will comply with this v2 in the future.

[1] Z. Zhang and K. Barbary and F. A. Nothaft and E. R. Sparks and O. Zahn and M. J. Franklin and D. A. Patterson and S. Perlmutter, Kira: Processing Astronomy Imagery Using Big Data Technology, DOI 10.1109/TBDATA.2016.2599926.  
[2] Peloton, Julien and Arnault, Christian and Plaszczynski, Stéphane, FITS Data Source for Apache Spark, Computing and Software for Big Science (1804.07501). https://github.com/astrolabsoftware/spark-fits   
[3] Liu, Jialin and Racah, Evan and Koziol, Quincey and Canon, Richard Shane, H5spark: bridging the I/O gap between Spark and scientific data formats on HPC systems, Cray user group (2016). https://github.com/valiantljk/h5spark  

## Loading and distributing data: Spark SQL and DataFrames

The interface to read data from disk is always the same for any kind of built-in and officially supported data format:

```python
df = spark.read\
    .format(format: str)\
    .option(key: str, value: Any)\
    # ...
    .option(key: str, value: Any)\
    .load(path: str)
```
Note that for most of the data sources, you can use wrappers such as:

```python
spark.read.csv(path, key1=value1, key2=value2, ...)
```

### Format

The format can be "csv", "json", "parquet", etc. 

### Options 

The number of options depends on the underlying data source. Each has its own set of options. 
In most of the case, no options are needed, but you might want to explore the different possibilities at some point. Surprisingly it is not easy to find documentation and the best remains to read the source code documentation. In pyspark you can easily access it via the wrappers:

```python
# DataFrameReader object
df_reader = spark.read

# Doc on reading CSV
df_reader.csv?
# doc printed

# Doc on reading Parquet
df_reader.parquet?
# doc printed
```


### Path

The path to the data is 

### Specifying custom connector



## Data structure

Partition, ...
JVM limit: 2G points!

## Going further

Here is a series of useful links on similar topics:

- Spark SQL module: https://spark.apache.org/sql/
- Spark SQL code on GitHub: https://github.com/apache/spark/tree/master/sql
- Spark SQL doc: http://spark.apache.org/docs/latest/sql-programming-guide.html
- Databricks Data Source documentation: https://docs.databricks.com/spark/latest/data-sources/index.html
- Apache Spark Data Source V2 explained in video (Spark Summit 2018): https://databricks.com/session/apache-spark-data-source-v2
- Introducing Apache Spark Data Sources API V2 (IBM): https://developer.ibm.com/code/2018/04/16/introducing-apache-spark-data-sources-api-v2/