<div style="text-align: center; line-height: 0; padding-top: 2px;">
  <img src="https://www.quantiaconsulting.com/logos/quantia_logo_orizz.png" alt="Quantia Consulting" style="width: 600px; height: 250px">
</div>

# Reading Data with Spark 

## ![Spark Logo Tiny](https://www.quantiaconsulting.com/logos/logo_spark_tiny.png) Getting Started

Let's start importing libraries and creating useful variables 

In [None]:
%load_ext autotime

import os
import qcutils
from pyspark.sql import SparkSession
import boto3
import io

s3 = boto3.client('s3')
baseUri = "s3a://quantia-master/training/"

os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.5 pyspark-shell'

spark = (SparkSession.builder 
    .master("local[*]")
    .appName("test")
    .getOrCreate()
        )
qcutils.init_spark_session(spark)

## Entry Points

The entry point for Spark 2.0 applications is the class `SparkSession`.

In [None]:
spark

## Spark API

### **Spark API Home Page**
0. Open a new browser tab.
0. Google for **Spark API Latest** or **Spark API _x.x.x_** for a specific version.
0. Select **Spark API Documentation - Spark _x.x.x_ Documentation - Apache Spark**. 
0. Which set of documentation you will use depends on which language you will use.

Other Documentation:
* Programming Guides for DataFrames, SQL, Graphs, Machine Learning, Streaming...
* Deployment Guides for Spark Standalone, Mesos, Yarn...
* Configuration, Monitoring, Tuning, Security...

Here are some shortcuts
  * <a href="https://spark.apache.org/docs/latest/api.html" target="_blank">Spark API Documentation - Latest</a>

### Spark API (Scala)

0. Select **Spark Scala API (Scaladoc)**.
0. Look up the documentation for `org.apache.spark.sql.SparkSession`.
  0. In the upper-left-hand-corner type **SparkSession** into the search field.
  0. The search will execute automatically.
  0. In the class/package list, click on **SparkSession**.
  0. The documentation should open in the right-hand pane.

### Spark API (Python)

0. Select **Spark Python API (Sphinx)**.
0. Look up the documentation for `pyspark.sql.SparkSession`.
  0. In the lower-left-hand-corner type **SparkSession** into the search field.
  0. Hit **[Enter]**.
  0. The search results should appear in the right-hand pane.
  0. Click on **pyspark.sql.SparkSession (Python class, in pyspark.sql module)**
  0. The documentation should open in the right-hand pane.

## SparkSession

Quick function review:
* `createDataSet(..)`
* `createDataFrame(..)`
* `emptyDataSet(..)`
* `emptyDataFrame(..)`
* `range(..)`
* `read(..)`
* `readStream(..)`
* `sparkContext(..)`
* `sqlContext(..)`
* `sql(..)`
* `streams(..)`
* `table(..)`
* `udf(..)`

The function we are most interested in is `SparkSession.read()` which returns a `DataFrameReader`.

## DataFrameReader

Look up the documentation for `DataFrameReader`.

Quick function review:
* `csv(path)`
* `jdbc(url, table, ..., connectionProperties)`
* `json(path)`
* `format(source)`
* `load(path)`
* `orc(path)`
* `parquet(path)`
* `table(tableName)`
* `text(path)`
* `textFile(path)`

Configuration methods:
* `option(key, value)`
* `options(map)`
* `schema(schema)`

##### ![Quantia Tiny Logo](https://www.quantiaconsulting.com/logos/quantia_logo_tiny.png) 2020 Quantia Consulting, srl. All rights reserved.