<h1 align="center"> DataFrames</h1>


# Table of Contents


- [I. Catalyst Optimizer refresh](#Catalyst-Optimizer-refresh)

- [II. Speeding up PySpark with DataFrames](#Speeding-up-PysPark-with-DataFrames)

- [III. Creating DataFrames](#Creating-DataFrames)

- [IV. Simple DataFrame queries](#Simple-DataFrame-queries)

- [V. Interoperating with RDDs](#Interoperating-with-RDDs)

- [VI. Querying with the DataFrame API](#Querying-with-the-DataFrame-API)

- [VII. Querying with SQL](#Querying-with-SQL)

- [VIII. DataFrame scenario - on-time flight performance](#DataFrame-scenario-on-time-flight-performance)

- [IX. Spark Dataset API](#Spark-Dataset-API)

A DataFrame is an immutable distributed collection of data that is organized into named columns analogous to a table in a relational database. 


## Python to RDD communications

Whenever a PySpark program is executed using RDDs, there is a potentially large overhead to execute the job. As noted in the following diagram, in the PySpark driver, the Spark Context uses Py4j to launch a JVM using the JavaSparkContext. Any RDD transformations are initially mapped to PythonRDD objects in Java.

Once these tasks are pushed out to the Spark Worker(s), PythonRDD objects launch Python subprocesses using pipes to send both code and data to be processed within Python:

![](https://www.safaribooksonline.com/library/view/learning-pyspark/9781786463708/graphics/B05793_03_01.jpg)


While this approach allows PySpark to distribute the processing of the data to multiple Python subprocesses on multiple workers, as you can see, there is a lot of context switching and communications overhead between Python and the JVM.



# Catalyst Optimizer refresh

one of the primary reasons the Spark SQL engine is so fast is because of the Catalyst Optimizer.This diagram looks similar to the logical/physical planner and cost model/cost-based optimization of a relational database management system (RDBMS):

![](https://www.safaribooksonline.com/library/view/learning-pyspark/9781786463708/graphics/B05793_03_02.jpg)

The significance of this is that, as opposed to immediately processing the query, the Spark engine's Catalyst Optimizer compiles and optimizes a logical plan and has a cost optimizer that determines the most efficient physical plan generated.

# Speeding up PysPark with DataFrames
[back to top](#Table-of-Contents)

The significance of DataFrames and the Catalyst Optimizer (and Project Tungsten) is the increase in performance of PySpark queries when compared to non-optimized RDD queries. As shown in the following figure, prior to the introduction of DataFrames, Python query speeds were often twice as slow as the same Scala queries using RDD. Typically, this slowdown in query performance was due to the communications overhead between Python and the JVM:
![](https://www.safaribooksonline.com/library/view/learning-pyspark/9781786463708/graphics/B05793_03_03.jpg)

*Source: Introducing DataFrames in Apache-spark for Large Scale Data Science at http://bit.ly/2blDBI1*


Python can take advantage of the performance optimizations in Spark even while the codebase for the Catalyst Optimizer is written in Scala. Basically, it is a Python wrapper of approximately 2,000 lines of code that allows PySpark DataFrame queries to be significantly faster.Altogether, Python DataFrames (as well as SQL, Scala DataFrames, and R DataFrames) are all able to make use of the Catalyst Optimizer (as per the following updated diagram):

![](https://www.safaribooksonline.com/library/view/learning-pyspark/9781786463708/graphics/B05793_03_04.jpg)

# Creating DataFrames
[back to top](#Table-of-Contents)

## Generate our own DataFrame

Instead of accessing the file system, let's create a DataFrame by generating the data. In this case, we'll first create the ```stringRDD RDD``` and then convert it into a DataFrame when we're reading ```stringJSONRDD``` using ```spark.read.json```.

In [5]:
# Generate our own JSON data 
#   This way we don't have to access the file system yet.
stringJSONRDD = sc.parallelize((""" 
  { "id": "123",
    "name": "Katie",
    "age": 19,
    "eyeColor": "brown"
  }""",
   """{
    "id": "234",
    "name": "Michael",
    "age": 22,
    "eyeColor": "green"
  }""", 
  """{
    "id": "345",
    "name": "Simone",
    "age": 23,
    "eyeColor": "blue"
  }""")
)

In [6]:
# create DataFrame
swimmersJSON = spark.read.json(stringJSONRDD)

In [7]:
# Create temporary table
swimmersJSON.createOrReplaceTempView("swimmersJSON")

In [8]:
# SQL Qeury
spark.sql("select * from swimmersJSON").collect()

Below is the DAG visualization for the job above.
![](https://i.imgur.com/Zpe3pOL.png)

In [10]:
%sql
-- Query Data
select * from swimmersJSON

As you can see from above, we can programmatically apply the ```schema``` instead of allowing the Spark engine to infer the schema via reflection.

Additional Resources include:

- [PySpark API Reference](https://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html)
- [Spark SQL, DataFrames, and Datasets Guide](https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema): This is in reference to Programmatically Specifying the Schema using a CSV file.


### || SparkSession

We're no longer using ```sqlContext.read```... but instead ```spark.read```.... This is because as part of Spark 2.0, ```HiveContext, SQLContext, StreamingContext, SparkContext``` have been merged together into the Spark Session spark.

- Entry point for reading data
- Working with metadata
- Configuration
- Cluster resource management

For more information, please refer to How to use SparkSession in Apache Spark 2.0 (http://bit.ly/2br0Fr1).