<h1 align="center"> DataFrames</h1>


# Table of Contents


- [I. Catalyst Optimizer refresh](#Catalyst-Optimizer-refresh)

- [II. Speeding up PySpark with DataFrames](#Speeding-up-PysPark-with-DataFrames)

- [III. Creating DataFrames](#Creating-DataFrames)

- [IV. Simple DataFrame queries](#Simple-DataFrame-queries)

- [V. Interoperating with RDDs](#Interoperating-with-RDDs)

- [VI. Querying with the DataFrame API](#Querying-with-the-DataFrame-API)

- [VII. Querying with SQL](#Querying-with-SQL)

- [VIII. DataFrame scenario - on-time flight performance](#DataFrame-scenario-on-time-flight-performance)

- [IX. Spark Dataset API](#Spark-Dataset-API)

A DataFrame is an immutable distributed collection of data that is organized into named columns analogous to a table in a relational database. 


## Python to RDD communications

Whenever a PySpark program is executed using RDDs, there is a potentially large overhead to execute the job. As noted in the following diagram, in the PySpark driver, the Spark Context uses Py4j to launch a JVM using the JavaSparkContext. Any RDD transformations are initially mapped to PythonRDD objects in Java.

Once these tasks are pushed out to the Spark Worker(s), PythonRDD objects launch Python subprocesses using pipes to send both code and data to be processed within Python:

![](https://www.safaribooksonline.com/library/view/learning-pyspark/9781786463708/graphics/B05793_03_01.jpg)


While this approach allows PySpark to distribute the processing of the data to multiple Python subprocesses on multiple workers, as you can see, there is a lot of context switching and communications overhead between Python and the JVM.



# Catalyst Optimizer refresh

one of the primary reasons the Spark SQL engine is so fast is because of the Catalyst Optimizer.This diagram looks similar to the logical/physical planner and cost model/cost-based optimization of a relational database management system (RDBMS):

![](https://www.safaribooksonline.com/library/view/learning-pyspark/9781786463708/graphics/B05793_03_02.jpg)

The significance of this is that, as opposed to immediately processing the query, the Spark engine's Catalyst Optimizer compiles and optimizes a logical plan and has a cost optimizer that determines the most efficient physical plan generated.

# Speeding up PysPark with DataFrames
[back to top](#Table-of-Contents)

The significance of DataFrames and the Catalyst Optimizer (and Project Tungsten) is the increase in performance of PySpark queries when compared to non-optimized RDD queries. As shown in the following figure, prior to the introduction of DataFrames, Python query speeds were often twice as slow as the same Scala queries using RDD. Typically, this slowdown in query performance was due to the communications overhead between Python and the JVM:
![](https://www.safaribooksonline.com/library/view/learning-pyspark/9781786463708/graphics/B05793_03_03.jpg)

*Source: Introducing DataFrames in Apache-spark for Large Scale Data Science at http://bit.ly/2blDBI1*


Python can take advantage of the performance optimizations in Spark even while the codebase for the Catalyst Optimizer is written in Scala. Basically, it is a Python wrapper of approximately 2,000 lines of code that allows PySpark DataFrame queries to be significantly faster.Altogether, Python DataFrames (as well as SQL, Scala DataFrames, and R DataFrames) are all able to make use of the Catalyst Optimizer (as per the following updated diagram):

![](https://www.safaribooksonline.com/library/view/learning-pyspark/9781786463708/graphics/B05793_03_04.jpg)

# Creating DataFrames
[back to top](#Table-of-Contents)

## Generate our own DataFrame

Instead of accessing the file system, let's create a DataFrame by generating the data. In this case, we'll first create the ```stringRDD RDD``` and then convert it into a DataFrame when we're reading ```stringJSONRDD``` using ```spark.read.json```.

In [5]:
# Generate our own JSON data 
#   This way we don't have to access the file system yet.
stringJSONRDD = sc.parallelize((""" 
  { "id": "123",
    "name": "Katie",
    "age": 19,
    "eyeColor": "brown"
  }""",
   """{
    "id": "234",
    "name": "Michael",
    "age": 22,
    "eyeColor": "green"
  }""", 
  """{
    "id": "345",
    "name": "Simone",
    "age": 23,
    "eyeColor": "blue"
  }""")
)

In [6]:
# create DataFrame
swimmersJSON = spark.read.json(stringJSONRDD)

![Imgur](http://i.imgur.com/PmQrksQ.png)

In [8]:
# Create temporary table
swimmersJSON.createOrReplaceTempView("swimmersJSON")

# Simple DataFrame queries
[back to top](#Table-of-Contents)

Now that we have created the swimmersJSON DataFrame, we will be able to run the DataFrame API, as well as SQL queries against it. Let's start with a simple query showing all the rows within the DataFrame.


## DataFrame API query

To do this using the DataFrame API, we can use the ```show(<n>)``` method, which prints the first n rows to the console:

In [10]:
# DataFrame API
swimmersJSON.show()

## SQL query

In [12]:
# SQL Qeury
spark.sql("select * from swimmersJSON").collect()

Below is the DAG visualization for the job above.
![](https://i.imgur.com/Zpe3pOL.png)

In [14]:
%sql
-- Query Data
select * from swimmersJSON

# Interoperating with RDDs

There are two different methods for converting existing RDDs to DataFrames (or Datasets[T]): 
- inferring the schema using reflection, or 
- programmatically specifying the schema. 

The former allows you to write more concise code (when your Spark application already knows the schema), while the latter allows you to construct DataFrames when the columns and their data types are only revealed at run time. 

**Note:** reflection is in reference to schema reflection as opposed to Python reflection.

### Inferring the Schema using Reflection

In the process of building the DataFrame and running the queries, we skipped over the fact that the schema for this DataFrame was automatically defined. Initially, row objects are constructed by passing a list of key/value pairs as ```**kwargs``` to the row class. Then, Spark SQL converts this RDD of row objects into a DataFrame, where the keys are the columns and the data types are inferred by sampling the data.

Apache Spark is inferring the schema using reflection; i.e it automatically determines the schema of the data based on reviewing the JSON data.

In [17]:
# Print the schema
swimmersJSON.printSchema()

Notice that Spark was able to determine infer the schema (when reviewing the schema using .printSchema).
But what if we want to programmatically specify the schema?

### Programmatically Specifying the Schema 

In this case, let's specify the schema for a CSV text file.

In [19]:
from pyspark.sql.types import *

# Generate our own CSV data
stringCSVRDD = sc.parallelize([(123, 'Katie', 19, 'brown'), 
                               (234, 'Michael', 22, 'green'), 
                               (345, 'Simone', 23, 'blue')
                              ])



First, we will encode the schema as a string, per the ```[schema]``` variable below. Then we will define the schema using ```StructType``` and ```StructField```:

In [21]:

# The schema is encoded in a string, using StructType we define the schema using various pyspark.sql.types
schemaString = "id name age eyeColor"
schema = StructType([
    StructField("id", LongType(), True),    
    StructField("name", StringType(), True),
    StructField("age", LongType(), True),
    StructField("eyeColor", StringType(), True)
])

Note, the ```StructField``` class is broken down in terms of:
- ```name```: The name of this field
- ```dataType```: The data type of this field
- ```nullable```: Indicates whether values of this field can be null

Finally, we will apply the schema (```schema```) we created to the ```stringCSVRDD RDD``` (that is, the generated```.csv``` data) and create a temporary view so we can query it using SQL:

In [23]:

# Apply the schema to the RDD and Create DataFrame
swimmers = spark.createDataFrame(stringCSVRDD, schema)


# Creates a temporary view using the DataFrame
swimmers.createOrReplaceTempView("swimmers")

With this example, we have finer-grain control over the schema and can specify that ```id``` is a ```long``` (as opposed to a string in the previous section):

In [25]:
# Print the schema
# we have redefined id as Long (instead of String)
swimmers.printSchema()

In [26]:
%sql 
-- Query the data
select * from swimmers

As you can see from above, we can programmatically apply the ```schema``` instead of allowing the Spark engine to infer the schema via reflection.

Additional Resources include:

- [PySpark API Reference](https://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html)
- [Spark SQL, DataFrames, and Datasets Guide](https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema): This is in reference to Programmatically Specifying the Schema using a CSV file.


### || SparkSession

We're no longer using ```sqlContext.read```... but instead ```spark.read```.... This is because as part of Spark 2.0, ```HiveContext, SQLContext, StreamingContext, SparkContext``` have been merged together into the Spark Session spark.

- Entry point for reading data
- Working with metadata
- Configuration
- Cluster resource management

For more information, please refer to How to use SparkSession in Apache Spark 2.0 (http://bit.ly/2br0Fr1).

# Querying with the DataFrame API
we can start off by using ```collect(), show(), or take()``` to view the data within our DataFrame (with the last two including the option to limit the number of returned rows).

## Number of rows
To get the number of rows within our DataFrame, we can use the ```count()``` method:

In [29]:
swimmers.count()

## Running filter statements

To run a filter statement, you can use the ```filter``` clause; in the following code snippet, we are using the ```select``` clause to specify the columns to be returned as well:

In [31]:
# Get the id, age where age = 22
swimmers.select("id", "age").filter("age = 22").show()

# or 
swimmers.select(swimmers.id, swimmers.age).filter(swimmers.age == 22).show()

![Imgur](http://i.imgur.com/o3GmKQD.png)

In [33]:
# Get the name, eyecolor where eyecolor like 'b%'
swimmers.select("name", "eyecolor").filter("eyecolor like 'b%'").show()

![Imgur](http://i.imgur.com/WdBxuXF.png)

# On-Time Flight Performance

Querying flight departure delays by State and City by joining the departure delay to the airport codes(to identify state and city).


## DataFrame Queries
Let's run a flight performance using DataFrames; let's first build the DataFrames from the source datasets.

## Preparing the source datasets

We will first process the source airports and flight performance datasets by specifying their file path location and importing them using SparkSession:

In [36]:
# Set File Paths
flightPerfFilePath = "/databricks-datasets/flights/departuredelays.csv"
airportsFilePath = "/databricks-datasets/flights/airport-codes-na.txt"

# Obtain Airports dataset
airports = spark.read.csv(airportsFilePath, header="true", 
                          inferSchema="true", sep="\t")
airports.createOrReplaceTempView("airports")

# Obtain Departure Delays dataset
flightPerf = spark.read.csv(flightPerfFilePath, header="true")
flightPerf.createOrReplaceTempView("FlightPerformance")


# Cache the Departure Delays dataset
flightPerf.cache()

## Joining flight performance and airports

One of the more common tasks with DataFrames/SQL is to join two different datasets; it is often one of the more demanding operations (from a performance perspective). With DataFrames, a lot of the performance optimizations for these joins are included by default:

In [38]:
# Query sum of flight Delays by City and Origin Code (for washington State)'
spark.sql("select a.City, f.origin, sum(f.delay) as Delays from FlightPerformance f join airports a on a.IATA = f.origin where a.State = 'WA' group by a.City, f.origin order by sum(f.delay) desc").show()

n our scenario, we are querying the total delays by city and origin code for the state of Washington. This will require joining the flight performance data with the airports data by International Air Transport Association (IATA) code. The output of the query is as follows:

![](https://i.imgur.com/WZbV1sy.png)

![](https://i.imgur.com/dTd0cgZ.png)

In [41]:
%sql
-- Query Sum of Flight Delays by City and Origin Code (for Washington State)
select a.City, f.origin, sum(f.delay) as Delays
 from FlightPerformance f
    join airports a
      on a.IATA = f.origin
        
where a.State = 'WA'
group by a.City, f.origin
order by sum(f.delay) desc

## Visualizing our flight-performance data

Let's continue visualizing our data, but broken down by all states in the continental US:

In [43]:
%sql
-- Query Sum of Flight Delays by State (for the US)
select a.State, sum(f.delay) as Delays  
  from FlightPerformance f    
    join airports a      
      on a.IATA = f.origin 
where a.Country = 'USA'
group by a.State

![](https://i.imgur.com/7uxbRfu.png)

![](https://i.imgur.com/wn06vwB.png)

In [45]:
%sql
-- Query Sum of Flight Delays by State (for the US)
select a.State, sum(f.delay) as Delays  
  from FlightPerformance f    
    join airports a      
      on a.IATA = f.origin 
where a.Country = 'Canada'
group by a.State

In [46]:
%sql
-- Query Sum of Flight Delays by City and Origin Code (for Washington State)
select a.City, f.origin, sum(f.delay) as Delays
 from FlightPerformance f
    join airports a
      on a.IATA = f.origin
        
where a.State = 'IL'
group by a.City, f.origin
order by sum(f.delay) desc