In [3]:
# Overview of Structured API Execution
# Overview of the steps:
# 1. Write DataFrame/Dataset/SQL Code.
# 2. If valid code, Spark converts this to a Logical Plan.
# 3. Spark transforms this Logical Plan to a Physical Plan, checking for optimizations along the way.
# 4. Spark then executes this Physical Plan (RDD manipulations) on the cluster.

# To execute code, we must write code. 
# This code is then submitted to Spark either through the console or via a submitted job. 
# This code then passes through the Catalyst Optimizer, which decides how the code should be executed and lays out a plan for doing so before, 
# finally, the code is run and the result is returned to the user

In [4]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Overview of Structured API Execution").getOrCreate()

22/10/17 09:36:49 WARN Utils: Your hostname, HP-G62 resolves to a loopback address: 127.0.1.1; using 192.168.18.113 instead (on interface enp3s0)
22/10/17 09:36:49 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/10/17 09:36:50 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/10/17 09:36:52 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
22/10/17 09:36:52 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
22/10/17 09:36:52 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.
22/10/17 09:36:52 WARN Utils: Service 'SparkUI' could not bind on port 4043. Attempting port 4044.
22/10/17 09:36:52 WARN Utils: Service 'SparkUI' could not bind on port 4044. Attempting port 4045.
22/10/17 09:36:52 WARN Utils: Service 'SparkUI' could not bind on port 4045. Attempting port 4046.
22/10/17 09:36:52 WARN Utils: Service 'SparkUI' could not bind on port 4046. Attempting port 4047.
22/10/17 09:36:52 WARN Utils: Service 'SparkUI' could not bind on port 4047. Attempting port 4048.
22/10/17 09:36:52 WARN Utils: Service 'SparkUI' could not bind on

In [None]:
# # Logical Planning
# # The first phase of execution is meant to take user code and convert it into a logical plan.
# # This logical plan only represents a set of abstract transformations that do not refer to executors or
# # drivers, it’s purely to convert the user’s set of expressions into the most optimized version.
# # It does this by converting user code into an unresolved logical plan. 
# This plan is unresolved because although your code might be valid, the tables or columns 
# that it refers to might or might
# # not exist. Spark uses the catalog, a repository of all table and DataFrame information, to resolve
# # columns and tables in the analyzer. The analyzer might reject the unresolved logical plan if the
# # required table or column name does not exist in the catalog. If the analyzer can resolve it, the
# # result is passed through the Catalyst Optimizer, a collection of rules that attempt to optimize the
# # logical plan by pushing down predicates or selections. Packages can extend the Catalyst to
# # include their own rules for domain-specific optimizations.

In [None]:
# Physical Planning
# After successfully creating an optimized logical plan, Spark then begins the physical planning
# process. The physical plan, often called a Spark plan, specifies how the logical plan will execute
# on the cluster by generating different physical execution strategies and comparing them through
# a cost model, as depicted in Figure 4-3. An example of the cost comparison might be choosing
# how to perform a given join by looking at the physical attributes of a given table (how big the
# table is or how big its partitions are).
# Physical planning results in a series of RDDs and transformations. This result is why you might
# have heard Spark referred to as a compiler—it takes queries in DataFrames, Datasets, and SQL
# and compiles them into RDD transformations for you.

In [None]:
# Execution
# Upon selecting a physical plan, Spark runs all of this code over RDDs, the lower-level
# programming interface of Spark . Spark performs further
# optimizations at runtime, generating native Java bytecode that can remove entire tasks or stages
# during execution. Finally the result is returned to the user.