In [7]:
# Structured API Overview  
#         Datasets
#         DataFrames
#         SQL tables and views

# A Spark Job
# Stages
# Tasks
# Rows 
# Columns
# Spark Types
# Directed Acyclic graph

In [8]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Spark Types").getOrCreate()

22/10/17 09:18:00 WARN Utils: Your hostname, HP-G62 resolves to a loopback address: 127.0.1.1; using 192.168.18.113 instead (on interface enp3s0)
22/10/17 09:18:00 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/10/17 09:18:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/10/17 09:18:04 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
22/10/17 09:18:04 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
22/10/17 09:18:04 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.
22/10/17 09:18:04 WARN Utils: Service 'SparkUI' could not bind on port 4043. Attempting port 4044.
22/10/17 09:18:04 WARN Utils: Service 'SparkUI' could not bind on port 4044. Attempting port 4045.
22/10/17 09:18:04 WARN Utils: Service 'SparkUI' could not bind on port 4045. Attempting port 4046.
22/10/17 09:18:04 WARN Utils: Service 'SparkUI' could not bind on port 4046. Attempting port 4047.
22/10/17 09:18:04 WARN Utils: Service 'SparkUI' could not bind on port 4047. Attempting port 4048.


In [3]:
# DataFrames and Datasets
# Spark has two notions of structured collections: 
#     DataFrames and
#     Datasets. 

# DataFrames and Datasets are (distributed) table-like collections with well-defined rows and
# columns. 
# Each column must have the same number of rows as all the other columns and each column has type information that
# must be consistent for every row in the collection. 
# In Spark, DataFrames and Datasets represent immutable, lazily evaluated plans that specify what operations to apply to data residing at a
# location to generate some output. When we perform an action on a DataFrame, 
# we instruct Spark to perform the actual transformations and return the result. 
# These represent plans of how to manipulate rows and columns to compute the user’s desired result

In [4]:
# Schemas
# A schema defines the column names and types of a DataFrame.
# we can define schemas manually or read a schema from a data source (often called schema on read). 
# Schemas consist of types, meaning that you need a way of specifying what lies where.

In [12]:
# Overview of Structured Spark Types
# Spark is effectively a programming language of its own. 
# Internally, Spark uses an engine called Catalyst that maintains its own type information through the planning
# and processing of work. 
# In doing so, this opens up a wide variety of execution optimizations that make significant differences. 
# Spark types map directly to the different language APIs that Spark maintains and
# there exists a lookup table for each of these in Scala, Java, Python, SQL, and R. 
# Even if we use Spark’s Structured APIs from Python or R, the majority of our manipulations will operate strictly
# on Spark types, not Python types. 
# For example, the following code does not perform addition in Scala or Python; 
# it actually performs addition purely in Spark:
    
# in Python
df = spark.range(500).toDF("number")
df.select(df["number"] + 10)
df.show()

+------+
|number|
+------+
|     0|
|     1|
|     2|
|     3|
|     4|
|     5|
|     6|
|     7|
|     8|
|     9|
|    10|
|    11|
|    12|
|    13|
|    14|
|    15|
|    16|
|    17|
|    18|
|    19|
+------+
only showing top 20 rows



In [None]:
#  dataframe versus dataset
# https://phoenixnap.com/kb/rdd-vs-dataframe-vs-dataset#:~:text=DataFrames%20are%20a%20SparkSQL%20data,and%20the%20convenience%20of%20RDDs.


In [None]:
# A Spark Job
# Spark Jobs (?)
# User: babucode
# Total Uptime: 118.9 h
# Scheduling Mode: FIFO
#  Event Timeline
#  Enable zooming

# Spark Jobs:A job is defined as a series of stages combined.
# So, what Spark does is that as soon as action operations like collect(), count(), etc., is triggered, 
# the driver program, which is responsible for launching the spark application as well as considered the 
# entry point of any spark application, converts this spark application into a single job which can be seen 
# in the figure below.
# https://www.analyticsvidhya.com/blog/2022/09/all-about-spark-jobs-stages-and-tasks/

In [None]:
# # Stages
# https://www.analyticsvidhya.com/blog/2022/09/all-about-spark-jobs-stages-and-tasks/
# Now here comes the concept of Stage. Whenever there is a shuffling of data over the network, Spark divides the job into multiple stages. 
# Therefore, a stage is created when the shuffling of data takes place.

# These stages can be either processed parallelly or sequentially depending upon the dependencies of these stages between each other. 
# If there are two stages, Stage 0 and Stage 1, and if they are not sequentially dependent, they will be executed parallelly.

# The sequential processing of RDDs in a single stage is called pipelining.

# So, in our code, we have used reduceByKey() function, which shuffles our data in order to group the same keys. 
# Since shuffling of data is taking place only once, our job will be divided into two stages as shown in the figure below.

# There are two types of stages in Spark:

# 1.ShuffleMapStage in Spark

# 2. ResultStage in Spark

# 1. ShuffleMapStage
# As the name suggests, it is a type of stage in the spark that produces data for shuffle operation.
# The output of this stage acts as an input for the other following stages.
# In the above code, Stage 0 will act as the ShuffleMapStage since it produces data for shuffle operation, 
# which acts as an input for Stage 1.

# 2. ResultStage in Spark
# The final stage in a Job executes an action operation by running a function (in our example, the action operation is collect) on an RDD.
# It computes the result of the active operation.
# Stage 1 in our example acts as a ResultStage since it gives us a result of an action operation performed on an RDD.
# In our code, after the data shuffling, similar keys got grouped using reduceByKey() 
# function, so this stage gives us the final result of our code using the collect (action operation) function.

# A stage is further a group of tasks executed together. Now we will go through what a task is.

In [26]:
# Tasks
# https://www.analyticsvidhya.com/blog/2022/09/all-about-spark-jobs-stages-and-tasks/
# The single computation unit performed on a single data partition is called a task. 
# It is computed on a single core of the worker node.
# Whenever Spark is performing any computation operation like transformation etc, Spark is executing a task on a partition of data.
# Since in our code, we have two partitions of data here, therefore, we have two tasks here.
# Each is computing the same operation on a different partition in parallel on a different core of the worker node.

In [None]:
# Some important points to note:
# https://www.analyticsvidhya.com/blog/2022/09/all-about-spark-jobs-stages-and-tasks/
# 1. The cluster manager assigns each worker node resources to execute the tasks.
# 2. A core is the CPU’s computation unit; it controls the total number of concurrent tasks an executor 
#     can execute or run.Suppose if the number of cores is 3, then executors can run 3 tasks at max simultaneously.
# 3. Executors are responsible for executing tasks individually. Parallel processing of the tasks by an executor depends upon the number of cores assigned to it, as mentioned in the second point.
# 4. Each working node has Cache memory for storage, and as soon as a result is computed, it is sent to the driver’s program.

# So this is how a Spark application is converted into Job, which is further divided into Stages and Tasks.

In [13]:
# Rows 
# A row is nothing more than a record of data. 
# Each record in a DataFrame must be of type Row, as we can see when we collect the following DataFrames. 
# We can create these rows manually from SQL, from Resilient Distributed Datasets (RDDs), from data sources, or manually from scratch.
# Here, we create one by using a range:
spark.range(2).collect()

[Row(id=0), Row(id=1)]

In [24]:
# Columns
# Columns represent a simple type like an integer or string, a complex type like an array or map, or a null value. 
# Spark tracks all of this type information for you and offers a variety of ways, with which you can transform columns.
# There are a lot of different ways to construct and refer to columns but the two simplest ways are
# by using the col or column functions. To use either of these functions, you pass in a column
# name:

from pyspark.sql.functions import col, column
col("number")


Column<'number'>

In [25]:
column("number")

Column<'number'>

In [16]:
# Spark Types
# We mentioned earlier that Spark has a large number of internal type representations. 
# We include a handy reference table on the next several pages so that you can most easily reference what type, in your specific language, lines up with the type in Spark.
# Before getting to those tables, let’s talk about how we instantiate, or declare, a column to be of a certain type.


In [19]:
# Python types at times have certain requirements
# To work with the correct Python types, use the following:
from pyspark.sql.types import *
b = ByteType()


In [20]:
b

ByteType()

In [None]:
# Directed Acyclic graph