https://github.com/databricks/LearningSparkV2


https://github.com/RodrigoLima82/spark-certification

reference repos

In [1]:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import *

spark = (
    SparkSession.builder
    .appName("spark-notations")
    .getOrCreate()
)

In [2]:
from pyspark import SparkContext
#sc= SparkContext()
sc = SparkContext.getOrCreate();

## Spark: What’s Underneath an RDD?

In [3]:
#In Python
# Create an RDD of tuples (name, age)
dataRDD = sc.parallelize([("Brooke", 20), ("Denny", 31), ("Jules", 30),
("TD", 35), ("Brooke", 25)])
# Use map and reduceByKey transformations with their lambda
# expressions to aggregate and then compute average
agesRDD = (dataRDD
.map(lambda x: (x[0], (x[1], 1)))
.reduceByKey(lambda x, y: (x[0] + y[0], x[1] + y[1]))
.map(lambda x: (x[0], x[1][0]/x[1][1])))

dfnew = spark.createDataFrame(agesRDD)
dfnew.show()

+------+----+
|    _1|  _2|
+------+----+
|Brooke|22.5|
| Denny|31.0|
|    TD|35.0|
| Jules|30.0|
+------+----+



In [4]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import avg
# Create a DataFrame
data_df = spark.createDataFrame([("Brooke", 20), ("Denny", 31), ("Jules", 30),("TD", 35), ("Brooke", 25)], ["name", "age"])
# Group the same names together, aggregate their ages, and compute an average
avg_df = data_df.groupBy("name").agg(avg("age"))
# Show the results of the final execution
avg_df.show()

+------+--------+
|  name|avg(age)|
+------+--------+
|Brooke|    22.5|
| Denny|    31.0|
| Jules|    30.0|
|    TD|    35.0|
+------+--------+



Two ways to define a schema

In [5]:
# In Python
from pyspark.sql.types import *
schema = StructType([StructField("author", StringType(), False),
StructField("title", StringType(), False),
StructField("pages", IntegerType(), False)])

In [6]:
schema = "author STRING, title STRING, pages INT"

In [7]:
from pyspark.sql import SparkSession
# Define schema for our data using DDL
schema = "`Id` INT, `First` STRING, `Last` STRING, `Url` STRING,`Published` STRING, `Hits` INT, `Campaigns` ARRAY<STRING>"
# Create our static data
data = [[1, "Jules", "Damji", "https://tinyurl.1", "1/4/2016", 4535, ["twitter","LinkedIn"]],
[2, "Brooke","Wenig", "https://tinyurl.2", "5/5/2018", 8908, ["twitter",
"LinkedIn"]],
[3, "Denny", "Lee", "https://tinyurl.3", "6/7/2019", 7659, ["web",
"twitter", "FB", "LinkedIn"]],
[4, "Tathagata", "Das", "https://tinyurl.4", "5/12/2018", 10568,
["twitter", "FB"]],
[5, "Matei","Zaharia", "https://tinyurl.5", "5/14/2014", 40578, ["web",
"twitter", "FB", "LinkedIn"]],
[6, "Reynold", "Xin", "https://tinyurl.6", "3/2/2015", 25568,
["twitter", "LinkedIn"]]
]
# Main program
if __name__ == "__main__":

    # Create a DataFrame using the schema defined above
    blogs_df = spark.createDataFrame(data, schema)
    # Show the DataFrame; it should reflect our table above
    blogs_df.show()

    # Print the schema used by Spark to process the DataFrame
    print(blogs_df.printSchema())
    

+---+---------+-------+-----------------+---------+-----+--------------------+
| Id|    First|   Last|              Url|Published| Hits|           Campaigns|
+---+---------+-------+-----------------+---------+-----+--------------------+
|  1|    Jules|  Damji|https://tinyurl.1| 1/4/2016| 4535| [twitter, LinkedIn]|
|  2|   Brooke|  Wenig|https://tinyurl.2| 5/5/2018| 8908| [twitter, LinkedIn]|
|  3|    Denny|    Lee|https://tinyurl.3| 6/7/2019| 7659|[web, twitter, FB...|
|  4|Tathagata|    Das|https://tinyurl.4|5/12/2018|10568|       [twitter, FB]|
|  5|    Matei|Zaharia|https://tinyurl.5|5/14/2014|40578|[web, twitter, FB...|
|  6|  Reynold|    Xin|https://tinyurl.6| 3/2/2015|25568| [twitter, LinkedIn]|
+---+---------+-------+-----------------+---------+-----+--------------------+

root
 |-- Id: integer (nullable = true)
 |-- First: string (nullable = true)
 |-- Last: string (nullable = true)
 |-- Url: string (nullable = true)
 |-- Published: string (nullable = true)
 |-- Hits: integer (

In [8]:
blogs_df.schema

StructType([StructField('Id', IntegerType(), True), StructField('First', StringType(), True), StructField('Last', StringType(), True), StructField('Url', StringType(), True), StructField('Published', StringType(), True), StructField('Hits', IntegerType(), True), StructField('Campaigns', ArrayType(StringType(), True), True)])

## Columns and Expressions

In [9]:

blogs_df.select(col("Id")).show()

+---+
| Id|
+---+
|  1|
|  2|
|  3|
|  4|
|  5|
|  6|
+---+



In [10]:
blogs_df.select(expr("Hits * 2")).show(2)

+----------+
|(Hits * 2)|
+----------+
|      9070|
|     17816|
+----------+
only showing top 2 rows



In [11]:
blogs_df.withColumn("Big Hitters", (expr("Hits > 10000"))).show()

+---+---------+-------+-----------------+---------+-----+--------------------+-----------+
| Id|    First|   Last|              Url|Published| Hits|           Campaigns|Big Hitters|
+---+---------+-------+-----------------+---------+-----+--------------------+-----------+
|  1|    Jules|  Damji|https://tinyurl.1| 1/4/2016| 4535| [twitter, LinkedIn]|      false|
|  2|   Brooke|  Wenig|https://tinyurl.2| 5/5/2018| 8908| [twitter, LinkedIn]|      false|
|  3|    Denny|    Lee|https://tinyurl.3| 6/7/2019| 7659|[web, twitter, FB...|      false|
|  4|Tathagata|    Das|https://tinyurl.4|5/12/2018|10568|       [twitter, FB]|       true|
|  5|    Matei|Zaharia|https://tinyurl.5|5/14/2014|40578|[web, twitter, FB...|       true|
|  6|  Reynold|    Xin|https://tinyurl.6| 3/2/2015|25568| [twitter, LinkedIn]|       true|
+---+---------+-------+-----------------+---------+-----+--------------------+-----------+



In [12]:
blogs_df.withColumn("AuthorsId", (concat(expr("First"), expr("Last"), expr("Id")))).select(col("AuthorsId")).show(4)

+-------------+
|    AuthorsId|
+-------------+
|  JulesDamji1|
| BrookeWenig2|
|    DennyLee3|
|TathagataDas4|
+-------------+
only showing top 4 rows



In [13]:
#test with filter
df2 = blogs_df.withColumn("Big_Hitters", (expr("Hits > 10000")))
df2.filter((df2.Big_Hitters =="true") & (df2.Id == 4)).show()

+---+---------+----+-----------------+---------+-----+-------------+-----------+
| Id|    First|Last|              Url|Published| Hits|    Campaigns|Big_Hitters|
+---+---------+----+-----------------+---------+-----+-------------+-----------+
|  4|Tathagata| Das|https://tinyurl.4|5/12/2018|10568|[twitter, FB]|       true|
+---+---------+----+-----------------+---------+-----+-------------+-----------+



In [14]:
# These statements return the same value, showing that
# expr is the same as a col method call
blogs_df.select(expr("Hits")).show(2)
blogs_df.select(col("Hits")).show(2)
blogs_df.select("Hits").show(2)

+----+
|Hits|
+----+
|4535|
|8908|
+----+
only showing top 2 rows

+----+
|Hits|
+----+
|4535|
|8908|
+----+
only showing top 2 rows

+----+
|Hits|
+----+
|4535|
|8908|
+----+
only showing top 2 rows



In [15]:
# Sort by column "Id" in descending order
blogs_df.sort(col("Id").desc()).show()
#blogsDF.sort($"Id".desc()).show()

+---+---------+-------+-----------------+---------+-----+--------------------+
| Id|    First|   Last|              Url|Published| Hits|           Campaigns|
+---+---------+-------+-----------------+---------+-----+--------------------+
|  6|  Reynold|    Xin|https://tinyurl.6| 3/2/2015|25568| [twitter, LinkedIn]|
|  5|    Matei|Zaharia|https://tinyurl.5|5/14/2014|40578|[web, twitter, FB...|
|  4|Tathagata|    Das|https://tinyurl.4|5/12/2018|10568|       [twitter, FB]|
|  3|    Denny|    Lee|https://tinyurl.3| 6/7/2019| 7659|[web, twitter, FB...|
|  2|   Brooke|  Wenig|https://tinyurl.2| 5/5/2018| 8908| [twitter, LinkedIn]|
|  1|    Jules|  Damji|https://tinyurl.1| 1/4/2016| 4535| [twitter, LinkedIn]|
+---+---------+-------+-----------------+---------+-----+--------------------+



## Rows

In [16]:
# In Python
from pyspark.sql import Row
blog_row = Row(6, "Reynold", "Xin", "https://tinyurl.6", 255568, "3/2/2015",
["twitter", "LinkedIn"])
# access using index for individual items
blog_row[1]
'Reynold'

'Reynold'

In [17]:
# In Python
rows = [Row("Matei Zaharia", "CA"), Row("Reynold Xin", "CA")]
authors_df = spark.createDataFrame(rows, ["Authors", "State"])
authors_df.show()

+-------------+-----+
|      Authors|State|
+-------------+-----+
|Matei Zaharia|   CA|
|  Reynold Xin|   CA|
+-------------+-----+



## Common DataFrame Operations

In [40]:
# In Python, define a schema
from pyspark.sql.types import *
# Programmatic way to define a schema
fire_schema = StructType([StructField('CallNumber', IntegerType(), True),
StructField('UnitID', StringType(), True),
StructField('IncidentNumber', IntegerType(), True),
StructField('CallType', StringType(), True),
StructField('CallDate', StringType(), True),
StructField('WatchDate', StringType(), True),
StructField('CallFinalDisposition', StringType(), True),
StructField('AvailableDtTm', StringType(), True),
StructField('Address', StringType(), True),
StructField('City', StringType(), True),
StructField('Zipcode', IntegerType(), True),
StructField('Battalion', StringType(), True),
StructField('StationArea', StringType(), True),
StructField('Box', StringType(), True),
StructField('OriginalPriority', StringType(), True),
StructField('Priority', StringType(), True),
StructField('FinalPriority', IntegerType(), True),
StructField('ALSUnit', BooleanType(), True),
StructField('CallTypeGroup', StringType(), True),
StructField('NumAlarms', IntegerType(), True),
StructField('UnitType', StringType(), True),
StructField('UnitSequenceInCallDispatch', IntegerType(), True),
StructField('FirePreventionDistrict', StringType(), True),
StructField('SupervisorDistrict', StringType(), True),
StructField('Neighborhood', StringType(), True),
StructField('Location', StringType(), True),
StructField('RowID', StringType(), True),
StructField('Delay', FloatType(), True)])
# Use the DataFrameReader interface to read a CSV file
sf_fire_file = "C:/Lenzi/Spark/LearningSparkV2-master/chapter3/data/sf-fire-calls.csv"
fire_df = spark.read.csv(sf_fire_file, header=True, schema=fire_schema)
#fire_df=fire_df.na.drop()



In [42]:
fire_df.count()

175296

In [43]:
parquet_path = "C:/Lenzi/Spark/spark-certification-main/files/parquet/"
fire_df.write.format("parquet").save(parquet_path)

In [45]:
parquet_table = 'fire_table8' # name of the table
fire_df.write.format("parquet").saveAsTable(parquet_table)

Transformations and actions

In [46]:
#In Python
few_fire_df = (fire_df
.select("IncidentNumber", "AvailableDtTm", "CallType")
.where(col("CallType") != "Medical Incident"))
few_fire_df.show(5, truncate=False)

+--------------+----------------------+--------------+
|IncidentNumber|AvailableDtTm         |CallType      |
+--------------+----------------------+--------------+
|2003235       |01/11/2002 01:51:44 AM|Structure Fire|
|2003250       |01/11/2002 04:16:46 AM|Vehicle Fire  |
|2003259       |01/11/2002 06:01:58 AM|Alarms        |
|2003279       |01/11/2002 08:03:26 AM|Structure Fire|
|2003301       |01/11/2002 09:46:44 AM|Alarms        |
+--------------+----------------------+--------------+
only showing top 5 rows



In [47]:
# In Python, return number of distinct types of calls using countDistinct()
from pyspark.sql.functions import *
(fire_df
.select("IncidentNumber")
.where(col("CallType").isNotNull())
.agg(countDistinct("IncidentNumber").alias("DistinctCallTypes"))
.show())

+-----------------+
|DistinctCallTypes|
+-----------------+
|           168571|
+-----------------+



In [48]:
#In Python, filter for only distinct non-null CallTypes from all the rows
(fire_df
.select("CallType")
.where(col("CallType").isNotNull())
.distinct()
.show(10, False))

+-----------------------------+
|CallType                     |
+-----------------------------+
|Elevator / Escalator Rescue  |
|Marine Fire                  |
|Aircraft Emergency           |
|Administrative               |
|Alarms                       |
|Odor (Strange / Unknown)     |
|Citizen Assist / Service Call|
|HazMat                       |
|Watercraft in Distress       |
|Explosion                    |
+-----------------------------+
only showing top 10 rows



Renaming, adding, and dropping columns

In [49]:
# In Python
# no livro os autores mencionam que adicionaram uma coluna Delay mas não mencionam o que fizeram
#adicionamos aqui um numero aleatório apenas para testes
new_fire_df = fire_df.withColumn("Delay",when(rand() > 0.5, 1).otherwise(4)).withColumnRenamed("Delay", "ResponseDelayedinMins")



In [50]:
(new_fire_df
.select("ResponseDelayedinMins")
.where(col("ResponseDelayedinMins") > 1)
.show(5, False))

+---------------------+
|ResponseDelayedinMins|
+---------------------+
|4                    |
|4                    |
|4                    |
|4                    |
|4                    |
+---------------------+
only showing top 5 rows



In [51]:
#test query - avg response
new_fire_df = fire_df.withColumn("Delay",when(rand() > 0.5, 1).otherwise(4)).withColumnRenamed("Delay", "ResponseDelayedinMins")

(new_fire_df
.select("ResponseDelayedinMins")
.where(col("CallType").isNotNull())
.agg(avg("ResponseDelayedinMins").alias("AvgResponseDelayedinMins"))
.show())

+------------------------+
|AvgResponseDelayedinMins|
+------------------------+
|      2.5056989320920042|
+------------------------+



In [52]:
def displayFunc(df, rows):
    df = df.pandas_api()
    return df.head(rows)

In [53]:
new_fire_df.createOrReplaceTempView("new_fire_vw")

In [54]:
#df = spark.sql('''select * from new_fire_vw''')
#df2 = df.toPandas()

#displayFunc(df2,20)

In [55]:
# In Python
fire_ts_df = (new_fire_df
.withColumn("IncidentDate", to_timestamp(col("CallDate"), "MM/dd/yyyy"))
.drop("CallDate")
.withColumn("OnWatchDate", to_timestamp(col("WatchDate"), "MM/dd/yyyy"))
.drop("WatchDate")
.withColumn("AvailableDtTS", to_timestamp(col("AvailableDtTm"),
"MM/dd/yyyy hh:mm:ss a"))
.drop("AvailableDtTm"))

In [56]:
(new_fire_df
.select("CallDate", "WatchDate", "AvailableDtTm")
.show(50, False))

+----------+----------+----------------------+
|CallDate  |WatchDate |AvailableDtTm         |
+----------+----------+----------------------+
|01/11/2002|01/10/2002|01/11/2002 01:51:44 AM|
|01/11/2002|01/10/2002|01/11/2002 03:01:18 AM|
|01/11/2002|01/10/2002|01/11/2002 02:39:50 AM|
|01/11/2002|01/10/2002|01/11/2002 04:16:46 AM|
|01/11/2002|01/10/2002|01/11/2002 06:01:58 AM|
|01/11/2002|01/11/2002|01/11/2002 08:03:26 AM|
|01/11/2002|01/11/2002|01/11/2002 09:46:44 AM|
|01/11/2002|01/11/2002|01/11/2002 09:58:53 AM|
|01/11/2002|01/11/2002|01/11/2002 12:06:57 PM|
|01/11/2002|01/11/2002|01/11/2002 01:08:40 PM|
|01/11/2002|01/11/2002|01/11/2002 03:31:02 PM|
|01/11/2002|01/11/2002|01/11/2002 02:59:04 PM|
|01/11/2002|01/11/2002|01/11/2002 04:22:49 PM|
|01/11/2002|01/11/2002|01/11/2002 04:18:33 PM|
|01/11/2002|01/11/2002|01/11/2002 04:09:08 PM|
|01/11/2002|01/11/2002|01/11/2002 04:09:08 PM|
|01/11/2002|01/11/2002|01/11/2002 04:09:08 PM|
|01/11/2002|01/11/2002|01/11/2002 04:34:23 PM|
|01/11/2002|0

In [57]:
    # Select the converted columns
(fire_ts_df
.select("IncidentDate", "OnWatchDate", "AvailableDtTS")
.show(50, False))

+-------------------+-------------------+-------------------+
|IncidentDate       |OnWatchDate        |AvailableDtTS      |
+-------------------+-------------------+-------------------+
|2002-01-11 00:00:00|2002-01-10 00:00:00|2002-01-11 01:51:44|
|2002-01-11 00:00:00|2002-01-10 00:00:00|2002-01-11 03:01:18|
|2002-01-11 00:00:00|2002-01-10 00:00:00|2002-01-11 02:39:50|
|2002-01-11 00:00:00|2002-01-10 00:00:00|2002-01-11 04:16:46|
|2002-01-11 00:00:00|2002-01-10 00:00:00|2002-01-11 06:01:58|
|2002-01-11 00:00:00|2002-01-11 00:00:00|2002-01-11 08:03:26|
|2002-01-11 00:00:00|2002-01-11 00:00:00|2002-01-11 09:46:44|
|2002-01-11 00:00:00|2002-01-11 00:00:00|2002-01-11 09:58:53|
|2002-01-11 00:00:00|2002-01-11 00:00:00|2002-01-11 12:06:57|
|2002-01-11 00:00:00|2002-01-11 00:00:00|2002-01-11 13:08:40|
|2002-01-11 00:00:00|2002-01-11 00:00:00|2002-01-11 15:31:02|
|2002-01-11 00:00:00|2002-01-11 00:00:00|2002-01-11 14:59:04|
|2002-01-11 00:00:00|2002-01-11 00:00:00|2002-01-11 16:22:49|
|2002-01

In [58]:
# In Python
(fire_ts_df
.select(year('IncidentDate'))
.distinct()
.orderBy(year('IncidentDate'))
.show())

+------------------+
|year(IncidentDate)|
+------------------+
|              2000|
|              2001|
|              2002|
|              2003|
|              2004|
|              2005|
|              2006|
|              2007|
|              2008|
|              2009|
|              2010|
|              2011|
|              2012|
|              2013|
|              2014|
|              2015|
|              2016|
|              2017|
|              2018|
+------------------+



Aggregations

In [59]:
# In Python
(fire_ts_df
.select("CallType")
.where(col("CallType").isNotNull())
.groupBy("CallType")
.count()
.orderBy("count", ascending=False)
.show(n=10, truncate=False))

+-------------------------------+------+
|CallType                       |count |
+-------------------------------+------+
|Medical Incident               |113794|
|Structure Fire                 |23319 |
|Alarms                         |19406 |
|Traffic Collision              |7013  |
|Citizen Assist / Service Call  |2524  |
|Other                          |2166  |
|Outside Fire                   |2094  |
|Vehicle Fire                   |854   |
|Gas Leak (Natural and LP Gases)|764   |
|Water Rescue                   |755   |
+-------------------------------+------+
only showing top 10 rows



Other common DataFrame operations

In [60]:
# In Python
import pyspark.sql.functions as F
(fire_ts_df
.select(F.sum("NumAlarms"), F.avg("ResponseDelayedinMins"),F.min("ResponseDelayedinMins"), F.max("ResponseDelayedinMins"))
.show())

+--------------+--------------------------+--------------------------+--------------------------+
|sum(NumAlarms)|avg(ResponseDelayedinMins)|min(ResponseDelayedinMins)|max(ResponseDelayedinMins)|
+--------------+--------------------------+--------------------------+--------------------------+
|        176170|        2.5056989320920042|                         1|                         4|
+--------------+--------------------------+--------------------------+--------------------------+



## DataFrames Versus Datasets
By now you may be wondering why and when you should use DataFrames or Datasets.
In many cases either will do, depending on the languages you are working in, but
there are some situations where one is preferable to the other. Here are a few
examples:

 • If you want to tell Spark what to do, not how to do it, use DataFrames or Datasets.

 • If you want rich semantics, high-level abstractions, and DSL operators, use Data‐Frames or Datasets.

 • If you want strict compile-time type safety and don’t mind creating multiple case classes for a specific Dataset[T], use Datasets.

 • If your processing demands high-level expressions, filters, maps, aggregations,computing averages or sums, SQL queries, columnar access, or use of relational operators on semi-structured data, use DataFrames or Datasets.

 • If your processing dictates relational transformations similar to SQL-like queries,use DataFrames.
 
 • If you want to take advantage of and benefit from Tungsten’s efficient serializationwith Encoders, use Datasets.

 • If you want unification, code optimization, and simplification of APIs across Spark components, use DataFrames.

 • If you are an R user, use DataFrames.

 • If you are a Python user, use DataFrames and drop down to RDDs if you need more control.
 
 • If you want space and speed efficiency, use DataFrames.

## When to Use RDDs

• Are using a third-party package that’s written using RDDs

• Can forgo the code optimization, efficient space utilization, and performance benefits available with DataFrames and Datasets

• Want to precisely instruct Spark how to do a query

Let’s go through each of the four query optimization phases..

Phase 1: Analysis
The Spark SQL engine begins by generating an abstract syntax tree (AST) for the SQL or DataFrame query. In this initial phase, any columns or table names will be resolved by consulting an internal Catalog, a programmatic interface to Spark SQL that holds a list of names of columns, data types, functions, tables, databases, etc. Once they’ve all been successfully resolved, the query proceeds to the next phase.

Phase
 2: Logical optimization
As Figure 3-4 shows, this phase comprises two internal stages. Applying a standardrule based optimization approach, the Catalyst optimizer will first construct a set of multiple plans and then, using its cost-based optimizer (CBO), assign costs to each plan. These plans are laid out as operator trees (like in Figure 3-5); they may include,for example, the process of constant folding, predicate pushdown, projection pruning,Boolean expression simplification, etc. This logical plan is the input into thephysical plan.

Phase 3: Physical planning
In this phase, Spark SQL generates an optimal physical plan for the selected logical plan, using physical operators that match those available in the Spark execution engine.

Phase 4: Code generation
The final phase of query optimization involves generating efficient Java bytecode to run on each machine. Because Spark SQL can operate on data sets loaded in memory, Spark can use state-of-the-art compiler technology for code generation to speed up execution. In other words, it acts as a compiler. Project Tungsten, which facilitates
whole-stage code generation, plays a role here.
Just what is whole-stage code generation? It’s a physical query optimization phase that collapses the whole query into a single function, getting rid of virtual function calls and employing CPU registers for intermediate data. The second-generation Tungsten
engine, introduced in Spark 2.0, uses this approach to generate compact RDD code for final execution. This streamlined strategy significantly improves CPU efficiency and performance.

