# NOTEBOOK 3.6 Spark DataFrames

## 0. Create a Spark session object

In [None]:
from pyspark.sql import SparkSession

spark = SparkSession\
        .builder\
        .appName("DataFrameDemo")\
        .getOrCreate()

24/06/05 10:12:34 WARN Utils: Your hostname, PC25. resolves to a loopback address: 127.0.1.1; using 192.168.76.195 instead (on interface eth0)
24/06/05 10:12:34 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/06/05 10:12:36 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


## 1. File I/O with Spark DataFrame

### 1.1 Load data from CSV file into a DataFrame

Note:
- The Hadoop installation in the WSL distro has configured HDFS as the default file system.
- To access files in WSL's local file system, the filepath format will start with **"file://**".
- If files do not have header information in them, you can skip the (header, true) option.

#### 1.1 Load data from a CSV file in HDFS

In [None]:
sales_df = spark.read.option("sep", "\t")\
    .option("header", "true")\
    .csv("data/sales.csv")

sales_df.show(3)

+----+-----------+----------+--------+
|code|description|unit_price|quantity|
+----+-----------+----------+--------+
|1005|        pen|       2.5|       4|
|1007|     pencil|       1.0|      10|
|1001|   notebook|       5.0|       2|
+----+-----------+----------+--------+
only showing top 3 rows



### 1.2 Write/Read data from parquet file

In [None]:
sales_df.write.parquet('data/sales.parquet')

                                                                                

In [None]:
parquet_df = spark.read.parquet("data/sales.parquet")
parquet_df.show()

+----+-----------+----------+--------+
|code|description|unit_price|quantity|
+----+-----------+----------+--------+
|1005|        pen|       2.5|       4|
|1007|     pencil|       1.0|      10|
|1001|   notebook|       5.0|       2|
|1003|      ruler|       1.0|       1|
|1002| calculator|      55.0|       1|
+----+-----------+----------+--------+



#### Query

In [None]:
parquet_df.createOrReplaceTempView("parquetSales")
results_df = spark.sql("SELECT code, description FROM parquetSales WHERE quantity >= 2 AND quantity <= 20")
results_df.show()

+----+-----------+
|code|description|
+----+-----------+
|1005|        pen|
|1007|     pencil|
|1001|   notebook|
+----+-----------+



### 1.3 Write/Read data from JSON file

In [None]:
sales_df.write.json("data/sales.json")

In [None]:
json_df = spark.read.json("data/sales.json")
json_df.show()

+----+-----------+--------+----------+
|code|description|quantity|unit_price|
+----+-----------+--------+----------+
|1005|        pen|       4|       2.5|
|1007|     pencil|      10|       1.0|
|1001|   notebook|       2|       5.0|
|1003|      ruler|       1|       1.0|
|1002| calculator|       1|      55.0|
+----+-----------+--------+----------+



## 2. DataFrame Operations (Transformations)

### 2(a) Print the schema in a tree format

In [None]:
sales_df.printSchema()

root
 |-- code: string (nullable = true)
 |-- description: string (nullable = true)
 |-- unit_price: string (nullable = true)
 |-- quantity: string (nullable = true)



### 2(b) Convert multiple column types

In [None]:
# Cast columns to appropriate types

from pyspark.sql.functions import col
from pyspark.sql.types import DoubleType, IntegerType

sales_df = sales_df.withColumn('unit_price', col('unit_price').cast(DoubleType())) \
            .withColumn('quantity', col('quantity').cast(IntegerType()))

sales_df.printSchema()

root
 |-- code: string (nullable = true)
 |-- description: string (nullable = true)
 |-- unit_price: double (nullable = true)
 |-- quantity: integer (nullable = true)



### 2(c) Select a set of columns from a DataFrame

In [None]:
sales_df.select("description").show()

+-----------+
|description|
+-----------+
|        pen|
|     pencil|
|   notebook|
|      ruler|
| calculator|
+-----------+



### 2(d) Filter rows from a DataFrame based on certain conditions

In [None]:
sales_df.filter(sales_df['unit_price'] > 2.00).show()

+----+-----------+----------+--------+
|code|description|unit_price|quantity|
+----+-----------+----------+--------+
|1005|        pen|       2.5|       4|
|1001|   notebook|       5.0|       2|
|1002| calculator|      55.0|       1|
+----+-----------+----------+--------+



### 2(e) Group rows in a DataFrame based on a set of columns & apply aggregated functions (e.g., count(), avg()) on the grouped dataset.

In [None]:
sales_df.groupBy("code").count().show()

+----+-----+
|code|count|
+----+-----+
|1007|    1|
|1005|    1|
|1003|    1|
|1002|    1|
|1001|    1|
+----+-----+



In [None]:
sales_df.groupBy("quantity").count().show()

+--------+-----+
|quantity|count|
+--------+-----+
|       1|    2|
|       4|    1|
|      10|    1|
|       2|    1|
+--------+-----+



## 3. SQL Statements on DataFrames

### 3.1 Temporary Views
Temporary views enables developers us run SQL queries in a program, and get the result as a DataFrame.

#### 3.1(a) Local temporary view on DataFrame

In [None]:
# Create a temporary view on a DataFrame
sales_df.createOrReplaceTempView("sales")
sqlDF = spark.sql("SELECT * FROM sales")
sqlDF.show()

+----+-----------+----------+--------+
|code|description|unit_price|quantity|
+----+-----------+----------+--------+
|1005|        pen|       2.5|       4|
|1007|     pencil|       1.0|      10|
|1001|   notebook|       5.0|       2|
|1003|      ruler|       1.0|       1|
|1002| calculator|      55.0|       1|
+----+-----------+----------+--------+



#### 3.1(b) Global temporary views on DataFrames
Temporary views only last for the session in which they are created. If we want to have views available across various sessions, we need to create Global Temporary Views. The view definition is stored in the default database, **global_temp**. Once a view is created, we need to use the fully qualified name to access it in a query.

In [None]:
sales_df.createGlobalTempView("sales")

# Global temporary view is tied to a system database `global_temp`
spark.sql("SELECT * FROM global_temp.sales").show()
spark.newSession().sql("SELECT * FROM global_temp.sales").show()

+----+-----------+----------+--------+
|code|description|unit_price|quantity|
+----+-----------+----------+--------+
|1005|        pen|       2.5|       4|
|1007|     pencil|       1.0|      10|
|1001|   notebook|       5.0|       2|
|1003|      ruler|       1.0|       1|
|1002| calculator|      55.0|       1|
+----+-----------+----------+--------+

+----+-----------+----------+--------+
|code|description|unit_price|quantity|
+----+-----------+----------+--------+
|1005|        pen|       2.5|       4|
|1007|     pencil|       1.0|      10|
|1001|   notebook|       5.0|       2|
|1003|      ruler|       1.0|       1|
|1002| calculator|      55.0|       1|
+----+-----------+----------+--------+



### 3.2 Use SparkSQL to read the columns with correct data types

In [None]:
sales_df2 = spark.read.parquet("data/sales.parquet")
sales_df2.printSchema()
sales_df2.show()

root
 |-- code: string (nullable = true)
 |-- description: string (nullable = true)
 |-- unit_price: string (nullable = true)
 |-- quantity: string (nullable = true)

+----+-----------+----------+--------+
|code|description|unit_price|quantity|
+----+-----------+----------+--------+
|1005|        pen|       2.5|       4|
|1007|     pencil|       1.0|      10|
|1001|   notebook|       5.0|       2|
|1003|      ruler|       1.0|       1|
|1002| calculator|      55.0|       1|
+----+-----------+----------+--------+



In [None]:
sales_df2.createOrReplaceTempView('SalesData')
sales_df2 = spark.sql("SELECT code, description, DOUBLE(unit_price), INT(quantity) from SalesData")
sales_df2.printSchema()
sales_df2.show()

root
 |-- code: string (nullable = true)
 |-- description: string (nullable = true)
 |-- unit_price: double (nullable = true)
 |-- quantity: integer (nullable = true)

+----+-----------+----------+--------+
|code|description|unit_price|quantity|
+----+-----------+----------+--------+
|1005|        pen|       2.5|       4|
|1007|     pencil|       1.0|      10|
|1001|   notebook|       5.0|       2|
|1003|      ruler|       1.0|       1|
|1002| calculator|      55.0|       1|
+----+-----------+----------+--------+



### 3.3 Joining DataFrames

#### Check the sales data

In [None]:
sales_df.printSchema()
sales_df.show()

root
 |-- code: string (nullable = true)
 |-- description: string (nullable = true)
 |-- unit_price: double (nullable = true)
 |-- quantity: integer (nullable = true)

+----+-----------+----------+--------+
|code|description|unit_price|quantity|
+----+-----------+----------+--------+
|1005|        pen|       2.5|       4|
|1007|     pencil|       1.0|      10|
|1001|   notebook|       5.0|       2|
|1003|      ruler|       1.0|       1|
|1002| calculator|      55.0|       1|
+----+-----------+----------+--------+



#### Read the discount data

In [None]:
discount_df = spark.read.option("sep", "\t")\
    .option("header", "true")\
    .csv("data/discounts.csv")

discount_df.createOrReplaceTempView('DiscountData')
discount_df = spark.sql("SELECT item_code, DOUBLE(discount_perc) from DiscountData")
discount_df.printSchema()
discount_df.show()

root
 |-- item_code: string (nullable = true)
 |-- discount_perc: double (nullable = true)

+---------+-------------+
|item_code|discount_perc|
+---------+-------------+
|     1005|         20.0|
|     1007|         10.0|
|     1001|         50.0|
|     1003|         15.0|
|     1002|         10.0|
+---------+-------------+



In [None]:
#### Join sales_df and discount_df based on the item code

In [None]:
sales_df = sales_df.join(discount_df, sales_df.code == discount_df.item_code, "inner")
sales_df.show()

+----+-----------+----------+--------+---------+-------------+
|code|description|unit_price|quantity|item_code|discount_perc|
+----+-----------+----------+--------+---------+-------------+
|1005|        pen|       2.5|       4|     1005|         20.0|
|1007|     pencil|       1.0|      10|     1007|         10.0|
|1001|   notebook|       5.0|       2|     1001|         50.0|
|1003|      ruler|       1.0|       1|     1003|         15.0|
|1002| calculator|      55.0|       1|     1002|         10.0|
+----+-----------+----------+--------+---------+-------------+



#### Drop item_code column

In [None]:
sales_df = sales_df.drop('item_code')
sales_df.show()

+----+-----------+----------+--------+-------------+
|code|description|unit_price|quantity|discount_perc|
+----+-----------+----------+--------+-------------+
|1005|        pen|       2.5|       4|         20.0|
|1007|     pencil|       1.0|      10|         10.0|
|1001|   notebook|       5.0|       2|         50.0|
|1003|      ruler|       1.0|       1|         15.0|
|1002| calculator|      55.0|       1|         10.0|
+----+-----------+----------+--------+-------------+



#### Rename code column

In [None]:
sales_df = sales_df.withColumnRenamed('code', 'product_code')
sales_df.show()

+------------+-----------+----------+--------+-------------+
|product_code|description|unit_price|quantity|discount_perc|
+------------+-----------+----------+--------+-------------+
|        1005|        pen|       2.5|       4|         20.0|
|        1007|     pencil|       1.0|      10|         10.0|
|        1001|   notebook|       5.0|       2|         50.0|
|        1003|      ruler|       1.0|       1|         15.0|
|        1002| calculator|      55.0|       1|         10.0|
+------------+-----------+----------+--------+-------------+



## 4. Convert Spark DataFrame to pandas DataFrame
**Note:** You have to install the **Pandas** package first.


PySpark DataFrame can be converted to a pandas DataFrame using the function toPandas()

In [None]:
pandasDF = sales_df.toPandas()
pandasDF.head()

Unnamed: 0,product_code,description,unit_price,quantity,discount_perc
0,1005,pen,2.5,4,20.0
1,1007,pencil,1.0,10,10.0
2,1001,notebook,5.0,2,50.0
3,1003,ruler,1.0,1,15.0
4,1002,calculator,55.0,1,10.0


## 5. User-Defined Function (UDF)

### 5(a) Registering an exsting function as a UDF

In [None]:
from pyspark.sql.functions import udf

def calculate_price(unit_price, quantity):
    return unit_price * quantity

# UDF registration
calculate_price_udf = udf(calculate_price, DoubleType())

#### Compute total price before discount

In [None]:
sales_df = sales_df.withColumn("original_total", calculate_price_udf('unit_price', 'quantity'))
sales_df.show()

+------------+-----------+----------+--------+-------------+--------------+
|product_code|description|unit_price|quantity|discount_perc|original_total|
+------------+-----------+----------+--------+-------------+--------------+
|        1005|        pen|       2.5|       4|         20.0|          10.0|
|        1007|     pencil|       1.0|      10|         10.0|          10.0|
|        1001|   notebook|       5.0|       2|         50.0|          10.0|
|        1003|      ruler|       1.0|       1|         15.0|           1.0|
|        1002| calculator|      55.0|       1|         10.0|          55.0|
+------------+-----------+----------+--------+-------------+--------------+



                                                                                

### 5(b) Using a UDF created using annotations
Refer to the SalesProcessor class in sales_processor.py.
This class contains 2 UDFs that were created using the **@udf** annotation. Note that to use this approach, the methods must be static methods (as indicated with the **@staticmethod**) annotation).

#### Add a file to be downloaded with the Spark job on every node.

In [None]:
sc = spark.sparkContext
sc.addFile("de_classes/sales_processor.py")

from sales_processor import SalesProcessor

#### Invoke UDF to compute the discounted price

In [None]:
sales_df = sales_df.withColumn("discounted_unit_price", SalesProcessor.calculate_discounted_price('unit_price', 'discount_perc'))
sales_df.show()

+------------+-----------+----------+--------+-------------+--------------+---------------------+
|product_code|description|unit_price|quantity|discount_perc|original_total|discounted_unit_price|
+------------+-----------+----------+--------+-------------+--------------+---------------------+
|        1005|        pen|       2.5|       4|         20.0|          10.0|                  2.0|
|        1007|     pencil|       1.0|      10|         10.0|          10.0|                  0.9|
|        1001|   notebook|       5.0|       2|         50.0|          10.0|                  2.5|
|        1003|      ruler|       1.0|       1|         15.0|           1.0|                 0.85|
|        1002| calculator|      55.0|       1|         10.0|          55.0|                 49.5|
+------------+-----------+----------+--------+-------------+--------------+---------------------+



#### Compute the discounted total

In [None]:
sales_df = sales_df.withColumn("discounted_total", calculate_price_udf('discounted_unit_price', 'quantity'))
sales_df.show()

+------------+-----------+----------+--------+-------------+--------------+---------------------+----------------+
|product_code|description|unit_price|quantity|discount_perc|original_total|discounted_unit_price|discounted_total|
+------------+-----------+----------+--------+-------------+--------------+---------------------+----------------+
|        1005|        pen|       2.5|       4|         20.0|          10.0|                  2.0|             8.0|
|        1007|     pencil|       1.0|      10|         10.0|          10.0|                  0.9|             9.0|
|        1001|   notebook|       5.0|       2|         50.0|          10.0|                  2.5|             5.0|
|        1003|      ruler|       1.0|       1|         15.0|           1.0|                 0.85|            0.85|
|        1002| calculator|      55.0|       1|         10.0|          55.0|                 49.5|            49.5|
+------------+-----------+----------+--------+-------------+--------------+-----

In [None]:
#### Select fewer columns for summary df

In [None]:
sales_summary_df = sales_df.select('product_code', 'description', 'unit_price', 'quantity', 'discount_perc', 'discounted_total')
sales_summary_df.show()

+------------+-----------+----------+--------+-------------+----------------+
|product_code|description|unit_price|quantity|discount_perc|discounted_total|
+------------+-----------+----------+--------+-------------+----------------+
|        1005|        pen|       2.5|       4|         20.0|             8.0|
|        1007|     pencil|       1.0|      10|         10.0|             9.0|
|        1001|   notebook|       5.0|       2|         50.0|             5.0|
|        1003|      ruler|       1.0|       1|         15.0|            0.85|
|        1002| calculator|      55.0|       1|         10.0|            49.5|
+------------+-----------+----------+--------+-------------+----------------+



In [None]:
# Invoke another UDF to format the price
sales_summary_df = sales_summary_df.withColumn('unit_price', SalesProcessor.format_price('unit_price'))\
            .withColumn('discounted_total', SalesProcessor.format_price('discounted_total'))
sales_summary_df.show()

+------------+-----------+----------+--------+-------------+----------------+
|product_code|description|unit_price|quantity|discount_perc|discounted_total|
+------------+-----------+----------+--------+-------------+----------------+
|        1005|        pen|    RM2.50|       4|         20.0|          RM8.00|
|        1007|     pencil|    RM1.00|      10|         10.0|          RM9.00|
|        1001|   notebook|    RM5.00|       2|         50.0|          RM5.00|
|        1003|      ruler|    RM1.00|       1|         15.0|          RM0.85|
|        1002| calculator|   RM55.00|       1|         10.0|         RM49.50|
+------------+-----------+----------+--------+-------------+----------------+



# 6. RDD vs DataFrame

## Task: To compute the average quantity for each person

### 6.1 Using RDD

In [None]:
# Get the SparkContext object
sc = spark.sparkContext

# Create an RDD of tuples (name, quantity)
dataRDD = sc.parallelize([("Brooke", 20), ("Denny", 31), ("Jules", 30),("TD", 35), ("Brooke", 25)])

# Use map and reduceByKey transformations with their lambda expressions to aggregate and then compute average
quantityRDD = (dataRDD
.map(lambda x: (x[0], (x[1], 1)))
.reduceByKey(lambda x, y: (x[0] + y[0], x[1] + y[1]))
.map(lambda x: (x[0], x[1][0]/x[1][1])))

quantityRDD.collect()

                                                                                

[('Brooke', 22.5), ('TD', 35.0), ('Jules', 30.0), ('Denny', 31.0)]

### 6.2 Using DataFrame

In [None]:
from pyspark.sql.functions import avg

# Create a DataFrame
data_df = spark.createDataFrame([("Brooke", 20), ("Denny", 31), ("Jules", 30), ("TD", 35), ("Brooke", 25)], ["name", "quantity"])

# Group the same names together, aggregate their ages, and compute an average
avg_df = data_df.groupBy("name").agg(avg("quantity"))

# Show the results of the final execution
avg_df.show()

+------+-------------+
|  name|avg(quantity)|
+------+-------------+
|Brooke|         22.5|
| Denny|         31.0|
| Jules|         30.0|
|    TD|         35.0|
+------+-------------+



## 7. Another Example
https://zacks.one/spark-tutorial/#The-DataFrame-API

**Data**: The SF Fire Department data set.

### 7.1 Inferring the schema

In [None]:
fire_df = (spark
    .read
    .option("samplingRatio", 0.001)
    .option("header", True)
    .csv("data/sf-fire-calls.csv"))

fire_df.printSchema()

root
 |-- CallNumber: string (nullable = true)
 |-- UnitID: string (nullable = true)
 |-- IncidentNumber: string (nullable = true)
 |-- CallType: string (nullable = true)
 |-- CallDate: string (nullable = true)
 |-- WatchDate: string (nullable = true)
 |-- CallFinalDisposition: string (nullable = true)
 |-- AvailableDtTm: string (nullable = true)
 |-- Address: string (nullable = true)
 |-- City: string (nullable = true)
 |-- Zipcode: string (nullable = true)
 |-- Battalion: string (nullable = true)
 |-- StationArea: string (nullable = true)
 |-- Box: string (nullable = true)
 |-- OriginalPriority: string (nullable = true)
 |-- Priority: string (nullable = true)
 |-- FinalPriority: string (nullable = true)
 |-- ALSUnit: string (nullable = true)
 |-- CallTypeGroup: string (nullable = true)
 |-- NumAlarms: string (nullable = true)
 |-- UnitType: string (nullable = true)
 |-- UnitSequenceInCallDispatch: string (nullable = true)
 |-- FirePreventionDistrict: string (nullable = true)
 |-- Sup

### 7.2 Save to HDFS as a parquet file

In [None]:
parquet_path = "data/sf-fire-calls.parquet"
fire_df.write.format("parquet").save(parquet_path)

24/06/05 10:13:50 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
24/06/05 10:13:50 WARN MemoryManager: Total allocation exceeds 50.00% (520,880,128 bytes) of heap memory
Scaling row group sizes to 97.02% for 4 writers
24/06/05 10:13:50 WARN MemoryManager: Total allocation exceeds 50.00% (520,880,128 bytes) of heap memory
Scaling row group sizes to 77.62% for 5 writers
24/06/05 10:13:50 WARN MemoryManager: Total allocation exceeds 50.00% (520,880,128 bytes) of heap memory
Scaling row group sizes to 64.68% for 6 writers
24/06/05 10:13:50 WARN MemoryManager: Total allocation exceeds 50.00% (520,880,128 bytes) of heap memory
Scaling row group sizes to 55.44% for 7 writers
24/06/05 10:13:50 WARN MemoryManager: Total allocation exceeds 50.00% (520,880,128 bytes) of heap memory
Scaling row group sizes to 48.51% for 8 writers
24/06/05 10:13:50 WARN MemoryManager: Total al

In [None]:
fire_parquet = spark.read.parquet(parquet_path)
fire_parquet.show(5)

+----------+------+--------------+----------------+----------+----------+--------------------+--------------------+--------------------+----+-------+---------+-----------+----+----------------+--------+-------------+-------+-------------+---------+--------------+--------------------------+----------------------+------------------+------------------+--------------------+-------------+---------+
|CallNumber|UnitID|IncidentNumber|        CallType|  CallDate| WatchDate|CallFinalDisposition|       AvailableDtTm|             Address|City|Zipcode|Battalion|StationArea| Box|OriginalPriority|Priority|FinalPriority|ALSUnit|CallTypeGroup|NumAlarms|      UnitType|UnitSequenceInCallDispatch|FirePreventionDistrict|SupervisorDistrict|      Neighborhood|            Location|        RowID|    Delay|
+----------+------+--------------+----------------+----------+----------+--------------------+--------------------+--------------------+----+-------+---------+-----------+----+----------------+--------+----

### 7.3 Projections and Filters

A **projection** is a way to return only the rows matching a certain relational condition by using filters. In Spark, projections are done with the select() method, while filters can be expressed using the filter() or where() method.

In [None]:
from pyspark.sql.functions import *

fire_parquet.select("IncidentNumber", "AvailableDtTm", "CallType") \
    .where(col("CallType") != "Medical Incident") \
    .orderBy("IncidentNumber") \
    .show(5, truncate=False)

+--------------+----------------------+--------------+
|IncidentNumber|AvailableDtTm         |CallType      |
+--------------+----------------------+--------------+
|100000        |11/30/2000 12:25:31 PM|Structure Fire|
|10000108      |01/01/2010 02:20:58 AM|Alarms        |
|10000145      |01/01/2010 03:07:02 AM|Alarms        |
|10000149      |01/01/2010 03:10:44 AM|Outside Fire  |
|10000178      |01/01/2010 04:36:35 AM|Structure Fire|
+--------------+----------------------+--------------+
only showing top 5 rows



#### 7.3(a) Find the number of distinct CallTypes

In [None]:
(fire_parquet
    .select("CallType")
    .where(col("CallType").isNotNull())
    .agg(countDistinct("CallType").alias("DistinctCallTypes"))
    .show())

+-----------------+
|DistinctCallTypes|
+-----------------+
|               30|
+-----------------+



#### 7.3(b) List the distinct call types in the data set

In [None]:
# Filter for only distinct non-null CallTypes from all the rows
(fire_parquet
    .select("CallType")
    .where(col("CallType").isNotNull())
    .distinct()
    .show(10, False))

+-----------------------------+
|CallType                     |
+-----------------------------+
|Elevator / Escalator Rescue  |
|Aircraft Emergency           |
|Alarms                       |
|Odor (Strange / Unknown)     |
|Citizen Assist / Service Call|
|HazMat                       |
|Explosion                    |
|Oil Spill                    |
|Vehicle Fire                 |
|Suspicious Package           |
+-----------------------------+
only showing top 10 rows



### 7.4 Renaming Columns

The original column names in the SF Fire Department data set had spaces in them. For example, the column name IncidentNumber was Incident Number. Spaces in column names can be problematic, especially when you want to write or save a DataFrame as a Parquet file (which prohibits this).

#### 7.4(a) Renaming columns by specifying the desired column names in the schema with StructField

In [None]:
from pyspark.sql.types import *

fire_schema = StructType([StructField('CallNumber', IntegerType(), True),
                StructField('UnitID', StringType(), True),
                StructField('IncidentNumber', IntegerType(), True),
                StructField('CallType', StringType(), True),
                StructField('CallDate', StringType(), True),
                StructField('WatchDate', StringType(), True),
                StructField('CallFinalDisposition', StringType(), True),
                StructField('AvailableDtTm', StringType(), True),
                StructField('Address', StringType(), True),
                StructField('City', StringType(), True),
                StructField('Zipcode', IntegerType(), True),
                StructField('Battalion', StringType(), True),
                StructField('StationArea', StringType(), True),
                StructField('Box', StringType(), True),
                StructField('OriginalPriority', StringType(), True),
                StructField('Priority', StringType(), True),
                StructField('FinalPriority', IntegerType(), True),
                StructField('ALSUnit', BooleanType(), True),
                StructField('CallTypeGroup', StringType(), True),
                StructField('NumAlarms', IntegerType(), True),
                StructField('UnitType', StringType(), True),
                StructField('UnitSequenceInCallDispatch', IntegerType(), True),
                StructField('FirePreventionDistrict', StringType(), True),
                StructField('SupervisorDistrict', StringType(), True),
                StructField('Neighborhood', StringType(), True),
                StructField('Location', StringType(), True),
                StructField('RowID', StringType(), True),
                StructField('Delay', FloatType(), True)])

fire_df = (spark
 .read
 .csv("data/sf-fire-calls.csv", header=True, schema=fire_schema))

fire_df.show(3)

+----------+------+--------------+----------------+----------+----------+--------------------+--------------------+--------------------+----+-------+---------+-----------+----+----------------+--------+-------------+-------+-------------+---------+--------+--------------------------+----------------------+------------------+--------------------+--------------------+-------------+---------+
|CallNumber|UnitID|IncidentNumber|        CallType|  CallDate| WatchDate|CallFinalDisposition|       AvailableDtTm|             Address|City|Zipcode|Battalion|StationArea| Box|OriginalPriority|Priority|FinalPriority|ALSUnit|CallTypeGroup|NumAlarms|UnitType|UnitSequenceInCallDispatch|FirePreventionDistrict|SupervisorDistrict|        Neighborhood|            Location|        RowID|    Delay|
+----------+------+--------------+----------------+----------+----------+--------------------+--------------------+--------------------+----+-------+---------+-----------+----+----------------+--------+------------

#### 7.4(b) Renaming columns using the withColumnRenamed() method

In [None]:
# Change the name of the Delay column to ResponseDelayedinMins
new_fire_parquet = fire_parquet.withColumnRenamed("Delay",
                                                  "ResponseDelayedinMins")
# select ResponseDelayedinMins > 5 mins
(new_fire_parquet
    .select("ResponseDelayedinMins")
    .where(col("ResponseDelayedinMins") > 5)
    .show(5, False))

+---------------------+
|ResponseDelayedinMins|
+---------------------+
|7.2166667            |
|8.666667             |
|16.016666            |
|9.933333             |
|8.833333             |
+---------------------+
only showing top 5 rows



As DataFrame transformations are immutable, when we rename a column using withColumnRenamed() we get a new DataFrame while retaining the original with the old column name.

In [None]:
fire_parquet.columns

['CallNumber',
 'UnitID',
 'IncidentNumber',
 'CallType',
 'CallDate',
 'WatchDate',
 'CallFinalDisposition',
 'AvailableDtTm',
 'Address',
 'City',
 'Zipcode',
 'Battalion',
 'StationArea',
 'Box',
 'OriginalPriority',
 'Priority',
 'FinalPriority',
 'ALSUnit',
 'CallTypeGroup',
 'NumAlarms',
 'UnitType',
 'UnitSequenceInCallDispatch',
 'FirePreventionDistrict',
 'SupervisorDistrict',
 'Neighborhood',
 'Location',
 'RowID',
 'Delay']

### 7.5 Format Conversion
1. Convert the existing column’s data type from string to a Spark-supported timestamp.
2. Use the new format specified in the format string "MM/dd/yyyy" or "MM/dd/yyyy
hh:mm:ss a" where appropriate.
3. Drop the old column and append the new one specified in the first argument to the withColumn() method.
4. Assign the new modified DataFrame to fire_ts_df.

In [None]:
fire_ts_df = (fire_df
.withColumn("IncidentDate", to_timestamp(col("CallDate"), "MM/dd/yyyy"))
.drop("CallDate")
.withColumn("OnWatchDate", to_timestamp(col("WatchDate"), "MM/dd/yyyy"))
.drop("WatchDate")
.withColumn("AvailableDtTS", to_timestamp(col("AvailableDtTm"),
"MM/dd/yyyy hh:mm:ss a"))
.drop("AvailableDtTm"))

# Select the converted columns
(fire_ts_df
.select("IncidentDate", "OnWatchDate", "AvailableDtTS")
.show(5, False))


+-------------------+-------------------+-------------------+
|IncidentDate       |OnWatchDate        |AvailableDtTS      |
+-------------------+-------------------+-------------------+
|2002-01-11 00:00:00|2002-01-10 00:00:00|2002-01-11 01:51:44|
|2002-01-11 00:00:00|2002-01-10 00:00:00|2002-01-11 03:01:18|
|2002-01-11 00:00:00|2002-01-10 00:00:00|2002-01-11 02:39:50|
|2002-01-11 00:00:00|2002-01-10 00:00:00|2002-01-11 04:16:46|
|2002-01-11 00:00:00|2002-01-10 00:00:00|2002-01-11 06:01:58|
+-------------------+-------------------+-------------------+
only showing top 5 rows



#### month(), year(), and day()

In [None]:
(fire_ts_df
.select(year('IncidentDate'))
.distinct()
.orderBy(year('IncidentDate'))
.show(5))

+------------------+
|year(IncidentDate)|
+------------------+
|              2000|
|              2001|
|              2002|
|              2003|
|              2004|
+------------------+
only showing top 5 rows



### 7.6 Aggregations
**groupBy()**, **orderBy()**, and **count()**, offer the ability to aggregate by column names and then aggregate counts

In [None]:
(fire_ts_df
.select("CallType")
.where(col("CallType").isNotNull())
.groupBy("CallType")
.count()
.orderBy("count", ascending=False)
.show(n=5, truncate=False))

+-----------------------------+------+
|CallType                     |count |
+-----------------------------+------+
|Medical Incident             |113794|
|Structure Fire               |23319 |
|Alarms                       |19406 |
|Traffic Collision            |7013  |
|Citizen Assist / Service Call|2524  |
+-----------------------------+------+
only showing top 5 rows



### 7.7 Descriptive Statistical Methods

DataFrame API provides descriptive statistical methods like min(), max(), sum(), and avg().

In [None]:
import pyspark.sql.functions as F
(fire_ts_df
.select(F.sum("NumAlarms"), F.avg("Delay"), F.min("Delay"), F.max("Delay"))
.show())

+--------------+-----------------+-----------+----------+
|sum(NumAlarms)|       avg(Delay)| min(Delay)|max(Delay)|
+--------------+-----------------+-----------+----------+
|        176170|3.892364154521585|0.016666668|   1844.55|
+--------------+-----------------+-----------+----------+



# 8. Join Operations
https://pedropark99.github.io/Introd-pyspark/Chapters/08-transforming2.html

In [3]:
# Create sample data

info = [
    ('Mick', 'Rolling Stones', '1943-07-26', True),
    ('John', 'Beatles', '1940-09-10', True),
    ('Paul', 'Beatles', '1942-06-18', True),
    ('George', 'Beatles', '1943-02-25', True),
    ('Ringo', 'Beatles', '1940-07-07', True)
]

info = spark.createDataFrame(
    info,
    ['name', 'band', 'born', 'children']
)

band_instruments = [
    ('John', 'guitar'),
    ('Paul', 'bass'),
    ('Keith', 'guitar')
]

band_instruments = spark.createDataFrame(
    band_instruments,
    ['name', 'plays']
)

print("info:")
info.show()
print("band_instruments:")
band_instruments.show()

info:
+------+--------------+----------+--------+
|  name|          band|      born|children|
+------+--------------+----------+--------+
|  Mick|Rolling Stones|1943-07-26|    true|
|  John|       Beatles|1940-09-10|    true|
|  Paul|       Beatles|1942-06-18|    true|
|George|       Beatles|1943-02-25|    true|
| Ringo|       Beatles|1940-07-07|    true|
+------+--------------+----------+--------+

band_instruments:
+-----+------+
| name| plays|
+-----+------+
| John|guitar|
| Paul|  bass|
|Keith|guitar|
+-----+------+



In [4]:
# a. Inner join (the default)
info.join(band_instruments, on = 'name', how = 'inner')\
    .show()

+----+-------+----------+--------+------+
|name|   band|      born|children| plays|
+----+-------+----------+--------+------+
|John|Beatles|1940-09-10|    true|guitar|
|Paul|Beatles|1942-06-18|    true|  bass|
+----+-------+----------+--------+------+



In [5]:
# b. Left anti join
info.join(band_instruments, on = 'name', how = 'leftanti')\
    .show()

+------+--------------+----------+--------+
|  name|          band|      born|children|
+------+--------------+----------+--------+
|  Mick|Rolling Stones|1943-07-26|    true|
| Ringo|       Beatles|1940-07-07|    true|
|George|       Beatles|1943-02-25|    true|
+------+--------------+----------+--------+



In [6]:
# c. Left outer join
info.join(band_instruments, on = 'name', how = 'leftouter')\
    .show()

+------+--------------+----------+--------+------+
|  name|          band|      born|children| plays|
+------+--------------+----------+--------+------+
|  John|       Beatles|1940-09-10|    true|guitar|
|  Mick|Rolling Stones|1943-07-26|    true|  NULL|
| Ringo|       Beatles|1940-07-07|    true|  NULL|
|George|       Beatles|1943-02-25|    true|  NULL|
|  Paul|       Beatles|1942-06-18|    true|  bass|
+------+--------------+----------+--------+------+



In [7]:
# d. Right outer join
info.join(band_instruments, on = 'name', how = 'rightouter')\
    .show()

+-----+-------+----------+--------+------+
| name|   band|      born|children| plays|
+-----+-------+----------+--------+------+
| John|Beatles|1940-09-10|    true|guitar|
|Keith|   NULL|      NULL|    NULL|guitar|
| Paul|Beatles|1942-06-18|    true|  bass|
+-----+-------+----------+--------+------+



In [8]:
# e. Full outer join
info.join(band_instruments, on = 'name', how = 'fullouter')\
    .show()

+------+--------------+----------+--------+------+
|  name|          band|      born|children| plays|
+------+--------------+----------+--------+------+
|George|       Beatles|1943-02-25|    true|  NULL|
|  John|       Beatles|1940-09-10|    true|guitar|
| Keith|          NULL|      NULL|    NULL|guitar|
|  Mick|Rolling Stones|1943-07-26|    true|  NULL|
|  Paul|       Beatles|1942-06-18|    true|  bass|
| Ringo|       Beatles|1940-07-07|    true|  NULL|
+------+--------------+----------+--------+------+



In [9]:
# f. Left Semi Join
info.join(band_instruments, on = 'name', how = 'leftsemi')\
    .show()

+----+-------+----------+--------+
|name|   band|      born|children|
+----+-------+----------+--------+
|John|Beatles|1940-09-10|    true|
|Paul|Beatles|1942-06-18|    true|
+----+-------+----------+--------+



In [10]:
# g. Cross Join
info.crossJoin(band_instruments)\
    .show()

+------+--------------+----------+--------+-----+------+
|  name|          band|      born|children| name| plays|
+------+--------------+----------+--------+-----+------+
|  Mick|Rolling Stones|1943-07-26|    true| John|guitar|
|  John|       Beatles|1940-09-10|    true| John|guitar|
|  Mick|Rolling Stones|1943-07-26|    true| Paul|  bass|
|  Mick|Rolling Stones|1943-07-26|    true|Keith|guitar|
|  John|       Beatles|1940-09-10|    true| Paul|  bass|
|  John|       Beatles|1940-09-10|    true|Keith|guitar|
|  Paul|       Beatles|1942-06-18|    true| John|guitar|
|George|       Beatles|1943-02-25|    true| John|guitar|
| Ringo|       Beatles|1940-07-07|    true| John|guitar|
|  Paul|       Beatles|1942-06-18|    true| Paul|  bass|
|  Paul|       Beatles|1942-06-18|    true|Keith|guitar|
|George|       Beatles|1943-02-25|    true| Paul|  bass|
|George|       Beatles|1943-02-25|    true|Keith|guitar|
| Ringo|       Beatles|1940-07-07|    true| Paul|  bass|
| Ringo|       Beatles|1940-07-

In [None]:
spark.stop()