# Spark Data Frames and Data Sets – Overview of APIs

As we have basic knowledge of how to create Data Frames or Data Sets, now let us explore some key APIs to Create Data Frames Dynamically as well as to process the Data.

* List of Important APIs
* Creating Data Frame Dynamically
* Data Frame Native Operations – Overview
* Spark SQL – Overview
* Saving Data Frames into Files – Overview

### List of Important APIs

Let us explore the important packages which provide APIs to create, process as well as write Data Frames or Data Sets.

* We have already seen spark.read to read the data
* We also have write APIs on top of Data Frames which can be used to save Data Frame to underlying File System.
* **org.apache.spark.sql** have several other APIs for different purposes.
    * **org.apache.spark.sql.types** for pre-defined types or to create schemas dynamically.
    * **org.apache.spark.sql.functions** for pre-defined functions
    * **createTempView** or **createOrReplaceTempView** on top of Data Frame to register it as in-memory view and process data using SQL based Queries.
    * **spark.sql** to run queries from Hive Tables or temporary views or even Hive commands
    * **org.apache.spark.sql.functions.udf** to create User Defined Functions for Data Frame Operations.
    * **spark.register.udf** to register standard Scala Functions as SQL functions.

### Creating Data Frame Dynamically

Let us see how we can use StructTypes to create Data Frame dynamically based upon control files.

* Many times we will get metadata about data in the form of control files.
* Control files will have information such as column names, Data Types etc.
* We need to create fields in Data Frame dynamically using column names and Data Types provided as part of control files.
* The process is divided into two steps
    * Create Schema
    * Create Data Frame using Schema
    
 **Creating a Schema**
 
Here are the steps involved in creating Schema by using metadata from control files.

* Load data from a file into Scala collection
* Build an array of fields using StructField with column name and Data Type
* Using the array we can build StructType
* Also, we need to read data from files and then apply the map to build RDD of Row type for each record with attributes.
* Then we can use spark.createDataFrame to create Data Frame programmatically by passing RDD of records of type Row and StructType   

In [1]:
// CreateSchemaUsingMetadata.scala

import org.apache.spark.sql.types._

val schemaString = "order_id:int order_date:string order_customer_id:int order_status:string"

val a = schemaString.split(" ")

val fields = a.map(f => {
  if(f.split(":")(1) == "int") 
    StructField(f.split(":")(0), IntegerType)
  else
    StructField(f.split(":")(1), StringType)
})

schemaString = order_id:int order_date:string order_customer_id:int order_status:string
a = Array(order_id:int, order_date:string, order_customer_id:int, order_status:string)
fields = Array(StructField(order_id,IntegerType,true), StructField(string,StringType,true), StructField(order_customer_id,IntegerType,true), StructField(string,StringType,true))


Array(StructField(order_id,IntegerType,true), StructField(string,StringType,true), StructField(order_customer_id,IntegerType,true), StructField(string,StringType,true))

In [4]:
// Using pattern matching
val fields = a.map(f => f.split(":")(1) match {
  case "int" => StructField(f.split(":")(0), IntegerType)
  case _ => StructField(f.split(":")(0), StringType)
})

val schema = StructType(fields)

fields = Array(StructField(order_id,IntegerType,true), StructField(order_date,StringType,true), StructField(order_customer_id,IntegerType,true), StructField(order_status,StringType,true))
schema = StructType(StructField(order_id,IntegerType,true), StructField(order_date,StringType,true), StructField(order_customer_id,IntegerType,true), StructField(order_status,StringType,true))


StructType(StructField(order_id,IntegerType,true), StructField(order_date,StringType,true), StructField(order_customer_id,IntegerType,true), StructField(order_status,StringType,true))

**Creating Data Frame Dynamically**

Let us see how we can create Data Frame after defining the Schema using metadata.

* Read the data from the file – orders
* Apply the necessary transformation to create RDD of type Row with four fields using map.
* Convert into dataframe using **spark.createDataFrame**. It take RDD and schema as arguments.
* RDD will be converted to Data Frame using Schema defined.


In [5]:
import org.apache.spark.sql.types._
val schemaString = "order_id:int order_date:string order_customer_id:int order_status:string"

val a = schemaString.split(" ")

val fields = a.map(f => {
  if(f.split(":")(1) == "int") 
    StructField(f.split(":")(0), IntegerType)
  else
    StructField(f.split(":")(1), StringType)
})

schemaString = order_id:int order_date:string order_customer_id:int order_status:string
a = Array(order_id:int, order_date:string, order_customer_id:int, order_status:string)
fields = Array(StructField(order_id,IntegerType,true), StructField(string,StringType,true), StructField(order_customer_id,IntegerType,true), StructField(string,StringType,true))


Array(StructField(order_id,IntegerType,true), StructField(string,StringType,true), StructField(order_customer_id,IntegerType,true), StructField(string,StringType,true))

In [7]:
// Using pattern matching
val fields = a.map(f => f.split(":")(1) match {
  case "int" => StructField(f.split(":")(0), IntegerType)
  case _ => StructField(f.split(":")(0), StringType)
})

val schema = StructType(fields)


fields = Array(StructField(order_id,IntegerType,true), StructField(order_date,StringType,true), StructField(order_customer_id,IntegerType,true), StructField(order_status,StringType,true))
schema = StructType(StructField(order_id,IntegerType,true), StructField(order_date,StringType,true), StructField(order_customer_id,IntegerType,true), StructField(order_status,StringType,true))


StructType(StructField(order_id,IntegerType,true), StructField(order_date,StringType,true), StructField(order_customer_id,IntegerType,true), StructField(order_status,StringType,true))

In [8]:
val inputBaseDir = "/public/retail_db"
val orders = sc.textFile(inputBaseDir + "/orders")

import org.apache.spark.sql.Row

val ordersRDD = orders.map(o => Row(o.split(",")(0).toInt, o.split(",")(1), o.split(",")(2).toInt, o.split(",")(3)))
val ordersDF = spark.createDataFrame(ordersRDD, schema)

ordersDF.printSchema
ordersDF.show

root
 |-- order_id: integer (nullable = true)
 |-- order_date: string (nullable = true)
 |-- order_customer_id: integer (nullable = true)
 |-- order_status: string (nullable = true)

+--------+--------------------+-----------------+---------------+
|order_id|          order_date|order_customer_id|   order_status|
+--------+--------------------+-----------------+---------------+
|       1|2013-07-25 00:00:...|            11599|         CLOSED|
|       2|2013-07-25 00:00:...|              256|PENDING_PAYMENT|
|       3|2013-07-25 00:00:...|            12111|       COMPLETE|
|       4|2013-07-25 00:00:...|             8827|         CLOSED|
|       5|2013-07-25 00:00:...|            11318|       COMPLETE|
|       6|2013-07-25 00:00:...|             7130|       COMPLETE|
|       7|2013-07-25 00:00:...|             4530|       COMPLETE|
|       8|2013-07-25 00:00:...|             2911|     PROCESSING|
|       9|2013-07-25 00:00:...|             5657|PENDING_PAYMENT|
|      10|2013-07-25 00:0

inputBaseDir = /public/retail_db
orders = /public/retail_db/orders MapPartitionsRDD[1] at textFile at <console>:42
ordersRDD = MapPartitionsRDD[2] at map at <console>:46
ordersDF = [order_id: int, order_date: string ... 2 more fields]


[order_id: int, order_date: string ... 2 more fields]

In [9]:
import org.apache.spark.sql.types._
val schemaString = "order_id:int order_date:string order_customer_id:int order_status:string"

val a = schemaString.split(" ")

val fields = a.map(f => {
  if(f.split(":")(1) == "int") 
    StructField(f.split(":")(0), IntegerType)
  else
    StructField(f.split(":")(1), StringType)
})

schemaString = order_id:int order_date:string order_customer_id:int order_status:string
a = Array(order_id:int, order_date:string, order_customer_id:int, order_status:string)
fields = Array(StructField(order_id,IntegerType,true), StructField(string,StringType,true), StructField(order_customer_id,IntegerType,true), StructField(string,StringType,true))


Array(StructField(order_id,IntegerType,true), StructField(string,StringType,true), StructField(order_customer_id,IntegerType,true), StructField(string,StringType,true))

In [10]:
// Using pattern matching
val fields = a.map(f => f.split(":")(1) match {
  case "int" => StructField(f.split(":")(0), IntegerType)
  case _ => StructField(f.split(":")(0), StringType)
})

val schema = StructType(fields)
val inputBaseDir = "/Users/itversity/Research/data/retail_db"

val ordersDF = spark.
  read.
  schema(schema).
  csv(inputBaseDir + "/orders")

ordersDF.printSchema
ordersDF.show

Name: org.apache.spark.sql.AnalysisException
Message: Path does not exist: hdfs://nn01.itversity.com:8020/Users/itversity/Research/data/retail_db/orders;
StackTrace:   at org.apache.spark.sql.execution.datasources.DataSource$.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:715)
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:389)
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:389)
  at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
  at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
  at scala.collection.immutable.List.flatMap(List.scala:344)
  at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scal

### Data Frame Native Operations – Overview

As we have seen how to create Data Frames from files, RDD etc., let us get into the high-level details of Data Frame Native Operations. We can perform all standard transformations using Data Frame Operations.

* Previewing Schema and Data – printSchema and show
* Row-level transformations – using select, withColumn
* Filtering the Data – filter or where. We can pass filtering either by using SQL style syntax or Data Frame Native Syntax.
* Aggregations – count, sum, avg, min, max etc
* Sorting
* Ranking using Windowing or Analytical Functions

Let us see a few simple examples.

* Get orders for the month of 2014 January.
* Get count by status from filtered orders
* Get revenue for each order_id from order_items

In [16]:
import org.apache.spark.sql.functions.count

val inputBaseDir = "/public/retail_db_json"
val ordersDF = spark.read.json(inputBaseDir + "/orders")

// We can use either filter or where
ordersDF.where("order_date like '2014-01%'").show
ordersDF.where($"order_date".like("2014-01%")).show

ordersDF.filter("order_date like '2014-01%'").show
ordersDF.filter($"order_date".like("2014-01%")).show


+-----------------+--------------------+--------+---------------+
|order_customer_id|          order_date|order_id|   order_status|
+-----------------+--------------------+--------+---------------+
|             3414|2014-01-01 00:00:...|   25876|PENDING_PAYMENT|
|             5549|2014-01-01 00:00:...|   25877|PENDING_PAYMENT|
|             9084|2014-01-01 00:00:...|   25878|        PENDING|
|             5118|2014-01-01 00:00:...|   25879|        PENDING|
|            10146|2014-01-01 00:00:...|   25880|       CANCELED|
|             3205|2014-01-01 00:00:...|   25881|PENDING_PAYMENT|
|             4598|2014-01-01 00:00:...|   25882|       COMPLETE|
|            11764|2014-01-01 00:00:...|   25883|        PENDING|
|             7904|2014-01-01 00:00:...|   25884|PENDING_PAYMENT|
|             7253|2014-01-01 00:00:...|   25885|        PENDING|
|             8195|2014-01-01 00:00:...|   25886|     PROCESSING|
|            10062|2014-01-01 00:00:...|   25887|        PENDING|
|         

inputBaseDir = /public/retail_db_json
ordersDF = [order_customer_id: bigint, order_date: string ... 2 more fields]


[order_customer_id: bigint, order_date: string ... 2 more fields]

In [20]:
ordersDF.
  where("order_date like '2014-01%'").
  groupBy("order_status").
  agg(count("order_status").alias("order_count")).
  show

+---------------+-----------+
|   order_status|order_count|
+---------------+-----------+
|PENDING_PAYMENT|       1334|
|       COMPLETE|       1911|
|        ON_HOLD|        365|
| PAYMENT_REVIEW|         77|
|     PROCESSING|        712|
|         CLOSED|        633|
|SUSPECTED_FRAUD|        131|
|        PENDING|        635|
|       CANCELED|        110|
+---------------+-----------+



In [22]:
val orderItemsDF = spark.read.json(inputBaseDir + "/order_items")
orderItemsDF.
  groupBy("order_item_order_id").
  agg(round(sum("order_item_subtotal"), 2).alias("order_revenue")).
  show

Name: Unknown Error
Message: <console>:45: error: not found: value round
         agg(round(sum("order_item_subtotal"), 2).alias("order_revenue")).
             ^
<console>:45: error: not found: value sum
         agg(round(sum("order_item_subtotal"), 2).alias("order_revenue")).
                   ^

StackTrace: 

### Spark SQL – Overview

Let us get into the details related to Spark SQL. We can submit queries on Hive tables or in memory temp tables using spark.sql API.

* We can directly run queries on Hive tables or tables from remote databases using JDBC and create Data Frame for the results.
* We can also register a temporary table for a Data Frame and run queries against it.
* We can perform all the standard transformations using SQL syntax
* We can also run standard Hive commands using spark.sql, such as **show tables, describe table** etc.
* As part of Data Processing, we typically perform these operations.
    * Row Level Transformations (Data Standardization, Cleansing etc)
    * Filtering the data
    * Joining the Data Sets
    * Aggregations such as sum, min, max
    * Sorting and Ranking
    * and more
    
Let us see a few simple examples.

* Get orders for the month of 2014 January.
* Get count by status from filtered orders
* Get revenue for each order_id from order_items    

In [23]:
val inputBaseDir = "/public/retail_db_json"
val ordersDF = spark.read.json(inputBaseDir + "/orders")

ordersDF.createTempView("orders")

spark.
  sql(s"""SELECT * FROM orders 
          WHERE order_date LIKE '2014-01%'""").
  show

spark.
  sql(s"""SELECT order_status, COUNT(1) order_count 
          FROM orders
          WHERE order_date LIKE '2014-01%'
          GROUP BY order_status""").
  show

+-----------------+--------------------+--------+---------------+
|order_customer_id|          order_date|order_id|   order_status|
+-----------------+--------------------+--------+---------------+
|             3414|2014-01-01 00:00:...|   25876|PENDING_PAYMENT|
|             5549|2014-01-01 00:00:...|   25877|PENDING_PAYMENT|
|             9084|2014-01-01 00:00:...|   25878|        PENDING|
|             5118|2014-01-01 00:00:...|   25879|        PENDING|
|            10146|2014-01-01 00:00:...|   25880|       CANCELED|
|             3205|2014-01-01 00:00:...|   25881|PENDING_PAYMENT|
|             4598|2014-01-01 00:00:...|   25882|       COMPLETE|
|            11764|2014-01-01 00:00:...|   25883|        PENDING|
|             7904|2014-01-01 00:00:...|   25884|PENDING_PAYMENT|
|             7253|2014-01-01 00:00:...|   25885|        PENDING|
|             8195|2014-01-01 00:00:...|   25886|     PROCESSING|
|            10062|2014-01-01 00:00:...|   25887|        PENDING|
|         

inputBaseDir = /public/retail_db_json
ordersDF = [order_customer_id: bigint, order_date: string ... 2 more fields]


[order_customer_id: bigint, order_date: string ... 2 more fields]

In [24]:
val orderItemsDF = spark.read.json(inputBaseDir + "/order_items")

orderItemsDF.createTempView("order_items")

spark.
  sql(s"""SELECT order_item_order_id, 
            ROUND(SUM(order_item_subtotal), 2) order_revenue 
          FROM order_items
          GROUP BY order_item_order_id
          ORDER BY order_item_order_id""").
  show

+-------------------+-------------+
|order_item_order_id|order_revenue|
+-------------------+-------------+
|                  1|       299.98|
|                  2|       579.98|
|                  4|       699.85|
|                  5|      1129.86|
|                  7|       579.92|
|                  8|       729.84|
|                  9|       599.96|
|                 10|       651.92|
|                 11|       919.79|
|                 12|      1299.87|
|                 13|       127.96|
|                 14|       549.94|
|                 15|       925.91|
|                 16|       419.93|
|                 17|       694.84|
|                 18|       449.96|
|                 19|       699.96|
|                 20|       879.86|
|                 21|       372.91|
|                 23|       299.98|
+-------------------+-------------+
only showing top 20 rows



orderItemsDF = [order_item_id: bigint, order_item_order_id: bigint ... 4 more fields]


[order_item_id: bigint, order_item_order_id: bigint ... 4 more fields]

### Saving Data Frames into Files – Overview

Once data in Data Frame is processed using either Data Frame Operations or Spark SQL, we can write Data Frame into target systems.

* There are APIs to write data into files, Hive tables as well as remote RDBMS table over JDBC.
* spark.write or spark.save are the main packages to write data into files in supported file systems.
* We have APIs for these different file formats.
    * Text File Format – csv and text
    * parquet
    * orc
    * json
    * avro (require plugin)
    
* We will see the details at a later point in time. For now, we will just validate on a Data Frame by writing into JSON format.

Let us see a demo.

* Read JSON data from order_items
* Compute revenue for each order. We can either use Data Frame Native Operations or Spark SQL for this purpose.
* Let us spark.sql.shuffle.partitions to 2, so that data can be aggregated using 2 tasks. By default Spark SQL or Data Frame Operations use 200.
* As our data set size is very small, it does not make sense to use 200 threads to perform aggregation.
* Save data back to File System in the form of JSON