# Spark Data Frames and Data Sets – Getting Started

As part of this Spark Data Frames getting started session, let us understand how we can create data frames from different file formats and how to write data frames back to different file formats.

* Overview of Data Frames
* Creating Data Frames from RDD
* Creating Data Frames – File Formats
* Creating Data Sets from RDD
* Creating Data Sets from Data Frame

### Overview of Data Frames

Data Frames are nothing but named RDD.

* With RDD we do not have names to refer field names
* As there are no names, we only can access elements by position unless we use Object-Oriented Concepts
* Once Data Frames are created we can process data using multiple approaches
    * Data Frame Operations
    * Spark SQL
* We can create Data Frames from files using APIs, from Hive tables as well as over JDBC.
* Data Frames can be written into different file formats, Hive tables as well as remote databases over JDBC.

### Creating Data Frames from RDD

Now let us see how we can create Data Frames from RDD.

* We do not use this that often as we can use spark.read APIs to convert data in flat files into Data Frames directly.
* There are some file formats which are available on sc, but not on spark.read. In those scenarios we might have to load data from files into RDD, extract information using map and then create Data Frame.
* Let us see an example where we read comma separated data from text files and create Data Frame out of it.
    * Read text data using sc.textFile
    * Apply map to convert data into tuple with right data type
    * Use toDF to create Data Frame. We can define data types while creating tables.


In [38]:
val inputBaseDir = "/public/retail_db"
val orders = sc.textFile(inputBaseDir + "/orders")

// Without column names
val ordersDF = orders.
  map(o => {
    val a = o.split(",")
    (a(0).toInt, a(1), a(2).toInt, a(3))
  }).
  toDF

ordersDF.printSchema
ordersDF.show


root
 |-- _1: integer (nullable = false)
 |-- _2: string (nullable = true)
 |-- _3: integer (nullable = false)
 |-- _4: string (nullable = true)

+---+--------------------+-----+---------------+
| _1|                  _2|   _3|             _4|
+---+--------------------+-----+---------------+
|  1|2013-07-25 00:00:...|11599|         CLOSED|
|  2|2013-07-25 00:00:...|  256|PENDING_PAYMENT|
|  3|2013-07-25 00:00:...|12111|       COMPLETE|
|  4|2013-07-25 00:00:...| 8827|         CLOSED|
|  5|2013-07-25 00:00:...|11318|       COMPLETE|
|  6|2013-07-25 00:00:...| 7130|       COMPLETE|
|  7|2013-07-25 00:00:...| 4530|       COMPLETE|
|  8|2013-07-25 00:00:...| 2911|     PROCESSING|
|  9|2013-07-25 00:00:...| 5657|PENDING_PAYMENT|
| 10|2013-07-25 00:00:...| 5648|PENDING_PAYMENT|
| 11|2013-07-25 00:00:...|  918| PAYMENT_REVIEW|
| 12|2013-07-25 00:00:...| 1837|         CLOSED|
| 13|2013-07-25 00:00:...| 9149|PENDING_PAYMENT|
| 14|2013-07-25 00:00:...| 9842|     PROCESSING|
| 15|2013-07-25 00:00

inputBaseDir = /public/retail_db
orders = /public/retail_db/orders MapPartitionsRDD[92] at textFile at <console>:39
ordersDF = [_1: int, _2: string ... 2 more fields]


[_1: int, _2: string ... 2 more fields]

In [39]:
// With column names

val inputBaseDir = "/public/retail_db"
val ordersDF = orders.
  map(o => {
    val a = o.split(",")
    (a(0).toInt, a(1), a(2).toInt, a(3))
  }).
  toDF("order_id", "order_date", "order_customer_id", "order_status")

ordersDF.printSchema
ordersDF.show

root
 |-- order_id: integer (nullable = false)
 |-- order_date: string (nullable = true)
 |-- order_customer_id: integer (nullable = false)
 |-- order_status: string (nullable = true)

+--------+--------------------+-----------------+---------------+
|order_id|          order_date|order_customer_id|   order_status|
+--------+--------------------+-----------------+---------------+
|       1|2013-07-25 00:00:...|            11599|         CLOSED|
|       2|2013-07-25 00:00:...|              256|PENDING_PAYMENT|
|       3|2013-07-25 00:00:...|            12111|       COMPLETE|
|       4|2013-07-25 00:00:...|             8827|         CLOSED|
|       5|2013-07-25 00:00:...|            11318|       COMPLETE|
|       6|2013-07-25 00:00:...|             7130|       COMPLETE|
|       7|2013-07-25 00:00:...|             4530|       COMPLETE|
|       8|2013-07-25 00:00:...|             2911|     PROCESSING|
|       9|2013-07-25 00:00:...|             5657|PENDING_PAYMENT|
|      10|2013-07-25 00

inputBaseDir = /public/retail_db
ordersDF = [order_id: int, order_date: string ... 2 more fields]


[order_id: int, order_date: string ... 2 more fields]

### Creating Data Frames – File Formats

Spark support multiple file formats out of the box. We will just go through overview for now and get into details at later point in time.

* spark.read and spark.load provide APIs to read data from files of different file formats.
* Supported file formats
    * Text File Format – csv and text
    * parquet
    * orc
    * json
    * avro (require plugin)
* All the file formats except text files typically store metadata along with data. Hence when we create Data Frame out of the special file formats, they typically inherit the schema.
* Let us create Data Frame out of JSON data and process it using both Data Frame Operations as well as Spark SQL.

In [1]:
val ordersDF = spark.read.json("/public/retail_db_json/orders")

// Using Data Frame Operations
ordersDF.
  groupBy("order_date").
  count.
  show


+--------------------+-----+
|          order_date|count|
+--------------------+-----+
|2013-08-13 00:00:...|   73|
|2013-10-12 00:00:...|  162|
|2013-11-15 00:00:...|  135|
|2014-03-19 00:00:...|  130|
|2014-04-26 00:00:...|  251|
|2013-09-16 00:00:...|  121|
|2013-09-20 00:00:...|  139|
|2013-12-31 00:00:...|  266|
|2013-09-06 00:00:...|  276|
|2014-06-15 00:00:...|  128|
|2013-12-24 00:00:...|  170|
|2014-01-07 00:00:...|  163|
|2014-06-07 00:00:...|  191|
|2013-10-14 00:00:...|  139|
|2013-11-11 00:00:...|  246|
|2014-01-27 00:00:...|  163|
|2014-01-29 00:00:...|  158|
|2014-02-14 00:00:...|  174|
|2014-04-15 00:00:...|  180|
|2014-04-22 00:00:...|  144|
+--------------------+-----+
only showing top 20 rows



ordersDF = [order_customer_id: bigint, order_date: string ... 2 more fields]


[order_customer_id: bigint, order_date: string ... 2 more fields]

In [2]:
import org.apache.spark.sql.functions.count

val ordersDF = spark.read.json("/public/retail_db_json/orders")

ordersDF.
  groupBy("order_date").
  agg(count("order_date").alias("order_count")).
  show

+--------------------+-----------+
|          order_date|order_count|
+--------------------+-----------+
|2013-08-13 00:00:...|         73|
|2013-10-12 00:00:...|        162|
|2013-11-15 00:00:...|        135|
|2014-03-19 00:00:...|        130|
|2014-04-26 00:00:...|        251|
|2013-09-16 00:00:...|        121|
|2013-09-20 00:00:...|        139|
|2013-12-31 00:00:...|        266|
|2013-09-06 00:00:...|        276|
|2014-06-15 00:00:...|        128|
|2013-12-24 00:00:...|        170|
|2014-01-07 00:00:...|        163|
|2014-06-07 00:00:...|        191|
|2013-10-14 00:00:...|        139|
|2013-11-11 00:00:...|        246|
|2014-01-27 00:00:...|        163|
|2014-01-29 00:00:...|        158|
|2014-02-14 00:00:...|        174|
|2014-04-15 00:00:...|        180|
|2014-04-22 00:00:...|        144|
+--------------------+-----------+
only showing top 20 rows



ordersDF = [order_customer_id: bigint, order_date: string ... 2 more fields]


[order_customer_id: bigint, order_date: string ... 2 more fields]

In [None]:
// Using Spark SQL
val ordersDF = spark.read.json("/public/retail_db_json/orders")

ordersDF.createTempView("orders1")
spark.
  sql("select order_date, count(1) order_count from orders group by order_date").
  show

### Creating Data Sets from RDD

Let us see how we can create Data Set from RDD. We need to use case classes that comes as part of Scala to create Data Sets.

* Create RDD by reading data from the file

**Recap of Case Classes**

Let us quickly review some of the concepts related to Case Classes as we need to use them for creating Data Sets.

* Create case class – we will review some of the important concepts of Case Class
    * We get getters and setters with case classes.
    * It implements Serializable and Product
    * While Serializable gives us functionality to convert the object to the data stream, Product will give us functionality such as productArity, productIterator etc.
    * productArity gives us the number of elements while productIterator convert case class attribute values to the collection.

**Creating Data Sets**

As we understood details with respect to Case Classes, now let us go ahead and create Data Sets from RDD.

* Data Frames can be processed either by using Data Frame Operations or Spark SQL, whereas Data Sets can be processed using core APIs as well.
* APIs are same for both Data Frames as well as Data Sets to process using Data Frame Operations or Spark SQL

In [None]:
val inputBaseDir = "/public/retail_db"
val orders = sc.textFile(inputBaseDir + "/orders")

case class Order (
  order_id: Int,
  order_date: String,
  order_customer_id: Int,
  order_status: String
)

// Case classes can be validated using :javap -p Order

val o = Order(1, "2013-07-25", 1, "COMPLETE")

In [None]:
o.order_id

In [None]:
o.productIterator.mkString(":")

In [None]:
// Creating Data Set
val ordersDS = orders.
  map(o => {
    val a = o.split(",")
    Order(a(0).toInt, a(1), a(2).toInt, a(3))
  }).
  toDS

ordersDS.printSchema

In [51]:
ordersDS.show

+--------+--------------------+-----------------+---------------+
|order_id|          order_date|order_customer_id|   order_status|
+--------+--------------------+-----------------+---------------+
|       1|2013-07-25 00:00:...|            11599|         CLOSED|
|       2|2013-07-25 00:00:...|              256|PENDING_PAYMENT|
|       3|2013-07-25 00:00:...|            12111|       COMPLETE|
|       4|2013-07-25 00:00:...|             8827|         CLOSED|
|       5|2013-07-25 00:00:...|            11318|       COMPLETE|
|       6|2013-07-25 00:00:...|             7130|       COMPLETE|
|       7|2013-07-25 00:00:...|             4530|       COMPLETE|
|       8|2013-07-25 00:00:...|             2911|     PROCESSING|
|       9|2013-07-25 00:00:...|             5657|PENDING_PAYMENT|
|      10|2013-07-25 00:00:...|             5648|PENDING_PAYMENT|
|      11|2013-07-25 00:00:...|              918| PAYMENT_REVIEW|
|      12|2013-07-25 00:00:...|             1837|         CLOSED|
|      13|

In [52]:
ordersDS.map(o => o.order_id).show

+-----+
|value|
+-----+
|    1|
|    2|
|    3|
|    4|
|    5|
|    6|
|    7|
|    8|
|    9|
|   10|
|   11|
|   12|
|   13|
|   14|
|   15|
|   16|
|   17|
|   18|
|   19|
|   20|
+-----+
only showing top 20 rows



In [53]:
ordersDS.select("order_id").show

+--------+
|order_id|
+--------+
|       1|
|       2|
|       3|
|       4|
|       5|
|       6|
|       7|
|       8|
|       9|
|      10|
|      11|
|      12|
|      13|
|      14|
|      15|
|      16|
|      17|
|      18|
|      19|
|      20|
+--------+
only showing top 20 rows



### Creating Data Sets from Data Frame

We can create Data Set from Data Frame using as function by applying case class. But you need to make sure the data types are in sync.

* Create Data Frame from JSON
* Create case class with required fields and appropriate data sets
* Use as and convert Data Frame to Data Set.
* Make sure the data types between Data Frame and case class are compatible.
* Also as part of map we can only use Data Types such as Int, Long etc. Data Types such as BigInt are not serializable.

In [54]:
val ordersDF = spark.read.json("/public/retail_db_json/orders")

ordersDF.printSchema

case class Order (
  order_id: BigInt,
  order_date: String,
  order_customer_id: BigInt,
  order_status: String
)


root
 |-- order_customer_id: long (nullable = true)
 |-- order_date: string (nullable = true)
 |-- order_id: long (nullable = true)
 |-- order_status: string (nullable = true)



ordersDF = [order_customer_id: bigint, order_date: string ... 2 more fields]
defined class Order


[order_customer_id: bigint, order_date: string ... 2 more fields]

In [55]:
ordersDF.as[Order].show

+-----------------+--------------------+--------+---------------+
|order_customer_id|          order_date|order_id|   order_status|
+-----------------+--------------------+--------+---------------+
|            11599|2013-07-25 00:00:...|       1|         CLOSED|
|              256|2013-07-25 00:00:...|       2|PENDING_PAYMENT|
|            12111|2013-07-25 00:00:...|       3|       COMPLETE|
|             8827|2013-07-25 00:00:...|       4|         CLOSED|
|            11318|2013-07-25 00:00:...|       5|       COMPLETE|
|             7130|2013-07-25 00:00:...|       6|       COMPLETE|
|             4530|2013-07-25 00:00:...|       7|       COMPLETE|
|             2911|2013-07-25 00:00:...|       8|     PROCESSING|
|             5657|2013-07-25 00:00:...|       9|PENDING_PAYMENT|
|             5648|2013-07-25 00:00:...|      10|PENDING_PAYMENT|
|              918|2013-07-25 00:00:...|      11| PAYMENT_REVIEW|
|             1837|2013-07-25 00:00:...|      12|         CLOSED|
|         

In [56]:
ordersDS.map(o => o.order_id.toLong).show

+-----+
|value|
+-----+
|    1|
|    2|
|    3|
|    4|
|    5|
|    6|
|    7|
|    8|
|    9|
|   10|
|   11|
|   12|
|   13|
|   14|
|   15|
|   16|
|   17|
|   18|
|   19|
|   20|
+-----+
only showing top 20 rows

