# How to Process IoT Device JSON Data Using Dataset

Datasets in Apache Spark 2.0 provide high-level domain specific APIs as well as provide structure and compile-time type-safety. You can read your JSON data file into a DataFrame, a generic row of JVM objects, and convert them into type-specific collection of JVM objects.
In this notebook, you read a JSON file, convert the semi-structured JSON data into a collection of Datasets[T], and work with some high-level Spark 2.0 Dataset APIs.

#### Spark supports multiple formats : JSON, CSV, Text, Parquet, ORC etc. To read a JSON file, you can simply use the SparkSession handle spark.

In previous versions of Spark, you had to create a SparkConf and SparkContext to interact with Spark (http://bit.ly/2jOyWCE)

- //set up the spark configuration and create contexts
- val sparkConf = new SparkConf().setAppName("SparkSessionZipsExample").setMaster("local")
- // your handle to SparkContext to access other context like SQLContext
- val sc = new SparkContext(sparkConf).set("spark.some.config.option", "some-value")
- val sqlContext = new org.apache.spark.sql.SQLContext(sc)

Whereas in Spark 2.0 the same effects can be achieved through SparkSession, without expliciting creating SparkConf, SparkContext or SQLContext, as they’re encapsulated within the SparkSession. Using a builder design pattern, it instantiates a SparkSession object if one does not already exist, along with its associated underlying contexts.
- // Create a SparkSession. No need to create SparkContext
- // You automatically get it as part of the SparkSession
- val warehouseLocation = "file:${system:user.dir}/spark-warehouse"
- val spark = SparkSession
- .builder()
- .appName("SparkSessionZipsExample")
- .config("spark.sql.warehouse.dir", warehouseLocation)
- .enableHiveSupport()
- .getOrCreate()

At this point you can use the spark variable as your instance object to access its public methods and instances for the duration of your Spark job.

In [1]:
spark

org.apache.spark.sql.SparkSession@9a1777e6

In [2]:
spark.sparkContext

org.apache.spark.SparkContext@2efc56e8

In [3]:
spark.sqlContext

org.apache.spark.sql.SQLContext@23ddb267

In [4]:
spark.conf

org.apache.spark.sql.RuntimeConfig@fa33f052

In [5]:
sc

org.apache.spark.SparkContext@7cb5ce9f

In [6]:
spark.conf.get("spark.sql.warehouse.dir")

file:/gpfs/global_fs01/sym_shared/YPProdSpark/user/sc07-a3c399a7caae2d-99fc3133bdbb/notebook/work/spark-warehouse/

# Reading JSON as a Dataset

## Create a case class to represent your IoT Device Data

In [1]:
// define a case class that represents our Device data.
// Use the Scala case class DeviceIoTData to convert the JSON device data into a Scala object.

// Import implicts or error while loading the dataset below - Unable to find encoder for type stored in a Dataset.  Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ 
import spark.implicits._ 

case class DeviceIoTData (
 device_id: Long,  
 device_name: String,
 ip: String,
 cca2: String,
 cca3: String,
 cn: String,
 latitude: Double,
 longitude: Double,
 scale: String,
 temp: Long,
 humidity: Long,
 battery_level: Long,
 c02_level: Long,
 lcd: String,
 timestamp: Long
)

Spark supports multiple formats : JSON, CSV, Text, Parquet, ORC etc. To read a JSON file, you can simply use the SparkSession handle spark.

In [2]:
import org.apache.spark.sql.SparkSession

// @hidden_cell
// This function is used to setup the access of Spark to your Object Storage. The definition contains your credentials.
// You might want to remove those credentials before you share your notebook.
def setHadoopConfig19099026f8df40b6aec4353c7e897e95(name: String) = {
    // This function sets the Hadoop configuration so it is possible to
    // access data from Bluemix Object Storage using Spark

    val prefix = "fs.swift.service." + name
    sc.hadoopConfiguration.set(prefix + ".auth.url", "https://identity.open.softlayer.com" + "/v3/auth/tokens")
    sc.hadoopConfiguration.set(prefix + ".auth.endpoint.prefix","endpoints")
    sc.hadoopConfiguration.set(prefix + ".tenant", "cc29768790ec45439a43668592b02f84")
    sc.hadoopConfiguration.set(prefix + ".username", "a55ccc8b825944fa90f0188f8e5a2ffc")
    sc.hadoopConfiguration.set(prefix + ".password", "Q#i79zYI{qV?d74u")
    sc.hadoopConfiguration.setInt(prefix + ".http.port", 8080)
    sc.hadoopConfiguration.set(prefix + ".region", "dallas")
    sc.hadoopConfiguration.setBoolean(prefix + ".public", false)
}

// you can choose any name
val name = "keystone"
setHadoopConfig19099026f8df40b6aec4353c7e897e95(name)

val spark = SparkSession.
    builder().
    getOrCreate()


In [3]:
// Since JSON data can be semi-structured and contain additional metadata, it is possible that you might face issues with the DataFrame layout.
// Please read the documentation of 'SparkSession.read()' and 'DataFrameReader' to learn more about the possibilities to adjust the data loading.
// Spark documentation: hhttp://spark.apache.org/docs/2.0.2/api/scala/index.html#org.apache.spark.sql.DataFrameReader@json%28paths:String*%29:org.apache.spark.sql.DataFrame

val ds1 = spark.read.json("swift://Databricks." + name + "/iot_devices.json")
ds1.take(10).foreach(println(_))


1
[8,868,US,USA,United States,1,meter-gauge-1xbYRYcj,51,68.161.225.1,38.0,green,-97.0,Celsius,34,1458444054093]
[7,1473,NO,NOR,Norway,2,sensor-pad-2n2Pea,70,213.161.254.1,62.47,red,6.15,Celsius,11,1458444054119]
[2,1556,IT,ITA,Italy,3,device-mac-36TWSKiT,44,88.36.5.1,42.83,red,12.83,Celsius,19,1458444054120]
[6,1080,US,USA,United States,4,sensor-pad-4mzWkz,32,66.39.173.154,44.06,yellow,-121.32,Celsius,28,1458444054121]
[4,931,PH,PHL,Philippines,5,therm-stick-5gimpUrBB,62,203.82.41.9,14.58,green,120.97,Celsius,25,1458444054122]
[3,1210,US,USA,United States,6,sensor-pad-6al7RTAobR,51,204.116.105.67,35.93,yellow,-85.46,Celsius,27,1458444054122]
[3,1129,CN,CHN,China,7,meter-gauge-7GeDoanM,26,220.173.179.1,22.82,yellow,108.32,Celsius,18,1458444054123]
[0,1536,JP,JPN,Japan,8,sensor-pad-8xUD6pzsQI,35,210.173.177.1,35.69,red,139.69,Celsius,27,1458444054123]
[3,807,JP,JPN,Japan,9,device-mac-9GcjZ2pw,85,118.23.68.227,35.69,green,139.69,Celsius,13,1458444054124]
[7,1470,US,USA,United States,10,se

### Three things happen here under the hood in the code above:
Spark reads the JSON, infers the schema, and creates a collection of DataFrames. At this point, Spark converts your data into DataFrame = Dataset[Row], a collection of generic Row object, since it does not know the exact type. Now, Spark converts the Dataset[Row] -> Dataset[DeviceIoTData] type-specific Scala JVM object, as dictated by the class DeviceIoTData. Most of us have who work with structured data are accustomed to viewing and processing data in either columnar manner or accessing specific attributes within an object. With Dataset as a collection of Dataset[ElementType] typed objects, you seamlessly get both compile-time safety and custom view for strongly-typed JVM objects. And your resulting strongly-typed Dataset[T] from above code can be easily displayed or processed with high-level methods.

In [4]:
val ds = spark.read.json("swift://Databricks." + name + "/iot_devices.json").as[DeviceIoTData]
ds.take(10).foreach(println(_))

DeviceIoTData(1,meter-gauge-1xbYRYcj,68.161.225.1,US,USA,United States,38.0,-97.0,Celsius,34,51,8,868,green,1458444054093)
DeviceIoTData(2,sensor-pad-2n2Pea,213.161.254.1,NO,NOR,Norway,62.47,6.15,Celsius,11,70,7,1473,red,1458444054119)
DeviceIoTData(3,device-mac-36TWSKiT,88.36.5.1,IT,ITA,Italy,42.83,12.83,Celsius,19,44,2,1556,red,1458444054120)
DeviceIoTData(4,sensor-pad-4mzWkz,66.39.173.154,US,USA,United States,44.06,-121.32,Celsius,28,32,6,1080,yellow,1458444054121)
DeviceIoTData(5,therm-stick-5gimpUrBB,203.82.41.9,PH,PHL,Philippines,14.58,120.97,Celsius,25,62,4,931,green,1458444054122)
DeviceIoTData(6,sensor-pad-6al7RTAobR,204.116.105.67,US,USA,United States,35.93,-85.46,Celsius,27,51,3,1210,yellow,1458444054122)
DeviceIoTData(7,meter-gauge-7GeDoanM,220.173.179.1,CN,CHN,China,22.82,108.32,Celsius,18,26,3,1129,yellow,1458444054123)
DeviceIoTData(8,sensor-pad-8xUD6pzsQI,210.173.177.1,JP,JPN,Japan,35.69,139.69,Celsius,27,35,0,1536,red,1458444054123)
DeviceIoTData(9,device-mac-9GcjZ2pw,

In [5]:
ds.getClass.getSimpleName

Dataset

In [6]:
ds.count()

198164

### Displaying your Dataset

Viewing a Dataset
Once you have loaded the JSON data and converted into a Dataset for your type-specific collection of JVM objects, you can view them as you would view a DataFrame, by using either display() or using standard Spark commands, such as take(), foreach(), and println() API calls.



In [7]:
// display the dataset table just read in from the JSON file
//  display(ds)
ds.show()

+-------------+---------+----+----+-------------+---------+--------------------+--------+---------------+--------+------+---------+-------+----+-------------+
|battery_level|c02_level|cca2|cca3|           cn|device_id|         device_name|humidity|             ip|latitude|   lcd|longitude|  scale|temp|    timestamp|
+-------------+---------+----+----+-------------+---------+--------------------+--------+---------------+--------+------+---------+-------+----+-------------+
|            8|      868|  US| USA|United States|        1|meter-gauge-1xbYRYcj|      51|   68.161.225.1|    38.0| green|    -97.0|Celsius|  34|1458444054093|
|            7|     1473|  NO| NOR|       Norway|        2|   sensor-pad-2n2Pea|      70|  213.161.254.1|   62.47|   red|     6.15|Celsius|  11|1458444054119|
|            2|     1556|  IT| ITA|        Italy|        3| device-mac-36TWSKiT|      44|      88.36.5.1|   42.83|   red|    12.83|Celsius|  19|1458444054120|
|            6|     1080|  US| USA|United Stat

# Iterating, transforming, and filtering Dataset

Let's iterate over the first 10 entries with the foreach() method and print them

In [8]:
// Using the standard Spark commands, take() and foreach(), print the first 
// 10 rows of the Datasets.
ds.take(10).foreach(println(_))

DeviceIoTData(1,meter-gauge-1xbYRYcj,68.161.225.1,US,USA,United States,38.0,-97.0,Celsius,34,51,8,868,green,1458444054093)
DeviceIoTData(2,sensor-pad-2n2Pea,213.161.254.1,NO,NOR,Norway,62.47,6.15,Celsius,11,70,7,1473,red,1458444054119)
DeviceIoTData(3,device-mac-36TWSKiT,88.36.5.1,IT,ITA,Italy,42.83,12.83,Celsius,19,44,2,1556,red,1458444054120)
DeviceIoTData(4,sensor-pad-4mzWkz,66.39.173.154,US,USA,United States,44.06,-121.32,Celsius,28,32,6,1080,yellow,1458444054121)
DeviceIoTData(5,therm-stick-5gimpUrBB,203.82.41.9,PH,PHL,Philippines,14.58,120.97,Celsius,25,62,4,931,green,1458444054122)
DeviceIoTData(6,sensor-pad-6al7RTAobR,204.116.105.67,US,USA,United States,35.93,-85.46,Celsius,27,51,3,1210,yellow,1458444054122)
DeviceIoTData(7,meter-gauge-7GeDoanM,220.173.179.1,CN,CHN,China,22.82,108.32,Celsius,18,26,3,1129,yellow,1458444054123)
DeviceIoTData(8,sensor-pad-8xUD6pzsQI,210.173.177.1,JP,JPN,Japan,35.69,139.69,Celsius,27,35,0,1536,red,1458444054123)
DeviceIoTData(9,device-mac-9GcjZ2pw,

# Ease-of-use of APIs with structure

Although structure may limit control in what your Spark program can do with data, it introduces rich semantics and an easy set of domain specific operations that can be expressed as high-level constructs. Most computations, however, can be accomplished with Dataset’s high-level APIs. For example, it’s much simpler to perform agg, select, sum, avg, map, filter, or groupBy operations by accessing a Dataset typed object’s DeviceIoTData than using RDD rows’ data fields.

Expressing your computation in a domain specific API is far simpler and easier than with relation algebra type expressions (in RDDs). For instance, the code below will filter() and  map() create another immutable Dataset.

Like RDD, Dataset has transformations and actions methods. Most importantly are the high-level domain specific operations such as sum(), select(), avg(), join(), and union() that are absent in RDDs. For more information, look at the Scala Dataset API.

Let’s look at a few handy ones in action. In the example below, we use filter(), map(), groupBy(), and avg(), all higher-level methods, to create another Dataset, with only fields that we wish to view. What’s noteworthy is that we access the attributes we want to filter by their names as defined in the case class. That is, we use the dot notation to access individual fields. As such, it makes code easy to read and write.

## Processing a Dataset

In [9]:
// filter out all devices whose temperature exceed 25 degrees and generate 
// another Dataset with three fields that of interest and then display 
// the mapped Dataset
val dsTemp = ds.filter(d => d.temp > 25).map(d => (d.temp, d.device_name, d.cca3))
//display(dsTemp)
dsTemp.take(10)

Array((34,meter-gauge-1xbYRYcj,USA), (28,sensor-pad-4mzWkz,USA), (27,sensor-pad-6al7RTAobR,USA), (27,sensor-pad-8xUD6pzsQI,JPN), (26,sensor-pad-10BsywSYUF,USA), (31,meter-gauge-17zb8Fghhl,USA), (31,sensor-pad-18XULN9Xv,CHN), (29,meter-gauge-19eg1BpfCO,USA), (30,device-mac-21sjz5h,AUT), (28,sensor-pad-24PytzD00Cp,CAN))

In [21]:
dsTemp.take(10).foreach(println(_))

(34,meter-gauge-1xbYRYcj,USA)
(28,sensor-pad-4mzWkz,USA)
(27,sensor-pad-6al7RTAobR,USA)
(27,sensor-pad-8xUD6pzsQI,JPN)
(26,sensor-pad-10BsywSYUF,USA)
(31,meter-gauge-17zb8Fghhl,USA)
(31,sensor-pad-18XULN9Xv,CHN)
(29,meter-gauge-19eg1BpfCO,USA)
(30,device-mac-21sjz5h,AUT)
(28,sensor-pad-24PytzD00Cp,CAN)


## Performance and Optimization

Along with all the above benefits, you cannot overlook the space efficiency and performance gains in using DataFrames and Dataset APIs for two reasons.

First, because DataFrame and Dataset APIs are built on top of the Spark SQL engine, it uses Catalyst to generate an optimized logical and physical query plan. Across R, Java, Scala, or Python DataFrame/Dataset APIs, all relation type queries undergo the same code optimizer, providing the space and speed efficiency. Whereas the Dataset[T] typed API is optimized for data engineering tasks, the untyped Dataset[Row] (an alias of DataFrame) is even faster and suitable for interactive analysis.

Second, since Spark as a compiler understands your Dataset type JVM object, it maps your type-specific JVM object to Tungsten’s internal memory representation using Encoders. As a result, Tungsten Encoders can efficiently serialize/deserialize JVM objects as well as generate compact bytecode that can execute at superior speeds.

### When should I use DataFrames or Datasets?

If you want rich semantics, high-level abstractions, and domain specific APIs, use DataFrame or Dataset.
If your processing demands high-level expressions, filters, maps, aggregation, averages, sum, SQL queries, columnar access and use of lambda functions on semi-structured data, use DataFrame or Dataset.
If you want higher degree of type-safety at compile time, want typed JVM objects, take advantage of Catalyst optimization, and benefit from Tungsten’s efficient code generation, use Dataset.
If you want unification and simplification of APIs across Spark Libraries, use DataFrame or Dataset.
If you are a R user, use DataFrames.
If you are a Python user, use DataFrames and resort back to RDDs if you need more control.
#### Note that you can always seamlessly interoperate or convert from DataFrame and/or Dataset to an RDD, by simple method call .rdd. For instance,

In [10]:
// select specific fields from the Dataset, apply a predicate
// using the where() method, convert to an RDD, and show first 10
// RDD rows
val deviceEventsDS = ds.select($"device_name", $"cca3", $"c02_level").where($"c02_level" > 1300)
// convert to RDDs and take the first 10 rows
val eventsRDD = deviceEventsDS.rdd.take(10)
eventsRDD.take(10).foreach(println(_))

[sensor-pad-2n2Pea,NOR,1473]
[device-mac-36TWSKiT,ITA,1556]
[sensor-pad-8xUD6pzsQI,JPN,1536]
[sensor-pad-10BsywSYUF,USA,1470]
[meter-gauge-11dlMTZty,ITA,1544]
[sensor-pad-14QL93sBR0j,NOR,1346]
[sensor-pad-16aXmIJZtdO,USA,1425]
[meter-gauge-17zb8Fghhl,USA,1466]
[meter-gauge-19eg1BpfCO,USA,1531]
[sensor-pad-22oWV2D,JPN,1522]


In [11]:
deviceEventsDS.take(10).foreach(println(_))

[sensor-pad-2n2Pea,NOR,1473]
[device-mac-36TWSKiT,ITA,1556]
[sensor-pad-8xUD6pzsQI,JPN,1536]
[sensor-pad-10BsywSYUF,USA,1470]
[meter-gauge-11dlMTZty,ITA,1544]
[sensor-pad-14QL93sBR0j,NOR,1346]
[sensor-pad-16aXmIJZtdO,USA,1425]
[meter-gauge-17zb8Fghhl,USA,1466]
[meter-gauge-19eg1BpfCO,USA,1531]
[sensor-pad-22oWV2D,JPN,1522]


In [12]:
// Apply higher-level Dataset API methods such as groupBy() and avg().
// Filter temperatures > 25, along with their corresponding
// devices' humidity, compute averages, groupBy cca3 country codes,
// and display the results, using table and bar charts
val dsAvgTmp = ds.filter(d => {d.temp > 25}).map(d => (d.temp, d.humidity, d.cca3)).groupBy($"_3").avg()
 
// display averages as a table, grouped by the country
//display(dsAvgTmp)
dsAvgTmp.take(10).foreach(println(_))

[PSE,30.88888888888889,62.22222222222222]
[HTI,30.6,75.0]
[POL,29.929577464788732,62.045271629778675]
[LVA,29.721804511278197,63.29323308270677]
[BRB,29.63157894736842,61.21052631578947]
[ZMB,30.0,60.0]
[JAM,30.466666666666665,69.46666666666667]
[BRA,30.09396551724138,61.126724137931035]
[ARM,30.09090909090909,58.27272727272727]
[MOZ,29.8,67.8]


https://bzhangusc.wordpress.com/2015/03/29/the-column-class/

#### Select individual fields using the Dataset method select() where battery_level is greater than 6. 
#### Note this high-level domain specific language API reads like a SQL query

In [13]:
// Select individual fields using the Dataset method select()
// where battery_level is greater than 6. Note this high-level
// domain specific language API reads like a SQL query
// display(ds.select($"battery_level", $"c02_level", $"device_name").where($"battery_level" > 6).sort($"c02_level"))
ds.select($"battery_level", $"c02_level", $"device_name").where($"battery_level" > 6).sort($"c02_level").take(10).foreach(println(_))

                                                                                [7,800,sensor-pad-146902NqACUHQISa]
[7,800,device-mac-155337F0m78fHJKl]
[7,800,sensor-pad-144602MYaTPv]
[7,800,device-mac-141741tnSSlTg]
[9,800,sensor-pad-142282fDTmdvJ]
[8,800,meter-gauge-148337o50gjrXGEi]
[8,800,sensor-pad-151562COzl8oo]
[9,800,device-mac-154917UU1XP7GTj]
[7,800,device-mac-156087C4ZyQ1]
[7,800,device-mac-171549aDMEfCiPH]


### Creating temporary table

In [14]:
// registering your Dataset as a temporary table to which you can issue SQL queries
ds.createOrReplaceTempView("iot_device_data")

In [15]:
val results =  spark.sql("select cca3 from iot_device_data")
results.show(5)

+----+
|cca3|
+----+
| USA|
| NOR|
| ITA|
| USA|
| PHL|
+----+
only showing top 5 rows



In [16]:
// Having saved the Dataset of DeviceIoTData as a temporary table, you can issue SQL queries to it.
spark.sql("select cca3, count (distinct device_id) as device_id from iot_device_data group by cca3 order by device_id desc limit 100").show()

                                                                                +----+---------+
|cca3|device_id|
+----+---------+
| USA|    70405|
| CHN|    14455|
| JPN|    12100|
| KOR|    11879|
| DEU|     7942|
| GBR|     6486|
| CAN|     6041|
| RUS|     5989|
| FRA|     5305|
| BRA|     3224|
| AUS|     3119|
| ITA|     2915|
| SWE|     2880|
| POL|     2744|
| NLD|     2488|
| ESP|     2310|
| TWN|     2128|
| IND|     1867|
| CZE|     1507|
| NOR|     1487|
+----+---------+
only showing top 20 rows



### Saving and Reading from Hive table with SparkSession

Next, we are going to create a Hive table and issue queries against it using SparkSession object as you would with a HiveContext.

### Creating Hive Table

In [17]:
import sys.process._

In [18]:
//drop the table if exists to get around existing table error
spark.sql("DROP TABLE IF EXISTS iot_hive_table")

[]

In [19]:
"ls /gpfs/global_fs01/sym_shared/YPProdSpark/user/sc07-a3c399a7caae2d-99fc3133bdbb/notebook/work/spark-warehouse/iot_hive_table" !

ls: cannot access /gpfs/global_fs01/sym_shared/YPProdSpark/user/sc07-a3c399a7caae2d-99fc3133bdbb/notebook/work/spark-warehouse/iot_hive_table: No such file or directory


In [20]:
"rm -rf /gpfs/global_fs01/sym_shared/YPProdSpark/user/sc07-a3c399a7caae2d-99fc3133bdbb/notebook/work/spark-warehouse/iot_hive_table" !

#### Query the Hive table with the Spark SQL query

In [21]:
//drop the table if exists to get around existing table error
spark.sql("DROP TABLE IF EXISTS iot_hive_table")
//save as a hive table
spark.table("iot_device_data").write.saveAsTable("iot_hive_table")
//make a similar query against the hive table 
val resultsHiveDF = spark.sql("select cca3, count (distinct device_id) as device_id from iot_hive_table group by cca3 order by device_id desc limit 100")
resultsHiveDF.show(10)

                                                                                +----+---------+
|cca3|device_id|
+----+---------+
| USA|    70405|
| CHN|    14455|
| JPN|    12100|
| KOR|    11879|
| DEU|     7942|
| GBR|     6486|
| CAN|     6041|
| RUS|     5989|
| FRA|     5305|
| BRA|     3224|
+----+---------+
only showing top 10 rows



### Working and Accessing Catalog metadata

In [22]:
spark.catalog.listTables.show()

+---------------+--------+-----------+---------+-----------+
|           name|database|description|tableType|isTemporary|
+---------------+--------+-----------+---------+-----------+
| iot_hive_table| default|       null|  MANAGED|      false|
|iot_device_data|    null|       null|TEMPORARY|       true|
+---------------+--------+-----------+---------+-----------+



In [23]:
spark.catalog.listDatabases.show()

+-------+----------------+--------------------+
|   name|     description|         locationUri|
+-------+----------------+--------------------+
|default|default database|file:/gpfs/global...|
+-------+----------------+--------------------+

