### Processing and Querying JSON files with Spark SQL 

Generally JSON files are difficulty to analyze. 
The challanges are mainly in defining a schema, maintaining the schema, and in accessing fields of the JSON dataset.

JSON is supported in Spark so that one can work with fresh data without the need of mandatory ETL process, the schema can be automatically inferred without the need to define the schema, and one can easily get access to the fields of a complex structure with queries and without the need of building many UDF's. 

json files can be loaded with spark.read.json() or sqlContext.read.json() methods

In [3]:
from pyspark.sql import SparkSession
from pyspark.sql.types import *
sqlContext = SQLContext(sc)

In [5]:
dfj = sqlContext.read.json("/Users/binggangliu/Downloads/copa.json")

Schema is automatically inferred

In [77]:
# dfj.dtypes

In [78]:
# dfj.show()

In [8]:
dfj.printSchema()

root
 |-- attribution: string (nullable = true)
 |-- attributionLink: string (nullable = true)
 |-- averageRating: long (nullable = true)
 |-- category: string (nullable = true)
 |-- columns: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- cachedContents: struct (nullable = true)
 |    |    |    |-- average: string (nullable = true)
 |    |    |    |-- cardinality: string (nullable = true)
 |    |    |    |-- largest: string (nullable = true)
 |    |    |    |-- non_null: long (nullable = true)
 |    |    |    |-- not_null: string (nullable = true)
 |    |    |    |-- null: string (nullable = true)
 |    |    |    |-- smallest: string (nullable = true)
 |    |    |    |-- sum: string (nullable = true)
 |    |    |    |-- top: array (nullable = true)
 |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |-- count: string (nullable = true)
 |    |    |    |    |    |-- item: string (nullable = true)
 |    |    |-- 

In [9]:
dfj.registerTempTable("td")

In [54]:
attribution = sqlContext.sql("select attribution from td")

In [55]:
attribution.show()

+---------------+
|    attribution|
+---------------+
|City of Chicago|
+---------------+



In [43]:
id = sqlContext.sql("select id from td")

In [29]:
id.show()

+---------+
|       id|
+---------+
|mft5-nfa8|
+---------+



In [44]:
description = sqlContext.sql("select description from td")

In [31]:
description.show()

+--------------------+
|         description|
+--------------------+
|Complaints receiv...|
+--------------------+



In [45]:
category = sqlContext.sql("select category from td")

In [33]:
category.show()

+-------------+
|     category|
+-------------+
|Public Safety|
+-------------+



In [46]:
owner = sqlContext.sql("select owner from td limit 5")

In [47]:
owner.show()

+--------------------+
|               owner|
+--------------------+
|[Jonathan Levy, [...|
+--------------------+



In [51]:
info_owner = sqlContext.sql("select owner.id, owner.screenName, owner.type from td")

Using dot '.' to access to the next layer of the data structure (e.g. owner.id)

In [52]:
info_owner.show()

+---------+-------------+-----------+
|       id|   screenName|       type|
+---------+-------------+-----------+
|vewm-vupz|Jonathan Levy|interactive|
+---------+-------------+-----------+



In [50]:
info_tableAuthor = sqlContext.sql("select tableAuthor.flags from td")

In [40]:
info_tableAuthor.show()

+--------------------+
|               flags|
+--------------------+
|[mayBeStoriesCoOw...|
+--------------------+



In [62]:
columns = sqlContext.sql("select columns from td")

In [63]:
columns.show()

+--------------------+
|             columns|
+--------------------+
|[[[1051076.442097...|
+--------------------+



In [64]:
metadata = sqlContext.sql("select metadata from td")

In [65]:
metadata.show()

+--------------------+
|            metadata|
+--------------------+
|[[table, fatrow, ...|
+--------------------+



Access to the sub-sub-layer with two dots, e.g., metadata.filterCondition.type

In [96]:
info_metadata = sqlContext.sql("select metadata.filterCondition.type from td")

In [97]:
info_metadata.show()

+--------+
|    type|
+--------+
|operator|
+--------+



Separate one column (e.g. 'owner') into multiple independent columns, in this way the inner structure of the element can be exposed just under the root

In [84]:
dfj_c = dfj.withColumn("o_displayName", dfj.owner.displayName).withColumn("o_flags", dfj.owner.flags).withColumn("o_id", dfj.owner.id).withColumn("o_screenName", dfj.owner.screenName).withColumn("o_type", dfj.owner.type).drop(dfj.owner)

In [87]:
dfj_c.show(8)

+---------------+--------------------+-------------+-------------+--------------------+----------+--------------------+-----------+-------------+--------------------+--------------------+---------------+----------------+---------+--------------+------+--------------------+--------------------+----------+----------------+--------+----------+------------------------+---------------+----------------+----------------+--------------------+------+-------------+-------------+--------------------+--------+--------------------+---------------+---------+----------------+--------+-------------+--------------------+---------+-------------+-----------+
|    attribution|     attributionLink|averageRating|     category|             columns| createdAt|         description|displayType|downloadCount|               flags|              grants|hideFromCatalog|hideFromDataJson|       id|indexUpdatedAt|locale|            metadata|                name|newBackend|numberOfComments|     oid|provenance|publicatio

In [90]:
dfj_c.registerTempTable("td_c")

In [91]:
o_id = sqlContext.sql("select o_id from td_c")

In [92]:
o_id.show()

+---------+
|     o_id|
+---------+
|vewm-vupz|
+---------+



In [94]:
o_flags = sqlContext.sql("select o_flags from td_c")

In [95]:
o_flags.show()

+--------------------+
|             o_flags|
+--------------------+
|[mayBeStoriesCoOw...|
+--------------------+

