d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px; height: 163px">
</div>

# Querying JSON & Hierarchical Data with SQL

Apache Spark&trade; and Databricks&reg; make it easy to work with hierarchical data, such as nested JSON records.

## In this lesson you:
* Use SQL to query a table backed by JSON data
* Query nested structured data
* Query data containing array columns 

## Audience
* Primary Audience: Data Analysts
* Additional Audiences: Data Engineers and Data Scientists

## Prerequisites
* Web browser: **Chrome**
* A cluster configured with **8 cores** and **DBR 6.2**
* Familiarity with <a href="https://www.w3schools.com/sql/" target="_blank">ANSI SQL</a> is required

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Classroom-Setup

For each lesson to execute correctly, please make sure to run the **`Classroom-Setup`** cell at the<br/>
start of each lesson (see the next cell) and the **`Classroom-Cleanup`** cell at the end of each lesson.

In [4]:
%run "./Includes/Classroom-Setup"

<iframe  
src="//fast.wistia.net/embed/iframe/a3098jg2t0?videoFoam=true"
style="border:1px solid #1cb1c2;"
allowtransparency="true" scrolling="no" class="wistia_embed"
name="wistia_embed" allowfullscreen mozallowfullscreen webkitallowfullscreen
oallowfullscreen msallowfullscreen width="640" height="360" ></iframe>
<div>
<a target="_blank" href="https://fast.wistia.net/embed/iframe/a3098jg2t0?seo=false">
  <img alt="Opens in new tab" src="https://files.training.databricks.com/static/images/external-link-icon-16x16.png"/>&nbsp;Watch full-screen.</a>
</div>

## Examining the contents of a JSON file

JSON is a common file format in big data applications and in data lakes (or large stores of diverse data).  Datatypes such as JSON arise out of a number of data needs.  For instance, what if...  
<br>
* Your schema, or the structure of your data, changes over time?
* You need nested fields like an array with many values or an array of arrays?
* You don't know how you're going use your data yet so you don't want to spend time creating relational tables?

The popularity of JSON is largely due to the fact that JSON allows for nested, flexible schemas.

This lesson uses the `DatabricksBlog` table, which is backed by JSON file `dbfs:/mnt/training/databricks-blog.json`. If you examine the raw file, you can see that it contains compact JSON data. There's a single JSON object on each line of the file; each object corresponds to a row in the table. Each row represents a blog post on the <a href="https://databricks.com/blog" target="_blank">Databricks blog</a>, and the table contains all blog posts through August 9, 2017.

<iframe  
src="//fast.wistia.net/embed/iframe/1i3n3rb0vy?videoFoam=true"
style="border:1px solid #1cb1c2;"
allowtransparency="true" scrolling="no" class="wistia_embed"
name="wistia_embed" allowfullscreen mozallowfullscreen webkitallowfullscreen
oallowfullscreen msallowfullscreen width="640" height="360" ></iframe>
<div>
<a target="_blank" href="https://fast.wistia.net/embed/iframe/1i3n3rb0vy?seo=false">
  <img alt="Opens in new tab" src="https://files.training.databricks.com/static/images/external-link-icon-16x16.png"/>&nbsp;Watch full-screen.</a>
</div>

In [8]:
%fs head dbfs:/mnt/training/databricks-blog.json

To expose the JSON file as a table, use the standard SQL create table using syntax introduced in the previous lesson:

In [10]:
%sql
CREATE TABLE IF NOT EXISTS DatabricksBlog
  USING json
  OPTIONS (
    path "dbfs:/mnt/training/databricks-blog.json",
    inferSchema "true"
  )

Take a look at the schema with the `DESCRIBE` function.

In [12]:
%sql
DESCRIBE DatabricksBlog

col_name,data_type,comment
authors,array,
categories,array,
content,string,
creator,string,
dates,struct,
description,string,
id,bigint,
link,string,
slug,string,
status,string,


Run a query to view the contents of the table.

Notice:
* The `authors` column is an array containing multiple author names.
* The `categories` column is an array of multiple blog post category names.
* The `dates` column contains nested fields `createdOn`, `publishedOn` and `tz`.

In [14]:
%sql
SELECT authors, categories, dates, content 
FROM DatabricksBlog
limit 5

authors,categories,dates,content
List(Tomer Shiran (VP of Product Management at MapR)),"List(Company Blog, Partners)","List(2014-04-10, 2014-04-10, UTC)","This post is guest authored by our friends at MapR, announcing our new partnership to provide enterprise support for Apache Spark as part of MapR's Distribution of Hadoop. With over 500 paying customers, my team and I have the opportunity to talk to many organizations that are leveraging Hadoop in production to extract value from big data. One of the most common topics raised by our customers in recent months is Apache Spark. Some customers just want to learn more about the advantages of this technology and the use cases that it addresses, while others are already running it in production with the MapR Distribution. These customers range from the world’s largest cable telcos and retailers to Silicon Valley startups such as Quantifind, which recently talked about its use of Spark on MapR in an interview with Stefan Groschupf, CEO of Datameer. Today, I am happy to announce and share with you the beginning of our journey with Databricks, and the addition of the complete Spark stack to the MapR Distribution for Apache Hadoop. We are now the only Hadoop distribution to support the complete Spark stack, including Spark, Spark Streaming (stream processing), Shark (Hive on Spark), MLLib (machine learning) and GraphX (graph processing). This is a testament to our commitment to open source and to providing our customers with maximum flexibility to pick and choose the right tool for the job. Why Spark? One of the challenges organizations face when adopting Hadoop is a shortage of developers who have experience building Hadoop applications. Our professional services organization has helped dozens of companies with the development and deployment of Hadoop applications, and our training department has trained countless engineers. Organizations are hungry for solutions that make it easier to develop Hadoop applications while increasing developer productivity, and Spark fits this bill. Spark jobs can require as little as 1/5th of code. Spark provides a simple programming abstraction allowing developers to design applications as operations on data collections (known as RDDs, or Resilient Distributed Datasets). Developers can build these applications in multiple programming languages, including Java, Scala and Python, and the same code can be reused across batch, interactive and streaming applications. In addition to making developers happier and more productive, Spark provides significant benefits with respect to end-to-end application performance. To this end, Spark provides a general-purpose execution framework with in-memory pipelining. For many applications, this results in a 5-100x performance improvement, because some or all steps can execute in memory without unnecessarily writing to and reading from disk. The performance advantage of the Spark engine, combined with the industry-leading performance of the MapR Distribution, provides customers with the highest-performance platform for big data applications. Why Databricks? Databricks was founded by the creators of Apache Spark, and is currently the driving force behind the project. When we decided to add the Spark stack to our distribution and double down on our involvement in the Spark community, a strategic partnership with Databricks was a no-brainer. This partnership will benefit MapR customers who are interested in 24x7 support for Spark or any of the other projects in the stack, including Spark Streaming, Shark, MLLib and GraphX (with several other projects coming soon). In addition, MapR will be working closely with Databricks to drive the Spark roadmap and accelerate the development of new features, benefiting both MapR customers and the broader community. We are very excited about the upcoming Apache Spark 1.0 release, expected later this month. We are looking forward to a great journey with Databricks and the other members of the Spark community. Register for an upcoming joint webinar to learn more about the benefits of the complete Spark stack on MapR."
List(Tathagata Das),"List(Apache Spark, Engineering Blog, Machine Learning)","List(2014-04-10, 2014-04-10, UTC)","We are happy to announce the availability of Apache Spark 0.9.1! This is a maintenance release with bug fixes, performance improvements, better stability with YARN and improved parity of the Scala and Python API. We recommend all 0.9.0 users to upgrade to this stable release. This is the first release since Spark graduated as a top level Apache project. Contributions to this release came from 37 developers. Visit the release notes for more information about all the improvements and bug fixes. Download it and try it out!"
List(Steven Hillion),"List(Company Blog, Partners)","List(2014-04-01, 2014-04-01, UTC)","This post is guest authored by our friends at Alpine Data Labs, part of the 'Application Spotlight' series highlighting innovative applications that are part of the Databricks ""Certified on Apache Spark"" program. Everyone knows how hard it is to recruit engineers and data scientists in Silicon Valley. At Alpine Data Labs, we think what we’re up to is pretty fun and challenging, but we still have to compete with other start-ups as well as the big internet companies to attract the best talent. One thing that can help is to be able to say that you’re working with the most innovative and powerful technologies. Last year, I was interviewing a talented engineer with a strong background in machine learning. And he said that the one thing he wanted to do above all was to work with Apache Spark. “Will I get to do that at Alpine?” he asked. If it had been even a year earlier, I would have said “Sure…at some point.” But in the meantime I’d met several of the members of the AMPLab research team at Berkeley, and been impressed with their mature approach to building a platform and ecosystem. And I’d seen enough companies installing Spark on their dev clusters that it was clear this was a technology to watch. In a remarkably short time, it went from experimental to very real. And now prospects in the Alpine pipeline were asking me if it was on the roadmap. So yes, I told my candidate. “You’ll be working on Spark from day one.” Last week, Alpine announced at GigaOM that it’s one of the first analytics companies to leverage Spark for building predictive models. We demonstrated the Alpine engine running on Pivotal’s Analytics Workbench, where it ran an iterative classification algorithm (logistic regression) on 50 million rows in less than 50 seconds. Furthermore, we were officially certified on Spark by the team at Databricks. It’s been an honor to work with them and the research team at Berkeley. We think their technology will be a serious contender for the leading platform for data science. Spark is more to us than just speed. It’s really the entire ecosystem that represents such an exciting paradigm for working with data. Still, the core capability of caching data in memory was our primary consideration, and our iterative algorithms have been shown to speed up by one or even two orders of magnitude (thanks again to that Pivotal cluster). We’ve always had this mantra at Alpine: “Avoid multiple passes through the data!” And we’ve designed many of our machine learning algorithms to avoid scanning the data too many times, packing on calculations into each MapReduce job like a waiter piling up plates to try and clear a table in one go. But it’s rare that we can avoid it entirely. With Spark, it’s incredibly satisfying to watch the progress bar zip along as the system re-uses data it’s already seen before. Another thing that’s getting our engineers excited is Spark’s MLLib, the machine-learning library written on top of the Spark runtime. Alpine has long thought that machine learning algorithms should be open source. (I helped to kick off the MADlib library of analytics functions for databases, and Alpine now uses it extensively.) So we’re now beginning to contribute some of our code back into MLLib. And, moreover, we think MLLib and MLI have the potential to be a more general repository for open-source machine learning. So I’ll congratulate the Alpine team for helping to bring the power of Spark to our users, and I’ll also congratulate the Spark team and Databricks for making it possible!"
"List(Michael Armbrust, Reynold Xin)","List(Apache Spark, Engineering Blog)","List(2014-03-27, 2014-03-27, UTC)","Building a unified platform for big data analytics has long been the vision of Apache Spark, allowing a single program to perform ETL, MapReduce, and complex analytics. An important aspect of unification that our users have consistently requested is the ability to more easily import data stored in external sources, such as Apache Hive. Today, we are excited to announce Spark SQL, a new component recently merged into the Spark repository. Spark SQL brings native support for SQL to Spark and streamlines the process of querying data stored both in RDDs (Spark’s distributed datasets) and in external sources. Spark SQL conveniently blurs the lines between RDDs and relational tables. Unifying these powerful abstractions makes it easy for developers to intermix SQL commands querying external data with complex analytics, all within in a single application. Concretely, Spark SQL will allow developers to:  Import relational data from Parquet files and Hive tables  Run SQL queries over imported data and existing RDDs  Easily write RDDs out to Hive tables or Parquet files Spark SQL In Action Now, let’s take a closer look at how Spark SQL gives developers the power to integrate SQL commands into applications that also take advantage of MLlib, Spark’s machine learning library. Consider an application that needs to predict which users are likely candidates for a service, based on their profile. Often, such an analysis requires joining data from multiple sources. For the purposes of illustration, imagine an application with two tables:  Users(userId INT, name String, email STRING, age INT, latitude: DOUBLE, longitude: DOUBLE, subscribed: BOOLEAN)  Events(userId INT, action INT) Given the data stored in in these tables, one might want to build a model that will predict which users are good targets for a new campaign, based on users that are similar. [scala] // Data can easily be extracted from existing sources, // such as Apache Hive. val trainingDataTable = sql(""""""  SELECT e.action  u.age,  u.latitude,  u.logitude  FROM Users u  JOIN Events e  ON u.userId = e.userId"""""") // Since `sql` returns an RDD, the results of the above // query can be easily used in MLlib val trainingData = trainingDataTable.map { row =>  val features = Array[Double](row(1), row(2), row(3))  LabeledPoint(row(0), features) } val model =  new LogisticRegressionWithSGD().run(trainingData) [/scala] Now that we have used SQL to join existing data and train a model, we can use this model to predict which users are likely targets. [scala] val allCandidates = sql(""""""  SELECT userId,  age,  latitude,  logitude  FROM Users  WHERE subscribed = FALSE"""""") // Results of ML algorithms can be used as tables // in subsequent SQL statements. case class Score(userId: Int, score: Double) val scores = allCandidates.map { row =>  val features = Array[Double](row(1), row(2), row(3))  Score(row(0), model.predict(features)) } scores.registerAsTable(""Scores"") val topCandidates = sql(""""""  SELECT u.name, u.email  FROM Scores s  JOIN Users u ON s.userId = u.userId  ORDER BY score DESC  LIMIT 100"""""") // Send emails to top candidates to promote the service. [/scala] In this example, Spark SQL made it easy to extract and join the various datasets preparing them for the machine learning algorithm. Since the results of Spark SQL are also stored in RDDs, interfacing with other Spark libraries is trivial. Furthermore, Spark SQL allows developers to close the loop, by making it easy to manipulate and join the output of these algorithms, producing the desired final result. To summarize, the unified Spark platform gives developers the power to choose the right tool for the right job, without having to juggle multiple systems. If you would like to see more concrete examples of using Spark SQL please check out the programming guide. Optimizing with Catalyst In addition to providing new ways to interact with data, Spark SQL also brings a powerful new optimization framework called Catalyst. Using Catalyst, Spark can automatically transform SQL queries so that they execute more efficiently. The Catalyst framework allows the developers behind Spark SQL to rapidly add new optimizations, enabling us to build a faster system more quickly. In one recent example, we found an inefficiency in Hive group-bys that took an experienced developer an entire weekend and over 250 lines of code to fix; we were then able to make the same fix in Catalyst in only a few lines of code. Future of Shark The natural question that arises is about the future of Shark. Shark was among the first systems that delivered up to 100X speedup over Hive. It builds on the Apache Hive codebase and achieves performance improvements by swapping out the physical execution engine part of Hive. While this approach enables Shark users to speed up their Hive queries without modification to their existing warehouses, Shark inherits the large, complicated code base from Hive that makes it hard to optimize and maintain. As Spark SQL matures, Shark will transition to using Spark SQL for query optimization and physical execution so that users can benefit from the ongoing optimization efforts within Spark SQL. In short, we will continue to invest in Shark and make it an excellent drop-in replacement for Apache Hive. It will take advantage of the new Spark SQL component, and will provide features that complement it, such as Hive compatibility and the standalone SharkServer, which allows external tools to connect queries through JDBC/ODBC. What’s next Spark SQL will be included in Spark 1.0 as an alpha component. However, this is only the beginning of better support for relational data in Spark, and this post only scratches the surface of Catalyst. Look for future blog posts on the following topics:  Generating custom bytecode to speed up expression evaluation  Reading and writing data using other formats and systems, include Avro and HBase  API support for using Spark SQL in Python and Java"
List(Patrick Wendell),"List(Apache Spark, Engineering Blog)","List(2014-02-04, 2014-02-04, UTC)","Our goal with Apache Spark is very simple: provide the best platform for computation on big data. We do this through both a powerful core engine and rich libraries for useful analytics tasks. Today, we are excited to announce the release of Apache Spark 0.9.0. This major release extends Spark’s libraries and further improves its performance and usability. Apache Spark 0.9.0 is the largest release to date, with work from 83 contributors, who submitted over 300 patches. Apache Spark 0.9 features significant extensions to the set of standard analytical libraries packaged with Spark. The release introduces GraphX, a library for graph computation that comes with implementations of several standard algorithms, such as PageRank. Spark’s machine learning library (MLlib) has been extended to support Python, using the NumPy numerical library. A Naive Bayes Classifier has also been added to MLlib. Finally, Spark Streaming, which supports near-real-time continuous computation, has added a simplified high-availability mode and several significant optimizations. In addition to higher-level libraries, Spark 0.9 features improvements to the core computation engine. Spark now now automatically spills reduce output to disk, increasing the stability of workloads with very large aggregations. Support for Spark in YARN mode has been hardened and improved. The standalone mode has added automatic supervision of applications and better support for sharing clusters amongst several users. Finally, we’ve focused on stabilizing API’s ahead of Apache Spark’s 1.0 release to make things easy for developers writing Spark applications. This includes upgrading to Scala 2.10, allowing applications written in Scala to use newer libraries. Apache Spark 0.9.0 can be downloaded directly from the Apache Spark website. It will also be available to CDH users via a Cloudera parcel, which can automatically install Spark on existing CDH clusters. For a more detailed explanation of the features in this release, head on over to the official release notes. Enjoy the newest release of Spark!"


## Nested Data

Think of nested data as columns within columns. 

For instance, look at the `dates` column.

<iframe  
src="//fast.wistia.net/embed/iframe/kqmfblujy9?videoFoam=true"
style="border:1px solid #1cb1c2;"
allowtransparency="true" scrolling="no" class="wistia_embed"
name="wistia_embed" allowfullscreen mozallowfullscreen webkitallowfullscreen
oallowfullscreen msallowfullscreen width="640" height="360" ></iframe>
<div>
<a target="_blank" href="https://fast.wistia.net/embed/iframe/kqmfblujy9?seo=false">
  <img alt="Opens in new tab" src="https://files.training.databricks.com/static/images/external-link-icon-16x16.png"/>&nbsp;Watch full-screen.</a>
</div>

In [17]:
%sql
SELECT dates FROM DatabricksBlog
limit 10

dates
"List(2014-04-10, 2014-04-10, UTC)"
"List(2014-04-10, 2014-04-10, UTC)"
"List(2014-04-01, 2014-04-01, UTC)"
"List(2014-03-27, 2014-03-27, UTC)"
"List(2014-02-04, 2014-02-04, UTC)"
"List(2014-01-02, 2014-01-02, UTC)"
"List(2014-03-26, 2014-03-26, UTC)"
"List(2014-03-21, 2014-03-21, UTC)"
"List(2014-03-19, 2014-03-19, UTC)"
"List(2014-03-03, 2014-03-03, UTC)"


Pull out a specific subfield with "dot" notation.

In [19]:
%sql
SELECT dates.createdOn, dates.publishedOn, dates.tz
FROM DatabricksBlog
limit 10

createdOn,publishedOn,tz
2014-04-10,2014-04-10,UTC
2014-04-10,2014-04-10,UTC
2014-04-01,2014-04-01,UTC
2014-03-27,2014-03-27,UTC
2014-02-04,2014-02-04,UTC
2014-01-02,2014-01-02,UTC
2014-03-26,2014-03-26,UTC
2014-03-21,2014-03-21,UTC
2014-03-19,2014-03-19,UTC
2014-03-03,2014-03-03,UTC


Both `createdOn` and `publishedOn` are stored as strings.

Cast those values to SQL timestamps:

In this case, use a single `SELECT` statement to:
0. Cast `dates.publishedOn` to a `timestamp` data type.
0. "Flatten" the `dates.publishedOn` column to just `publishedOn`.

In [21]:
%sql
SELECT title, 
       cast(dates.publishedOn AS timestamp) AS publishedOn 
FROM DatabricksBlog
limit 10

title,publishedOn
MapR Integrates the Complete Apache Spark Stack,2014-04-10T00:00:00.000+0000
Apache Spark 0.9.1 Released,2014-04-10T00:00:00.000+0000
Application Spotlight: Alpine Data Labs,2014-04-01T00:00:00.000+0000
Spark SQL: Manipulating Structured Data Using Apache Spark,2014-03-27T00:00:00.000+0000
Apache Spark 0.9.0 Released,2014-02-04T00:00:00.000+0000
Apache Spark In MapReduce (SIMR),2014-01-02T00:00:00.000+0000
Sharethrough Uses Apache Spark Streaming to Optimize Bidding in Real Time,2014-03-26T00:00:00.000+0000
Apache Spark: A Delight for Developers,2014-03-21T00:00:00.000+0000
"Databricks announces ""Certified on Apache Spark"" Program",2014-03-19T00:00:00.000+0000
Apache Spark Now a Top-level Apache Project,2014-03-03T00:00:00.000+0000


Create the temporary view `DatabricksBlog2` to capture the conversion and flattening of the `publishedOn` column.

In [23]:
%sql
CREATE OR REPLACE TEMPORARY VIEW DatabricksBlog2 AS
  SELECT *, 
         cast(dates.publishedOn AS timestamp) AS publishedOn 
  FROM DatabricksBlog

Now that we have this temporary view, we can use `DESCRIBE` to check its schema and confirm the timestamp conversion.

In [25]:
%sql
DESCRIBE DatabricksBlog2

col_name,data_type,comment
authors,array,
categories,array,
content,string,
creator,string,
dates,struct,
description,string,
id,bigint,
link,string,
slug,string,
status,string,


-sandbox
Now the dates are represented by a `timestamp` data type, query for articles within certain date ranges (such as getting a list of all articles published in 2013), and format the date for presentation purposes.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> See the Spark documentation, <a href="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions" target="_blank">built-in functions</a>, for a long list of date-specific functions.

In [27]:
%sql
SELECT authors, categories,
TRANSFORM (categories, category -> LOWER(category)) AS lwr_categories
FROM DatabricksBlog
limit 20

authors,categories,lwr_categories
List(Tomer Shiran (VP of Product Management at MapR)),"List(Company Blog, Partners)","List(company blog, partners)"
List(Tathagata Das),"List(Apache Spark, Engineering Blog, Machine Learning)","List(apache spark, engineering blog, machine learning)"
List(Steven Hillion),"List(Company Blog, Partners)","List(company blog, partners)"
"List(Michael Armbrust, Reynold Xin)","List(Apache Spark, Engineering Blog)","List(apache spark, engineering blog)"
List(Patrick Wendell),"List(Apache Spark, Engineering Blog)","List(apache spark, engineering blog)"
"List(Ali Ghodsi, Ahir Reddy)","List(Apache Spark, Ecosystem, Engineering Blog)","List(apache spark, ecosystem, engineering blog)"
"List(Russell Cardullo (Data Infrastructure Engineer at Sharethrough), Michael Ruggiero (Data Infrastructure Engineer at Sharethrough))","List(Company Blog, Customers)","List(company blog, customers)"
"List(Jai Ranganathan, Matei Zaharia)","List(Apache Spark, Engineering Blog)","List(apache spark, engineering blog)"
List(Databricks Press Office),"List(Announcements, Company Blog)","List(announcements, company blog)"
List(Ion Stoica),"List(Apache Spark, Engineering Blog)","List(apache spark, engineering blog)"


In [28]:
%sql
SELECT title,
  FILTER (categories, category -> category = "Apache Spark") filtered
FROM DatabricksBlog
limit 10

title,filtered
MapR Integrates the Complete Apache Spark Stack,List()
Apache Spark 0.9.1 Released,List(Apache Spark)
Application Spotlight: Alpine Data Labs,List()
Spark SQL: Manipulating Structured Data Using Apache Spark,List(Apache Spark)
Apache Spark 0.9.0 Released,List(Apache Spark)
Apache Spark In MapReduce (SIMR),List(Apache Spark)
Sharethrough Uses Apache Spark Streaming to Optimize Bidding in Real Time,List()
Apache Spark: A Delight for Developers,List(Apache Spark)
"Databricks announces ""Certified on Apache Spark"" Program",List()
Apache Spark Now a Top-level Apache Project,List(Apache Spark)


In [29]:
%sql
SELECT title,
  EXISTS (authors, author -> author = "Reynold Xin" 
    OR author = "Ion Stoica") selected
FROM DatabricksBlog
limit 10

title,selected
MapR Integrates the Complete Apache Spark Stack,False
Apache Spark 0.9.1 Released,False
Application Spotlight: Alpine Data Labs,False
Spark SQL: Manipulating Structured Data Using Apache Spark,True
Apache Spark 0.9.0 Released,False
Apache Spark In MapReduce (SIMR),False
Sharethrough Uses Apache Spark Streaming to Optimize Bidding in Real Time,False
Apache Spark: A Delight for Developers,False
"Databricks announces ""Certified on Apache Spark"" Program",False
Apache Spark Now a Top-level Apache Project,True


In [30]:
%sql
SELECT title, 
       date_format(publishedOn, "MMM dd, yyyy") AS date, 
       link,
       year(publishedOn) as year
FROM DatabricksBlog2
WHERE year(publishedOn) = 2013
ORDER BY publishedOn
limit 10

title,date,link,year
Databricks and the Apache Spark Platform,"Oct 27, 2013",https://databricks.com/blog/2013/10/27/databricks-and-the-apache-spark-platform.html,2013
The Growing Apache Spark Community,"Oct 28, 2013",https://databricks.com/blog/2013/10/27/the-growing-spark-community.html,2013
Databricks and Cloudera Partner to Support Apache Spark,"Oct 29, 2013",https://databricks.com/blog/2013/10/28/databricks-and-cloudera-partner-to-support-spark.html,2013
Putting Apache Spark to Use: Fast In-Memory Computing for Your Big Data Applications,"Nov 22, 2013",https://databricks.com/blog/2013/11/21/putting-spark-to-use.html,2013
Highlights From Spark Summit 2013,"Dec 19, 2013",https://databricks.com/blog/2013/12/18/spark-summit-2013-follow-up.html,2013
Apache Spark 0.8.1 Released,"Dec 20, 2013",https://databricks.com/blog/2013/12/19/release-0_8_1.html,2013


## Array Data

The table also contains array columns. 

Easily determine the size of each array using the built-in `size(..)` function with array columns.

<iframe  
src="//fast.wistia.net/embed/iframe/w9vj8mjpf7?videoFoam=true"
style="border:1px solid #1cb1c2;"
allowtransparency="true" scrolling="no" class="wistia_embed"
name="wistia_embed" allowfullscreen mozallowfullscreen webkitallowfullscreen
oallowfullscreen msallowfullscreen width="640" height="360" ></iframe>
<div>
<a target="_blank" href="https://fast.wistia.net/embed/iframe/w9vj8mjpf7?seo=false">
  <img alt="Opens in new tab" src="https://files.training.databricks.com/static/images/external-link-icon-16x16.png"/>&nbsp;Watch full-screen.</a>
</div>

In [33]:
%sql
SELECT size(authors), 
       authors 
FROM DatabricksBlog
limit 10

size(authors),authors
1,List(Tomer Shiran (VP of Product Management at MapR))
1,List(Tathagata Das)
1,List(Steven Hillion)
2,"List(Michael Armbrust, Reynold Xin)"
1,List(Patrick Wendell)
2,"List(Ali Ghodsi, Ahir Reddy)"
2,"List(Russell Cardullo (Data Infrastructure Engineer at Sharethrough), Michael Ruggiero (Data Infrastructure Engineer at Sharethrough))"
2,"List(Jai Ranganathan, Matei Zaharia)"
1,List(Databricks Press Office)
1,List(Ion Stoica)


Pull the first element from the array `authors` using an array subscript operator.

In [35]:
%sql
SELECT authors[0] AS primaryAuthor 
FROM DatabricksBlog
limit 10

primaryAuthor
Tomer Shiran (VP of Product Management at MapR)
Tathagata Das
Steven Hillion
Michael Armbrust
Patrick Wendell
Ali Ghodsi
Russell Cardullo (Data Infrastructure Engineer at Sharethrough)
Jai Ranganathan
Databricks Press Office
Ion Stoica


### Explode

The `explode` function allows you to split an array column into multiple rows, copying all the other columns into each new row. 

For example, you can split the column `authors` into the column `author`, with one author per row.

<iframe  
src="//fast.wistia.net/embed/iframe/h8tv263d04?videoFoam=true"
style="border:1px solid #1cb1c2;"
allowtransparency="true" scrolling="no" class="wistia_embed"
name="wistia_embed" allowfullscreen mozallowfullscreen webkitallowfullscreen
oallowfullscreen msallowfullscreen width="640" height="360" ></iframe>
<div>
<a target="_blank" href="https://fast.wistia.net/embed/iframe/h8tv263d04?seo=false">
  <img alt="Opens in new tab" src="https://files.training.databricks.com/static/images/external-link-icon-16x16.png"/>&nbsp;Watch full-screen.</a>
</div>

In [38]:
%sql
SELECT title, 
       authors, 
       explode(authors) AS author, 
       link 
FROM DatabricksBlog
limit 10

title,authors,author,link
MapR Integrates the Complete Apache Spark Stack,List(Tomer Shiran (VP of Product Management at MapR)),Tomer Shiran (VP of Product Management at MapR),https://databricks.com/blog/2014/04/10/mapr-integrates-spark-stack.html
Apache Spark 0.9.1 Released,List(Tathagata Das),Tathagata Das,https://databricks.com/blog/2014/04/09/spark-0_9_1-released.html
Application Spotlight: Alpine Data Labs,List(Steven Hillion),Steven Hillion,https://databricks.com/blog/2014/03/31/application-spotlight-alpine.html
Spark SQL: Manipulating Structured Data Using Apache Spark,"List(Michael Armbrust, Reynold Xin)",Michael Armbrust,https://databricks.com/blog/2014/03/26/spark-sql-manipulating-structured-data-using-spark-2.html
Spark SQL: Manipulating Structured Data Using Apache Spark,"List(Michael Armbrust, Reynold Xin)",Reynold Xin,https://databricks.com/blog/2014/03/26/spark-sql-manipulating-structured-data-using-spark-2.html
Apache Spark 0.9.0 Released,List(Patrick Wendell),Patrick Wendell,https://databricks.com/blog/2014/02/03/release-0_9_0.html
Apache Spark In MapReduce (SIMR),"List(Ali Ghodsi, Ahir Reddy)",Ali Ghodsi,https://databricks.com/blog/2014/01/01/simr.html
Apache Spark In MapReduce (SIMR),"List(Ali Ghodsi, Ahir Reddy)",Ahir Reddy,https://databricks.com/blog/2014/01/01/simr.html
Sharethrough Uses Apache Spark Streaming to Optimize Bidding in Real Time,"List(Russell Cardullo (Data Infrastructure Engineer at Sharethrough), Michael Ruggiero (Data Infrastructure Engineer at Sharethrough))",Russell Cardullo (Data Infrastructure Engineer at Sharethrough),https://databricks.com/blog/2014/03/25/sharethrough-and-spark-streaming.html
Sharethrough Uses Apache Spark Streaming to Optimize Bidding in Real Time,"List(Russell Cardullo (Data Infrastructure Engineer at Sharethrough), Michael Ruggiero (Data Infrastructure Engineer at Sharethrough))",Michael Ruggiero (Data Infrastructure Engineer at Sharethrough),https://databricks.com/blog/2014/03/25/sharethrough-and-spark-streaming.html


It's more obvious to restrict the output to articles that have multiple authors, and sort by the title.

In [40]:
%sql
SELECT title, 
       authors, 
       explode(authors) AS author, 
       link 
FROM DatabricksBlog 
WHERE size(authors) > 1 
ORDER BY title
limit 10

title,authors,author,link
"""Learning Spark"" book available from O'Reilly","List(Holden Karau, Andy Konwinski, Patrick Wendell, Matei Zaharia)",Patrick Wendell,https://databricks.com/blog/2015/02/09/learning-spark-book-available-from-oreilly.html
"""Learning Spark"" book available from O'Reilly","List(Holden Karau, Andy Konwinski, Patrick Wendell, Matei Zaharia)",Matei Zaharia,https://databricks.com/blog/2015/02/09/learning-spark-book-available-from-oreilly.html
"""Learning Spark"" book available from O'Reilly","List(Holden Karau, Andy Konwinski, Patrick Wendell, Matei Zaharia)",Holden Karau,https://databricks.com/blog/2015/02/09/learning-spark-book-available-from-oreilly.html
"""Learning Spark"" book available from O'Reilly","List(Holden Karau, Andy Konwinski, Patrick Wendell, Matei Zaharia)",Andy Konwinski,https://databricks.com/blog/2015/02/09/learning-spark-book-available-from-oreilly.html
AMPLab updates the Big Data Benchmark,"List(Ahir Reddy, Reynold Xin)",Ahir Reddy,https://databricks.com/blog/2014/02/12/big-data-benchmark.html
AMPLab updates the Big Data Benchmark,"List(Ahir Reddy, Reynold Xin)",Reynold Xin,https://databricks.com/blog/2014/02/12/big-data-benchmark.html
Announcing Apache Spark Packages,"List(Xiangrui Meng, Patrick Wendell)",Patrick Wendell,https://databricks.com/blog/2014/12/22/announcing-spark-packages.html
Announcing Apache Spark Packages,"List(Xiangrui Meng, Patrick Wendell)",Xiangrui Meng,https://databricks.com/blog/2014/12/22/announcing-spark-packages.html
Apache Spark 1.1: Bringing Hadoop Input/Output Formats to PySpark,"List(Nick Pentreath (Graphflow), Kan Zhang (IBM))",Kan Zhang (IBM),https://databricks.com/blog/2014/09/17/spark-1-1-bringing-hadoop-inputoutput-formats-to-pyspark.html
Apache Spark 1.1: Bringing Hadoop Input/Output Formats to PySpark,"List(Nick Pentreath (Graphflow), Kan Zhang (IBM))",Nick Pentreath (Graphflow),https://databricks.com/blog/2014/09/17/spark-1-1-bringing-hadoop-inputoutput-formats-to-pyspark.html


### Lateral View
The data has multiple columns with nested objects.  In this case, the data has multiple dates, authors, and categories.

Take a look at the blog entry **Apache Spark 1.1: The State of Spark Streaming**:

In [42]:
%sql
SELECT dates.publishedOn, title, authors, categories
FROM DatabricksBlog
WHERE title = "Apache Spark 1.1: The State of Spark Streaming"
limit 10

publishedOn,title,authors,categories
2014-09-16,Apache Spark 1.1: The State of Spark Streaming,"List(Arsalan Tavakoli-Shiraji, Tathagata Das, Patrick Wendell)","List(Apache Spark, Engineering Blog, Streaming)"


Next, use `LATERAL VIEW` to explode multiple columns at once, in this case, the columns `authors` and `categories`.

In [44]:
%sql
SELECT dates.publishedOn, title, author, category
FROM DatabricksBlog
LATERAL VIEW explode(authors) exploded_authors_view AS author
LATERAL VIEW explode(categories) exploded_categories AS category
WHERE title = "Apache Spark 1.1: The State of Spark Streaming"
ORDER BY author, category
limit 10

publishedOn,title,author,category
2014-09-16,Apache Spark 1.1: The State of Spark Streaming,Arsalan Tavakoli-Shiraji,Apache Spark
2014-09-16,Apache Spark 1.1: The State of Spark Streaming,Arsalan Tavakoli-Shiraji,Engineering Blog
2014-09-16,Apache Spark 1.1: The State of Spark Streaming,Arsalan Tavakoli-Shiraji,Streaming
2014-09-16,Apache Spark 1.1: The State of Spark Streaming,Patrick Wendell,Apache Spark
2014-09-16,Apache Spark 1.1: The State of Spark Streaming,Patrick Wendell,Engineering Blog
2014-09-16,Apache Spark 1.1: The State of Spark Streaming,Patrick Wendell,Streaming
2014-09-16,Apache Spark 1.1: The State of Spark Streaming,Tathagata Das,Apache Spark
2014-09-16,Apache Spark 1.1: The State of Spark Streaming,Tathagata Das,Engineering Blog
2014-09-16,Apache Spark 1.1: The State of Spark Streaming,Tathagata Das,Streaming


## Exercise 1

Identify all the articles written or co-written by Michael Armbrust.

-sandbox
### Step 1

Starting with the table `DatabricksBlog`, create a temporary view called `ArticlesByMichael` where:
0. Michael Armbrust is the author
0. The data set contains the column `title` (it may contain others)
0. It contains only one record per article

<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** See the Spark documentation, <a href="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions" target="_blank">built-in functions</a>.  

<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** Include the column `authors` in your view, to help you debug your solution.

In [47]:
%sql
-- TODO

FILL_IN

In [48]:
# TEST - Run this cell to test your solution.

resultsDF = spark.sql("select title from ArticlesByMichael order by title")
dbTest("SQL-L5-articlesByMichael-count", 3, resultsDF.count())

results = [r[0] for r in resultsDF.collect()]
dbTest("SQL-L5-articlesByMichael-0", "Exciting Performance Improvements on the Horizon for Spark SQL", results[0])
dbTest("SQL-L5-articlesByMichael-1", "Spark SQL Data Sources API: Unified Data Access for the Apache Spark Platform", results[1])
dbTest("SQL-L5-articlesByMichael-2", "Spark SQL: Manipulating Structured Data Using Apache Spark", results[2])

print("Tests passed!")

### Step 2
Show the list of Michael Armbrust's articles.

In [50]:
%sql
-- TODO

FILL_IN

## Exercise 2

Identify the complete set of categories used in the Databricks blog articles.

### Step 1

Starting with the table `DatabricksBlog`, create another view called `UniqueCategories` where:
0. The data set contains the one column `category` (and no others)
0. This list of categories should be unique

In [53]:
%sql
-- TODO

FILL_IN

In [54]:
# TEST - Run this cell to test your solution.

resultsCount = spark.sql("SELECT category FROM UniqueCategories order by category")

dbTest("SQL-L5-uniqueCategories-count", 12, resultsCount.count())

results = [r[0] for r in resultsCount.collect()]
dbTest("SQL-L5-uniqueCategories-0", "Announcements", results[0])
dbTest("SQL-L5-uniqueCategories-1", "Apache Spark", results[1])
dbTest("SQL-L5-uniqueCategories-2", "Company Blog", results[2])

dbTest("SQL-L5-uniqueCategories-9", "Platform", results[9])
dbTest("SQL-L5-uniqueCategories-10", "Product", results[10])
dbTest("SQL-L5-uniqueCategories-11", "Streaming", results[11])

print("Tests passed!")

### Step 2
Show the complete list of categories.

In [56]:
%sql
-- TODO

FILL_IN

## Exercise 3

Count how many times each category is referenced in the Databricks blog.

-sandbox
### Step 1

Starting with the table `DatabricksBlog`, create a temporary view called `TotalArticlesByCategory` where:
0. The new table contains two columns, `category` and `total`
0. The `category` column is a single, distinct category (similar to the last exercise)
0. The `total` column is the total number of articles in that category

<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** You need either multiple views or a `LATERAL VIEW` to solve this.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> Because articles can be tagged with multiple categories, the sum of the totals adds up to more than the total number of articles.

In [59]:
%sql
-- TODO

FILL_IN

In [60]:
# TEST - Run this cell to test your solution.

resultsDF = spark.sql("SELECT category, total FROM TotalArticlesByCategory ORDER BY category")
dbTest("SQL-L5-articlesByCategory-count", 12, resultsDF.count())

results = [ (r[0]+" w/"+str(r[1])) for r in resultsDF.collect()]

dbTest("SQL-L5-articlesByCategory-0", "Announcements w/72", results[0])
dbTest("SQL-L5-articlesByCategory-1", "Apache Spark w/132", results[1])
dbTest("SQL-L5-articlesByCategory-2", "Company Blog w/224", results[2])

dbTest("SQL-L5-articlesByCategory-9", "Platform w/4", results[9])
dbTest("SQL-L5-articlesByCategory-10", "Product w/83", results[10])
dbTest("SQL-L5-articlesByCategory-11", "Streaming w/21", results[11])

print("Tests passed!")

### Step 2
Display the totals of each category, order by `category`.

In [62]:
%sql
-- TODO

FILL_IN

## Summary

* Spark SQL allows you to query and manipulate structured and semi-structured data
* Spark SQL's built-in functions provide powerful primitives for querying complex schemas

## Review Questions
**Q:** What is the syntax for accessing nested columns?  
**A:** Use the dot notation: ```SELECT dates.publishedOn```

**Q:** What is the syntax for accessing the first element in an array?  
**A:** Use the [subscript] notation:  ```SELECT authors[0]```

**Q:** What is the syntax for expanding an array into multiple rows?  
**A:** Use the explode keyword, either:  
```SELECT explode(authors) as Author``` or  
```LATERAL VIEW explode(authors) exploded_authors_view AS author```

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Classroom-Cleanup<br>

Run the **`Classroom-Cleanup`** cell below to remove any artifacts created by this lesson.

In [66]:
%run "./Includes/Classroom-Cleanup"

## Next Steps

Start the next lesson, [Querying Data Lakes with SQL]($./SSQL 06 - Data Lakes).

## Additional Topics & Resources

* <a href="https://docs.databricks.com/spark/latest/spark-sql/index.html" target="_blank">Spark SQL Reference</a>
* <a href="http://spark.apache.org/docs/latest/sql-programming-guide.html" target="_blank">Spark SQL, DataFrames and Datasets Guide</a>
* <a href="https://stackoverflow.com/questions/36876959/sparksql-can-i-explode-two-different-variables-in-the-same-query" target="_blank">SparkSQL: Can I explode two different variables in the same query? (StackOverflow)</a>

-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>