d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px; height: 163px">
</div>

# 3.5 Schemas and Types

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) In this notebook you:<br>
* Motivate the use of schemas and types
* Read from JSON without a schema
* Read from JSON with a schema

In [0]:
%run ../Includes/Classroom-Setup

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Why Schemas Matter

Schemas are at the heart of data structures in Spark.
**A schema describes the structure of your data by naming columns and declaring the type of data in that column.** 
Rigorously enforcing schemas leads to significant performance optimizations and reliability of code.

Why is open source Spark so fast, and why is [Databricks Runtime even faster?](https://databricks.com/blog/2017/07/12/benchmarking-big-data-sql-platforms-in-the-cloud.html) While there are many reasons for these performance improvements, two key reasons are:<br><br>
* First and foremost, Spark runs first in memory rather than reading and writing to disk. 
* Second, using DataFrames allows Spark to optimize the execution of your queries because it knows what your data looks like.

Two pillars of computer science education are data structures, the organization and storage of data and algorithms, and the computational procedures on that data.  A rigorous understanding of computer science involves both of these domains. When you apply the most relevant data structures, the algorithms that carry out the computation become significantly more eloquent.

### Schemas with Semi-Structured JSON Data

**Tabular data**, such as that found in CSV files or relational databases, has a formal structure where each observation, or row, of the data has a value (even if it's a NULL value) for each feature, or column, in the data set.  

**Semi-structured data** does not need to conform to a formal data model. Instead, a given feature may appear zero, once, or many times for a given observation.  

Semi-structured data storage works well with hierarchical data and with schemas that may evolve over time.  One of the most common forms of semi-structured data is JSON data, which consists of attribute-value pairs.

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Reading from JSON w/ InferSchema

Reading in JSON isn't that much different than reading in CSV files.

Let's start with taking a look at all the different options that go along with reading in JSON files.

### JSON Lines

Much like the CSV reader, the JSON reader also assumes:
* That there is one JSON object per line and
* That it's delineated by a new-line.

This format is referred to as **JSON Lines** or **newline-delimited JSON** 

More information about this format can be found at <a href="http://jsonlines.org/" target="_blank">http://jsonlines.org</a>.

** *Note:* ** *Spark 2.2 was released on July 11th 2016. With that comes File IO improvements for CSV & JSON, but more importantly, **support for parsing multi-line JSON and CSV files**. You can read more about that (and other features in Spark 2.2) in the <a href="https://databricks.com/blog/2017/07/11/introducing-apache-spark-2-2.html" target="_blank">Databricks Blog</a>.*

Take a look at the sample of our JSON data.

In [0]:
%fs ls /mnt/davis/fire-calls/fire-calls-truncated.json

path,name,size
dbfs:/mnt/davis/fire-calls/fire-calls-truncated.json,fire-calls-truncated.json,221798942


Like we did with the CSV file, we can use **&percnt;fs head ...** to take a look at the first few lines of the file.

In [0]:
%fs head /mnt/davis/fire-calls/fire-calls-truncated.json

### Read the JSON File

The command to read in JSON looks very similar to that of CSV.

In addition to reading the JSON file, we will also print the resulting schema.

In [0]:
%sql
CREATE OR REPLACE TEMPORARY VIEW fireCallsJSON
USING JSON 
OPTIONS (
    path "/mnt/davis/fire-calls/fire-calls-truncated.json"
  )

In [0]:
%sql
DESCRIBE fireCallsJSON

col_name,data_type,comment
ALS Unit,boolean,
Address,string,
Available DtTm,string,
Battalion,string,
Box,string,
Call Date,string,
Call Final Disposition,string,
Call Number,bigint,
Call Type,string,
Call Type Group,string,


Take a look at the table.

In [0]:
%sql
SELECT * FROM fireCallsJSON LIMIT 10

ALS Unit,Address,Available DtTm,Battalion,Box,Call Date,Call Final Disposition,Call Number,Call Type,Call Type Group,City,Dispatch DtTm,Entry DtTm,Final Priority,Fire Prevention District,Hospital DtTm,Incident Number,Location,Neighborhooods - Analysis Boundaries,Number of Alarms,On Scene DtTm,Original Priority,Priority,Received DtTm,Response DtTm,RowID,Station Area,Supervisor District,Transport DtTm,Unit ID,Unit Type,Unit sequence in call dispatch,Watch Date,Zipcode of Incident
False,4TH ST/CHANNEL ST,04/12/2000 09:45:28 PM,B03,2226,04/12/2000,Other,1030118,Medical Incident,,SF,04/12/2000 09:29:21 PM,04/12/2000 09:28:58 PM,3,3.0,,30625,"(37.7750268633971, -122.392346204303)",,1,04/12/2000 09:32:34 PM,3,3,04/12/2000 09:27:45 PM,04/12/2000 09:31:26 PM,001030118-E08,8,6,,E08,ENGINE,1,04/12/2000,
False,1800 Block of IRVING ST,04/12/2000 09:49:52 PM,B08,7424,04/12/2000,Other,1030122,Medical Incident,,SF,04/12/2000 09:34:10 PM,04/12/2000 09:33:48 PM,2,8.0,,30630,"(37.763482287794, -122.477678638767)",Sunset/Parkside,1,04/12/2000 09:45:22 PM,1,1,04/12/2000 09:31:55 PM,04/12/2000 09:35:59 PM,001030122-M18,22,4,,M18,MEDIC,1,04/12/2000,94122.0
False,0 Block of SOUTH VAN NESS AVE,04/12/2000 11:42:43 PM,B02,5117,04/12/2000,Other,1030154,Medical Incident,,SF,04/12/2000 10:49:59 PM,04/12/2000 10:45:53 PM,2,2.0,04/12/2000 11:22:17 PM,30662,"(37.7741251002903, -122.418810211803)",Mission,1,04/12/2000 10:53:18 PM,1,1,04/12/2000 10:43:54 PM,04/12/2000 10:50:35 PM,001030154-M36,36,6,04/12/2000 11:11:36 PM,M36,MEDIC,1,04/12/2000,94103.0
True,CLAYTON ST/PARNASSUS AV,04/13/2000 12:33:18 AM,B05,5151,04/13/2000,Other,1040007,Structure Fire,,SF,04/13/2000 12:29:35 AM,04/13/2000 12:29:24 AM,3,5.0,,30697,"(37.7651387353822, -122.44763462758)",Haight Ashbury,1,04/13/2000 12:32:36 AM,3,3,04/13/2000 12:19:54 AM,04/13/2000 12:31:25 AM,001040007-E12,12,5,,E12,ENGINE,1,04/12/2000,94117.0
True,500 Block of 38TH AVE,04/13/2000 02:40:25 AM,B07,7255,04/13/2000,Other,1040021,Medical Incident,,SF,04/13/2000 01:20:02 AM,04/13/2000 01:18:44 AM,3,7.0,04/13/2000 02:33:33 AM,30711,"(37.778489948235, -122.498662035969)",Outer Richmond,1,04/13/2000 01:24:05 AM,3,3,04/13/2000 01:17:25 AM,04/13/2000 01:21:40 AM,001040021-M14,34,1,04/13/2000 01:56:02 AM,M14,MEDIC,1,04/12/2000,94121.0
True,200 Block of MADRID ST,04/13/2000 09:26:54 AM,B09,613,04/13/2000,Other,1040061,Medical Incident,,SF,04/13/2000 07:55:54 AM,04/13/2000 07:55:35 AM,3,9.0,04/13/2000 08:28:37 AM,30749,"(37.7255316247491, -122.429925994016)",Excelsior,1,,3,3,04/13/2000 07:51:29 AM,04/13/2000 07:59:58 AM,001040061-M43,43,11,04/13/2000 08:16:30 AM,M43,MEDIC,3,04/12/2000,94112.0
False,2800 Block of BROADWAY,04/13/2000 09:39:36 AM,B04,4226,04/13/2000,Other,1040079,Alarms,,SF,04/13/2000 09:34:10 AM,04/13/2000 09:33:04 AM,3,4.0,,30766,"(37.7931736175933, -122.444028632879)",Pacific Heights,1,04/13/2000 09:37:59 AM,3,3,04/13/2000 09:31:19 AM,04/13/2000 09:35:52 AM,001040079-E10,10,2,,E10,ENGINE,1,04/13/2000,94123.0
True,2500 Block of OCEAN AVE,04/13/2000 01:16:20 PM,B08,8452,04/13/2000,Other,1040143,Medical Incident,,SF,04/13/2000 01:13:08 PM,04/13/2000 01:04:12 PM,2,8.0,,30832,"(37.7314853147957, -122.472647880057)",West of Twin Peaks,1,04/13/2000 01:29:19 PM,1,1,04/13/2000 01:01:56 PM,,001040143-M43,19,7,04/13/2000 01:36:34 PM,M43,MEDIC,1,04/13/2000,94132.0
False,POLK ST/UNION ST,04/13/2000 02:22:01 PM,B04,3131,04/13/2000,Other,1040170,Structure Fire,,SF,04/13/2000 02:15:00 PM,04/13/2000 02:12:27 PM,3,4.0,,30855,"(37.7987615790944, -122.422336952094)",Russian Hill,1,,3,3,04/13/2000 02:09:54 PM,,001040170-T16,4,3,,T16,TRUCK,2,04/13/2000,94109.0
False,CALL BOX: FS TI,04/13/2000 05:52:48 PM,B03,2931,04/13/2000,Other,1040233,Alarms,,TI,04/13/2000 05:25:02 PM,04/13/2000 05:23:58 PM,3,,,30914,"(37.8225682263653, -122.371537518925)",Treasure Island,1,04/13/2000 05:29:16 PM,3,3,04/13/2000 05:23:03 PM,,001040233-E48,48,6,,E48,ENGINE,1,04/13/2000,94130.0


### Review: Reading from JSON w/ InferSchema

While there are similarities between reading in CSV & JSON there are some key differences:
* We only need one job even when inferring the schema.
* There is no header which is why there isn't a second job in this case - the column names are extracted from the JSON object's attributes.
* Unlike CSV which reads in 100% of the data, the JSON reader only samples the data.  
**Note:** In Spark 2.2 the behavior was changed to read in the entire JSON file.

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Reading from JSON w/ a User Defined Schema

### User-Defined Schemas

Spark infers schemas from the data, as detailed in the example above.  Challenges with inferred schemas include:  
<br>
* Schema inference means Spark scans all of your data, creating an extra job, which can affect performance
* Consider providing alternative data types (for example, change a `Long` to a `Integer`)
* Consider throwing out certain fields in the data, to read only the data of interest

In [0]:
%sql
CREATE OR REPLACE TEMPORARY VIEW fireCallsJSON ( 
  `Call Number` INT,
  `Unit ID` STRING,
  `Incident Number` INT,
  `Call Type` STRING,
  `Call Date` STRING,
  `Watch Date` STRING,
  `Received DtTm` STRING,
  `Entry DtTm` STRING,
  `Dispatch DtTm` STRING,
  `Response DtTm` STRING,
  `On Scene DtTm` STRING,
  `Transport DtTm` STRING,
  `Hospital DtTm` STRING,
  `Call Final Disposition` STRING,
  `Available DtTm` STRING,
  `Address` STRING,
  `City` STRING,
  `Zipcode of Incident` INT,
  `Battalion` STRING,
  `Station Area` STRING,
  `Box` STRING,
  `Original Priority` STRING,
  `Priority` STRING,
  `Final Priority` INT,
  `ALS Unit` BOOLEAN,
  `Call Type Group` STRING,
  `Number of Alarms` INT,
  `Unit Type` STRING,
  `Unit sequence in call dispatch` INT,
  `Fire Prevention District` STRING,
  `Supervisor District` STRING,
  `Neighborhooods - Analysis Boundaries` STRING,
  `Location` STRING,
  `RowID` STRING
)
USING JSON 
OPTIONS (
    path "/mnt/davis/fire-calls/fire-calls-truncated.json"
)

Take a look at how much faster that process was!

-sandbox
### Primitive and Non-primitive Types

The Spark [`types` package](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.types) provides the building blocks for constructing schemas.

A primitive type contains the data itself.  The most common primitive types include:

| Numeric | General | Time |
|-----|-----|
| `FloatType` | `StringType` | `TimestampType` | 
| `IntegerType` | `BooleanType` | `DateType` | 
| `DoubleType` | `NullType` | |
| `LongType` | | |
| `ShortType` |  | |

Non-primitive types are sometimes called reference variables or composite types.  Technically, non-primitive types contain references to memory locations and not the data itself.  Non-primitive types are the composite of a number of primitive types such as an Array of the primitive type `Integer`.

The two most common composite types are `ArrayType` and `MapType`. These types allow for a given field to contain an arbitrary number of elements in either an Array/List or Map/Dictionary form.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> See the [Spark documentation](http://spark.apache.org/docs/latest/sql-programming-guide.html#data-types) for a complete picture of types in Spark.

### Review: Reading from JSON w/ User-Defined Schema
* Just like CSV, providing the schema avoids the extra jobs.
* The schema allows us to rename columns and specify alternate data types.
* Can get arbitrarily complex in its structure.

-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>