
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning">
</div>




# Extracting Data Directly From Files with Spark SQL

In this notebook, you'll learn to extract data directly from files using Spark SQL on Databricks.

A number of file formats support this option, but it is most useful for self-describing data formats (such as Parquet and JSON).

## Learning Objectives
By the end of this lesson, you should be able to:
- Use Spark SQL to directly query data files
- Layer views and CTEs to make referencing data files easier
- Leverage **`text`** and **`binaryFile`** methods to review raw file contents



## Run Setup

The setup script will create the data and declare necessary values for the rest of this notebook to execute.

In [0]:
%run ./Includes/Classroom-Setup-02.1



## Data Overview

In this example, we'll work with a sample of raw Kafka data written as JSON files. 

Each file contains all records consumed during a 5-second interval, stored with the full Kafka schema as a multiple-record JSON file.

| field | type | description |
| --- | --- | --- |
| key | BINARY | The **`user_id`** field is used as the key; this is a unique alphanumeric field that corresponds to session/cookie information |
| value | BINARY | This is the full data payload (to be discussed later), sent as JSON |
| topic | STRING | While the Kafka service hosts multiple topics, only those records from the **`clickstream`** topic are included here |
| partition | INTEGER | Our current Kafka implementation uses only 2 partitions (0 and 1) |
| offset | LONG | This is a unique value, monotonically increasing for each partition |
| timestamp | LONG | This timestamp is recorded as milliseconds since epoch, and represents the time at which the producer appends a record to a partition |



Note that our source directory contains many JSON files.

In [0]:
%python
print(DA.paths.kafka_events)

files = dbutils.fs.ls(DA.paths.kafka_events)
display(files)



Here, we'll be using relative file paths to data that's been written to the DBFS root. 

Most workflows will require users to access data from external cloud storage locations. 

In most companies, a workspace administrator will be responsible for configuring access to these storage locations.

Instructions for configuring and accessing these locations can be found in the cloud-vendor specific self-paced courses titled "Cloud Architecture & Systems Integrations".



## Query a Single File

To query the data contained in a single file, execute the query with the following pattern:

<strong><code>SELECT * FROM file_format.&#x60;/path/to/file&#x60;</code></strong>

Make special note of the use of back-ticks (not single quotes) around the path.

In [0]:
SELECT * FROM json.`${DA.paths.kafka_events}/001.json`

Databricks data profile. Run in Databricks to view.



Note that our preview displays all 321 rows of our source file.



## Query a Directory of Files

Assuming all of the files in a directory have the same format and schema, all files can be queried simultaneously by specifying the directory path rather than an individual file.

In [0]:
SELECT * FROM json.`${DA.paths.kafka_events}`



By default, this query will only show the first  10,000 records or 2 MB, whichever is less.



## Create References to Files
This ability to directly query files and directories means that additional Spark logic can be chained to queries against files.

When we create a view from a query against a path, we can reference this view in later queries.

In [0]:
CREATE OR REPLACE VIEW event_view
AS SELECT * FROM json.`${DA.paths.kafka_events}`


As long as a user has permission to access the view and the underlying storage location, that user will be able to use this view definition to query the underlying data. This applies to different users in the workspace, different notebooks, and different clusters.

In [0]:
SELECT * FROM event_view

## Create Temporary References to Files

Temporary views similarly alias queries to a name that's easier to reference in later queries.

In [0]:
CREATE OR REPLACE TEMP VIEW events_temp_view
AS SELECT * FROM json.`${DA.paths.kafka_events}`


Temporary views exists only for the current SparkSession. On Databricks, this means they are isolated to the current notebook, job, or DBSQL query.

In [0]:
SELECT * FROM events_temp_view

## Apply CTEs for Reference within a Query 
Common table expressions (CTEs) are perfect when you want a short-lived, human-readable reference to the results of a query.

In [0]:
WITH cte_json
AS (SELECT * FROM json.`${DA.paths.kafka_events}`)
SELECT * FROM cte_json

CTEs only alias the results of a query while that query is being planned and executed.

As such, **the following cell with throw an error when executed**.

In [0]:
-- SELECT COUNT(*) FROM cte_json



## Extract Text Files as Raw Strings

When working with text-based files (which include JSON, CSV, TSV, and TXT formats), you can use the **`text`** format to load each line of the file as a row with one string column named **`value`**. This can be useful when data sources are prone to corruption and custom text parsing functions will be used to extract values from text fields.

In [0]:
SELECT * FROM text.`${DA.paths.kafka_events}`



## Extract the Raw Bytes and Metadata of a File

Some workflows may require working with entire files, such as when dealing with images or unstructured data. Using **`binaryFile`** to query a directory will provide file metadata alongside the binary representation of the file contents.

Specifically, the fields created will indicate the **`path`**, **`modificationTime`**, **`length`**, and **`content`**.

In [0]:
SELECT * FROM binaryFile.`${DA.paths.kafka_events}`


 
Run the following cell to delete the tables and files associated with this lesson.

In [0]:
%python 
DA.cleanup()


&copy; 2024 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the 
<a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/><a href="https://databricks.com/privacy-policy">Privacy Policy</a> | 
<a href="https://databricks.com/terms-of-use">Terms of Use</a> | 
<a href="https://help.databricks.com/">Support</a>