# Chapter 4. Spark SQL and DataFrames: Introduction to Built-in Data Sources

Spark SQL engine provides a unified foundation for the high-level DataFrame and Dataset APIs.

In particular, Spark SQL:

* Provides the engine upon which the high-level Structured APIs are built.
* Can read and write data in a variety of structured formats (e.g., JSON, Hivetables, Parquet, Avro, ORC, CSV).
* Lets you query data using JDBC/ODBC connectors from external business intelligence (BI) data sources such as Tableau, Power BI, Talend, or from RDBMSs such as MySQL and PostgreSQL.
* Provides a programmatic interface to interact with structured data stored as tables or views in a database from a Spark application
* Offers an interactive shell to issue SQL queries on your structured data.
* Supports ANSI SQL:2003-compliant commands and HiveQL.

## Using Spark SQL in Spark Applications

The **SparkSession**, introduced in Spark 2.0, provides a unified entry point for programming Spark with the Structured APIs. You can use a SparkSession to access Spark functionality: just import the class and create an instance in your code.

To issue any SQL query, use the `sql()` method on the SparkSession instance, spark, such as spark.sql("SELECT * FROM myTableName").

**All spark.sql queries executed in this manner return a DataFrame** on which you may perform further Spark operations if you desire.

In [1]:
# Import PySpark and libraries
import pyspark
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import *
from pyspark.sql import Row

In [2]:
# Build a SparkSession
spark = SparkSession.builder.appName("chapter4").getOrCreate()

# Print SparkSession
spark

<pyspark.sql.session.SparkSession object at 0x000001F1BCC43040>
Spark version:  3.0.1


In [3]:
# Path to data set
file_path = "./datasets/departure_delays.csv"

# Define schema
flights_schema = StructType([
    StructField("date", StringType()),
    StructField("delay", IntegerType()),
    StructField("distance", IntegerType()),
    StructField("origin", StringType()),
    StructField("destination", StringType())
])

# Create a DataFrame
us_delay_flights_df = spark.read.csv(path= file_path,
                                     header=True,
                                     schema=flights_schema,
                                     timestampFormat="mm-yy hh:mm a",)


# Show the 20 first rows of the dataset
us_delay_flights_df.show()

+--------+-----+--------+------+-----------+
|    date|delay|distance|origin|destination|
+--------+-----+--------+------+-----------+
|01011245|    6|     602|   ABE|        ATL|
|01020600|   -8|     369|   ABE|        DTW|
|01021245|   -2|     602|   ABE|        ATL|
|01020605|   -4|     602|   ABE|        ATL|
|01031245|   -4|     602|   ABE|        ATL|
|01030605|    0|     602|   ABE|        ATL|
|01041243|   10|     602|   ABE|        ATL|
|01040605|   28|     602|   ABE|        ATL|
|01051245|   88|     602|   ABE|        ATL|
|01050605|    9|     602|   ABE|        ATL|
|01061215|   -6|     602|   ABE|        ATL|
|01061725|   69|     602|   ABE|        ATL|
|01061230|    0|     369|   ABE|        DTW|
|01060625|   -3|     602|   ABE|        ATL|
|01070600|    0|     369|   ABE|        DTW|
|01071725|    0|     602|   ABE|        ATL|
|01071230|    0|     369|   ABE|        DTW|
|01070625|    0|     602|   ABE|        ATL|
|01071219|    0|     569|   ABE|        ORD|
|01080600|

In [4]:
# Create a temporal table from DataFrame
us_delay_flights_df.createOrReplaceTempView("delay_flights_tbl")

# Print the tables in the catalog
print(spark.catalog.listTables())

[Table(name='delay_flights_tbl', database=None, description=None, tableType='TEMPORARY', isTemporary=True)]


In [45]:
# Show tables and views
spark.sql("SHOW VIEWS").show(truncate=False)
spark.sql("SHOW DATABASES").show(truncate=False)

+---------+------------------------------+-----------+
|namespace|viewName                      |isTemporary|
+---------+------------------------------+-----------+
|         |delay_flights_tbl             |true       |
|         |us_origin_airport_jfk_tmp_view|true       |
+---------+------------------------------+-----------+

+--------------+
|namespace     |
+--------------+
|default       |
|learn_spark_db|
+--------------+



In [17]:
# Show tables and views
spark.sql("SHOW VIEWS").show(truncate=False)
spark.sql("SHOW TABLES").show(truncate=False)

+---------+-----------------+-----------+
|namespace|viewName         |isTemporary|
+---------+-----------------+-----------+
|         |delay_flights_tbl|true       |
+---------+-----------------+-----------+

+--------------+-------------------------------+-----------+
|database      |tableName                      |isTemporary|
+--------------+-------------------------------+-----------+
|learn_spark_db|df_managed_us_delay_flights_tbl|false      |
|learn_spark_db|unmanaged_us_delay_flights_tbl |false      |
|              |delay_flights_tbl              |true       |
+--------------+-------------------------------+-----------+



In [6]:
query = """
    SELECT distance, origin, destination
    FROM delay_flights_tbl
    WHERE distance > 1000
    ORDER BY distance DESC;
"""

spark.sql(query).show(10)
print(type(spark.sql(query)))

+--------+------+-----------+
|distance|origin|destination|
+--------+------+-----------+
|    4330|   HNL|        JFK|
|    4330|   HNL|        JFK|
|    4330|   HNL|        JFK|
|    4330|   HNL|        JFK|
|    4330|   HNL|        JFK|
|    4330|   HNL|        JFK|
|    4330|   HNL|        JFK|
|    4330|   HNL|        JFK|
|    4330|   HNL|        JFK|
|    4330|   HNL|        JFK|
+--------+------+-----------+
only showing top 10 rows

<class 'pyspark.sql.dataframe.DataFrame'>


In [None]:
# In Python
(us_delay_flights_df.select("distance", "origin", "destination")
    .where("distance > 1000") #(F.col("distance") > 1000)
    .orderBy("distance", ascending=False) #(F.desc("distance"))
    .show(10)
)

In [None]:
query2 = """
    SELECT date, delay, origin, destination
    FROM delay_flights_tbl
    WHERE delay > 120 AND ORIGIN = 'SFO' AND DESTINATION = 'ORD'
    ORDER by delay DESC;
"""
spark.sql(query2).show(10)

To enable you to query structured data as shown in the preceding examples, Spark manages all the complexities of creating and managing views and tables, both in memory and on disk. That leads us to our next topic: how tables and views are created
and managed.

## SQL Tables and Views

Tables hold data. Associated with each table in Spark is its relevant **metadata**, which is information about the table and its data: the schema, description, table name, database name, column names, partitions, physical location where the actual data resides,etc. All of this is stored in a central metastore.

Instead of having a separate metastore for Spark tables, Spark by default uses the **Apache Hive metastore**, located at /user/hive/warehouse, to persist all the metadata about your tables. 

However, you may change the default location by setting the Spark config variable spark.sql.warehouse.dir to another location, which can be set to a local or external distributed storage.

### Managed vs Unmanaged Tables

Spark allows you to create two types of tables: managed and unmanaged. 

* For a **managed table**, Spark manages both the metadata and the data in the file store. This could be a local filesystem, HDFS, or an object store such as Amazon S3 or Azure Blob.

* For an **unmanaged table**, Spark only manages the metadata, while you manage the data yourself in an external data source such as Cassandra.

With a **managed table**, because Spark manages everything, a SQL command such as `DROP TABLE table_name` deletes both the metadata and the data. With an **unmanaged table**, the same command will delete only the metadata, not the actual data.

### Creating SQL Databases and Tables

Tables reside within a database. By default, Spark creates tables under the *default* database. To create your own database name, you can issue a SQL command from your Spark application or notebook. 

Using the US flight delays data set, let’s create both a managed and an unmanaged table. To begin, we’ll create a database called learn_spark_db and tell Spark we want to use that database:

In [8]:
# Create a database
spark.sql("CREATE DATABASE learn_spark_db")

# Use the new database created
spark.sql("USE learn_spark_db")

# Check databases in Spark
spark.sql("SHOW DATABASES").show()

+--------------+
|     namespace|
+--------------+
|       default|
|learn_spark_db|
+--------------+



#### Creating a managed table

To create a managed table within the database learn_spark_db, you can issue a SQL query like the following:

In [11]:
# Create managed table using SQL query
query = """ 
    CREATE TABLE managed_us_delay_flights_tbl (
    date STRING, 
    delay INT, 
    distance INT, 
    origin STRING, 
    destination STRING;)
"""
# AnalysisException: Hive support is required to CREATE Hive TABLE (AS SELECT);;
#spark.sql(query)

# Create managed table using DataFrame API
us_delay_flights_df.write.saveAsTable("df_us_delay_flights_managed_table")

In [24]:
spark.sql("SHOW TABLES").show(truncate=False)

+--------------+-----------------------------------+-----------+
|database      |tableName                          |isTemporary|
+--------------+-----------------------------------+-----------+
|learn_spark_db|df_us_delay_flights_managed_table  |false      |
|learn_spark_db|df_us_delay_flights_unmanaged_table|false      |
|              |delay_flights_tbl                  |true       |
+--------------+-----------------------------------+-----------+



In [9]:
# Show tables
spark.sql("SHOW TABLES").show(truncate=False)

+--------------+-------------------------------+-----------+
|database      |tableName                      |isTemporary|
+--------------+-------------------------------+-----------+
|learn_spark_db|df_managed_us_delay_flights_tbl|false      |
|              |delay_flights_tbl              |true       |
+--------------+-------------------------------+-----------+



In [16]:
spark.sql("SELECT * FROM df_us_delay_flights_managed_table LIMIT 5").show()

+--------+-----+--------+------+-----------+
|    date|delay|distance|origin|destination|
+--------+-----+--------+------+-----------+
|02151800|  108|     290|   ORD|        MSP|
|02151800|  142|     772|   ORD|        DEN|
|02151303|   16|    1516|   ORD|        LAX|
|02151157|    7|    1316|   ORD|        LAS|
|02151818|   55|    1511|   ORD|        PDX|
+--------+-----+--------+------+-----------+



#### Creating an unmanaged table

Also you can create unmanaged tables from your own data sources—say, Parquet, CSV, or JSON files stored in a file store accessible to your Spark application.

To create an unmanaged table from a data source such as a CSV file, in SQL use:

In [23]:
# Create unmanaged table using SQL query
query = """ 
    CREATE TABLE df_us_delay_flights_unmanaged_table (
    date STRING, 
    delay INT, 
    distance INT, 
    origin STRING, 
    destination STRING )
    USING CSV
    OPTIONS (PATH "./datasets/departure_delays.csv", HEADER "TRUE");
"""

spark.sql(query)

DataFrame[]

In [11]:
# Show tables in the learn_spark_db database
spark.sql("SHOW TABLES").show(truncate=False)

+--------------+-------------------------------+-----------+
|database      |tableName                      |isTemporary|
+--------------+-------------------------------+-----------+
|learn_spark_db|df_managed_us_delay_flights_tbl|false      |
|learn_spark_db|unmanaged_us_delay_flights_tbl |false      |
|              |delay_flights_tbl              |true       |
+--------------+-------------------------------+-----------+



In [12]:
spark.sql("SELECT * FROM unmanaged_us_delay_flights_tbl LIMIT 5").show()

+--------+-----+--------+------+-----------+
|    date|delay|distance|origin|destination|
+--------+-----+--------+------+-----------+
|01011245|    6|     602|   ABE|        ATL|
|01020600|   -8|     369|   ABE|        DTW|
|01021245|   -2|     602|   ABE|        ATL|
|01020605|   -4|     602|   ABE|        ATL|
|01031245|   -4|     602|   ABE|        ATL|
+--------+-----+--------+------+-----------+



In [None]:
# Create unmanaged table using DataFrame API
#(us_delay_flights_df
# .write
# .option("path", file_path)
# .saveAsTable("df_unmanaged_us_delay_flights_tbl")
#)

In [None]:
#spark.sql("DROP TABLE unmanaged_us_delay_flights_tbl")

### Creating Views

In addition to creating tables, Spark can create **views** on top of existing tables. 

Views can be **global** (visible across all SparkSessions on a given cluster) or **session-scoped** (visible only to a single SparkSession), and they are **temporary**: they disappear after your Spark application terminates.

Creating views has a similar syntax to creating tables within a database. Once you create a view, you can query it as you would a table. 

**The difference between a view and a table is that views don’t actually hold the data; tables persist after your Spark application terminates, but views disappear**.

In [35]:
# In SQL
query1 = """
    CREATE OR REPLACE GLOBAL TEMP VIEW us_origin_airport_SFO_global_tmp_view AS
    SELECT date, origin, destination 
    FROM delay_flights_tbl
    WHERE origin="SFO"
"""
query2 = """
    CREATE OR REPLACE TEMP VIEW us_origin_airport_JFK_tmp_view AS
    SELECT date, origin, destination 
    FROM delay_flights_tbl
    WHERE origin="JFK"
"""

# In Python
df_sfo = spark.sql("SELECT date, delay, origin, destination FROM delay_flights_tbl WHERE origin = 'SFO'")
df_jfk = spark.sql("SELECT date, delay, origin, destination FROM delay_flights_tbl WHERE origin = 'JFK'")

# Create a temporary and global temporary view
df_sfo.createOrReplaceGlobalTempView("us_origin_airport_SFO_global_tmp_view")
df_jfk.createOrReplaceTempView("us_origin_airport_JFK_tmp_view")

In [43]:
spark.catalog.listTables("global_temp")

[Table(name='us_origin_airport_sfo_global_tmp_view', database='global_temp', description=None, tableType='TEMPORARY', isTemporary=True),
 Table(name='delay_flights_tbl', database=None, description=None, tableType='TEMPORARY', isTemporary=True),
 Table(name='us_origin_airport_jfk_tmp_view', database=None, description=None, tableType='TEMPORARY', isTemporary=True)]

In [37]:
spark.sql("SHOW TABLES").show(truncate=False)

+--------------+-----------------------------------+-----------+
|database      |tableName                          |isTemporary|
+--------------+-----------------------------------+-----------+
|learn_spark_db|df_us_delay_flights_managed_table  |false      |
|learn_spark_db|df_us_delay_flights_unmanaged_table|false      |
|              |delay_flights_tbl                  |true       |
|              |us_origin_airport_jfk_tmp_view     |true       |
+--------------+-----------------------------------+-----------+



Once you’ve created these views, you can issue queries against them just as you would against a table. Keep in mind that when accessing a global temporary view you must use the prefix `global_temp.<view_name>`, because **Spark creates global temporary
views in a global temporary database called global_temp**. 

For example:

In [19]:
spark.sql("SELECT * FROM global_temp.us_origin_airport_SFO_global_tmp_view").show(5)

+--------+-----+------+-----------+
|    date|delay|origin|destination|
+--------+-----+------+-----------+
|01011250|   55|   SFO|        JFK|
|01012230|    0|   SFO|        JFK|
|01010705|   -7|   SFO|        JFK|
|01010620|   -3|   SFO|        MIA|
|01010915|   -3|   SFO|        LAX|
+--------+-----+------+-----------+
only showing top 5 rows



By contrast, you can access the normal temporary view without the `global_temp` prefix:

In [28]:
#spark.sql("SELECT * FROM us_origin_airport_JFK_tmp_view").show(5)
spark.read.table("us_origin_airport_JFK_tmp_view").show(5)

+--------+-----+------+-----------+
|    date|delay|origin|destination|
+--------+-----+------+-----------+
|01010900|   14|   JFK|        LAX|
|01011200|   -3|   JFK|        LAX|
|01011900|    2|   JFK|        LAX|
|01011700|   11|   JFK|        LAS|
|01010800|   -1|   JFK|        SFO|
+--------+-----+------+-----------+
only showing top 5 rows



You can also drop a view just like you would a table:

In [27]:
## In SQL
#DROP VIEW IF EXISTS us_origin_airport_SFO_global_tmp_view;
#DROP VIEW IF EXISTS us_origin_airport_JFK_tmp_view;

# In Python
#spark.catalog.dropGlobalTempView("us_origin_airport_SFO_global_tmp_view")
#spark.catalog.dropTempView("us_origin_airport_JFK_tmp_view")

#### Temporary views versus global temporary views

A **temporary view** is tied to a single SparkSession within a Spark application. 

In contrast, a **global temporary** view is visible across multiple SparkSessions within a Spark application. 

Yes, you can create multiple SparkSessions within a single Spark application—this can be handy, for example, in cases where you want to access (and combine) data from two different SparkSessions that don’t share the same Hive metastore configurations.

### Viewing the Metadata

Spark manages the metadata associated with each managed or unmanaged table. This is captured in the **Catalog**, a high-level abstraction in Spark SQL for storing metadata. 

The Catalog’s functionality was expanded in Spark 2.x with new public methods enabling you to examine the metadata associated with your databases, tables, and views. Spark 3.0 extends it to use external catalog.

For example, within a Spark application, after creating the SparkSession variable spark, you can access all the stored metadata through methods like these:

In [37]:
spark.catalog.listDatabases()

[Database(name='default', description='default database', locationUri='file:/C:/Users/antonio.arredondo/OneDrive%20-%20Bosonit/Spark%20Training/spark-warehouse'),
 Database(name='learn_spark_db', description='', locationUri='file:/C:/Users/antonio.arredondo/OneDrive%20-%20Bosonit/Spark%20Training/spark-warehouse/learn_spark_db.db')]

In [38]:
spark.catalog.listTables()

[Table(name='df_managed_us_delay_flights_tbl', database='learn_spark_db', description=None, tableType='MANAGED', isTemporary=False),
 Table(name='unmanaged_us_delay_flights_tbl', database='learn_spark_db', description=None, tableType='EXTERNAL', isTemporary=False),
 Table(name='delay_flights_tbl', database=None, description=None, tableType='TEMPORARY', isTemporary=True),
 Table(name='us_origin_airport_jfk_tmp_view', database=None, description=None, tableType='TEMPORARY', isTemporary=True)]

In [49]:
spark.catalog.listColumns("df_managed_us_delay_flights_tbl")

[Column(name='date', description=None, dataType='string', nullable=True, isPartition=False, isBucket=False),
 Column(name='delay', description=None, dataType='int', nullable=True, isPartition=False, isBucket=False),
 Column(name='distance', description=None, dataType='int', nullable=True, isPartition=False, isBucket=False),
 Column(name='origin', description=None, dataType='string', nullable=True, isPartition=False, isBucket=False),
 Column(name='destination', description=None, dataType='string', nullable=True, isPartition=False, isBucket=False)]

### Caching SQL Tables

Although we will discuss table caching strategies in the next chapter, it’s worth mentioning here that, like DataFrames, you can cache and uncache SQL tables and views.

In Spark 3.0, in addition to other options, you can specify a table as LAZY, meaning that it should only be cached when it is first used instead of immediately:

In [51]:
# In SQL
# CACHE [LAZY] TABLE <table-name>
# UNCACHE TABLE <table-name>

### Reading Tables into DataFrames

Often, data engineers build data pipelines as part of their regular data ingestion and ETL processes. They populate Spark SQL databases and tables with cleansed data for consumption by applications downstream.

Let’s assume you have an existing database, `learn_spark_db`, and table, `us_delay_flights_tbl`, ready for use. Instead of reading from an external JSON file, you can simply use SQL to query the table and assign the returned result to a
DataFrame:

In [52]:
# In Python
us_flights_df = spark.sql("SELECT * FROM delay_flights_tbl")
us_flights_df2 = spark.table("delay_flights_tbl")

us_flights_df.show(3)
us_flights_df2.show(3)

+--------+-----+--------+------+-----------+
|    date|delay|distance|origin|destination|
+--------+-----+--------+------+-----------+
|01011245|    6|     602|   ABE|        ATL|
|01020600|   -8|     369|   ABE|        DTW|
|01021245|   -2|     602|   ABE|        ATL|
+--------+-----+--------+------+-----------+
only showing top 3 rows

+--------+-----+--------+------+-----------+
|    date|delay|distance|origin|destination|
+--------+-----+--------+------+-----------+
|01011245|    6|     602|   ABE|        ATL|
|01020600|   -8|     369|   ABE|        DTW|
|01021245|   -2|     602|   ABE|        ATL|
+--------+-----+--------+------+-----------+
only showing top 3 rows



Now you have a cleansed DataFrame read from an existing Spark SQL table. You can also read data in other formats using Spark’s built-in data sources, giving you the flexibility to interact with various common file formats.

## Data Sources for DataFrames and SQL Tables

Spark SQL provides an interface to a variety of data sources. It also provides a set of common methods for reading and writing data to and from these data sources using the Data Sources API.

In this section we will cover some of the [built-in data sources](https://spark.apache.org/docs/latest/sql-data-sources.html#data-sources), available file formats, and ways to load and write data, along with specific options pertaining to these data sources. 

But first, let’s take a closer look at two high-level Data Source API constructs that dictate the manner in which you interact with different data sources: **DataFrameReader** and **DataFrameWriter**.

### DataFrameReader

DataFrameReader is the core construct for reading data from a data source into a DataFrame. It has a defined format and a recommended pattern for usage:

`DataFrameReader.format(args).option("key", "value").schema(args).load()`

Note that you can only access a DataFrameReader through a SparkSession instance. That is, you cannot create an instance of DataFrameReader. To get an instance handle to it, use:

`SparkSession.read`
or
`SparkSession.readStream`

While read returns a handle to DataFrameReader to read into a DataFrame from a static data source, readStream returns an instance to read from a streaming source.

We won’t comprehensively enumerate all the different combinations of arguments and options, the [documentation for Python, Scala, R, and Java](https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html#manually-specifying-options) offers suggestions and guidance.

In general, no schema is needed when reading from a static Parquet data source—the Parquet metadata usually contains the schema, so it’s inferred. However, for streaming data sources you will have to provide a schema.

Parquet is the default and preferred data source for Spark because it’s efficient, uses columnar storage, and employs a fast compression algorithm.

### DataFrameWriter

DataFrameWriter saves or writes data to a specified built-in data source. Unlike with DataFrameReader, you access its instance not from a SparkSession but from the DataFrame you wish to save. It has a few recommended usage patterns:

`DataFrameWriter.format(args).option(args).sortBy(args).saveAsTable(table)`

More info: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter

### Parquet

Parquet is the default datasource in Spark. Supported and widely used by many big data processing frameworks and platforms, Parquet is an **open source columnar file format** that offers many I/O optimizations (such as compression, which saves storage space and allows for quick access to data columns).

Because of its efficiency and these optimizations, we recommend that after you have transformed and cleansed your data, you save your DataFrames in the Parquet format for downstream consumption.

#### Reading Parquet files into a DataFrame

Parquet files are stored in a directory structure that contains the data files, metadata, a number of compressed files, and some status files. Metadata in the footer contains the version of the file format, the schema, and column data such as the path, etc.

For example, a directory in a Parquet file might contain a set of files like this:

_SUCCESS

_committed_1799640464332036264

_started_1799640464332036264

part-00000-tid-1799640464332036264-91273258-d7ef-4dc7-<...>-c000.snappy.parquet

There may be a number of part-XXXX compressed files in a directory.

To read Parquet files into a DataFrame, you simply specify the format and path:

In [None]:
#file = "/databricks-datasets/learning-spark-v2/flights/summary-data/parquet/2010-summary.parquet/"
# df = spark.read.format("parquet").load(file)

Unless you are reading from a streaming data source there’s no need to supply the schema, because Parquet saves it as part of its metadata.

#### Reading Parquet files into a Spark SQL table

As well as reading Parquet files into a Spark DataFrame, you can also create a Spark SQL unmanaged table or view directly using SQL:

In [55]:
## In SQL

q = """
    CREATE OR REPLACE TEMPORARY VIEW us_delay_flights_tbl
    USING parquet
    OPTIONS (path "/databricks-datasets/learning-spark-v2/flights/summary-data/parquet/2010-summary.parquet/")
"""

#### Writing DataFrames to Parquet files

Writing or saving a DataFrame as a table or file is a common operation in Spark. To write a DataFrame you simply use the methods and arguments to the DataFrame Writer outlined earlier in this chapter, supplying the location to save the Parquet files to. 

Recall that Parquet is the default file format. If you don’t include the format() method, the DataFrame will still be saved as a Parquet file.

For example:

In [56]:
# In Python
#(df.write.format("parquet")
#.mode("overwrite")
#.option("compression", "snappy")
#.save("/tmp/data/parquet/df_parquet"))

This will create a set of compact and compressed Parquet files at the specified path. Since we used snappy as our compression choice here, we’ll have snappy compressed files.

#### Writing DataFrames to Spark SQL tables

Writing a DataFrame to a SQL table is as easy as writing to a file—just use saveAsTable() instead of save(). This will create a managed table called us_delay_flights_tbl:

In [57]:
# In Python
#(df.write
#.mode("overwrite")
#.saveAsTable("us_delay_flights_tbl"))

To sum up, Parquet is the preferred and default built-in data source file format in Spark, and it has been adopted by many other frameworks. We recommend that you use this format in your ETL and data ingestion processes.

### JSON

JavaScript Object Notation (JSON) is also a popular data format. It came to prominence as an easy-to-read and easy-to-parse format compared to XML. It has two representational formats: **single-line mode** and **multiline mode**. Both modes are
supported in Spark.

In single-line mode each line denotes a single JSON object, whereas in multiline mode the entire multiline object constitutes a single JSON object. To read in this mode, set `multiLine` to true in the `option()` method.

More info: https://docs.databricks.com/data/data-sources/read-json.html

#### Reading a JSON file into a DataFrame

You can read a JSON file into a DataFrame the same way you did with Parquet—just specify "json" in the format() method:

In [None]:
# file = "/databricks-datasets/learning-spark-v2/flights/summary-data/json/*"
# df = spark.read.format("json").load(file)

#### Reading a JSON file into a Spark SQL table

You can also create a SQL table from a JSON file just like you did with Parquet:

In [None]:
## In SQL
q = """
    CREATE OR REPLACE TEMPORARY VIEW us_delay_flights_tbl
    USING json
    OPTIONS (path "/databricks-datasets/learning-spark-v2/flights/summary-data/json/*")
"""

Once the table is created, you can read data into a DataFrame using SQL:

In [None]:
# spark.sql("SELECT * FROM us_delay_flights_tbl").show()

#### Writing DataFrames to JSON files

Saving a DataFrame as a JSON file is simple. Specify the appropriate DataFrameWriter methods and arguments, and supply the location to save the JSON files to. 
This creates a directory at the specified path populated with a set of compact JSON files.

In [None]:
# (df.write.format("json")
# .mode("overwrite")
# .option("compression", "snappy")
# .save("/tmp/data/json/df_json"))

### CSV

As widely used as plain text files, this common text file format captures each datum or field delimited by a comma; each line with comma-separated fields represents a record. 

Even though a comma is the default separator, you may use other delimiters to separate fields in cases where commas are part of your data. Popular spreadsheets can generate CSV files, so it’s a popular format among data and business analysts.

#### Reading a CSV file into a DataFrame

As with the other built-in data sources, you can use the DataFrameReader methods and arguments to read a CSV file into a DataFrame:

In [None]:
# file = "/databricks-datasets/learning-spark-v2/flights/summary-data/csv/*"
# schema = "DEST_COUNTRY_NAME STRING, ORIGIN_COUNTRY_NAME STRING, count INT"
# df = (spark.read.format("csv")
# .option("header", "true")
# .schema(schema)
# .option("mode", "FAILFAST") # Exit if any errors
# .option("nullValue", "") # Replace any null data field with quotes
# .load(file))

#### Reading a CSV file into a Spark SQL table

Creating a SQL table from a CSV data source is no different from using Parquet or JSON:

In [None]:
q = """
    CREATE OR REPLACE TEMPORARY VIEW us_delay_flights_tbl
    USING csv
    OPTIONS (
        path "/databricks-datasets/learning-spark-v2/flights/summary-data/csv/*",
        header "true",
        inferSchema "true",
        mode "FAILFAST"
        )
"""

Once you’ve created the table, you can read data into a DataFrame using SQL as before:

In [1]:
# spark.sql("SELECT * FROM us_delay_flights_tbl").show(10)

#### Writing DataFrames to CSV files

Saving a DataFrame as a CSV file is simple. Specify the appropriate DataFrameWriter methods and arguments, and supply the location to save the CSV files to.

This generates a folder at the specified location, populated with a bunch of compressed and compact files.

In [None]:
# df.write.format("csv").mode("overwrite").save("/tmp/data/csv/df_csv")

### Avro

Avro format is used, for example, by Apache Kafka for message serializing and deserializing. It offers many benefits,
including direct mapping to JSON, speed and efficiency, and bindings available for many programming languages.

#### Reading an Avro file into a DataFrame

Reading an Avro file into a DataFrame using DataFrameReader is consistent in usage with the other data sources we have discussed in this section:

In [None]:
# df = spark.read.format("avro").load("/databricks-datasets/learning-spark-v2/flights/summary-data/avro/*")
# df.show(truncate=False)

#### Reading an Avro file into a Spark SQL table

Again, creating SQL tables using an Avro data source is no different from using Parquet, JSON, or CSV. Once you’ve created a table, you can read data into a DataFrame using SQL:

In [None]:
q = """ 
    CREATE OR REPLACE TEMPORARY VIEW episode_tbl
    USING avro
    OPTIONS (path "/databricks-datasets/learning-spark-v2/flights/summary-data/avro/*")

"""
# spark.sql("SELECT * FROM episode_tbl").show(truncate=False)

#### Writing DataFrames to Avro files

Writing a DataFrame as an Avro file is simple. As usual, specify the appropriate DataFrameWriter methods and arguments, and supply the location to save the Avro files to. 

This generates a folder at the specified location, populated with a bunch of compressed and compact files.

In [None]:
# (df.write
# .format("avro")
# .mode("overwrite")
# .save("/tmp/data/avro/df_avro"))

#### Avro data source options

info: https://spark.apache.org/docs/latest/sql-data-sources-avro.html