/databricks-datasets/definitive-guide/data

In [0]:
spark.sql("SELECT 1 + 1 AS simple_equation_1plus1").show()

+----------------------+
|simple_equation_1plus1|
+----------------------+
|                     2|
+----------------------+



In [0]:
spark.read.json("/databricks-datasets/definitive-guide/data/flight-data/json/2015-summary.json")\
  .createOrReplaceTempView("some_sql_view") # DF => SQL

spark.sql("""
SELECT DEST_COUNTRY_NAME, sum(count)
FROM some_sql_view GROUP BY DEST_COUNTRY_NAME
""")\
  .where("DEST_COUNTRY_NAME like 'S%'").where("`sum(count)` > 10")\
  .count() # SQL => DF


Out[2]: 12

In [0]:
%sql 
drop table flights

In [0]:
%sql
CREATE TABLE flights (
    DEST_COUNTRY_NAME STRING, ORIGIN_COUNTRY_NAME STRING, count LONG)
  USING JSON OPTIONS (path '/databricks-datasets/definitive-guide/data/flight-data/json/2015-summary.json')

not the use of `USING` in the above, which prevents the table being stored to a default Hive SerDe config (which has performance implications). `STORED AS` can also work

In [0]:
%sql

-- Adding comments

  CREATE TABLE flights_csv (
    DEST_COUNTRY_NAME STRING,
    ORIGIN_COUNTRY_NAME STRING COMMENT "remember, the US will be most prevalent",
    count LONG)
  USING csv OPTIONS (header true, path '/databricks-datasets/definitive-guide/data/flight-data/csv/2015-summary.csv')

In [0]:

%sql

-- p 195 approx


-- It is possible to create a table from a query as well:
CREATE TABLE IF NOT EXISTS flights_from_select AS SELECT * FROM flights

In [0]:
%sql

-- Finally, you can control the layout of the data by writing out a partitioned dataset, as we saw in Chapter 9:
  CREATE TABLE partitioned_flights USING parquet PARTITIONED BY (DEST_COUNTRY_NAME)
  AS SELECT DEST_COUNTRY_NAME, ORIGIN_COUNTRY_NAME, count FROM flights LIMIT 5

External tables ('unmanaged'--the metadata is managed by Spark, but not the files)

In [0]:
%sql
-- You can view any files that have already been defined by running the following command:
  CREATE EXTERNAL TABLE hive_flights1 (
    DEST_COUNTRY_NAME STRING, ORIGIN_COUNTRY_NAME STRING, count LONG)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '/databricks-datasets/definitive-guide/data/flight-data-hive/'

In [0]:
%sql
-- You can also create an external table from a select clause:
  CREATE EXTERNAL TABLE hive_flights_2
  ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
  LOCATION '/databricks-datasets/definitive-guide/data/flight-data-hive/' AS SELECT * FROM flights

In [0]:
%sql
DESCRIBE TABLE flights_csv

-- get the metadata

In [0]:
%sql
SHOW PARTITIONS partitioned_flights

### VIEWS

In [0]:
%sql
CREATE VIEW just_usa_view AS
    SELECT * FROM flights WHERE dest_country_name = 'United States'

In [0]:
%sql
SELECT * FROM just_usa_view

In [0]:
%sql
SHOW DATABASES

### Structs
Structs are more akin to maps. They provide a way of creating or querying nested data in Spark. To create one, you simply need to wrap a set of columns (or expressions) in parentheses:

In [0]:
%sql

CREATE VIEW IF NOT EXISTS nested_data AS
    SELECT (DEST_COUNTRY_NAME, ORIGIN_COUNTRY_NAME) as country, count FROM flights

In [0]:
%sql

SELECT * FROM nested_data

In [0]:
%sql
-- You can even query individual columns within a struct—all you need to do is use dot syntax:
  SELECT country.DEST_COUNTRY_NAME, count FROM nested_data

In [0]:
%sql
SELECT country.*, count FROM nested_data

### Lists
If you’re familiar with lists in programming languages, Spark SQL lists will feel familiar. There are several ways to create an array or list of values. You can use the collect_list function,
which creates a list of values. You can also use the function collect_set, which creates an array without duplicate values. These are both aggregation functions and therefore can be specified only in aggregations:

In [0]:
%sql
SELECT DEST_COUNTRY_NAME as new_name, collect_list(count) as flight_counts,
    collect_set(ORIGIN_COUNTRY_NAME) as origin_set
FROM flights GROUP BY DEST_COUNTRY_NAME

In [0]:
%sql
-- You can also query lists by position by using a Python-like array query syntax:
  SELECT DEST_COUNTRY_NAME as new_name, collect_list(count)[0]
  FROM flights GROUP BY DEST_COUNTRY_NAME

In [0]:
%sql
-- You can also do things like convert an array back into rows. You do this by using the explode function. To demonstrate, let’s create a new view as our aggregation:
  CREATE OR REPLACE TEMP VIEW flights_agg AS
    SELECT DEST_COUNTRY_NAME, collect_list(count) as collected_counts
    FROM flights GROUP BY DEST_COUNTRY_NAME

In [0]:
%sql
SELECT explode(collected_counts), DEST_COUNTRY_NAME FROM flights_agg

### User-defined functions
As we saw in Chapters 3 and 4, Spark gives you the ability to define your own functions and use them in a distributed manner. You can define functions, just as you did before, writing the function in the language of your choice and then registering it appropriately:

In [0]:
%scala
def power3(number:Double):Double = number * number * number
spark.udf.register("power3", power3(_:Double):Double)

In [0]:
%sql
SELECT count, power3(count) FROM flights