# Chapter 6: Spark SQL

[**6.1 SQL**](#6.1-SQL)   
[**6.2 Spark SQL**](#6.2-Spark-SQL)   
[**6.3 Running Spark SQL Queries**](#6.3-Running-Spark-SQL-Queries)   
[**6.3.1 Spark SQL CLI**](#6.3.1-Spark-SQL-CLI)   
[**6.3.2 Spark Programmatic SQL Interface**](#6.3.2-Spark-Programmatic-SQL-Interface)   
[**6.4 SparkSQL JDBC/ODBC**](#6.4-SparkSQL-JDBC/ODBC)   
[**6.5 Thrift Server**](#6.5-Thrift-Server)   
[**6.6 Catalog**](#6.6-Catalog)   
[**6.7 Database**](#6.7-Database)   
[**6.7.1 Create Database**](#6.7.1-Create-Database)   
[**6.7.2 Set Database**](#6.7.2-Set-Database)   
[**6.7.3 List Database**](#6.7.3-List-Database)     
[**6.7.4 Display Current Database**](#6.7.4-Display-Current-Database)  
[**6.7.5 Use Default Database**](#6.7.5-Use-Default-Database)   
[**6.7.6 Drop Database**](#6.7.6-Drop-Database)   
[**6.8 Tables**](#6.8-Tables)   
[**6.8.1 Managed Tables**](#6.8.1-Managed-Tables)   
[**6.8.2 Unmanaged Tables**](#6.8.2-Unmanaged-Tables)     
[**6.9 Create Tables**](#6.9-Create-Tables)  
[**6.9.1 Create Managed Tables**](#6.9.1-Create-Managed-Tables)   
[**6.9.2 Create Unmanaged Tables**](#6.9.2-Create-Unmanaged-Tables)   
[**6.10 Describe Table**](#6.10-Describe-Table)   
[**6.11 Display Table**](#6.11-Display-Table)   
[**6.12 Drop Table**](#6.12-Drop-Table)   
[**6.13 Refresh Table Metadata**](#6.13-Refresh-Table-Metadata)   
[**6.14 Cache Table**](#6.14-Cache-Table)   
[**6.15 Views**](#6.15-Views)   
[**6.15.1 Creating Views**](#6.15.1-Creating-Views)   
[**6.15.2 Creating Temporary Views**](#6.15.2-Creating-Temporary-Views)   
[**6.15.3 Creating Global Temporary Views**](#6.15.3-Creating-Global-Temporary-Views)   
[**6.15.4 Overwrite Views**](#6.15.4-Overwrite-Views)     
[**6.15.5 Explain Views**](#6.15.5-Explain-Views)  
[**6.15.6 Drop Views**](#6.15.6-Drop-Views)   
[**6.16 Select Statements**](#6.16-Select-Statements)   
[**6.17 Case Statements**](#6.17-Case-Statements)   
[**6.18 Complex Types**](#6.18-Complex-Types)   
[**6.18.1 Structs**](#6.18.1-Structs)   
[**6.18.2 Lists**](#6.18.2-Lists)   
[**6.18.3 Maps**](#6.18.3-Maps)   
[**6.19 Functions**](#6.19-Functions)   
[**6.20 User-Defined Functions**](#6.20-User-Defined-Functions)   
[**6.21 Subqueries**](#6.21-Subqueries)   
[**6.22 Interoperate SQL and DataFrames**](#6.22-Interoperate-SQL-and-DataFrames)   
[**6.23 Catalog API**](#6.23-Catalog-API)      

#### 6.1 SQL
Structured Query Language (SQL) is a domain-specific language (DSL) used in relational databases and often in NoSQL databases. To understand more about SQL, checkout the [SQL tutorial link](https://github.com/analyticstensor/sql)

#### 6.2 Spark SQL
[Spark SQL](https://spark.apache.org/sql/) is one of the important feature in Spark. Spark supports ANSI SQL 2003 command. Spark SQL helps data analysts, engineer and scientists to utilize the Spark's computation power through either Thrift Server or Spark SQL interface for building datasets and pipelines. Facebook has achieved huge performance gain by converting Hive jobs into Spark Job. [Check out the article](https://engineering.fb.com/core-data/apache-spark-scale-a-60-tb-production-use-case/). Spark SQL is designed for OLAP not OLTP, since OLTP requires extremely low-latency query to execute in milliseconds. [Delta Lake](https://delta.io/) is incubator for supporting RBMDS feature implemented in Databricks. Spark SQL provides several functionality: 
* It is used to execute SQL queries.
* It can read and write data from several formats such as CSV, JSON, Parquet, Avro, ORC, Hive Tables.
* It provides an interactive shell to issue SQL queries.
* It allows to query data through JDBC/ODBC connectors from external BI data sources such as Tableau, Power BI or any RDBMS such as MySQL, PostgreSQL, MSSQL etc.
* It provides an engine upon high-level abstraction API i.e. DataFrames and Datasets.

Spark SQL can easily connect to [Hive](https://hive.apache.org/). Hive metastore is used to store metadata of all the Hive tables. Spark SQL can connect seamlessily to Hive metastore to access information. Some configuration need to made before connection to Hive. [Please checkout the link](http://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html).

#### 6.3 Running Spark SQL Queries
Spark SQL queries can be executed multiple ways:
* Spark SQL CLI
* Spark Programmatic SQL Interface

#### 6.3.1 Spark SQL CLI
Spark SQL CLI is used to executed SQL queries in local mode from command line interface. In CLI, it cannot communicate with Thrift JDBC server. For executing Spark SQL CLI used the command below. Make sure to check environment variable is set if not use absolute Spark path.

`spark-sql` # if spark home is set in environment variable  
`SPARK_INSTALL_PATH/bin/spark-sql` # if spark home is not set in environment variable

After executing command successfully, you'll see similar screen shown below.
![Figure: 1 Spark CLI](spark_cli.png)

**To get help in Spark SQL CLI**  

`spark-sql --help`  # display spark help

After executing command successfully, you'll see similar screen shown below.
![Figure: 2 Spark CLI Help](spark_cli_help.png)

#### 6.3.2 Spark Programmatic SQL Interface
Spark provides API to execute SQL in ad hoc. The steps to execute SQL queries are:-
1. Create SparkSession
2. Apply `sql()` method in sparksession object.

The parameter to `sql()` is SQL statement to be executed and return type is DataFrame. Similar to DataFrame, it uses lazily evaluation. We can easily pass complex SQL statement. Through `sql()` method we can perform same task done in DataFrame. So, it is handy if we need to apply complex logic which cannot be easily achieved in DataFrame. Multiline queries can be pass using `multiline string` used in Python. Moreover, the special and powerful feature is that we can interoperate between SQL and DataFrames. i.e. We can apply SQL method and manipulate the same resultset using DataFrame methods.

**Create SparkSession**

In [1]:
# Creating Spark session in Python
from pyspark.sql import SparkSession
spark = SparkSession.builder \
     .master("local") \
     .appName("SparkSQLApplicationwithHive") \
     .enableHiveSupport() \
     .getOrCreate()

In [2]:
spark

**Apply SQL method**

In [3]:
spark.sql("SELECT 'analytics tensor'").show()

+----------------+
|analytics tensor|
+----------------+
|analytics tensor|
+----------------+



In [4]:
spark.sql("SELECT 2*2 as four").show()

+----+
|four|
+----+
|   4|
+----+



**Single line SQL Statement**  

`spark.sql("SELECT * from employees").show()` # this statement will not execute since table doesn't exists.

**Multi line SQL Statement**  


`spark.sql("""SELECT first_name, last_name from employees join dept_emp
  on employees.emp_no = dept_emp.emp_no where dept_no in ('d001', 'd005'
  'd007')""").show()` # this statement will not execute since table doesn't exists.

**Interoperate SQL and DataFrame**

`spark.sql("""SELECT first_name, last_name from employees join dept_emp
                  on employees.emp_no = dept_emp.emp_no where dept_no in ('d001', 'd005'
                  'd007') \
           """) \
      .where("first_name like 'Sri%'") \
      .count()` # this statement will not execute since table doesn't exists.

#### 6.4 SparkSQL JDBC/ODBC
Spark provides both JDBC and ODBC interface for connecting to Spark driver to execute Spark SQL queries. The connections is made through Thrift Server. For example:  
* Business Intelligence tools like Tableau, PowerBI etc. wants to connect to Spark to create data visualization dashboard. 
* Web application wants to run machine learning model using computation in Spark cluster for massive data processing.

#### 6.5 Thrift Server
[Thrift](https://cwiki.apache.org/confluence/display/Hive/HiveServer) JDBC or ODBC server is used to connect to Spark. Thrift JDBC/ODBC server allows JDBC/ODBC clients to execute SQL queries over JDBC and ODBC protocols in Spark. Spark Thirft server share the same `sparkcontext`. Spark Thrift server is a Spark standalone application.  

We are refering to HiveServer2 in Hive that uses Thrift server. Use the command below to start JDBC/ODBC services located in Spark directory.  

`./sbin/start-thriftserver.sh`  # to start JDBC/ODBC services.  
`./sbin/start-thriftserver.sh --help`  # to get help.  

This script also accepts `bin/spark-submit` command. The server listens to `localhost` and port `10000` by default. It can be override by using method below:

**Using environment configuration**

`export HIVE_SERVER2_THRIFT_PORT=port_number   
export HIVE_SERVER2_THRIFT_BIND_HOST=host_name`

**Using System properties**

`./sbin/start-thriftserver.sh \
    --hiveconf hive.server2.thrift.port=port_number \
    --hiveconf hive.server2.thrift.bind.host=host_name`

[Beeline](https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients) is a JDBC client used to run interactive queries on the command line.

`beeline`

After executing command successfully, you'll see similar screen shown below.
![Figure: 3 Spark CLI Help](spark_beeline.png)  
Connect using the `!connect jdbc:hive2://localhost:10000`. Type your machine username and empty password.

#### 6.6 Catalog
Catalog is the highest level abstraction in Spark SQL. Catalog is an abstraction for storing metadata of data stored in tables and information related to databases, tables, functions and views. Catalog is available in the `org.apache.spark.sql.catalog.Catalog` [package](https://spark.apache.org/docs/2.2.1/api/java/org/apache/spark/sql/catalog/Catalog.html). Several task such as listing/describing databases, tables, functions etc. can be found through catalog fuctions.

#### 6.7 Database
Database stores and manages tables, views, functions etc. Spark creates default database name `default`. We can create our own database.

#### 6.7.1 Create Database

`CREATE DATABASE analytics_tensor` # create database in sql cli

In [5]:
spark.sql("CREATE DATABASE IF NOT EXISTS analytics_tensor").show() # create database

++
||
++
++



#### 6.7.2 Set Database
Use current database

In [6]:
spark.sql("USE analytics_tensor").show() # set current database

++
||
++
++



#### 6.7.3 List Database
Display list of database

`SHOW DATABASES` # in sql cli

In [7]:
spark.sql("SHOW DATABASES").show() # show databases

+----------------+
|       namespace|
+----------------+
|analytics_tensor|
|         default|
+----------------+



#### 6.7.4 Display Current Database
Display current database

`SELECT current_database()`

In [8]:
spark.sql("SELECT current_database()").show()

+------------------+
|current_database()|
+------------------+
|  analytics_tensor|
+------------------+



#### 6.7.5 Use Default Database
Set `default` database as current database.

`USE default`

In [9]:
spark.sql("USE default").show()

++
||
++
++



#### 6.7.6 Drop Database
Dropping database. It removes all the object from current database such as tables, views, functions etc.

`DROP DATABASE IF EXISTS analytics_tensor`

In [10]:
spark.sql("DROP DATABASE IF EXISTS analytics_tensor CASCADE").show() # cascade drops the table in database

++
||
++
++



In [11]:
spark.sql("CREATE DATABASE IF NOT EXISTS analytics_tensor").show()

++
||
++
++



In [12]:
spark.sql("SHOW DATABASES").show()

+----------------+
|       namespace|
+----------------+
|analytics_tensor|
|         default|
+----------------+



In [13]:
spark.sql("CREATE DATABASE IF NOT EXISTS analytics_tensor").show()

++
||
++
++



In [14]:
spark.sql("SHOW DATABASES").show()

+----------------+
|       namespace|
+----------------+
|analytics_tensor|
|         default|
+----------------+



#### 6.8 Tables
Tables is used to store data and metadata. Metadata holds information about a table and data like schema, table name, column names, partitions, location of data for table etc. Spark uses Apache Hive metastore to stores metadata in Hive warehouse default location. i.e. `/user/hive/warehouse`. We can change the default location by setting configuration `spark.sql.warehouse.dir` to other location. Hive also uses `/user/hive/warehouse` to stores all its metadata. Spark SQL is fully compatible with Hive SQL (HiveQL). So, we can copy and paste Hive query into Spark SQL.
Spark supports two types of tables:  
* Managed Tables
* Unmanaged Tables

Hive features can be found on this [link](https://spark.apache.org/docs/1.5.2/sql-programming-guide.html#supported-hive-features).

#### 6.8.1 Managed Tables
Managed tables are the table where Spark manages both the metadata and data on the file store. The file can be HDFS, S3 or Azure Blob. Since Spark manages the table, if we drop the table both `metadata` and `data` will be deleted.

#### 6.8.2 Unmanaged Tables
Unmanaged tables are the table where Spark only manages metadata. User need to specify the data location while creating table. Since Spark only manages the metadata of table, if we drop the table only `metadata` will be dropped but the data file will still exists in specified location.

#### 6.9 Create Tables
Table can be create multiple ways in Spark. We will demonstrate various method to create both managed and unmanaged table.

#### 6.9.1 Create Managed Tables

\# create table with SQL CLI  
`CREATE TABLE country(
    id INT,
    name STRING,
    capital STRING)`

\# create table with Programmatic SQL Interface  

`spark.sql("CREATE TABLE IF NOT EXISTS country(id INT, name STRING, capital STRING)")`

If the above code will give errors such as:  

***AnalysisException: Hive support is required to CREATE Hive TABLE (AS SELECT);;
'CreateTable `country12`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Ignore***  

**Note:** We neeed to enable hive support while creating SparkSession by adding `enableHiveSupport`.

In [15]:
spark.sql("CREATE TABLE IF NOT EXISTS analytics_tensor.country12(id INT, name STRING, capital STRING)")

DataFrame[]

In [16]:
spark.sql("SHOW TABLES IN analytics_tensor").show()

+----------------+---------+-----------+
|        database|tableName|isTemporary|
+----------------+---------+-----------+
|analytics_tensor|country12|      false|
+----------------+---------+-----------+



In [17]:
spark.sql("SHOW TABLES FROM analytics_tensor").show()

+----------------+---------+-----------+
|        database|tableName|isTemporary|
+----------------+---------+-----------+
|analytics_tensor|country12|      false|
+----------------+---------+-----------+



In [18]:
spark.sql("USE analytics_tensor").show()

++
||
++
++



**Create student record**

In [19]:
import pandas as pd

# create list of lists 
student_score = [['john', 90], ['harry', 95], ['paul', 100]] 
  
# create dataframe 
student_score_df = pd.DataFrame(student_score, columns = ['name', 'score']) 
  
# print dataframe. 
student_score_df

Unnamed: 0,name,score
0,john,90
1,harry,95
2,paul,100


In [20]:
# write dataframe to file
file_path = '/tmp/student_score.csv'
student_score_df.to_csv(file_path, sep=',', header=True, index=False)
!cat $file_path

name,score
john,90
harry,95
paul,100


In [21]:
spark.sql("DROP TABLE IF EXISTS student_score")

DataFrame[]

In [22]:
# create table with DataFrame API
schema = "name STRING, score INT"
student_df = spark.read.csv(file_path, schema=schema, header=True)
student_df.write.saveAsTable("student_score")

In [23]:
spark.sql("SHOW TABLES").show()

+----------------+-------------+-----------+
|        database|    tableName|isTemporary|
+----------------+-------------+-----------+
|analytics_tensor|    country12|      false|
|analytics_tensor|student_score|      false|
+----------------+-------------+-----------+



In [24]:
spark.sql("SELECT * FROM STUDENT_SCORE").show()

+-----+-----+
| name|score|
+-----+-----+
| john|   90|
|harry|   95|
| paul|  100|
+-----+-----+



In [25]:
spark.sql("select * from student_score WHERE score >=95").show()

+-----+-----+
| name|score|
+-----+-----+
|harry|   95|
| paul|  100|
+-----+-----+



In [26]:
spark.sql("DROP TABLE IF EXISTS student_name").show()

++
||
++
++



In [27]:
# create table using CTAS or select query
spark.sql("CREATE TABLE student_name USING parquet as SELECT * FROM student_score").show()

++
||
++
++



In [28]:
spark.sql("SHOW TABLES").show()

+----------------+-------------+-----------+
|        database|    tableName|isTemporary|
+----------------+-------------+-----------+
|analytics_tensor|    country12|      false|
|analytics_tensor| student_name|      false|
|analytics_tensor|student_score|      false|
+----------------+-------------+-----------+



In [29]:
# create table using CTAS or select query if not exists
spark.sql("CREATE TABLE IF NOT EXISTS student_name1 USING ORC AS SELECT name FROM student_score").show()

++
||
++
++



In [30]:
# create table by partitioning column
spark.sql("""CREATE TABLE IF NOT EXISTS partition_student USING parquet
            PARTITIONED BY (score)
            AS SELECT name, score from student_score""")

DataFrame[]

In [31]:
spark.sql("describe table partition_student").show(10,False)

+-----------------------+---------+-------+
|col_name               |data_type|comment|
+-----------------------+---------+-------+
|name                   |string   |null   |
|score                  |int      |null   |
|# Partition Information|         |       |
|# col_name             |data_type|comment|
|score                  |int      |null   |
+-----------------------+---------+-------+



In [32]:
spark.sql("SHOW TABLES").show()

+----------------+-----------------+-----------+
|        database|        tableName|isTemporary|
+----------------+-----------------+-----------+
|analytics_tensor|        country12|      false|
|analytics_tensor|partition_student|      false|
|analytics_tensor|     student_name|      false|
|analytics_tensor|    student_name1|      false|
|analytics_tensor|    student_score|      false|
+----------------+-----------------+-----------+



In [33]:
spark.sql("SET").show(100,False)

+----------------------------------+---------------------------+
|key                               |value                      |
+----------------------------------+---------------------------+
|spark.app.id                      |local-1606786342664        |
|spark.app.name                    |SparkSQLApplicationwithHive|
|spark.driver.host                 |192.168.1.67               |
|spark.driver.port                 |50394                      |
|spark.executor.id                 |driver                     |
|spark.master                      |local                      |
|spark.rdd.compress                |True                       |
|spark.serializer.objectStreamReset|100                        |
|spark.sql.catalogImplementation   |hive                       |
|spark.submit.deployMode           |client                     |
|spark.submit.pyFiles              |                           |
|spark.ui.showConsoleProgress      |true                       |
+------------------------

**Note:** Spark doesn't support temporary tables.

#### 6.9.2 Create Unmanaged Tables

**Read built-in iris dataset from seaborn module**

In [34]:
import seaborn as sns

iris = sns.load_dataset('iris')
iris.head(10)
#iris.describe()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
5,5.4,3.9,1.7,0.4,setosa
6,4.6,3.4,1.4,0.3,setosa
7,5.0,3.4,1.5,0.2,setosa
8,4.4,2.9,1.4,0.2,setosa
9,4.9,3.1,1.5,0.1,setosa


**Export dataset to file**

In [35]:
iris_dataset_path = '/tmp/iris_dataset.csv'
iris.to_csv(iris_dataset_path, sep=',', header=True, index=False)
!head -n 10 $iris_dataset_path

sepal_length,sepal_width,petal_length,petal_width,species
5.1,3.5,1.4,0.2,setosa
4.9,3.0,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa
4.6,3.1,1.5,0.2,setosa
5.0,3.6,1.4,0.2,setosa
5.4,3.9,1.7,0.4,setosa
4.6,3.4,1.4,0.3,setosa
5.0,3.4,1.5,0.2,setosa
4.4,2.9,1.4,0.2,setosa


\# create table with SQL CLI  
`CREATE TABLE iris(
    sepal_length INT,
    sepal_width INT COMMENT "sepal width",
    petal_length INT,
    petal_width INT,
    species STRING )
    USING csv 
    OPTIONS (header true, path '/tmp/iris_dataset.csv')`

In [36]:
# create unmanaged table with Programmatic SQL Interface

spark.sql("DROP TABLE IF EXISTS iris")

spark.sql("""
CREATE TABLE iris(
    sepal_length FLOAT,
    sepal_width FLOAT COMMENT "sepal width",
    petal_length FLOAT,
    petal_width FLOAT,
    species STRING )
    USING csv 
    OPTIONS (header true, path "/tmp/iris_dataset.csv")
    """).show()

++
||
++
++



In [37]:
spark.sql("SELECT COUNT(*) FROM iris").show()

+--------+
|count(1)|
+--------+
|     150|
+--------+



In [38]:
spark.sql("SHOW TABLES").show()

+----------------+-----------------+-----------+
|        database|        tableName|isTemporary|
+----------------+-----------------+-----------+
|analytics_tensor|        country12|      false|
|analytics_tensor|             iris|      false|
|analytics_tensor|partition_student|      false|
|analytics_tensor|     student_name|      false|
|analytics_tensor|    student_name1|      false|
|analytics_tensor|    student_score|      false|
+----------------+-----------------+-----------+



In [39]:
# create iris DataFrame read from csv file
iris_df = spark.read.format("csv") \
                .option("header", "true") \
                .load("/tmp/iris_dataset.csv")
iris_df.printSchema()

root
 |-- sepal_length: string (nullable = true)
 |-- sepal_width: string (nullable = true)
 |-- petal_length: string (nullable = true)
 |-- petal_width: string (nullable = true)
 |-- species: string (nullable = true)



In [40]:
# dataframe to create external table
spark.sql("DROP TABLE IF EXISTS iris_dataset_20200628")
iris_df.write.option('path', "/tmp/iris_dataset/2020-06-28").saveAsTable("iris_dataset_20200628")

In [41]:
spark.sql("SHOW TABLES").show(20, False)

+----------------+---------------------+-----------+
|database        |tableName            |isTemporary|
+----------------+---------------------+-----------+
|analytics_tensor|country12            |false      |
|analytics_tensor|iris                 |false      |
|analytics_tensor|iris_dataset_20200628|false      |
|analytics_tensor|partition_student    |false      |
|analytics_tensor|student_name         |false      |
|analytics_tensor|student_name1        |false      |
|analytics_tensor|student_score        |false      |
+----------------+---------------------+-----------+



**Note:** We can also use Hive SQL statement to create external table. Refer to [Hive Language Manual](https://cwiki.apache.org/confluence/display/Hive/LanguageManual) for Hive Language.

\# create external table  

`CREATE EXTERNAL TABLE student(
    student_id int,
    student_name string,
    student_email string )
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LOCATION '/tmp/student_dataset/'`

In [42]:
spark.sql("USE analytics_tensor")

spark.sql("DROP TABLE IF EXISTS student")
spark.sql("""
    CREATE EXTERNAL TABLE IF NOT EXISTS student(
        student_id int,
        student_name string,
        student_email string )
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY '\t'
    LOCATION '/tmp/student_dataset/'
""")

spark.sql("SHOW TABLES").show()

+----------------+--------------------+-----------+
|        database|           tableName|isTemporary|
+----------------+--------------------+-----------+
|analytics_tensor|           country12|      false|
|analytics_tensor|                iris|      false|
|analytics_tensor|iris_dataset_2020...|      false|
|analytics_tensor|   partition_student|      false|
|analytics_tensor|             student|      false|
|analytics_tensor|        student_name|      false|
|analytics_tensor|       student_name1|      false|
|analytics_tensor|       student_score|      false|
+----------------+--------------------+-----------+



In [43]:
spark.sql("SHOW CREATE TABLE student").show(10,False)

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|createtab_stmt                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
+-----------------------

\# create external table using select  

`CREATE EXTERNAL TABLE sample_iris
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LOCATION '/tmp/iris_dataset/2020-01-30'
AS SELECT * from iris_dataset_20200130`

In [67]:
# create external table using select  
spark.sql("""
        CREATE EXTERNAL TABLE sample_iris
        ROW FORMAT DELIMITED
        FIELDS TERMINATED BY '\t'
        LOCATION '/tmp/iris_dataset/2020-02-01'
        AS SELECT * from iris_dataset_20200628
""")

DataFrame[]

In [68]:
spark.sql("SELECT * FROM sample_iris").show(10)

+------------+-----------+------------+-----------+-------+
|sepal_length|sepal_width|petal_length|petal_width|species|
+------------+-----------+------------+-----------+-------+
|         5.1|        3.5|         1.4|        0.2| setosa|
|         4.9|        3.0|         1.4|        0.2| setosa|
|         4.7|        3.2|         1.3|        0.2| setosa|
|         4.6|        3.1|         1.5|        0.2| setosa|
|         5.0|        3.6|         1.4|        0.2| setosa|
|         5.4|        3.9|         1.7|        0.4| setosa|
|         4.6|        3.4|         1.4|        0.3| setosa|
|         5.0|        3.4|         1.5|        0.2| setosa|
|         4.4|        2.9|         1.4|        0.2| setosa|
|         4.9|        3.1|         1.5|        0.1| setosa|
+------------+-----------+------------+-----------+-------+
only showing top 10 rows



**Perform demo for Lab_3 (at the end of section)**
* Load data into table from file path.
* Insert data into from other table.
* Insert data into partition table.

#### 6.10 Describe Table
Used to display table schema.

`DESCRIBE TABLE iris` # used to display table schema of iris  
`SHOW PARTITIONS partition_student` # used display partititon schema

In [72]:
spark.sql("DESCRIBE TABLE iris").show()

+------------+---------+-----------+
|    col_name|data_type|    comment|
+------------+---------+-----------+
|sepal_length|    float|       null|
| sepal_width|    float|sepal width|
|petal_length|    float|       null|
| petal_width|    float|       null|
|     species|   string|       null|
+------------+---------+-----------+



In [46]:
spark.sql("SHOW PARTITIONS partition_student").show(10, False)

+---------+
|partition|
+---------+
|score=100|
|score=90 |
|score=95 |
+---------+



#### 6.11 Display Table
Used to display current tables in database.

`SHOW TABLES`

In [47]:
spark.sql("SHOW TABLES").show(10, False)

+----------------+---------------------+-----------+
|database        |tableName            |isTemporary|
+----------------+---------------------+-----------+
|analytics_tensor|country12            |false      |
|analytics_tensor|iris                 |false      |
|analytics_tensor|iris_dataset_20200628|false      |
|analytics_tensor|partition_student    |false      |
|analytics_tensor|sample_iris          |false      |
|analytics_tensor|student              |false      |
|analytics_tensor|student_name         |false      |
|analytics_tensor|student_name1        |false      |
|analytics_tensor|student_score        |false      |
+----------------+---------------------+-----------+



In [49]:
spark.sql("SHOW TABLES like 'student*'").select("tableName").show()

+-------------+
|    tableName|
+-------------+
|      student|
| student_name|
|student_name1|
|student_score|
+-------------+



#### 6.12 Drop Table
Used to drop the table.

`DROP TABLE student_name1` # drop table, produce error if table doesn't exists.  
`DROP TABLE IF EXISTS student_name1` #  drop table by checking if table exists it won't give error if table doesn't exists.

In [50]:
spark.sql("DROP TABLE IF EXISTS tmp1")

DataFrame[]

In [51]:
spark.sql("DROP TABLE IF EXISTS student_name1")

DataFrame[]

In [54]:
spark.sql("DROP TABLE IF EXISTS country12")

DataFrame[]

In [55]:
spark.sql("DROP TABLE IF EXISTS sample_iris")

DataFrame[]

In [56]:
spark.sql("SHOW TABLES IN analytics_tensor").show()

+----------------+--------------------+-----------+
|        database|           tableName|isTemporary|
+----------------+--------------------+-----------+
|analytics_tensor|                iris|      false|
|analytics_tensor|iris_dataset_2020...|      false|
|analytics_tensor|   partition_student|      false|
|analytics_tensor|             student|      false|
|analytics_tensor|        student_name|      false|
|analytics_tensor|       student_score|      false|
+----------------+--------------------+-----------+



#### 6.13 Refresh Table Metadata
Used to refresh table metadata. Refreshing metadata helps to read table from recent dataset. Sometime, data are not synched due to caches and re-partition of data so refreshing is required. Refreshing metadata can be performed by:
* REFRESH TABLE: It refresh all the cached file associated with the tables.
* REPAIR TABLE: It refresh the partitions maintained in the catalog for tables.

`REFRESH TABLE partition_student`

`MSCK REPAIR TABLE partition_student`

In [57]:
spark.sql("REFRESH TABLE partition_student")

DataFrame[]

In [58]:
spark.sql("MSCK REPAIR TABLE partition_student")

DataFrame[]

#### 6.14 Cache Table
Table can be cache and uncache similar to DataFrame.

`CACHE TABLE student`

`UNCACHE TABLE student`

In [59]:
spark.sql("CACHE TABLE student")

DataFrame[]

In [60]:
spark.sql("UNCACHE TABLE student")

DataFrame[]

#### 6.15 Views
View is an virtual table that specifies a set of transformation on top of existing table. It doesn't store data in file like table but retrieves data from table defined on view. It is just a saved query plans. When data is selected from view it will run query on base table to retrieve data. Basically, view is similar to creating new DataFrame from an existing DataFrame. There are many advantage of views. User doesn't feel any different between table and view while executing query. 
View can be created in different modes:
* global
* set to a database
* set per session

#### 6.15.1 Creating Views

`CREATE VIEW iris_setosa_vw AS
    SELECT * FROM iris WHERE species = 'setosa'`

In [69]:
spark.sql("""
    CREATE VIEW IF NOT EXISTS default.iris_setosa_vw AS
        SELECT * FROM iris WHERE species = 'setosa'
""")

DataFrame[]

In [70]:
spark.sql("SHOW TABLES").show()

+----------------+--------------------+-----------+
|        database|           tableName|isTemporary|
+----------------+--------------------+-----------+
|analytics_tensor|                iris|      false|
|analytics_tensor|iris_dataset_2020...|      false|
|analytics_tensor|   partition_student|      false|
|analytics_tensor|         sample_iris|      false|
|analytics_tensor|             student|      false|
|analytics_tensor|        student_name|      false|
|analytics_tensor|       student_score|      false|
+----------------+--------------------+-----------+



In [71]:
spark.sql("SHOW TABLES in DEFAULT").show(10, False)

+--------+---------------------+-----------+
|database|tableName            |isTemporary|
+--------+---------------------+-----------+
|default |iris_setosa_june29_vw|false      |
|default |iris_setosa_vw       |false      |
+--------+---------------------+-----------+



In [73]:
spark.sql("select distinct species from default.iris_setosa_vw").show(50)

+-------+
|species|
+-------+
| setosa|
+-------+



In [74]:
spark.sql("select distinct species from iris").show(50)

+----------+
|   species|
+----------+
| virginica|
|versicolor|
|    setosa|
+----------+



#### 6.15.2 Creating Temporary Views
Temporary view are only available during current session. They are not registered to a database.

`CREATE TEMP VIEW iris_setosa_tmp_vw AS
    SELECT * FROM iris WHERE species = 'setosa'`

In [75]:
spark.sql("DROP TABLE IF EXISTS iris_setosa_tmp_vw")

spark.sql("""
    CREATE TEMP VIEW iris_setosa_tmp_vw AS
        SELECT * FROM iris WHERE species = 'setosa'
""")

DataFrame[]

In [76]:
spark.sql("SHOW TABLES").show()

+----------------+--------------------+-----------+
|        database|           tableName|isTemporary|
+----------------+--------------------+-----------+
|analytics_tensor|                iris|      false|
|analytics_tensor|iris_dataset_2020...|      false|
|analytics_tensor|   partition_student|      false|
|analytics_tensor|         sample_iris|      false|
|analytics_tensor|             student|      false|
|analytics_tensor|        student_name|      false|
|analytics_tensor|       student_score|      false|
|                |  iris_setosa_tmp_vw|       true|
+----------------+--------------------+-----------+



In [77]:
spark.sql("SHOW TABLES in DEFAULT").show()

+--------+--------------------+-----------+
|database|           tableName|isTemporary|
+--------+--------------------+-----------+
| default|iris_setosa_june2...|      false|
| default|      iris_setosa_vw|      false|
|        |  iris_setosa_tmp_vw|       true|
+--------+--------------------+-----------+



#### 6.15.3 Creating Global Temporary Views
Global temporary are viewable across the entire Spark application. They can be removed at the end of session.

`CREATE GLOBAL TEMP VIEW iris_setosa_global_tmp_vw AS
    SELECT * FROM iris WHERE species = 'setosa'`

In [83]:
spark.sql("drop table if exists iris_setosa_global_tmp_vw")

DataFrame[]

In [86]:
spark.sql("SHOW TABLES").show()

+----------------+--------------------+-----------+
|        database|           tableName|isTemporary|
+----------------+--------------------+-----------+
|analytics_tensor|                iris|      false|
|analytics_tensor|iris_dataset_2020...|      false|
|analytics_tensor|   partition_student|      false|
|analytics_tensor|         sample_iris|      false|
|analytics_tensor|             student|      false|
|analytics_tensor|        student_name|      false|
|analytics_tensor|       student_score|      false|
|                |  iris_setosa_tmp_vw|       true|
+----------------+--------------------+-----------+



In [84]:
spark.sql("""
    CREATE GLOBAL TEMP VIEW iris_setosa_global_tmp_vw AS
        SELECT * FROM iris WHERE species = 'setosa'
""")

AnalysisException: Temporary view 'iris_setosa_global_tmp_vw' already exists;

**Note:** All the global temporary view will be stored in system temporary database known as `global_temp`. These views are access from different sessions and will be alive until the application ends. To retrive the views created as global temporary we should always specify the `global_temp` database other wise it will provide the error indicating, `table or view not found`.

In [88]:
spark.sql("SELECT * from iris_setosa_global_tmp_vw ").show()

AnalysisException: Table or view not found: iris_setosa_global_tmp_vw; line 1 pos 14;
'Project [*]
+- 'UnresolvedRelation [iris_setosa_global_tmp_vw]


In [89]:
spark.sql("SELECT * from global_temp.iris_setosa_global_tmp_vw ").show()

+------------+-----------+------------+-----------+-------+
|sepal_length|sepal_width|petal_length|petal_width|species|
+------------+-----------+------------+-----------+-------+
|         5.1|        3.5|         1.4|        0.2| setosa|
|         4.9|        3.0|         1.4|        0.2| setosa|
|         4.7|        3.2|         1.3|        0.2| setosa|
|         4.6|        3.1|         1.5|        0.2| setosa|
|         5.0|        3.6|         1.4|        0.2| setosa|
|         5.4|        3.9|         1.7|        0.4| setosa|
|         4.6|        3.4|         1.4|        0.3| setosa|
|         5.0|        3.4|         1.5|        0.2| setosa|
|         4.4|        2.9|         1.4|        0.2| setosa|
|         4.9|        3.1|         1.5|        0.1| setosa|
|         5.4|        3.7|         1.5|        0.2| setosa|
|         4.8|        3.4|         1.6|        0.2| setosa|
|         4.8|        3.0|         1.4|        0.1| setosa|
|         4.3|        3.0|         1.1| 

In [90]:
spark.sql("SHOW TABLES IN GLOBAL_TEMP").show()

+-----------+--------------------+-----------+
|   database|           tableName|isTemporary|
+-----------+--------------------+-----------+
|global_temp|iris_setosa_globa...|       true|
|           |  iris_setosa_tmp_vw|       true|
+-----------+--------------------+-----------+



In [91]:
spark.sql("SHOW TABLES").show()

+----------------+--------------------+-----------+
|        database|           tableName|isTemporary|
+----------------+--------------------+-----------+
|analytics_tensor|                iris|      false|
|analytics_tensor|iris_dataset_2020...|      false|
|analytics_tensor|   partition_student|      false|
|analytics_tensor|         sample_iris|      false|
|analytics_tensor|             student|      false|
|analytics_tensor|        student_name|      false|
|analytics_tensor|       student_score|      false|
|                |  iris_setosa_tmp_vw|       true|
+----------------+--------------------+-----------+



#### 6.15.4 Overwrite Views
View can be overwrite on existing view.

`CREATE OR REPLACE TEMP VIEW iris_setosa_tmp_vw AS
    SELECT * FROM iris WHERE species in ('setosa') and sepal_length > 4.5`

In [92]:
spark.sql("""CREATE OR REPLACE TEMP VIEW iris_setosa_tmp_vw AS 
          SELECT * FROM iris WHERE species in ('setosa') and sepal_length > 4.5""")

DataFrame[]

In [93]:
spark.sql("select * from iris_setosa_tmp_vw").show()

+------------+-----------+------------+-----------+-------+
|sepal_length|sepal_width|petal_length|petal_width|species|
+------------+-----------+------------+-----------+-------+
|         5.1|        3.5|         1.4|        0.2| setosa|
|         4.9|        3.0|         1.4|        0.2| setosa|
|         4.7|        3.2|         1.3|        0.2| setosa|
|         4.6|        3.1|         1.5|        0.2| setosa|
|         5.0|        3.6|         1.4|        0.2| setosa|
|         5.4|        3.9|         1.7|        0.4| setosa|
|         4.6|        3.4|         1.4|        0.3| setosa|
|         5.0|        3.4|         1.5|        0.2| setosa|
|         4.9|        3.1|         1.5|        0.1| setosa|
|         5.4|        3.7|         1.5|        0.2| setosa|
|         4.8|        3.4|         1.6|        0.2| setosa|
|         4.8|        3.0|         1.4|        0.1| setosa|
|         5.8|        4.0|         1.2|        0.2| setosa|
|         5.7|        4.4|         1.5| 

#### 6.15.5 Explain Views
Describe the view explaination.

`EXPLAIN SELECT * FROM iris_setosa_tmp_vw`

In [94]:
spark.sql("EXPLAIN SELECT * FROM iris_setosa_tmp_vw").show(1, False)

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|plan                                                                                                                                                                                           

In [95]:
spark.sql("EXPLAIN table iris_setosa_tmp_vw").show(1,False)

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|plan                                                                                                                                                                                           

`EXPLAIN SELECT * FROM iris WHERE species = 'setosa'`

In [167]:
spark.sql("EXPLAIN SELECT * FROM iris WHERE species = 'setosa'").show(1,False)

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|plan                                                                                                                                                                                                                                                                                                                                                         

#### 6.15.6 Drop Views
Dropping view is similar to dropping the table. Dropping view doesn't drop data, only the metadata of view is dropped.

`DROP VIEW IF EXISTS iris_setosa_vw`

In [96]:
spark.sql("DROP VIEW IF EXISTS iris_setosa_vw")

DataFrame[]

#### 6.16 Select Statements
Spark support ANSI SQL for SELECT expression shown below.

**SELECT** [**ALL**|**DISTINCT**] named_expression[, named_expression, ...]  
&emsp;**FROM** relation[, relation, ...]  
&emsp;[lateral_view[, lateral_view, ...]]  
&emsp;[**WHERE** boolean_expression]  
&emsp;[aggregation [**HAVING** boolean_expression]]  
&emsp;[**ORDER BY** sort_expressions]  
&emsp;[**CLUSTER BY** expressions]  
&emsp;[DISTRIBUTE BY expressions]  
&emsp;[SORT BY sort_expressions]  
&emsp;[WINDOW named_window[, WINDOW named_window, ...]]  
&emsp;[**LIMIT** num_rows]  

named_expression:  
&emsp;: expression [**AS alias**]    

relation:  
&emsp;| join_relation  
&emsp;| (**table_name**|query|relation) [sample] [**AS alias**]  
&emsp;: **VALUES** (expressions)[, (expressions), ...]  
&emsp;&emsp; [**AS** **(column_name**[, **column_name**, ...])]  

expressions:  
&emsp;: expression[, expression, ...]  

sort_expressions:  
&emsp;: expression [**ASC**|**DESC**][, expression \[**ASC**|**DESC**\], ...]

**SELECT Statements example**

In [97]:
spark.sql("SELECT * FROM iris").show(20)

+------------+-----------+------------+-----------+-------+
|sepal_length|sepal_width|petal_length|petal_width|species|
+------------+-----------+------------+-----------+-------+
|         5.1|        3.5|         1.4|        0.2| setosa|
|         4.9|        3.0|         1.4|        0.2| setosa|
|         4.7|        3.2|         1.3|        0.2| setosa|
|         4.6|        3.1|         1.5|        0.2| setosa|
|         5.0|        3.6|         1.4|        0.2| setosa|
|         5.4|        3.9|         1.7|        0.4| setosa|
|         4.6|        3.4|         1.4|        0.3| setosa|
|         5.0|        3.4|         1.5|        0.2| setosa|
|         4.4|        2.9|         1.4|        0.2| setosa|
|         4.9|        3.1|         1.5|        0.1| setosa|
|         5.4|        3.7|         1.5|        0.2| setosa|
|         4.8|        3.4|         1.6|        0.2| setosa|
|         4.8|        3.0|         1.4|        0.1| setosa|
|         4.3|        3.0|         1.1| 

In [98]:
spark.sql("SELECT sepal_length as sl ,sepal_width as sw, upper(species) as species, current_date()\
          as current_date FROM iris\
          order by sepal_length desc limit 50").show(100)

+---+---+----------+------------+
| sl| sw|   species|current_date|
+---+---+----------+------------+
|7.9|3.8| VIRGINICA|  2020-12-05|
|7.7|3.8| VIRGINICA|  2020-12-05|
|7.7|2.6| VIRGINICA|  2020-12-05|
|7.7|2.8| VIRGINICA|  2020-12-05|
|7.7|3.0| VIRGINICA|  2020-12-05|
|7.6|3.0| VIRGINICA|  2020-12-05|
|7.4|2.8| VIRGINICA|  2020-12-05|
|7.3|2.9| VIRGINICA|  2020-12-05|
|7.2|3.6| VIRGINICA|  2020-12-05|
|7.2|3.2| VIRGINICA|  2020-12-05|
|7.2|3.0| VIRGINICA|  2020-12-05|
|7.1|3.0| VIRGINICA|  2020-12-05|
|7.0|3.2|VERSICOLOR|  2020-12-05|
|6.9|3.1|VERSICOLOR|  2020-12-05|
|6.9|3.2| VIRGINICA|  2020-12-05|
|6.9|3.1| VIRGINICA|  2020-12-05|
|6.9|3.1| VIRGINICA|  2020-12-05|
|6.8|2.8|VERSICOLOR|  2020-12-05|
|6.8|3.0| VIRGINICA|  2020-12-05|
|6.8|3.2| VIRGINICA|  2020-12-05|
|6.7|3.1|VERSICOLOR|  2020-12-05|
|6.7|3.0|VERSICOLOR|  2020-12-05|
|6.7|3.1|VERSICOLOR|  2020-12-05|
|6.7|2.5| VIRGINICA|  2020-12-05|
|6.7|3.3| VIRGINICA|  2020-12-05|
|6.7|3.1| VIRGINICA|  2020-12-05|
|6.7|3.3| VIRG

In [99]:
spark.sql("SELECT distinct species FROM iris").show(20)

+----------+
|   species|
+----------+
| virginica|
|versicolor|
|    setosa|
+----------+



In [100]:
spark.sql("""
    SELECT 
        species, min(sepal_length) min_sepal_len, max(sepal_length) max_sepal_w,
        min(sepal_width), max(sepal_width),
        min(petal_length), max(petal_length) as max_petal_length
    FROM iris
    GROUP BY species"""
    ).show(20)

+----------+-------------+-----------+----------------+----------------+-----------------+----------------+
|   species|min_sepal_len|max_sepal_w|min(sepal_width)|max(sepal_width)|min(petal_length)|max_petal_length|
+----------+-------------+-----------+----------------+----------------+-----------------+----------------+
| virginica|          4.9|        7.9|             2.2|             3.8|              4.5|             6.9|
|versicolor|          4.9|        7.0|             2.0|             3.4|              3.0|             5.1|
|    setosa|          4.3|        5.8|             2.3|             4.4|              1.0|             1.9|
+----------+-------------+-----------+----------------+----------------+-----------------+----------------+



In [101]:
spark.sql("select distinct sepal_length from iris order by sepal_length").show(50)

+------------+
|sepal_length|
+------------+
|         4.3|
|         4.4|
|         4.5|
|         4.6|
|         4.7|
|         4.8|
|         4.9|
|         5.0|
|         5.1|
|         5.2|
|         5.3|
|         5.4|
|         5.5|
|         5.6|
|         5.7|
|         5.8|
|         5.9|
|         6.0|
|         6.1|
|         6.2|
|         6.3|
|         6.4|
|         6.5|
|         6.6|
|         6.7|
|         6.8|
|         6.9|
|         7.0|
|         7.1|
|         7.2|
|         7.3|
|         7.4|
|         7.6|
|         7.7|
|         7.9|
+------------+



#### 6.17 Case Statements
Case statements is used to replace data value based on certain condition either defined in business logic or temporary explicit values for future calcuation.

`SELECT 
    sepal_length,
    CASE WHEN sepal_length >= 4.3 AND sepal_length <5.0 THEN 'Small'
         WHEN sepal_length >=5.0 AND sepal_length <5.5 THEN 'Medium'
         WHEN sepal_length >=5.5 AND sepal_length <6.0 THEN 'Large'
         WHEN sepal_length >=6.0 THEN 'Extra Large'            
    ELSE 'UNKNOWN' END AS length_range
FROM iris_`

In [102]:
spark.sql("""
    SELECT 
        sepal_length,
        CASE /*WHEN sepal_length >= 4.3 AND sepal_length <5.0 THEN 'Small' -- to comment */
             WHEN sepal_length >=5.0 AND sepal_length <5.5 THEN 'Medium'
             WHEN sepal_length >=5.5 AND sepal_length <6.0 THEN 'Large'
             WHEN sepal_length >=6.0 THEN 'Extra Large'            
        ELSE 'UNKNOWN' END AS length_range
    FROM iris
    order by length_range
""").show(20, False)

+------------+------------+
|sepal_length|length_range|
+------------+------------+
|6.7         |Extra Large |
|6.5         |Extra Large |
|6.0         |Extra Large |
|7.0         |Extra Large |
|6.0         |Extra Large |
|6.8         |Extra Large |
|6.0         |Extra Large |
|6.3         |Extra Large |
|6.7         |Extra Large |
|6.0         |Extra Large |
|6.3         |Extra Large |
|6.7         |Extra Large |
|6.1         |Extra Large |
|6.1         |Extra Large |
|6.2         |Extra Large |
|6.1         |Extra Large |
|6.3         |Extra Large |
|6.6         |Extra Large |
|7.1         |Extra Large |
|6.5         |Extra Large |
+------------+------------+
only showing top 20 rows



#### 6.18 Complex Types
Spark SQL has three core complex types:
* structs
* lists
* maps

#### 6.18.1 Structs
Structs are similar to maps. Both struct and maps are used to query nested data. To create structs from existing table we need to wrap of sets of columns or column expression in parentheses. Both structs and map data type can also be created while defining the table structure.

In [103]:
spark.sql("CREATE VIEW IF NOT EXISTS struct_example_vw AS\
            SELECT species, (sepal_length, sepal_width) as sepal_info, (petal_length, petal_width) as petal_info\
            FROM iris")

DataFrame[]

In [104]:
spark.sql("DESCRIBE EXTENDED struct_example_vw").show(10, False)

+----------------------------+--------------------------------------------+-------+
|col_name                    |data_type                                   |comment|
+----------------------------+--------------------------------------------+-------+
|species                     |string                                      |null   |
|sepal_info                  |struct<sepal_length:float,sepal_width:float>|null   |
|petal_info                  |struct<petal_length:float,petal_width:float>|null   |
|                            |                                            |       |
|# Detailed Table Information|                                            |       |
|Database                    |analytics_tensor                            |       |
|Table                       |struct_example_vw                           |       |
|Owner                       |kcmahesh                                    |       |
|Created Time                |Sun Dec 06 06:56:03 NPT 2020                | 

In [105]:
spark.sql("DESC EXTENDED struct_example_vw").show(10, False)

+----------------------------+--------------------------------------------+-------+
|col_name                    |data_type                                   |comment|
+----------------------------+--------------------------------------------+-------+
|species                     |string                                      |null   |
|sepal_info                  |struct<sepal_length:float,sepal_width:float>|null   |
|petal_info                  |struct<petal_length:float,petal_width:float>|null   |
|                            |                                            |       |
|# Detailed Table Information|                                            |       |
|Database                    |analytics_tensor                            |       |
|Table                       |struct_example_vw                           |       |
|Owner                       |kcmahesh                                    |       |
|Created Time                |Sun Dec 06 06:56:03 NPT 2020                | 

To query the struct type we need to use dot syntax by specifying column name followed by data_type's name. We can also use `column_name.*` to select all the columns for given struct type.

In [320]:
spark.sql("SELECT * from struct_example_vw").show()

+-------+----------+----------+
|species|sepal_info|petal_info|
+-------+----------+----------+
| setosa|[5.1, 3.5]|[1.4, 0.2]|
| setosa|[4.9, 3.0]|[1.4, 0.2]|
| setosa|[4.7, 3.2]|[1.3, 0.2]|
| setosa|[4.6, 3.1]|[1.5, 0.2]|
| setosa|[5.0, 3.6]|[1.4, 0.2]|
| setosa|[5.4, 3.9]|[1.7, 0.4]|
| setosa|[4.6, 3.4]|[1.4, 0.3]|
| setosa|[5.0, 3.4]|[1.5, 0.2]|
| setosa|[4.4, 2.9]|[1.4, 0.2]|
| setosa|[4.9, 3.1]|[1.5, 0.1]|
| setosa|[5.4, 3.7]|[1.5, 0.2]|
| setosa|[4.8, 3.4]|[1.6, 0.2]|
| setosa|[4.8, 3.0]|[1.4, 0.1]|
| setosa|[4.3, 3.0]|[1.1, 0.1]|
| setosa|[5.8, 4.0]|[1.2, 0.2]|
| setosa|[5.7, 4.4]|[1.5, 0.4]|
| setosa|[5.4, 3.9]|[1.3, 0.4]|
| setosa|[5.1, 3.5]|[1.4, 0.3]|
| setosa|[5.7, 3.8]|[1.7, 0.3]|
| setosa|[5.1, 3.8]|[1.5, 0.3]|
+-------+----------+----------+
only showing top 20 rows



In [106]:
spark.sql("SELECT sepal_info.sepal_length, petal_info.petal_width from struct_example_vw").show()

+------------+-----------+
|sepal_length|petal_width|
+------------+-----------+
|         5.1|        0.2|
|         4.9|        0.2|
|         4.7|        0.2|
|         4.6|        0.2|
|         5.0|        0.2|
|         5.4|        0.4|
|         4.6|        0.3|
|         5.0|        0.2|
|         4.4|        0.2|
|         4.9|        0.1|
|         5.4|        0.2|
|         4.8|        0.2|
|         4.8|        0.1|
|         4.3|        0.1|
|         5.8|        0.2|
|         5.7|        0.4|
|         5.4|        0.4|
|         5.1|        0.3|
|         5.7|        0.3|
|         5.1|        0.3|
+------------+-----------+
only showing top 20 rows



In [325]:
spark.sql("SELECT sepal_info.*, petal_info.* from struct_example_vw").show()

+------------+-----------+------------+-----------+
|sepal_length|sepal_width|petal_length|petal_width|
+------------+-----------+------------+-----------+
|         5.1|        3.5|         1.4|        0.2|
|         4.9|        3.0|         1.4|        0.2|
|         4.7|        3.2|         1.3|        0.2|
|         4.6|        3.1|         1.5|        0.2|
|         5.0|        3.6|         1.4|        0.2|
|         5.4|        3.9|         1.7|        0.4|
|         4.6|        3.4|         1.4|        0.3|
|         5.0|        3.4|         1.5|        0.2|
|         4.4|        2.9|         1.4|        0.2|
|         4.9|        3.1|         1.5|        0.1|
|         5.4|        3.7|         1.5|        0.2|
|         4.8|        3.4|         1.6|        0.2|
|         4.8|        3.0|         1.4|        0.1|
|         4.3|        3.0|         1.1|        0.1|
|         5.8|        4.0|         1.2|        0.2|
|         5.7|        4.4|         1.5|        0.4|
|         5.

#### 6.18.2 Lists
We can create list using `collect_list` or `collect_set` function which create list with duplicate values and create list without duplicate values respectively. These are aggregate function so it is only used in aggregation.

In [326]:
spark.sql("SELECT species, collect_list(sepal_length) as spl_length from iris\
             GROUP BY species\
          ").show(10, False)

+----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|species   |spl_length                                                                                                                                                                                                                                                                                                  |
+----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|virginica |[6.3, 5.8, 7.1, 6.3, 6.5, 7.6, 4.9, 7.3, 6.7, 

In [327]:
spark.sql("SELECT species, collect_set(sepal_length) as spl_length from iris\
             GROUP BY species\
          ").show(10, False)

+----------+--------------------------------------------------------------------------------------------------------------+
|species   |spl_length                                                                                                    |
+----------+--------------------------------------------------------------------------------------------------------------+
|virginica |[7.3, 6.0, 7.6, 4.9, 6.8, 7.1, 6.3, 7.9, 5.8, 7.4, 6.1, 7.7, 6.9, 5.6, 7.2, 6.4, 6.7, 5.9, 6.2, 5.1, 5.7, 6.5]|
|versicolor|[5.2, 5.4, 6.0, 4.9, 6.8, 5.5, 6.3, 6.6, 5.8, 6.1, 6.9, 5.0, 5.6, 6.4, 6.7, 5.9, 6.2, 7.0, 5.1, 5.7, 6.5]     |
|setosa    |[5.2, 5.4, 4.9, 5.5, 4.4, 4.7, 5.8, 5.0, 4.5, 5.3, 4.8, 5.1, 4.3, 5.7, 4.6]                                   |
+----------+--------------------------------------------------------------------------------------------------------------+



We can also create list manually using `array()` method shown below.

In [107]:
spark.sql("select species, array('A', 'B', 'C') as items_1, array(10,20,30) as items_2 from iris").show()

+-------+---------+------------+
|species|  items_1|     items_2|
+-------+---------+------------+
| setosa|[A, B, C]|[10, 20, 30]|
| setosa|[A, B, C]|[10, 20, 30]|
| setosa|[A, B, C]|[10, 20, 30]|
| setosa|[A, B, C]|[10, 20, 30]|
| setosa|[A, B, C]|[10, 20, 30]|
| setosa|[A, B, C]|[10, 20, 30]|
| setosa|[A, B, C]|[10, 20, 30]|
| setosa|[A, B, C]|[10, 20, 30]|
| setosa|[A, B, C]|[10, 20, 30]|
| setosa|[A, B, C]|[10, 20, 30]|
| setosa|[A, B, C]|[10, 20, 30]|
| setosa|[A, B, C]|[10, 20, 30]|
| setosa|[A, B, C]|[10, 20, 30]|
| setosa|[A, B, C]|[10, 20, 30]|
| setosa|[A, B, C]|[10, 20, 30]|
| setosa|[A, B, C]|[10, 20, 30]|
| setosa|[A, B, C]|[10, 20, 30]|
| setosa|[A, B, C]|[10, 20, 30]|
| setosa|[A, B, C]|[10, 20, 30]|
| setosa|[A, B, C]|[10, 20, 30]|
+-------+---------+------------+
only showing top 20 rows



In [108]:
spark.sql("CREATE VIEW IF NOT EXISTS list_example_vw as SELECT species, collect_set(sepal_length) as spl_length from iris\
             GROUP BY species")

DataFrame[]

In [109]:
spark.sql("DESC EXTENDED list_example_vw").show(10, False)

+----------------------------+----------------------------+-------+
|col_name                    |data_type                   |comment|
+----------------------------+----------------------------+-------+
|species                     |string                      |null   |
|spl_length                  |array<float>                |null   |
|                            |                            |       |
|# Detailed Table Information|                            |       |
|Database                    |analytics_tensor            |       |
|Table                       |list_example_vw             |       |
|Owner                       |kcmahesh                    |       |
|Created Time                |Sun Dec 06 07:05:18 NPT 2020|       |
|Last Access                 |UNKNOWN                     |       |
|Created By                  |Spark 3.0.0-preview         |       |
+----------------------------+----------------------------+-------+
only showing top 10 rows



In [333]:
spark.sql("select species, array_max(spl_length) as sorted_spl_length from list_example_vw").show(10, False)

+----------+-----------------+
|species   |sorted_spl_length|
+----------+-----------------+
|virginica |7.9              |
|versicolor|7.0              |
|setosa    |5.8              |
+----------+-----------------+



#### 6.18.3 Maps
Maps has keytype and valuetype. Key type and value type can be any data types such as integer, string, array, map and struct. Maps can be created while defining table or can be consturucted by specifying map functions such as `map()`.

In [110]:
spark.sql("DROP VIEW IF EXISTS map_example_vw")
spark.sql("CREATE VIEW IF NOT EXISTS map_example_vw as SELECT map(10, 'John', 20 , 'Bobby') as map_types from iris")

DataFrame[]

In [111]:
spark.sql("DESC EXTENDED map_example_vw").show(10, False)

+----------------------------+----------------------------+-------+
|col_name                    |data_type                   |comment|
+----------------------------+----------------------------+-------+
|map_types                   |map<int,string>             |null   |
|                            |                            |       |
|# Detailed Table Information|                            |       |
|Database                    |analytics_tensor            |       |
|Table                       |map_example_vw              |       |
|Owner                       |kcmahesh                    |       |
|Created Time                |Sun Dec 06 07:08:09 NPT 2020|       |
|Last Access                 |UNKNOWN                     |       |
|Created By                  |Spark 3.0.0-preview         |       |
|Type                        |VIEW                        |       |
+----------------------------+----------------------------+-------+
only showing top 10 rows



In [361]:
spark.sql("SELECT map_keys(map_types), map_values(map_types), map_concat(map_types) from map_example_vw").show(10, False)

+-------------------+---------------------+-------------------------+
|map_keys(map_types)|map_values(map_types)|map_concat(map_types)    |
+-------------------+---------------------+-------------------------+
|[10, 20]           |[John, Bobby]        |[10 -> John, 20 -> Bobby]|
|[10, 20]           |[John, Bobby]        |[10 -> John, 20 -> Bobby]|
|[10, 20]           |[John, Bobby]        |[10 -> John, 20 -> Bobby]|
|[10, 20]           |[John, Bobby]        |[10 -> John, 20 -> Bobby]|
|[10, 20]           |[John, Bobby]        |[10 -> John, 20 -> Bobby]|
|[10, 20]           |[John, Bobby]        |[10 -> John, 20 -> Bobby]|
|[10, 20]           |[John, Bobby]        |[10 -> John, 20 -> Bobby]|
|[10, 20]           |[John, Bobby]        |[10 -> John, 20 -> Bobby]|
|[10, 20]           |[John, Bobby]        |[10 -> John, 20 -> Bobby]|
|[10, 20]           |[John, Bobby]        |[10 -> John, 20 -> Bobby]|
+-------------------+---------------------+-------------------------+
only showing top 10 

`Example with of complex types`

In [116]:
spark.sql("""
        CREATE TABLE user_profile(
        user_id int,
        user_name string,
        profile_name string,
        hobbies array<string>,
        liked_image map<string, array<string>>,
        -- struct with same column name user_id
        friends_info struct<first_name: string, last_name: string, user_name: string, user_id: int>,
        -- struct with element of an array
        address array<struct<street:string, city: string, country: string>>,
        -- struct with map
        like_url map<string, struct <year:int, url: string, details: string>>        
        )
        """)

DataFrame[]

In [117]:
spark.sql("DESCRIBE user_profile").show(15, False)

+------------+-----------------------------------------------------------------------+-------+
|col_name    |data_type                                                              |comment|
+------------+-----------------------------------------------------------------------+-------+
|user_id     |int                                                                    |null   |
|user_name   |string                                                                 |null   |
|profile_name|string                                                                 |null   |
|hobbies     |array<string>                                                          |null   |
|liked_image |map<string,array<string>>                                              |null   |
|friends_info|struct<first_name:string,last_name:string,user_name:string,user_id:int>|null   |
|address     |array<struct<street:string,city:string,country:string>>                |null   |
|like_url    |map<string,struct<year:int,url:strin

In [118]:
hobbies = ['swimming', 'reading', 'hiking']
spark.sql("""INSERT INTO user_profile
    (user_id, user_name, profile_name, hobbies)
    -- liked_image, friends_info, address,like_url) 
    values
    (1, 'bobby', 'Bobby', hobbies
    -- "car:['audi', 'bently']"
    )
""")

ParseException: 
mismatched input 'user_id' expecting {'(', 'FROM', 'MAP', 'REDUCE', 'SELECT', 'TABLE', 'VALUES', 'WITH'}(line 2, pos 5)

== SQL ==
INSERT INTO user_profile
    (user_id, user_name, profile_name, hobbies)
-----^^^
    -- liked_image, friends_info, address,like_url) 
    values
    (1, 'bobby', 'Bobby', hobbies
    -- "car:['audi', 'bently']"
    )


In [120]:
hobbies = ['swimming', 'reading', 'hiking']
spark.sql("""INSERT INTO user_profile
    (user_name, profile_name, hobbies) 
    values
    ('bobby', 'Bobby', hobbies)
""")

ParseException: 
mismatched input 'user_name' expecting {'(', 'FROM', 'MAP', 'REDUCE', 'SELECT', 'TABLE', 'VALUES', 'WITH'}(line 2, pos 5)

== SQL ==
INSERT INTO user_profile
    (user_name, profile_name, hobbies) 
-----^^^
    values
    ('bobby', 'Bobby', hobbies)


#### 6.19 Functions
Spark SQL provides various function for manipulating data.  
`SHOW FUNCTIONS`: display the functions  
`SHOW SYSTEM FUNCTIONS`: display system functions.  
`SHOW USER FUNCTIONS`: display user functions.  
`SHOW FUNCTIONS "con*"`: display functions that only starts with `con` using wildcard\(\*\) characters.  
`SHOW FUNCTIONS LIKE "reg*"`: display functions that only starts with `reg` using wildcard\(*) characters. LIKE is an optional.

**Helpful Link for [functions](https://spark.apache.org/docs/latest/api/sql/index.html)**.

In [122]:
spark.sql("SHOW FUNCTIONS").show(100, False)

+---------------------+
|function             |
+---------------------+
|!                    |
|!=                   |
|%                    |
|&                    |
|*                    |
|+                    |
|-                    |
|/                    |
|<                    |
|<=                   |
|<=>                  |
|<>                   |
|=                    |
|==                   |
|>                    |
|>=                   |
|^                    |
|abs                  |
|acos                 |
|acosh                |
|add_months           |
|aggregate            |
|and                  |
|any                  |
|approx_count_distinct|
|approx_percentile    |
|array                |
|array_contains       |
|array_distinct       |
|array_except         |
|array_intersect      |
|array_join           |
|array_max            |
|array_min            |
|array_position       |
|array_remove         |
|array_repeat         |
|array_sort           |
|array_union    

In [123]:
spark.sql("SHOW SYSTEM FUNCTIONS").show(30, False)

+---------------------+
|function             |
+---------------------+
|!                    |
|!=                   |
|%                    |
|&                    |
|*                    |
|+                    |
|-                    |
|/                    |
|<                    |
|<=                   |
|<=>                  |
|<>                   |
|=                    |
|==                   |
|>                    |
|>=                   |
|^                    |
|abs                  |
|acos                 |
|acosh                |
|add_months           |
|aggregate            |
|and                  |
|any                  |
|approx_count_distinct|
|approx_percentile    |
|array                |
|array_contains       |
|array_distinct       |
|array_except         |
+---------------------+
only showing top 30 rows



In [124]:
spark.sql("SHOW USER FUNCTIONS").show(30, False)

+--------+
|function|
+--------+
+--------+



In [125]:
spark.sql("SHOW FUNCTIONS \"con*\"").show(30, False)

+---------+
|function |
+---------+
|concat   |
|concat_ws|
|conv     |
+---------+



In [126]:
spark.sql("SHOW FUNCTIONS LIKE \"*at*\"").show(30, False)

+----------------+
|function        |
+----------------+
|aggregate       |
|array_repeat    |
|atan            |
|atan2           |
|atanh           |
|concat          |
|concat_ws       |
|current_database|
|current_date    |
|date            |
|date_add        |
|date_format     |
|date_part       |
|date_sub        |
|date_trunc      |
|datediff        |
|element_at      |
|flatten         |
|float           |
|format_number   |
|format_string   |
|greatest        |
|locate          |
|make_date       |
|map_concat      |
|negative        |
|repeat          |
|to_date         |
|translate       |
|xpath           |
+----------------+
only showing top 30 rows



In [127]:
spark.sql("SHOW FUNCTIONS LIKE \"^make*\"").show(30, False)

+--------------+
|function      |
+--------------+
|make_date     |
|make_timestamp|
+--------------+



In [230]:
spark.sql("DESCRIBE FUNCTION concat_ws").show(30, False)

+---------------------------------------------------------------------------------------------------------+
|function_desc                                                                                            |
+---------------------------------------------------------------------------------------------------------+
|Function: concat_ws                                                                                      |
|Class: org.apache.spark.sql.catalyst.expressions.ConcatWs                                                |
|Usage: concat_ws(sep, [str | array(str)]+) - Returns the concatenation of the strings separated by `sep`.|
+---------------------------------------------------------------------------------------------------------+



In [128]:
spark.sql("DESCRIBE FUNCTION EXTENDED array_contains").show(30, False)

+----------------------------------------------------------------------------------------------+
|function_desc                                                                                 |
+----------------------------------------------------------------------------------------------+
|Function: array_contains                                                                      |
|Class: org.apache.spark.sql.catalyst.expressions.ArrayContains                                |
|Usage: array_contains(array, value) - Returns true if the array contains the value.           |
|Extended Usage:
    Examples:
      > SELECT array_contains(array(1, 2, 3), 2);
       true
  |
+----------------------------------------------------------------------------------------------+



#### 6.20 User-Defined Functions
We can use user-defined functions created in DataFrame in Spark SQL. We can also register functions through Hive using `CREATE TEMPORARY FUNCTION` syntax.

In [129]:
# Create increase_twice UDF
def increase_twice(number):
    """
    UDF for increasing value by twice.
    :param : column name, float, int, double
    >>> select value, increase_twice_udf(value) as twice; return twice the value
    +----------+---------+
    | value    | twice   |
    | 10       | 20      |
    +----------|---------+
    """
    return float(number * 2)

check = increase_twice(10.2)
check

20.4

In [130]:
# Register increase_twice_udf UDF
spark.udf.register("increase_twice_udf", increase_twice)

<function __main__.increase_twice(number)>

In [131]:
spark.sql("SHOW FUNCTIONS 'increase*'").show()

+------------------+
|          function|
+------------------+
|increase_twice_udf|
+------------------+



In [132]:
spark.sql("DESCRIBE FUNCTION EXTENDED increase_twice_udf").show(30, False)

+-------------------------------------------------------------------+
|function_desc                                                      |
+-------------------------------------------------------------------+
|Function: increase_twice_udf                                       |
|Class: org.apache.spark.sql.UDFRegistration$$Lambda$3973/1557538885|
|Usage: N/A.                                                        |
|Extended Usage:
    No example/argument for increase_twice_udf.
   |
+-------------------------------------------------------------------+



In [133]:
spark.sql("select * from iris limit 10 ").show()

spark.sql("DESCRIBE iris").show()

+------------+-----------+------------+-----------+-------+
|sepal_length|sepal_width|petal_length|petal_width|species|
+------------+-----------+------------+-----------+-------+
|         5.1|        3.5|         1.4|        0.2| setosa|
|         4.9|        3.0|         1.4|        0.2| setosa|
|         4.7|        3.2|         1.3|        0.2| setosa|
|         4.6|        3.1|         1.5|        0.2| setosa|
|         5.0|        3.6|         1.4|        0.2| setosa|
|         5.4|        3.9|         1.7|        0.4| setosa|
|         4.6|        3.4|         1.4|        0.3| setosa|
|         5.0|        3.4|         1.5|        0.2| setosa|
|         4.4|        2.9|         1.4|        0.2| setosa|
|         4.9|        3.1|         1.5|        0.1| setosa|
+------------+-----------+------------+-----------+-------+

+------------+---------+-----------+
|    col_name|data_type|    comment|
+------------+---------+-----------+
|sepal_length|    float|       null|
| sepal_wid

`SELECT sepal_length, increase_twice_udf(cast(sepal_length as float)) double_sepal_length FROM iris`

In [135]:
spark.sql("SELECT sepal_length, sepal_length * 2, increase_twice_udf(sepal_length) spl_twice_lgth, increase_twice_udf(cast(sepal_length as float)) as double_sepal_length FROM iris").show(10)

+------------+---------------------------------+------------------+-------------------+
|sepal_length|(sepal_length * CAST(2 AS FLOAT))|    spl_twice_lgth|double_sepal_length|
+------------+---------------------------------+------------------+-------------------+
|         5.1|                             10.2|10.199999809265137| 10.199999809265137|
|         4.9|                              9.8| 9.800000190734863|  9.800000190734863|
|         4.7|                              9.4| 9.399999618530273|  9.399999618530273|
|         4.6|                              9.2| 9.199999809265137|  9.199999809265137|
|         5.0|                             10.0|              10.0|               10.0|
|         5.4|                             10.8|10.800000190734863| 10.800000190734863|
|         4.6|                              9.2| 9.199999809265137|  9.199999809265137|
|         5.0|                             10.0|              10.0|               10.0|
|         4.4|                  

#### 6.21 Subqueries
Spark SQL support both correlated and uncorrelated subqueries. Refer to previous SQL chapter for subqueries.

#### 6.22 Interoperate SQL and DataFrames


In [136]:
df_1 = spark.sql("SELECT * from iris")
df_1.printSchema()
df_1.where('sepal_length == 5.0').show()

root
 |-- sepal_length: float (nullable = true)
 |-- sepal_width: float (nullable = true)
 |-- petal_length: float (nullable = true)
 |-- petal_width: float (nullable = true)
 |-- species: string (nullable = true)

+------------+-----------+------------+-----------+----------+
|sepal_length|sepal_width|petal_length|petal_width|   species|
+------------+-----------+------------+-----------+----------+
|         5.0|        3.6|         1.4|        0.2|    setosa|
|         5.0|        3.4|         1.5|        0.2|    setosa|
|         5.0|        3.0|         1.6|        0.2|    setosa|
|         5.0|        3.4|         1.6|        0.4|    setosa|
|         5.0|        3.2|         1.2|        0.2|    setosa|
|         5.0|        3.5|         1.3|        0.3|    setosa|
|         5.0|        3.5|         1.6|        0.6|    setosa|
|         5.0|        3.3|         1.4|        0.2|    setosa|
|         5.0|        2.0|         3.5|        1.0|versicolor|
|         5.0|        2.3|   

#### 6.23 Catalog API
Metadata can also be accessed through Spark SQL Catalog API.

In [138]:
spark.catalog.listDatabases()

[Database(name='analytics_tensor', description='', locationUri='file:/Users/kcmahesh/company/training/Spark/Chapter_6_Spark_SQL/spark-warehouse/analytics_tensor.db'),
 Database(name='default', description='Default Hive database', locationUri='file:/Users/kcmahesh/company/training/Spark/Chapter_6_Spark_SQL/spark-warehouse')]

In [139]:
spark.catalog.listTables()

[Table(name='iris', database='analytics_tensor', description=None, tableType='EXTERNAL', isTemporary=False),
 Table(name='iris_dataset_20200628', database='analytics_tensor', description=None, tableType='EXTERNAL', isTemporary=False),
 Table(name='list_example_vw', database='analytics_tensor', description=None, tableType='VIEW', isTemporary=False),
 Table(name='map_example_vw', database='analytics_tensor', description=None, tableType='VIEW', isTemporary=False),
 Table(name='partition_student', database='analytics_tensor', description=None, tableType='MANAGED', isTemporary=False),
 Table(name='sample_iris', database='analytics_tensor', description=None, tableType='EXTERNAL', isTemporary=False),
 Table(name='struct_example_vw', database='analytics_tensor', description=None, tableType='VIEW', isTemporary=False),
 Table(name='student', database='analytics_tensor', description=None, tableType='EXTERNAL', isTemporary=False),
 Table(name='student_name', database='analytics_tensor', descriptio

In [140]:
spark.catalog.listColumns("iris")

[Column(name='sepal_length', description=None, dataType='float', nullable=True, isPartition=False, isBucket=False),
 Column(name='sepal_width', description='sepal width', dataType='float', nullable=True, isPartition=False, isBucket=False),
 Column(name='petal_length', description=None, dataType='float', nullable=True, isPartition=False, isBucket=False),
 Column(name='petal_width', description=None, dataType='float', nullable=True, isPartition=False, isBucket=False),
 Column(name='species', description=None, dataType='string', nullable=True, isPartition=False, isBucket=False)]

#### Lab_3 Hints

In lab_1 we have created external file by performing some basic ETL on employees database. We'll utilize the same file for creating external table and partitioning the data. 

**Note**: Use this hints in lab_3.  

**Diagram**

In [68]:
spark.sql("CREATE DATABASE IF NOT EXISTS datalake_raw")

spark.sql("USE datalake_raw")

DataFrame[]

spark.sql("""
    CREATE EXTERNAL TABLE employees_employees (
        emp_no integer,
        birth_date date,
        first_name string,
        last_name string,
        gender string,
        hire_date date)
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY ','
    LOCATION '/dataset/incoming/employees/employees'
""")