# Setting Up Tables
Managing database and table metadata, locations, and configurations at the beginning of project can help to increase data security, discoverability, and performance.

Databricks allows you to store additional information about your databases and tables in a number of locations. This notebook reviews the basics of table and database declaration while introducing various options.

## Learning Objectives
By the end of this notebook, students will be able to:
- Set database locations
- Specify database comments
- Set table locations
- Specify table comments
- Specify column comments
- Use table properties for custom tagging
- Explore table metadata

## Classroom Setup

As we saw in the previous lesson, we need to execute the following script to configure our learning environment.

In [0]:
%run ../Includes/Classroom-Setup-1.1

## Important Database Options

The single most important option when creating a new database is the **`LOCATION`**. Because all managed tables will have their data files stored in this location, modifying the location of a database after initial declaration can require migration of many data files. Using an improper location can potentially store data in an unsecured location and lead to data exfiltration or deletion.

The database **`COMMENT`** option allows an arbitrary description to be provided for the database. This can be useful for both discovery and auditing purposes.

The **`DBPROPERTIES`** option allows user-defined keys to be specified for the database. This can be useful for creating tags that will be used in auditing. Note that this field may also contain options used elsewhere in Databricks to govern default behavior for the database.

## Creating a Database with Options

The following cell demonstrates the syntax for creating a database while:
1. Setting a database comment
1. Specifying a database location
1. Adding an arbitrary key-value pair as a database property

An arbitrary directory on the DBFS root is being used for the location; in any stage of development or production, it is best practice to create databases in secure cloud object storage with credentials locked down to appropriate teams within the organization.

In [0]:
%sql
CREATE DATABASE IF NOT EXISTS ${da.db_name}
COMMENT "This is a test database"
LOCATION "${da.paths.user_db}"
WITH DBPROPERTIES (contains_pii = true)

All of the comments and properties set during database declaration can be reviewed using **`DESCRIBE DATABASE EXTENDED`**.

This information can aid in data discovery, auditing, and governance. Having proactive rules about how databases will be created and tagged can help prevent accidental data exfiltration, redundancies, and deletions.

In [0]:
%sql
DESCRIBE DATABASE EXTENDED ${da.db_name}

## Important Table Options

The **`LOCATION`** keyword also plays a pivotal role when declaring tables, as it determines whether a table is **managed** or **external**. Note that explicitly providing a location will always result in an external table, even if this location maps directly to the directory that would be used by default.

Tables also have the **`COMMENT`** option to provide a table description.

Tables have the **`TBLPROPERTIES`** option that can also contain user-defined tags. All Delta Lake tables will have some default options stored to this field, and many customizations for Delta behavior will be reflected here.

Users can also define a **`COMMENT`** for an individual column.

## Creating a Table with Options
The following cell demonstrates creating a **managed** Delta Lake table while:
1. Setting a column comment
1. Setting a table comment
1. Adding an arbitrary key-value pair as a table property

In [0]:
%sql
CREATE TABLE IF NOT EXISTS ${da.db_name}.pii_test
(id INT, name STRING COMMENT "PII")
COMMENT "Contains PII"
TBLPROPERTIES ('contains_pii' = True) 

Much like the command for reviewing database metadata settings, **`DESCRIBE EXTENDED`** allows users to see all of the comments and properties for a given table.

**NOTE**: Delta Lake automatically adds several table properties on table creation.

In [0]:
%sql
DESCRIBE EXTENDED ${da.db_name}.pii_test

Below the code from above is replicated with the addition of specifying a location, creating an **external** table.

**NOTE**: The only thing that differentiates managed and external tables is this location setting. Performance of managed and external tables should be equivalent with regards to latency, but the results of SQL DDL statements on these tables differ drastically.

In [0]:
%sql
CREATE TABLE IF NOT EXISTS ${da.db_name}.pii_test_2
(id INT, name STRING COMMENT "PII")
COMMENT "Contains PII"
LOCATION "${da.paths.working_dir}/pii_test_2"
TBLPROPERTIES ('contains_pii' = True) 

As expected, the only differences in the extended description of the table have to do with the table location and type.

In [0]:
%sql
DESCRIBE EXTENDED ${da.db_name}.pii_test_2

## Using Table Metadata

Assuming that rules are followed appropriately when creating databases and tables, comments, table properties, and other metadata can be interacted with programmatically for discovering datasets for governance and auditing purposes.

The Python code below demonstrates parsing the table properties field, filtering those options that are specifically geared toward controlling Delta Lake behavior. In this case, logic could be written to further parse these properties to identify all tables in a database that contain PII.

In [0]:
def parse_table_keys(database, table=None):
    table_keys = {}
    if table:
        query = f"SHOW TABLES IN {DA.db_name} LIKE '{table}'"
    else:
        query = f"SHOW TABLES IN {DA.db_name}"
    for table_item in spark.sql(query).collect():
        table_name = table_item[1]
        key_values = spark.sql(f"DESCRIBE EXTENDED {DA.db_name}.{table_name}").filter("col_name = 'Table Properties'").collect()[0][1][1:-1].split(",")
        table_keys[table_name] = [kv for kv in key_values if not kv.startswith("delta.")]
    return table_keys

parse_table_keys(DA.db_name)   

## Code Challenge

Use the following cell to:
1. Create a new **managed** table
1. Using the database and table name provided
1. Define 4 columns with <a href="https://spark.apache.org/docs/latest/sql-ref-datatypes.html" target="_blank">any valid data type</a>
1. Add a table comment
1. Use the **`TBLPROPERTIES`** option to set the key-value pair **`'contains_pii' = False`**

In [0]:
%sql
-- ANSWER
CREATE TABLE IF NOT EXISTS ${da.db_name}.challenge
(col1 STRING, col2 DATE, col3 DOUBLE, col4 INT)
COMMENT "This is a table"
TBLPROPERTIES ('contains_pii' = False)

Run the checks below to confirm this:

In [0]:
assert len(spark.sql(f"SHOW TABLES IN {DA.db_name} LIKE 'challenge'").collect()) > 0, f"Table 'challenge' not in {da.db_name}"
assert parse_table_keys(DA.db_name, 'challenge') == {'challenge': ['contains_pii=false']}, "PII flag not set correctly"
assert len(spark.table(f"{DA.db_name}.challenge").columns) == 4, "Table does not have 4 columns"
assert [x.tableType for x in spark.catalog.listTables(DA.db_name) if x.name=="challenge"] == ["MANAGED"], "Table is not managed"
assert spark.sql(f"DESCRIBE EXTENDED {DA.db_name}.challenge").filter("col_name = 'Comment'").collect() != [], "Table comment not set"
print("All tests passed.")

Run the following cell to delete the tables and files associated with this lesson.

In [0]:
DA.cleanup()