# So far we have see the following:
    - How we can create a Database
    - How we can create a Table
    - How to load the data from a tempView to a Table 
    - We created Managed Table, and when we dropped the table:
        - Both the data and metadata got deleted

## Types of table:

**1. Managed Table** 

```python

# Create an empty table
spark.sql('CREATE TABLE my_db_spark.orders \
       (order_id integer, \
        order_date string, \
        customer_id integer, \
        order_status string) \
        USING csv')

# Load data into the table 
spark.sql("INSERT INTO orders \
    SELECT * \
    FROM orders_view")
 ```   
            
**2. External Table** 
```python
# No loading of data, just point to the data location 
spark.sql('CREATE TABLE my_db_spark.orders \
   (order_id integer, \
    order_date string, \
    customer_id integer, \
    order_status string) \
    USING csv \
    LOCATION '<S3:Path>') 
```

## Create an `external` Table

    - We dont own the data 
    - We own ONLY the metadata 
    - We can not delete the data (as many others might be using the data) 
        - When we drop, it will DROP only the meta data 
        - Even if we use TRUNCATE, it would fail

In [1]:
spark.sparkContext.setLogLevel("ERROR")

In [2]:
spark.catalog.currentDatabase()

'default'

In [3]:
spark.sql('SHOW databases').show()

23/07/10 17:57:07 INFO HiveConf: Found configuration file file:/etc/spark/conf.dist/hive-site.xml
23/07/10 17:57:07 WARN HiveConf: HiveConf of name hive.server2.thrift.url does not exist
23/07/10 17:57:07 INFO AWSGlueClientFactory: Using region from ec2 metadata : us-east-2


+--------------------+
|           namespace|
+--------------------+
|db_youtube_analytics|
| db_youtube_cleansed|
|      db_youtube_raw|
|             default|
|        dev_feedback|
+--------------------+



In [4]:
spark.sql('CREATE DATABASE IF NOT EXISTS my_db_spark')

23/07/10 17:57:09 INFO FileUtils: Creating directory if it doesn't exist: hdfs://ip-172-31-2-35.us-east-2.compute.internal:8020/user/spark/warehouse/my_db_spark.db


DataFrame[]

In [5]:
spark.sql('SHOW databases').show()

+--------------------+
|           namespace|
+--------------------+
|db_youtube_analytics|
| db_youtube_cleansed|
|      db_youtube_raw|
|             default|
|        dev_feedback|
|         my_db_spark|
+--------------------+



In [6]:
spark.sql('USE my_db_spark')

DataFrame[]

In [7]:
# spark.catalog.currentDatabase()
spark.sql("SELECT current_database()").show()

[Stage 0:>                                                          (0 + 1) / 1]

+------------------+
|current_database()|
+------------------+
|       my_db_spark|
+------------------+



                                                                                

In [8]:
spark.sql('SHOW TABLES').show()

+---------+---------+-----------+
|namespace|tableName|isTemporary|
+---------+---------+-----------+
+---------+---------+-----------+



In [9]:
spark.sql("CREATE TABLE my_db_spark.orders \
           (order_id integer, \
            order_date string, \
            customer_id integer, \
            order_status string) \
            USING parquet \
            OPTIONS ('header'='true', 'inferSchema'='true') \
            LOCATION 's3://fcc-spark-example/dataset/2023/my_orders'")

23/07/10 17:57:14 INFO SQLStdHiveAccessController: Created SQLStdHiveAccessController for session context : HiveAuthzSessionContext [sessionString=b4ddbb69-0a1f-4cda-a150-09d59c0d830b, clientType=HIVECLI]
23/07/10 17:57:14 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
23/07/10 17:57:14 INFO AWSCatalogMetastoreClient: Mestastore configuration hive.metastore.filter.hook changed from org.apache.hadoop.hive.metastore.DefaultMetaStoreFilterHookImpl to org.apache.hadoop.hive.ql.security.authorization.plugin.AuthorizationMetaStoreFilterHook
23/07/10 17:57:14 INFO AWSGlueClientFactory: Using region from ec2 metadata : us-east-2
23/07/10 17:57:14 INFO AWSGlueClientFactory: Using region from ec2 metadata : us-east-2


DataFrame[]

In [10]:
spark.sql('SELECT * FROM orders').show(5)

[Stage 1:>                                                          (0 + 1) / 1]

+--------+--------------------+-----------+---------------+
|order_id|          order_date|customer_id|   order_status|
+--------+--------------------+-----------+---------------+
|       1|2013-07-25 00:00:...|      11599|         CLOSED|
|       2|2013-07-25 00:00:...|        256|PENDING_PAYMENT|
|       3|2013-07-25 00:00:...|      12111|       COMPLETE|
|       4|2013-07-25 00:00:...|       8827|         CLOSED|
|       5|2013-07-25 00:00:...|      11318|       COMPLETE|
+--------+--------------------+-----------+---------------+
only showing top 5 rows



                                                                                

In [11]:
## There are many different options available 

# spark.sql("""
#   CREATE TABLE my_db_spark.orders2 (
#     order_id integer,
#     order_date string,
#     customer_id integer,
#     order_status string
#   )
#   USING csv
#   OPTIONS (
#     'path' 's3://fcc-spark-example/dataset/2023/orders.csv',
#     'header' 'true',
#     'sep' ',',
#     'inferSchema' 'true',
#     'mode' 'FAILFAST',
#     'quote' '"',
#     'escape' '"',
#     'multiline' 'true',
#     'charset' 'UTF-8'
#   )
# """)


In [12]:
spark.sql('show tables').show()

+-----------+---------+-----------+
|  namespace|tableName|isTemporary|
+-----------+---------+-----------+
|my_db_spark|   orders|      false|
+-----------+---------+-----------+



In [13]:
spark.sql('DESCRIBE EXTENDED orders').show()

+--------------------+--------------------+-------+
|            col_name|           data_type|comment|
+--------------------+--------------------+-------+
|            order_id|                 int|   null|
|          order_date|              string|   null|
|         customer_id|                 int|   null|
|        order_status|              string|   null|
|                    |                    |       |
|# Detailed Table ...|                    |       |
|            Database|         my_db_spark|       |
|               Table|              orders|       |
|               Owner|              hadoop|       |
|        Created Time|Mon Jul 10 17:57:...|       |
|         Last Access|             UNKNOWN|       |
|          Created By|  Spark 3.3.0-amzn-1|       |
|                Type|            EXTERNAL|       |
|            Provider|             parquet|       |
|            Location|s3://fcc-spark-ex...|       |
|       Serde Library|org.apache.hadoop...|       |
|         In

##### You can run this in a seperate shell, and we can see the data as its not an temp view
```python
>>> spark.sql('SHOW tables').show()
+-----------+---------+-----------+
|  namespace|tableName|isTemporary|
+-----------+---------+-----------+
|my_db_spark|   orders|      false|
+-----------+---------+-----------+

>>> 
>>> spark.sql('SELECT * FROM orders').show(5)
+--------+--------------------+-----------+---------------+                     
|order_id|          order_date|customer_id|   order_status|
+--------+--------------------+-----------+---------------+
|       1|2013-07-25 00:00:...|      11599|         CLOSED|
|       2|2013-07-25 00:00:...|        256|PENDING_PAYMENT|
|       3|2013-07-25 00:00:...|      12111|       COMPLETE|
|       4|2013-07-25 00:00:...|       8827|         CLOSED|
|       5|2013-07-25 00:00:...|      11318|       COMPLETE|
+--------+--------------------+-----------+---------------+
only showing top 5 rows

```

In [14]:
## This would FAIL 

# spark.sql('TRUNCATE table orders')

- **External Table**
    - We ONLY own the metadata 
    - Data stays somewhere else, like here its on S3 
    - The data may be used by others, so we dont have the rights to do anything to that data as we dont own that 

- **Managed Table**
    - We OWN BOTH the metadata and data 
    - We can do whataver we want 

### DML Operations 
    - INSERT - it works -> But its mostly for OLTP application, its not for Spark ideally
    - UPDATE - doesnt work  (This works in Databricks, using Delta lake)
    - DELETE - doesnt work  (This works in Databricks, using Delta lake) 
    - SELECT - Always work 
    
    (These are as part of Open Source Apache Spark)

In [15]:
spark.sql("INSERT INTO TABLE my_db_spark.orders VALUES (9988, '2023-05-23', 1234, 'COMPLETED')")

DataFrame[]

In [16]:
spark.sql("INSERT INTO TABLE my_db_spark.orders VALUES (9989, '2023-05-23', 1235, 'COMPLETED')")

                                                                                

DataFrame[]

In [17]:
# We will not see any data here 

!hadoop fs -ls hdfs://ip-172-31-2-35.us-east-2.compute.internal:8020/user/spark/warehouse/my_db_spark.db/

In [18]:
spark.sql('SELECT * FROM my_db_spark.orders WHERE order_id IN (9988, 9989)').show()

+--------+--------------------+-----------+---------------+
|order_id|          order_date|customer_id|   order_status|
+--------+--------------------+-----------+---------------+
|    9988|2013-09-25 00:00:...|      11739|SUSPECTED_FRAUD|
|    9989|2013-09-25 00:00:...|       4865|       COMPLETE|
|    9989|          2023-05-23|       1235|      COMPLETED|
|    9989|          2023-05-23|       1235|      COMPLETED|
|    9989|          2023-05-23|       1235|      COMPLETED|
|    9988|          2023-05-23|       1234|      COMPLETED|
|    9988|          2023-05-23|       1234|      COMPLETED|
|    9989|          2023-05-23|       1235|      COMPLETED|
|    9989|          2023-05-23|       1235|      COMPLETED|
|    9988|          2023-05-23|       1234|      COMPLETED|
|    9988|          2023-05-23|       1234|      COMPLETED|
|    9988|          2023-05-23|       1234|      COMPLETED|
|    9988|          2023-05-23|       1234|      COMPLETED|
|    9989|          2023-05-23|       12

```bash 
[hadoop@ip-172-31-2-35 ~]$ aws s3 ls s3://fcc-spark-example/dataset/2023/my_orders/
2023-07-10 17:27:22          0 _SUCCESS
2023-07-10 17:27:07       1277 part-00000-2733a635-a8dd-4c5f-b1e1-f8789acf6330-c000.snappy.parquet
2023-07-10 17:27:22       1277 part-00000-775b48c5-602f-45d0-8536-e559c1737bf0-c000.snappy.parquet
[hadoop@ip-172-31-2-35 ~]$
```

# Clean up 

In [19]:
spark.sql("SELECT current_database()").show()

+------------------+
|current_database()|
+------------------+
|       my_db_spark|
+------------------+



In [20]:
spark.sql("DROP TABLE orders")

23/07/10 17:57:28 INFO GlueMetastoreClientDelegate: Enabled to skip drop table partitions


DataFrame[]

In [21]:
spark.sql("SHOW TABLES").show()

+---------+---------+-----------+
|namespace|tableName|isTemporary|
+---------+---------+-----------+
+---------+---------+-----------+



In [22]:
spark.sql("DROP DATABASE my_db_spark")

DataFrame[]

In [23]:
spark.sql("USE default")

DataFrame[]

In [24]:
spark.sql("SHOW DATABASES").show() 

+--------------------+
|           namespace|
+--------------------+
|db_youtube_analytics|
| db_youtube_cleansed|
|      db_youtube_raw|
|             default|
|        dev_feedback|
+--------------------+

