# So far we have see the following:
- How we can create a `Database`
- How we can create a `Table`
- How to `load` the data from a `tempView` to a `Table` 
- And the table which we created was a `Managed Table`, and hence when we **dropped the table**:
    - **Both the data and metadata got deleted**

## Types of table:

**1. Managed Table** 

```python

# Create an empty table
spark.sql('CREATE TABLE my_db_spark.orders \
       (order_id integer, \
        order_date string, \
        customer_id integer, \
        order_status string))

# Load data into the table 
spark.sql("INSERT INTO orders \
    SELECT * \
    FROM orders_view")
 ```   
            
**2. External Table** 
```python
# No loading of data, just point to the data location 
spark.sql('CREATE TABLE my_db_spark.orders \
   (order_id integer, \
    order_date string, \
    customer_id integer, \
    order_status string) \
    USING csv \
    LOCATION '<S3:Path>') 
```

## Create an `external` Table

    - We dont own the data 
    - We own ONLY the metadata 
    - We can not delete the data (as many others might be using the data) 
        - When we drop, it will DROP only the meta data 
        - Even if we use TRUNCATE, it would fail

In [None]:
spark.catalog.currentDatabase()

In [None]:
spark.sql('CREATE DATABASE IF NOT EXISTS my_db_spark')

In [None]:
spark.sql('USE my_db_spark')

In [None]:
# Creating an external table 

spark.sql("CREATE TABLE orders \
           (order_id integer, \
            order_date string, \
            customer_id integer, \
            order_status string) \
            USING parquet \
            OPTIONS ('header'='true', 'inferSchema'='true') \
            LOCATION 's3://fcc-spark-example/dataset/2023/my_orders'")

In [None]:
spark.sql('SHOW TABLES').show()

In [None]:
spark.sql('SELECT * FROM orders').show(5)

In [None]:
# There are many different options available 

spark.sql("""
  CREATE TABLE my_db_spark.orders2 (
    order_id integer,
    order_date string,
    customer_id integer,
    order_status string
  )
  USING csv
  OPTIONS (
    'path' 's3://fcc-spark-example/dataset/2023/orders.csv',
    'header' 'true',
    'sep' ',',
    'inferSchema' 'true',
    'mode' 'FAILFAST',
    'quote' '"',
    'escape' '"',
    'multiline' 'true',
    'charset' 'UTF-8'
  )
""")


In [None]:
spark.sql('show tables').show()

In [None]:
spark.sql('DESCRIBE EXTENDED orders').show(truncate=False)

##### You can run this in a seperate shell, and we can see the data as its not an temp view
```python
>>> spark.sql('SHOW tables').show()
+-----------+---------+-----------+
|  namespace|tableName|isTemporary|
+-----------+---------+-----------+
|my_db_spark|   orders|      false|
+-----------+---------+-----------+

>>> 
>>> spark.sql('SELECT * FROM orders').show(5)
+--------+--------------------+-----------+---------------+                     
|order_id|          order_date|customer_id|   order_status|
+--------+--------------------+-----------+---------------+
|       1|2013-07-25 00:00:...|      11599|         CLOSED|
|       2|2013-07-25 00:00:...|        256|PENDING_PAYMENT|
|       3|2013-07-25 00:00:...|      12111|       COMPLETE|
|       4|2013-07-25 00:00:...|       8827|         CLOSED|
|       5|2013-07-25 00:00:...|      11318|       COMPLETE|
+--------+--------------------+-----------+---------------+
only showing top 5 rows

```

In [None]:
# # This would FAIL 

spark.sql('TRUNCATE table orders')

- **External Table**
    - We ONLY own the metadata 
    - Data stays somewhere else, like here its on S3 
    - The data may be used by others, so we dont have the rights to do anything to that data as we dont own that 

- **Managed Table**
    - We OWN BOTH the metadata and data 
    - We can do whataver we want 
    - When this table is dropped, both the data and metadata is lost
    

### DML Operations 
    - INSERT - it works -> But its mostly for OLTP application, its not for Spark ideally
    - UPDATE - doesnt work  (This works in Databricks, using Delta lake)
    - DELETE - doesnt work  (This works in Databricks, using Delta lake) 
    - SELECT - Always work 
    
    (These are as part of Open Source Apache Spark)

In [None]:
spark.sql('SELECT * FROM orders').show(5)

In [None]:
spark.sql("INSERT INTO TABLE my_db_spark.orders VALUES (555555, '2023-05-23', 555555, 'COMPLETED')")

In [None]:
spark.sql("INSERT INTO TABLE my_db_spark.orders VALUES (666666, '2023-05-23', 666666, 'COMPLETED')")

In [None]:
spark.sql('SELECT * FROM orders').show()

In [None]:
# We will not see ONLY the database folder

!hadoop fs -ls hdfs://ip-172-31-2-35.us-east-2.compute.internal:8020/user/spark/warehouse/

In [None]:
# We will not see any data here 

!hadoop fs -ls hdfs://ip-172-31-2-35.us-east-2.compute.internal:8020/user/spark/warehouse/my_db_spark.db/

In [None]:
spark.sql('SELECT * FROM my_db_spark.orders WHERE order_id IN (555555, 666666)').show()

```bash 
[hadoop@ip-172-31-2-35 ~]$ aws s3 ls s3://fcc-spark-example/dataset/2023/my_orders/
2023-07-10 17:27:22          0 _SUCCESS
2023-07-10 17:27:07       1277 part-00000-2733a635-a8dd-4c5f-b1e1-f8789acf6330-c000.snappy.parquet
2023-07-10 17:27:22       1277 part-00000-775b48c5-602f-45d0-8536-e559c1737bf0-c000.snappy.parquet
[hadoop@ip-172-31-2-35 ~]$
```

# Clean up 

In [None]:
spark.sql("SELECT current_database()").show()

In [None]:
spark.sql("DROP TABLE orders")

In [None]:
spark.sql("SHOW TABLES").show()

In [None]:
spark.sql("DROP DATABASE my_db_spark")

In [None]:
spark.sql("USE default")

In [None]:
spark.sql("SHOW DATABASES").show() 