## Working with tables

### Task I - Table creation
* create table messages 
* take data from questions (question_id, creation_date, body, user_id)
* partition the table by year (derived_from creation_date)

### Task II - Table append
* append to the table new data
* take data from answers with the same structure
* partition by year & append to the table messages

### Task III - Partitions overwrite
* overwrite only partition for the year 2018
* take data from questions but filter only for year 2018
* use insertInto with dynamic overwrite

### Task IV - Tables management
* list all tables that we have in our database
* see the properties of the messages table
* rename the table messages -> posts
* see all partitions that the table has
* see properties of the partiton year=2018
* drop the table posts


In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, year
from pyspark.sql.types import *

import os

In [2]:
spark = (
    SparkSession
    .builder
    .appName('Tables')
    #.config("spark.sql.hive.metastore.version", "1.2.1")
    #.config("spark.sql.hive.metastore.jars", "maven")
    .enableHiveSupport()
    .getOrCreate()
)



In [3]:
spark.sql("show tables").show()

+---------+---------+-----------+
|namespace|tableName|isTemporary|
+---------+---------+-----------+
|  default|    posts|      false|
+---------+---------+-----------+



In [4]:
base_path = os.getcwd()

project_path = ('/').join(base_path.split('/')[0:-3]) 

questions_input_path = os.path.join(project_path, 'output/questions-transformed')

answers_input_path = os.path.join(project_path, 'data/answers')

messages_path = os.path.join(project_path, 'output/tables/messages')

In [5]:
my_schema = StructType([
    StructField('question_id', LongType()),
    StructField('creation_date', TimestampType()),
    StructField('body', StringType()),
    StructField('user_id', LongType())
])

In [6]:
questionsDF = spark.read.schema(my_schema).parquet(questions_input_path)

In [7]:
answersDF = spark.read.schema(my_schema).parquet(answers_input_path)

### Task I

* create partitioned table `messages`, partition it by `year`
* use [year](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.year) to derive the partition column from `creation_date`
* save it at `messages_path`

In [8]:
(
    questionsDF
    .withColumn("year", year("creation_date"))
    .repartition("year")
    .write
    .mode("overwrite")
    .partitionBy("year")
    .option("path", messages_path)
    .saveAsTable("messages")
)

### Task II

* append the table `messages` with new data
* the new data is the `answersDF`

In [10]:
spark.table("messages").count()

195179

In [11]:
(
    answersDF
    .withColumn("year", year("creation_date"))
    .repartition("year")
    .write
    .mode("append")
    .partitionBy("year")
    .option("path", messages_path)
    .saveAsTable("messages")
)

In [12]:
answersDF.count()

298094

In [13]:
spark.table("messages").count() # you should see here that the append was successful

493273

### Task III

* ovewrite single partition in the table `messages`
* take data from questionsDF for the year=2018 and use this data to overwrite 2018 partiton in `messages`
* see insertInto functionality with different partitionOverwriteModes
 * STATIC is default (this can overwrite the whole dataset)
 * DYNAMIC will overwrite only relevant partition

In [14]:
spark.conf.get("spark.sql.sources.partitionOverwriteMode")

'STATIC'

In [15]:
spark.conf.set("spark.sql.sources.partitionOverwriteMode", "DYNAMIC")

In [16]:
(
    questionsDF
    .withColumn("year", year("creation_date"))
    .filter(col('year') == 2018)
).count()

18308

In [17]:
(
    spark.table('messages')
    .filter(col('year') == 2018)
).count()

42693

In [18]:
# Let's first see the overwrite=False option which appends the partition

(
    questionsDF
    .withColumn("year", year("creation_date"))
    .filter(col('year') == 2018)
    .repartition("year")
    .write
    .insertInto("messages", overwrite=False)
)

In [19]:
(
    spark.table('messages')
    .filter(col('year') == 2018)
).count()

61001

In [20]:
# overwrite=True option which overwrites the partition

(
    questionsDF
    .withColumn("year", year("creation_date"))
    .filter(col('year').isin([2018]))
    .repartition("year")
    .write
    .insertInto("messages", overwrite=True)
)

In [21]:
spark.table('messages').filter(col('year') == 2018).count() # this partition was overwritten

18308

In [22]:
spark.table('messages').filter(col('year') == 2019).count() # this partition didn't change

44977

#### Task IIIb

Let' see what happens in the default STATIC mode

In [23]:
spark.conf.set("spark.sql.sources.partitionOverwriteMode", "STATIC")

In [24]:
# STATIC with overwrite=False appends the partition

(
    questionsDF
    .withColumn("year", year("creation_date"))
    .filter(col('year') == 2018)
    .repartition("year")
    .write
    .insertInto("messages", overwrite=False)
)

In [25]:
spark.table('messages').filter(col('year') == 2018).count() # this partition was appended

36616

In [26]:
# STATIC with overwrite=TRUE overwrites the entire table
(
    questionsDF
    .withColumn("year", year("creation_date"))
    .filter(col('year') == 2018)
    .repartition("year")
    .write
    .insertInto("messages", overwrite=True)
)

In [27]:
spark.table('messages').filter(col('year') == 2018).count() # this partition was overwritten

18308

In [28]:
spark.table('messages').filter(col('year') == 2019).count() # all other partitions were deleted

0

### Task IV - Tables management
* list all tables that we have in our database
* see the properties of the messages table
* rename the table messages -> posts
* see all partitions that the table has
* see properties of the partiton year=2018

Hint:
* check the sql-reference [docs](https://spark.apache.org/docs/latest/sql-ref.html)
* check [catalog API](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Catalog)

In [29]:
spark.catalog.listTables()

[Table(name='messages', database='default', description=None, tableType='EXTERNAL', isTemporary=False)]

In [30]:
# See properties on a table

spark.sql("DESC EXTENDED messages").show(n=50)

+--------------------+--------------------+-------+
|            col_name|           data_type|comment|
+--------------------+--------------------+-------+
|         question_id|              bigint|   null|
|       creation_date|           timestamp|   null|
|                body|              string|   null|
|             user_id|              bigint|   null|
|                year|                 int|   null|
|# Partition Infor...|                    |       |
|          # col_name|           data_type|comment|
|                year|                 int|   null|
|                    |                    |       |
|# Detailed Table ...|                    |       |
|            Database|             default|       |
|               Table|            messages|       |
|               Owner|             student|       |
|        Created Time|Sat Nov 13 08:20:...|       |
|         Last Access|             UNKNOWN|       |
|          Created By|         Spark 3.2.0|       |
|           

In [31]:
# Change the name of the table

spark.sql("ALTER TABLE messages RENAME TO posts")

DataFrame[]

In [32]:
spark.table("posts").count()

18308

In [33]:
# See partitions of the table

spark.sql("SHOW PARTITIONS posts").show()

+---------+
|partition|
+---------+
|year=2018|
+---------+



In [34]:
# See properties of a single partition

spark.sql("DESC FORMATTED posts PARTITION (year=2018)").show(n=50, truncate=50)

+--------------------------------+--------------------------------------------------+-------+
|                        col_name|                                         data_type|comment|
+--------------------------------+--------------------------------------------------+-------+
|                     question_id|                                            bigint|   null|
|                   creation_date|                                         timestamp|   null|
|                            body|                                            string|   null|
|                         user_id|                                            bigint|   null|
|                            year|                                               int|   null|
|         # Partition Information|                                                  |       |
|                      # col_name|                                         data_type|comment|
|                            year|                          

In [None]:
# Drop the table

spark.sql('DROP TABLE posts')