## Working with tables

### Task I - Table creation
* create table messages 
* take data from questions (question_id -> message_id, creation_date, body, user_id)
* partition the table by year (derived_from creation_date)

### Task II - Table append
* append to the table new data
* take data from answers with the same structure
* partition by year & append to the table messages

### Task III - Partitions overwrite
* overwrite only partition for the year 2018
* take data from questions but filter only for year 2018
* use insertInto with dynamic overwrite

### Task IV - Tables management
* list all tables that we have in our database
* see the properties of the messages table
* rename the table messages -> posts
* see all partitions that the table has
* see properties of the partiton year=2018
* drop the table posts


In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, year
from pyspark.sql.types import *

import os

In [None]:
spark = (
    SparkSession
    .builder
    .appName('Tables')
    .config("spark.sql.hive.metastore.version", "1.2.1")
    .config("spark.sql.hive.metastore.jars", "maven")
    .enableHiveSupport()
    .getOrCreate()
)

In [None]:
base_path = os.getcwd()

project_path = ('/').join(base_path.split('/')[0:-3]) 

questions_input_path = os.path.join(project_path, 'output/questions-transformed')

answers_input_path = os.path.join(project_path, 'data/answers')

messages_path = os.path.join(project_path, 'output/tables/messages')

In [None]:
my_schema = StructType([
    StructField('question_id', LongType()),
    StructField('creation_date', TimestampType()),
    StructField('body', StringType()),
    StructField('user_id', LongType())
])

In [None]:
questionsDF = spark.read.schema(my_schema).parquet(questions_input_path)

In [None]:
answersDF = spark.read.schema(my_schema).parquet(answers_input_path)

### Task I

* create partitioned table

In [None]:
(
    questionsDF
    .withColumn("year", year("creation_date"))
    .repartition("year")
    .write
    .mode("overwrite")
    .partitionBy("year")
    .option("path", messages_path)
    .saveAsTable("messages")
)

### Task II

* append partitioned table with new data

In [None]:
spark.table("messages").count()

In [None]:
(
    answersDF
    .withColumn("year", year("creation_date"))
    .repartition("year")
    .write
    .mode("append")
    .partitionBy("year")
    .option("path", messages_path)
    .saveAsTable("messages")
)

In [None]:
answersDF.count()

In [None]:
spark.table("messages").count()

### Task III

* ovewrite single partition
* see insertInto functionality with different partitionOverwriteModes
 * STATIC is default (this can overwrite the whole dataset)
 * DYNAMIC will overwrite only relevant partition

In [None]:
spark.conf.get("spark.sql.sources.partitionOverwriteMode")

In [None]:
spark.conf.set("spark.sql.sources.partitionOverwriteMode", "DYNAMIC")

In [None]:
(
    questionsDF
    .withColumn("year", year("creation_date"))
    .filter(col('year') == 2018)
).count()

In [None]:
(
    spark.table('messages')
    .filter(col('year') == 2018)
).count()

In [None]:
# Let's first see the overwrite=False option which appends the partition

(
    questionsDF
    .withColumn("year", year("creation_date"))
    .filter(col('year') == 2018)
    .repartition("year")
    .write
    .insertInto("messages", overwrite=False)
)

In [None]:
(
    spark.table('messages')
    .filter(col('year') == 2018)
).count()

In [None]:
spark.read.schema(my_schema).parquet(messages_path).filter(col('year') == 2018).count()

In [None]:
# overwrite=True option which overwrites the partition

(
    questionsDF
    .withColumn("year", year("creation_date"))
    .filter(col('year').isin([2018]))
    .repartition("year")
    .write
    .insertInto("messages", overwrite=True)
)

In [None]:
spark.read.schema(my_schema).parquet(messages_path).filter(col('year') == 2018).count()

In [None]:
spark.conf.set("spark.sql.sources.partitionOverwriteMode", "STATIC")

In [None]:
# STATIC appends the partition

(
    questionsDF
    .withColumn("year", year("creation_date"))
    .filter(col('year') == 2018)
    .repartition("year")
    .write
    .insertInto("messages", overwrite=False)
)

In [None]:
spark.read.schema(my_schema).parquet(messages_path).filter(col('year') == 2018).count()

In [None]:
# STATIC with overwrite=TRUE overwrites the entire table
(
    questionsDF
    .withColumn("year", year("creation_date"))
    .filter(col('year') == 2018)
    .repartition("year")
    .write
    .insertInto("messages", overwrite=True)
)

In [None]:
spark.read.schema(my_schema).parquet(messages_path).filter(col('year') == 2018).count()

### Task IV - Tables management
* list all tables that we have in our database
* see the properties of the messages table
* rename the table messages -> posts
* see all partitions that the table has
* see properties of the partiton year=2018

In [None]:
spark.catalog.listTables()

In [None]:
# See properties on a table

spark.sql("DESC EXTENDED messages").show(n=50)

In [None]:
# Change the name of the table

spark.sql("ALTER TABLE messages RENAME TO posts")

In [None]:
spark.table("posts").count()

In [None]:
# See partitions of the table

spark.sql("SHOW PARTITIONS posts").show()

In [None]:
# See properties of a single partition

spark.sql("DESC FORMATTED posts PARTITION (year=2018)").show(n=50, truncate=50)

In [None]:
# Drop the table

spark.sql('DROP TABLE posts')