## Working with tables

### Task I - Table creation
* create table messages 
* take data from questions (question_id, creation_date, body, user_id)
* partition the table by year (derived from creation_date)

### Task II - Table append
* append to the table new data
* take data from answers with the same structure
* partition by year & append to the table messages

### Task III - Tables management
* list all tables that we have in our database
* see the properties of the messages table
* rename the table messages -> posts
* see all partitions that the table has
* see properties of the partiton year=2018
* compute and show the statistics for the table posts
* drop the table posts


In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, year
from pyspark.sql.types import *

import os

In [None]:
spark = (
    SparkSession
    .builder
    .appName('Tables')
    .enableHiveSupport()
    .getOrCreate()
)

In [None]:
base_path = os.getcwd()

project_path = ('/').join(base_path.split('/')[0:-3]) 

questions_input_path = os.path.join(project_path, 'output/questions-transformed')

answers_input_path = os.path.join(project_path, 'data/answers')

messages_path = os.path.join(project_path, 'output/tables/messages')

In [None]:
my_schema = StructType([
    StructField('question_id', LongType()),
    StructField('creation_date', TimestampType()),
    StructField('body', StringType()),
    StructField('user_id', LongType())
])

In [None]:
questionsDF = spark.read.schema(my_schema).parquet(questions_input_path)

In [None]:
answersDF = spark.read.schema(my_schema).parquet(answers_input_path)

### Task I

* create partitioned table `messages`, partition it by `year`
* use [year](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.year.html#pyspark.sql.functions.year) to derive the partition column from `creation_date`
* save it at `messages_path`
* use [write](http://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.write.html#pyspark.sql.DataFrame.write) and [saveAsTable](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriter.saveAsTable.html#pyspark.sql.DataFrameWriter.saveAsTable)

In [None]:
# your code here:

### Task II

* append the table `messages` with new data
* the new data is the `answersDF`

In [None]:
spark.table("messages").count()

In [None]:
# your code here:


In [None]:
answersDF.count()

In [None]:
spark.table("messages").count() # test if the count increased with the append

### Task III - Tables management
* list all tables that we have in our database
* see the properties of the messages table
* rename the table messages -> posts
* see all partitions that the table has
* see properties of the partiton year=2018

Hint:
* check the sql-reference [docs](https://spark.apache.org/docs/latest/sql-ref.html)
* check [catalog API](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Catalog.html#pyspark.sql.Catalog)

In [None]:
# list all tables:


In [None]:
# See properties of a table


In [None]:
# Change the name of the table to posts:


In [None]:
# See partitions of the table


In [None]:
# See properties of a single partition:


In [None]:
# Compute the statistics

In [None]:
# Show the computed statistics

In [None]:
# Drop the table posts:


To see more information about saving data with Spark, feel free to check my [article](https://towardsdatascience.com/notes-about-saving-data-with-spark-3-0-86ba85ca2b71).

In [None]:
spark.stop()