In this notebook you will see how to upsert a Hive table with a new increment of data.

In [None]:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql import Window
import os

In [None]:
spark = (
    SparkSession
    .builder
    .appName('table-upsert')
).getOrCreate()

In [None]:
base_path = os.getcwd()

project_path = ('/').join(base_path.split('/')[0:-3]) 

users_base_path = os.path.join(project_path, 'data/users_base')
users_increment_path = os.path.join(project_path, 'data/users_increment')
accounts_output_path = os.path.join(project_path, 'output/hive/accounts')

checkpoint_dir = os.path.join(project_path, 'output/checkpoints')

In [None]:
spark.sql('drop table if exists accounts')

### Create a new table

* Take the data from the `users_base_path` and save it as a new table with the name `accounts`
* Use [saveAsTable](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriter.saveAsTable.html#pyspark.sql.DataFrameWriter.saveAsTable)
* as the location for the table use `accounts_output_path`

In [None]:
# your code here:


### Verify that the table is created

you can use the following SQL commands:
* show tables
* describe table_name
* describe formatted table_name
* describe extended table_name

In [None]:
# your code here:


### Upsert

* load the increment in to a Spark DataFrame 
 * use the path `users_increment_path`
* upsert the increment on the accounts table
 * use the approach with Union + row_number:
   * add a new column `version` to both dataframes, use value 1 for the table and value 2 for the increment
   * union both DataFrames using [unionByName](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.unionByName.html#pyspark.sql.DataFrame.unionByName)
   * create a [window](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.WindowSpec.partitionBy.html#pyspark.sql.WindowSpec.partitionBy) partitioned by user_id and sorted by the new `version` column
   * call [row_number](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.row_number.html#pyspark.sql.functions.row_number) over this window
   * this will allow you to use a filter to keep for each `user_id` only records with newer `version`


In [None]:
# read the increment:


In [None]:
# define the window


In [None]:
# Write the query for the upsert - create a new dataframe called `result`


#### Save the result

Try to run the overwrite of the `accounts` table by this `result` DataFrame. 

Notice that running the overwrite will lead to the following error:

`AnalysisException: Cannot overwrite table default.accounts that is also being read from`

This is because we cannot write to the same location from which we also read

In [None]:
# run the overwrite to see the error:


### Checkpointing

This can be solved using checkpointing

* Checkpoint the result DataFrame using [checkpoint](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.checkpoint.html#pyspark.sql.DataFrame.checkpoint)
* assign it to a new DataFrame
* run the overwrite with this new checkpointed DataFrame

Note:
* the checkpoint will persist the data at a location specified using `setCheckpointDir`

In [None]:
spark.sparkContext.setCheckpointDir(checkpoint_dir)

In [None]:
# do the checkpoint:


In [None]:
# save the checkpointed result - the error should no longer be present


In [None]:
spark.stop()