In this notebook you will

* think about atomic upsert of Hive tables
* implement time-travel feature
* do a simple schema evolution

In [None]:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql import Window
import os

In [None]:
spark = (
    SparkSession
    .builder
    .appName('time-travel')
).getOrCreate()

In [None]:
base_path = os.getcwd()

project_path = ('/').join(base_path.split('/')[0:-3]) 

users_base_path = os.path.join(project_path, 'data/users_base')
users_increment_path = os.path.join(project_path, 'data/users_increment')

accounts_output_path_v1 = os.path.join(project_path, 'output/tables/accounts/1')
accounts_output_path_v2 = os.path.join(project_path, 'output/tables/accounts/2')

tmp_location = os.path.join(project_path, 'output/tmp')

In [None]:
spark.sql('drop table if exists accounts')

### Atomicity

If the saving process fails from some reason you may end up with a corrupted table. To avoid that, try to make the process more atomic. Do the saving again as follows:

1. Create the original table `accounts` at a new location (use `accounts_output_path_v1`)
2. Do the upsert and save it at a different location, namely accounts_output_path_v2, use a different name for the final table, namely `accounts_v2`
3. Use SQL command `ALTER TABLE` to rename the `accounts` table to `accounts_delete`
4. Use `ALTER TABLE` again to rename the `account_v2` to `accounts`
5. Use SQL command `DROP TABLE` to delete `accounts_delete`
6. Check how many locations are not null before and after the upsert

Basically, you will first write the result and after it is successfully written, you will switch the table names to make sure that your production table is still in a consistent state. On the other hand, if your write would fail for some reason, you won't make the switch to keep the original table in a consistent state.

In [None]:
# Resave the original table at the location accounts_output_path_v1
(
    spark.read.parquet(users_base_path)
    .write
    .mode('overwrite')
    .option('path', accounts_output_path_v1)
    .saveAsTable('accounts')
)

In [None]:
# read the increment:

increment = spark.read.parquet(users_increment_path)

In [None]:
# check how many locations are not null

spark.table('accounts').filter(col('location').isNotNull()).count()

In [None]:
# define the window

w = Window().partitionBy('user_id').orderBy(desc('version'))

In [None]:
# Do the upsert - save the result at the location accounts_output_path_v2, use a new table_name (accounts_v2)

result = (
    spark.table('accounts').withColumn('version', lit(1))
    .unionByName(
        increment.withColumn('version', lit(2))
    )
    .withColumn('r', row_number().over(w))
    .filter(col('r') == 1)
    .drop('r', 'version')
)

(
    result
    .write
    .mode('overwrite')
    .option('path', accounts_output_path_v2)
    .saveAsTable('accounts_v2')
)

In [None]:
# Run the SQL commands to switch the names
spark.sql('drop table if exists accounts_delete')

spark.sql('ALTER TABLE accounts RENAME TO accounts_delete')
spark.sql('ALTER TABLE accounts_v2 RENAME TO accounts')

In [None]:
# drop the original table (the one that was renamed to accounts_delete)

spark.sql('DROP TABLE accounts_delete')

In [None]:
# check again how many locations are not null:

spark.table('accounts').filter(col('location').isNotNull()).count()

### Time Travel

Now imagine, that you have made a mistake and you actually don't want to do the upsert. We want to roll-back the operation. We can do it because the `DROP` command didn't delete the actual data, but only removed the information from the metastore. We can reconstruct the original data back so long we have the data and now the schema.

1. Create an empty DataFrame with the schema of the accounts table (use the schema of the new table, because we didn't change it). To create an empty DataFrame use [createDataFrame](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.SparkSession.createDataFrame.html#pyspark.sql.SparkSession.createDataFrame)
2. Save the empty DataFrame at temporal location - use `tmp_location`
3. Use ALTER TABLE command to change the location so the table points to the data before the upsert - `accounts_output_path_v1`
4. Now the table is no longer empty so you can switch the names using ALTER TABLE to give it the proper name


In [None]:
# create empty DataFrame:

empty_table = spark.createDataFrame([], spark.table('accounts').schema)

In [None]:
# save it as an empty table at temporal location
(
    empty_table
    .write
    .mode('overwrite')
    .option('path', tmp_location)
    .saveAsTable('accounts_empty')
)

In [None]:
# Change the location of the empty table and switch the names
# Drop the table with the wrong upsert

spark.sql('ALTER TABLE accounts RENAME TO accounts_to_delete')

spark.sql(f'ALTER TABLE accounts_empty SET LOCATION "{accounts_output_path_v1}"')

spark.sql('ALTER TABLE accounts_empty RENAME TO accounts')

spark.sql('DROP TABLE accounts_to_delete')

In [None]:
# check again how many locations are not null:

spark.table('accounts').filter(col('location').isNotNull()).count()

In [None]:
spark.table('accounts').printSchema()

## Schema evolution - drop the column `about`

* First try to use alter table drop column statement
  * this won't work
* use similar approach as before
  * create a new table at empty location and for this new table use modified schema without the column
  * then change the location of the empty table to point to the data
  * finally switch the names
  * verify that the modified table has a new schema

In [None]:
# fails with: The feature is not supported

#spark.sql('alter table accounts drop column about')

In [None]:
# create an empty DataFrame without the about column:

empty_table = spark.createDataFrame([], spark.table('accounts').drop('about').schema)

In [None]:
# save it as an empty table at temporal location
(
    empty_table
    .write
    .mode('overwrite')
    .option('path', tmp_location)
    .saveAsTable('accounts_empty')
)

In [None]:
# Change the location of the empty table and switch the names
# Drop the original table

spark.sql('ALTER TABLE accounts RENAME TO accounts_to_delete')

spark.sql(f'ALTER TABLE accounts_empty SET LOCATION "{accounts_output_path_v1}"')

spark.sql('ALTER TABLE accounts_empty RENAME TO accounts')

spark.sql('DROP TABLE accounts_to_delete')

In [None]:
# verify the schema:

spark.table('accounts').printSchema()

In [None]:
spark.table('accounts').show()

In [None]:
spark.stop()