# Schema evolution

In this notebook we will:

* clone the table to a new location
* append data which has one more column
* drop a column
* append data which has different datatype in one column
  * use [type widening](https://docs.delta.io/latest/delta-type-widening.html) since delta 4.0 (not available now)
  * change the schema of the table
  * append afterwords 

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from delta.tables import DeltaTable
import os

In [None]:
spark = (
    SparkSession
    .builder
    .appName('delta-IV')
    .config('spark.jars.packages', 'io.delta:delta-spark_2.12:3.2.1')
    .config('spark.sql.extensions', 'io.delta.sql.DeltaSparkSessionExtension')
    .config('spark.sql.catalog.spark_catalog', 'org.apache.spark.sql.delta.catalog.DeltaCatalog')
    .getOrCreate()
)

In [None]:
base_path = os.getcwd()

project_path = ('/').join(base_path.split('/')[0:-3]) 

users_base_path = os.path.join(project_path, 'data/users_base')
users_increment_path = os.path.join(project_path, 'data/users_increment')
accounts_output_path = os.path.join(project_path, 'output/accounts')
accounts_output_path_dev = os.path.join(project_path, 'output/accounts_dev')

In [None]:
# First recreate the table using the data in the location accounts_output_path:

spark.sql('drop table if exists accounts')

spark.sql(f"""
    CREATE TABLE accounts
    USING DELTA
    LOCATION '{accounts_output_path}'
""")

In [None]:
# Check the history of the table:

DeltaTable.forPath(spark, accounts_output_path).history().select('version', 'timestamp', 'operation').show(truncate=30)

## Create cloned table

* Create a new table accounts_dev for testing purpose
* Use [shallow clone](https://docs.delta.io/latest/delta-utility.html#shallow-clone-a-delta-table)
* Use the location `accounts_output_path_dev`

In [None]:
# Your code here

spark.sql('drop table if exists accounts_dev')

spark.sql(f"""
    CREATE TABLE accounts_dev
    SHALLOW CLONE accounts
    LOCATION '{accounts_output_path_dev}'
""")

In [None]:
# Check the history of the cloned table

DeltaTable.forPath(spark, accounts_output_path_dev).history().select('version', 'timestamp', 'operation').show(truncate=30)

In [None]:
# ore using SQL:

spark.sql('describe history accounts_dev').select('version', 'timestamp', 'operation').show(truncate=30)

In [None]:
new_row = [
    (1, 'Test Test', 'This is testing account', 'Prague', 0, 1, 100, 1000, 'Test')
]
new_row_df = spark.createDataFrame(
    new_row, 
    schema=['user_id', 'display_name', 'about', 'location', 'downvotes', 'upvotes', 'reputation', 'views', 'first_name']
)

In [None]:
new_row_df.show()

In [None]:
# Check the schema of the new_row_df and see it differs from the schema of accounts_dev:

new_row_df.printSchema()

In [None]:
spark.table('accounts_dev').printSchema()

## Append the new_row_df to the accounts_dev table

* Use saveAsTable with the append mode
  * it will fail with a schema mismatch error
* Do one of the following:
  *  set this config to True: `spark.databricks.delta.schema.autoMerge.enabled`
  *  use `mergeSchema` option on the writer

In [None]:
# check the value of the conf setting:

spark.conf.get('spark.databricks.delta.schema.autoMerge.enabled')

In [None]:
# fails with A schema mismatch detected

(
    new_row_df
    .write
    .format('delta')
    .mode('append')
    .option('path', accounts_output_path)
    #.saveAsTable('accounts_dev')
)

In [None]:
# use mergeSchema option:

(
    new_row_df
    .write
    .format('delta')
    .mode('append')
    .option('path', accounts_output_path)
    .option('mergeSchema', True)
    .saveAsTable('accounts_dev')
)

In [None]:
# Check the schema of the dev table to see if the column was added:

spark.table('accounts_dev').printSchema()

## Append a new row where one of the columns has different data type

In [None]:
# Here we change the upvotes column data type to double:

new_row = [
    (2, 'Another Test', 'This is testing account', 'Prague', 0, 1.0, 100, 1000, 'Another')
]
new_row_df = spark.createDataFrame(
    new_row, 
    schema=['user_id', 'display_name', 'about', 'location', 'downvotes', 'upvotes', 'reputation', 'views', 'first_name']
)

In [None]:
new_row_df.printSchema()

* use saveAsTable with append mode
  * it will fail with `Failed to merge fields` error
* overwrite the table accounts_dev and change the data type in the table using [cast](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Column.cast.html#pyspark-sql-column-cast)
  * use saveAsTable with the `overwriteSchema` option
* then append the new data

In [None]:
# fails with Failed to merge fields 'upvotes' and 'upvotes'

(
    new_row_df
    .write
    .format('delta')
    .mode('append')
    .option('path', accounts_output_path)
    .option('mergeSchema', True)
    #.saveAsTable('accounts_dev')
)

In [None]:
# call saveAsTable on accounts_dev and use overwriteSchema option:

(
    spark.table('accounts_dev')
    .withColumn('upvotes', col('upvotes').cast('double'))
    .write
    .mode('overwrite')
    .format('delta')
    .option('overwriteSchema', True)
    .saveAsTable('accounts_dev')
)

In [None]:
# after the schema of the table was changed, append the new data:

(
    new_row_df
    .write
    .format('delta')
    .mode('append')
    .option('path', accounts_output_path)
    .saveAsTable('accounts_dev')
)

In [None]:
# change the data type back:

(
    spark.read.parquet(users_base_path)
    .write
    .format('delta')
    .mode('overwrite')
    .option('path', accounts_output_path)
    .option('overwriteSchema', True)
    .saveAsTable('accounts')
)