# Change Data Feed

In this notebook we will:

* recreate the Delta table in the metastore from the data which is in the location
* enable the Change Data Feed (CDF) feature on a Delta table
* delete a record and query the CDF
* append the record back using the CDF

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from delta.tables import DeltaTable
import os

In [None]:
spark = (
    SparkSession
    .builder
    .appName('delta-II')
    .config('spark.jars.packages', 'io.delta:delta-spark_2.12:3.2.1')
    .config('spark.sql.extensions', 'io.delta.sql.DeltaSparkSessionExtension')
    .config('spark.sql.catalog.spark_catalog', 'org.apache.spark.sql.delta.catalog.DeltaCatalog')
    .getOrCreate()
)

In [None]:
base_path = os.getcwd()

project_path = ('/').join(base_path.split('/')[0:-3]) 

users_base_path = os.path.join(project_path, 'data/users_base')
users_increment_path = os.path.join(project_path, 'data/users_increment')
accounts_output_path = os.path.join(project_path, 'output/accounts')

In [None]:
spark.sql('drop table if exists accounts')

In [None]:
spark.sql(f"""
    CREATE TABLE accounts
    USING DELTA
    LOCATION '{accounts_output_path}'
""")

In [None]:
spark.sql('show tables').show()

In [None]:
# Turn on the CDF feature:

spark.sql('ALTER TABLE accounts SET TBLPROPERTIES (delta.enableChangeDataFeed = true)')

In [None]:
# See the table history:

spark.sql('describe history accounts').select('version', 'timestamp', 'operation').show()

In [None]:
# See the table properties:

spark.sql('show tblproperties accounts').show(truncate=100)

## Query the CDF 

* see the [docs](https://docs.delta.io/latest/delta-change-data-feed.html)
* for the `startingVersion` option use the version when it was turned on

In [None]:
# your code here:

(
    spark.read
    .format('delta')
    .option('readChangeFeed', 'true')
    .option('startingVersion', 7)
    .table('accounts')
).show()

## Delete a row and then add it back using the CDF

* from the accounts table delete the row whee user_id = 79
* query the CDF again (you shoud see the change that happened)
* filter for the delete _change_type
* drop the additional columns
* append the row back to the table

In [None]:
# delete the row where user_id = 79:

DeltaTable.forName(spark, 'accounts').delete(col('user_id') == 79)

In [None]:
# see the CDF:

(
    spark.read
    .format('delta')
    .option('readChangeFeed', 'true')
    .option('startingVersion', 7)
    .table('accounts')
).show()

In [None]:
# append the row back:

(
    spark.read
    .format('delta')
    .option('readChangeFeed', 'true')
    .option('startingVersion', 7)
    .table('accounts')
    .filter(col('_change_type') == 'delete')
    .drop('_change_type', '_commit_version', '_commit_timestamp')
    .write
    .mode('append')
    .format('delta')
    .option('path', accounts_output_path)
    .saveAsTable('accounts')
)

In [None]:
# see the history of the table:

spark.sql('describe history accounts').select('version', 'timestamp', 'operation').show()

In [None]:
# check the CDF again and see the append:

(
    spark.read
    .format('delta')
    .option('readChangeFeed', 'true')
    .option('startingVersion', 7)
    .table('accounts')
).show()