# Module 2: Offline Store hard Delete


---
**Note:** Please set kernel to `Python 3 (Data Science)` and select instance to `ml.t3.medium`

## Contents

1. [Background](#Background)
1. [Setup](#Setup)
1. [Create Feature Group in Iceberg table format](#Create-Feature-Group-in-Iceberg-table-format)
1. [Hard Delete Record using DeleteRecord API](#Hard-Delete-Record-using-DeleteRecord-API)
2. [Hard Delete records from Offline store with Iceberg Compaction Procedures](#Hard-Delete-records-from-Offline-store-with-Iceberg-Compaction-Procedures)


# Background

New regulations such as the European General Data Protection Regulation (GDPR), California Consumer Privacy Act (CCPA), and others have created new obligations that operators now need to be able to erase private data from their data stores when requested. Customers who need to comply with such regulations will need to guarantee their users data that needs to be forgotten is forgotten. This applies to data stores such as Feature Store. Customers are responsible for implementing processes and solutions to erase private data.

In this notebook, you will learn how you can erase data from the online and offline feature store. We will talk about the concept of “hard-delete” in Iceberg which is removing the data entirely from feature store as opposed to "soft-delete" where we are flagging records as deleted without physically removing them from storage.

SageMaker Feature store supports Apache Iceberg as a table format. Iceberg supports operations such as record-level insert, update, delete, and time travel queries. Iceberg tracks individual data files in a table instead of in directories. The table state is maintained in metadata files. All changes to the table state create a new metadata file version that automically replaces the older metadata. As a result simply deleting records are soft and the data still exists if you use time travel. 

Customers will have business logic to defines which feature records are to be deleted and which ones to retain. They can then decide how to handle the records identified for deletion. For example, users may want to run Athena delete queries and Athena vacuum queries on a schedule to remove records from the offline store (Amazon S3), or call Iceberg maintenance methods directly from Spark Commands. Customers can also leverage the DeleteRecord API to delete a record from a FeatureGroup in the OnlineStore. Feature Store supports both SoftDelete and HardDelete. 

This notebook will show how to remove data from the online and offline store using the DeleteRecord API and Amazon Athena procedures, which customers can leverage to implement their compliance workflows and policies. Source[1](https://docs.dremio.com/software/data-formats/apache-iceberg/).

![Iceberg Architecture](../images/iceberg_architecture.png "Iceberg Architecture")


# Setup

#### Imports

In [None]:
import sagemaker
import boto3
import sys
import pandas as pd
import numpy as np
import io
import importlib
from sagemaker.session import Session
from sagemaker import get_execution_role
from sagemaker.feature_store.feature_group import FeatureGroup
from sagemaker.feature_store.inputs import TtlDuration
from time import gmtime, strftime, sleep
import subprocess
import logging
import time

sys.path.append('..')
from utilities import Utils

In [None]:
sm_version = sagemaker.__version__
major, minor, patch = sm_version.split('.')
if int(major) < 2 or int(minor) < 125:
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', 'sagemaker==2.125.0'])
    importlib.reload(sagemaker)

In [None]:
logger = logging.getLogger('__name__')
logger.setLevel(logging.DEBUG)
logger.addHandler(logging.StreamHandler())

In [None]:
logger.info(f'Using SageMaker version: {sagemaker.__version__}')
logger.info(f'Using Pandas version: {pd.__version__}')

#### Essentials

In [None]:
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()
default_bucket = sagemaker_session.default_bucket()
region = sagemaker_session.boto_region_name
logger.info(f'Default S3 bucket = {default_bucket}')
prefix = 'sagemaker-feature-store'
# check if need this
s3_bucket_name = sagemaker_session.default_bucket()

featurestore_runtime_client = sagemaker_session.boto_session.client('sagemaker-featurestore-runtime', region_name=region)

boto_session = boto3.Session(region_name=region)
sagemaker_client = boto_session.client(service_name='sagemaker', region_name=region)
featurestore_runtime = boto_session.client(service_name='sagemaker-featurestore-runtime', region_name=region)
feature_store_session = sagemaker.Session(boto_session=boto_session, 
                                          sagemaker_client=sagemaker_client, 
                                          sagemaker_featurestore_runtime_client=featurestore_runtime_client)

#### Load datasets

In [None]:
customers_df = pd.read_csv('.././data/transformed/customers.csv')
customers_df.head(5)

In [None]:
customers_df = customers_df.convert_dtypes(infer_objects=True, convert_boolean=False)

In [None]:
customers_df['customer_id'] = customers_df['customer_id'].astype('string')
customers_df['event_time'] = customers_df['event_time'].astype('string')

# Create Feature Group in Iceberg table format

First, create a feature group with online and offline stores configured.

In [None]:
# define feature group name
customers_feature_group_iceberg_name = "customers-fg-iceberg-hd-" + strftime("%d-%H-%M-%S", gmtime())

In [None]:
customers_feature_group_iceberg = FeatureGroup(
    name=customers_feature_group_iceberg_name, sagemaker_session=sagemaker_session
)

In [None]:
customers_feature_group_iceberg.load_feature_definitions(data_frame=customers_df)

Select Iceberg as a table format when creating new feature groups. A new optional parameter TableFormat can be configured either interactively using the Amazon SageMaker Studio, or through code using the API or the SDK. This parameter can accept the values ICEBERG, or GLUE for current Glue format.  The following code snippet shows you how to create a feature group using the Iceberg format and FeatureGroup.create API of the SageMaker SDK.

In [None]:
from sagemaker.feature_store.inputs import TableFormatEnum, TtlDuration

record_identifier_customer_feature_name = "customer_id"
ttl_duration = TtlDuration(unit="Seconds", value=30)

customers_feature_group_iceberg.create(
    s3_uri=f"s3://{s3_bucket_name}/{prefix}",
    record_identifier_name=record_identifier_customer_feature_name,
    event_time_feature_name="event_time",
    role_arn=role,
    ttl_duration=ttl_duration,
    enable_online_store=True,
    table_format=TableFormatEnum.ICEBERG,

)

In [None]:
customers_feature_group_iceberg.describe()

In [None]:
customers_fg = FeatureGroup(name=customers_feature_group_iceberg_name, sagemaker_session=feature_store_session)

### Ingest Data

In [None]:
def wait_for_feature_group_creation_complete(feature_group):
    status = feature_group.describe().get('FeatureGroupStatus')
    print(f'Initial status: {status}')
    while status == 'Creating':
        logger.info(f'Waiting for feature group: {feature_group.name} to be created ...')
        time.sleep(5)
        status = feature_group.describe().get('FeatureGroupStatus')
    if status != 'Created':
        raise SystemExit(f'Failed to create feature group {feature_group.name}: {status}')
    logger.info(f'FeatureGroup {feature_group.name} was successfully created.')

In [None]:
wait_for_feature_group_creation_complete(customers_fg)

In [None]:
%%time
logger.info(f'Ingesting data into feature group: {customers_fg.name} ...')
customers_fg.ingest(data_frame=customers_df, max_processes=16, wait=True)
logger.info(f'{len(customers_df)} orders records ingested into feature group: {customers_fg.name}')

### Retrieve Sample Order (to be deleted)

In [None]:
customers_query = customers_fg.athena_query()
customers_table = customers_query.table_name
customers_database = customers_query.database
print(customers_table)
print(customers_database)

<div class="alert alert-info"> 💡 Note it takes a few mins for records to appear in the offline store
</div>

In [None]:
# Get random rows from Customers FG
customers_sample_df = Utils.sample(customers_feature_group_iceberg_name, n=1)
customers_sample_df.head()

In [None]:
customers_sample_id = customers_sample_df['customer_id'][0]
print(customers_sample_id)

In [None]:
query_results= 'sagemaker-featurestore/athena-results'
output_location = f's3://{default_bucket}/{query_results}/query_results/'
print(f'Athena query output location: \n{output_location}')

# Hard Delete Record using DeleteRecord API

You can delete records using the [DeleteRecord API](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_feature_store_DeleteRecord.html) for Online and Offline stores. For HardDelete, the complete Record is removed from the OnlineStore. In both cases, Feature Store appends the deleted record marker to the OfflineStore.

**Note**: Amazon SageMaker Feature Store provides the option for records to be hard deleted from the online store after a time duration is reached, with time to live (TTL) duration (TtlDuration). A record deleted using TtlDuration is hard deleted, or completely removed from the online store, and the deleted record is added to the offline store. For more information, please refer to the [document](https://docs.aws.amazon.com/sagemaker/latest/dg/feature-store-time-to-live.html). 

![Iceberg Architecture](../images/smfs_hard_delete_1.png "Iceberg Architecture")

### Run DeleteRecord API

For customers using online and offline Feature Groups, you have the ability to soft delete records in the online store. This is done using the [DeleteRecord](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_feature_store_DeleteRecord.html) API which will put delete marker (is_deleted) against the FG record.

In [None]:
from datetime import datetime, timezone, date

def generate_event_timestamp():
    # naive datetime representing local time
    naive_dt = datetime.now()
    # take timezone into account
    aware_dt = naive_dt.astimezone()
    # time in UTC
    utc_dt = aware_dt.astimezone(timezone.utc)
    # transform to ISO-8601 format
    event_time = utc_dt.isoformat(timespec='milliseconds')
    event_time = event_time.replace('+00:00', 'Z')
    return event_time

In [None]:
delete_record_result = featurestore_runtime.delete_record(
    FeatureGroupName=customers_feature_group_iceberg_name,
    RecordIdentifierValueAsString=customers_sample_id,
    EventTime=generate_event_timestamp(),
    DeletionMode='HardDelete',
    TargetStores=[
        'OnlineStore', 'OfflineStore'
    ]
)

### Check that the record has been deleted from the online store

In [None]:
import pprint
pretty_printer = pprint.PrettyPrinter(indent=4)

get_record_result = featurestore_runtime.get_record(
    FeatureGroupName=customers_feature_group_iceberg_name,
    RecordIdentifierValueAsString=customers_sample_id
)
pretty_printer.pprint(get_record_result)

### Check the customer record has been marked as deleted in offline store

<div class="alert alert-info"> 💡 Note it takes a few mins for records to appear in the offline store. You should see two records in the offline store for the sample customer_id, one with is_deleted=True and one with is_deleted=False.
</div>

In [None]:
customers_sample_id = customers_sample_id
select_query_string = f'SELECT * ' \
    f'FROM "{customers_database}"."{customers_table}"' \
    f' WHERE customer_id =\'{customers_sample_id}\';'

select_query_string

In [None]:
customers_query.run(query_string=select_query_string, output_location=output_location)
customers_query.wait()
select_record_df = customers_query.as_dataframe()
select_record_df.head()

Feature Store appends the deleted record marker to the OfflineStore. The deleted record marker is a record with the same RecordIdentifer as the original, but with `is_deleted` value set to `True`, `EventTime` set to the delete input `EventTime`, and other feature values set to `null`

# Hard Delete records from Offline store with Iceberg Compaction Procedures

![Iceberg Architecture](../images/smfs_hard_delete_2.png "Iceberg Architecture")

### Delete record from offline store based on deleted record marker

In [None]:
delete_query_string = f'DELETE ' \
    f'FROM "{customers_database}"."{customers_table}"' \
    f' WHERE customer_id in (' \
    f'SELECT customer_id ' \
    f'FROM "{customers_database}"."{customers_table}"' \
    f' WHERE is_deleted =true);'

delete_query_string

In [None]:
customers_query.run(query_string=delete_query_string, output_location=output_location)
customers_query.wait()

<div class="alert alert-info"> 💡 Iceberg generates a snapshot when you create, or modify, a table. A snapshot stores the state of a table. You can specify which snapshot you want to read, and then view the data at that timestamp. Snapshots can be used for time-travel queries, or the table can be rolled back to any valid snapshot. Snapshots accumulate until they are expired by the expireSnapshots operation. Regularly expiring snapshots is recommended to delete data files that are no longer needed, and to keep the size of table metadata small.
</div>

By default, the data catalog points to the latest snapshot. Since the data files are rewritten on every update or delete, newer snapshots will refer to files that don’t have the deleted records, so when you try to retrieve the customer record without specifying the snapshot the delete record will not be found.

The $snapshots table provides a detailed view of snapshots of the Iceberg table. A snapshot is the union of all files in its manifests. Manifest files can also be shared between snapshots to avoid rewriting metadata that changes infrequently.

In [None]:
snapshot_query_string = f'SELECT * ' \
    f'FROM "{customers_database}"."{customers_table}$snapshots"';

snapshot_query_string

In [None]:
customers_query.run(query_string=snapshot_query_string, output_location=output_location)
customers_query.wait()

In [None]:
snapshots_df = customers_query.as_dataframe()
snapshots_df.head()

Retrieve previous Snapshot ID and show that the customer record still exists in the offline feature store

In [None]:
previous_snapshot_query_string = f'SELECT snapshot_id  ' \
    f'FROM "{customers_database}"."{customers_table}$snapshots"' \
    f'    WHERE snapshot_id IN (SELECT parent_id FROM "{customers_database}"."{customers_table}$snapshots" WHERE committed_at = (' \
    f'        SELECT MAX(committed_at) FROM "{customers_database}"."{customers_table}$snapshots"))' \
    f'    ORDER BY committed_at DESC' \
    f'    Limit 1';

previous_snapshot_query_string

In [None]:
customers_query.run(query_string=previous_snapshot_query_string, output_location=output_location)
customers_query.wait()

In [None]:
previous_snapshot_df = customers_query.as_dataframe()
snapshot_id = previous_snapshot_df['snapshot_id'][0]
print(snapshot_id)

Retrieve Delete Record from previous Snapshot

In [None]:
deleted_record_query_string = f'SELECT *  ' \
    f'FROM "{customers_database}"."{customers_table}" FOR VERSION AS OF {snapshot_id}' \
    f'     WHERE customer_id =\'{customers_sample_id}\';'

deleted_record_query_string

In [None]:
customers_query.run(query_string=deleted_record_query_string, output_location=output_location)
customers_query.wait()

### Run Iceberg Compaction procedure for Snapshot Expiry

To perform Hard Delete of records in the offline store, we need to run 
* Iceberg compaction using Athena `OPTIMIZE table REWRITE DATA` query. Since newer snapshots where records are deleted may still refer to data files with deleted records. With compaction these records are reconciled using delete files that list deleted records. 
* Snapshot expiration older than retention period using Athena `VACUUM` query.

In this example we will ALTER the vacuum_max_snapshot_age_seconds table property too a very low value for demonstration purpose (60 seconds). Customers have the ability to adjust this value according to their company policy.


VACUUM removes snapshots that are older than the amount of time that is specified by the vacuum_max_snapshot_age_seconds table property. By default, this property is set to 432000 seconds (5 days).

#### Adjust vacuum_max_snapshot_age_seconds table property

In [None]:
property_query_string = f'ALTER TABLE `{customers_table}` ' \
    f'SET TBLPROPERTIES (\'vacuum_max_snapshot_age_seconds\'=\'60\')';

property_query_string

In [None]:
customers_query.run(query_string=property_query_string, output_location=output_location)
customers_query.wait()

#### Compaction and Vacuum

In [None]:
!pip3 install pyspark
!pip3 install git+https://github.com/awslabs/aws-glue-libs.git

In [None]:
!python ../02-module-working-with-offline-store/smfs_offline_compaction.py --region {region} --database {customers_database} --table {customers_table} --workgroup primary --outputlocation {output_location}

#### Query post procedures

The Snapshots have expired and the data is no longer present for this customer record.

In [None]:
%%timeit
customers_query.run(query_string=snapshot_query_string, output_location=output_location)
customers_query.wait()

In [None]:
snapshot_df = customers_query.as_dataframe()
snapshot_df.head()

There is only one snapshot remaining in Iceberg and if try to retrieve the Customer record it doesn't not exist anymore.

In [None]:
customers_query.run(query_string=select_query_string, output_location=output_location)
customers_query.wait()
select_record_df = customers_query.as_dataframe()
select_record_df.head()