# Module 3 - Update Feature Group (Optional notebook)
### Module 1 is a pre-requisite for this notebook.

**Note:** Please set kernel to `Python 3 (Data Science)` and select instance to `ml.t3.medium`

---

## Contents

1. [Setup](#setup)
1. [Explore existing customer feature group and data](#explore-customer-fg)
1. [Update customer feature group](#update-customer-fg)
1. [Ingest data into customer feature group](#ingest-customer-fg)
1. [Prepare training data set to retrain model](#model-training-data)
1. [Retrain XG Boost model](#retrain-xg-boost)
1. [Test model performance against test data](#real-time-inference)

---

In this notebook, we will illustrate how to modify a feature group using boto3 API and then ingest data into modified feature group. We will cover the following aspects:

* Look at existing data from customer feature group
* Modify customer feature group to add "has_kids" feature and ingest sample data
* Verify for a customer record that data has been ingested
* Athena query for dataset extraction to prepare data set for retraining(programmatically using SageMaker SDK)
* Retrain an XGBoost model similar to what we did in the notebook `m3_nb1_model_training.ipynb`
* Test by deploying the model and predicting against a sample test record
* Cleanup resources



## Setup
<a id='setup'></a>

In [1]:
from sagemaker.feature_store.feature_group import FeatureGroup
from sagemaker.serializers import CSVSerializer
from sagemaker.inputs import TrainingInput
from sagemaker.predictor import Predictor
from datetime import datetime, timezone, date
from random import randint
import pandas as pd
import numpy as np
import subprocess
import sagemaker
import importlib
import logging
import time
import sys
import boto3
import os
sys.path.append('..')
from utilities import Utils

In [2]:
if sagemaker.__version__ < '2.48.1':
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', 'sagemaker==2.48.1'])
    importlib.reload(sagemaker)

In [3]:
if boto3.__version__ < '1.24.23':
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', 'boto3==1.24.23'])
    importlib.reload(boto3)

In [4]:
logger = logging.getLogger('__name__')
logger.setLevel(logging.DEBUG)
logger.addHandler(logging.StreamHandler())

In [5]:
logger.info(f'Using SageMaker version: {sagemaker.__version__}')
logger.info(f'Using Pandas version: {pd.__version__}')
logger.info(f'Using boto3 version: {boto3.__version__}')

Using SageMaker version: 2.95.0
Using Pandas version: 1.0.1
Using boto3 version: 1.24.23


In [6]:
import pprint
pretty_printer = pprint.PrettyPrinter(indent=4)

In [7]:
!mkdir ../data/retrain

## Essentials

In [8]:
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()
default_bucket = sagemaker_session.default_bucket()
logger.info(f'Default S3 bucket = {default_bucket}')
prefix = 'sagemaker-feature-store'

Default S3 bucket = sagemaker-us-east-1-758387645107


In [9]:
region = sagemaker_session.boto_region_name

In [10]:
boto_session = boto3.Session(region_name=region)
sagemaker_runtime = boto_session.client(service_name='sagemaker', region_name=region)
featurestore_runtime = boto_session.client(service_name='sagemaker-featurestore-runtime', region_name=region)
s3 = boto_session.resource('s3')


In [11]:
def generate_event_timestamp():
    # naive datetime representing local time
    naive_dt = datetime.now()
    # take timezone into account
    aware_dt = naive_dt.astimezone()
    # time in UTC
    utc_dt = aware_dt.astimezone(timezone.utc)
    # transform to ISO-8601 format
    event_time = utc_dt.isoformat(timespec='milliseconds')
    event_time = event_time.replace('+00:00', 'Z')
    return event_time

### Explore existing feature definition and the data set
<a id='explore-customer-fg'></a>

Retrieve variables stored in previous notebooks for feature group names

In [12]:
# Retreive FG names
%store -r customers_feature_group_name
%store -r products_feature_group_name
%store -r orders_feature_group_name
logger.info(f'Customers FG: {customers_feature_group_name}')
logger.info(f'Products FG: {products_feature_group_name}')
logger.info(f'Orders FG: {orders_feature_group_name}')

Customers FG: fscw-customers-07-13-23-47
Products FG: fscw-products-07-13-23-47
Orders FG: fscw-orders-07-13-23-47


In [13]:
customers_fg = FeatureGroup(name=customers_feature_group_name, sagemaker_session=sagemaker_session)  
products_fg = FeatureGroup(name=products_feature_group_name, sagemaker_session=sagemaker_session)
orders_fg = FeatureGroup(name=orders_feature_group_name, sagemaker_session=sagemaker_session)

Verify record exists in Customer Feature Group for a random customer_id

In [14]:
customer_id =  f'C{randint(1, 10000)}'
logger.info(f'customer_id={customer_id}') 

customer_id=C7942


In [15]:
feature_record = featurestore_runtime.get_record(FeatureGroupName=customers_feature_group_name, 
                                                        RecordIdentifierValueAsString=customer_id)
feature_record

{'ResponseMetadata': {'RequestId': 'dbb2e059-6307-4e8a-ab7b-7684cf38f2f0',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'dbb2e059-6307-4e8a-ab7b-7684cf38f2f0',
   'content-type': 'application/json',
   'content-length': '588',
   'date': 'Wed, 13 Jul 2022 23:58:52 GMT'},
  'RetryAttempts': 0},
 'Record': [{'FeatureName': 'customer_id', 'ValueAsString': 'C7942'},
  {'FeatureName': 'sex', 'ValueAsString': '1'},
  {'FeatureName': 'is_married', 'ValueAsString': '1'},
  {'FeatureName': 'event_time', 'ValueAsString': '2022-07-13T23:46:35.610Z'},
  {'FeatureName': 'age_18-29', 'ValueAsString': '1'},
  {'FeatureName': 'age_30-39', 'ValueAsString': '0'},
  {'FeatureName': 'age_40-49', 'ValueAsString': '0'},
  {'FeatureName': 'age_50-59', 'ValueAsString': '0'},
  {'FeatureName': 'age_60-69', 'ValueAsString': '0'},
  {'FeatureName': 'age_70-plus', 'ValueAsString': '0'},
  {'FeatureName': 'n_days_active', 'ValueAsString': '0.2356164383561644'}]}

In [16]:
describe_feature_group_result = sagemaker_runtime.describe_feature_group(
    FeatureGroupName=customers_feature_group_name
)
pretty_printer.pprint(describe_feature_group_result)

{   'CreationTime': datetime.datetime(2022, 7, 13, 23, 48, 0, 160000, tzinfo=tzlocal()),
    'EventTimeFeatureName': 'event_time',
    'FeatureDefinitions': [   {   'FeatureName': 'customer_id',
                                  'FeatureType': 'String'},
                              {'FeatureName': 'sex', 'FeatureType': 'Integral'},
                              {   'FeatureName': 'is_married',
                                  'FeatureType': 'Integral'},
                              {   'FeatureName': 'event_time',
                                  'FeatureType': 'String'},
                              {   'FeatureName': 'age_18-29',
                                  'FeatureType': 'Integral'},
                              {   'FeatureName': 'age_30-39',
                                  'FeatureType': 'Integral'},
                              {   'FeatureName': 'age_40-49',
                                  'FeatureType': 'Integral'},
                              {   'FeatureNa

### Update feature group and ingest data
<a id='update-customer-fg' />

The sample product set that we have are spread out across different categories - baby products, candies, cleaning products etc. So let us assume that a customer *“having kids or not”* is defintely an indicator of them buying baby and kids products. Lets go ahead and modify the customer feature group to add this new feature.

In [17]:
# Call UpdateFeatureGroup with feature addition(s)
sagemaker_runtime.update_feature_group(
    FeatureGroupName=customers_feature_group_name,
    FeatureAdditions=[
        {"FeatureName": "has_kids", "FeatureType": "Integral"}
    ]
)

{'FeatureGroupArn': 'arn:aws:sagemaker:us-east-1:758387645107:feature-group/fscw-customers-07-13-23-47',
 'ResponseMetadata': {'RequestId': '331947d0-6ce6-4e30-8a61-0578f372cebf',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '331947d0-6ce6-4e30-8a61-0578f372cebf',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '103',
   'date': 'Wed, 13 Jul 2022 23:58:55 GMT'},
  'RetryAttempts': 0}}

We have a sleep set for 60 seconds because the update operation could take a minute.

In [18]:
time.sleep(60)

In [19]:
describe_feature_group_result = sagemaker_runtime.describe_feature_group(
    FeatureGroupName=customers_feature_group_name
)
pretty_printer.pprint(describe_feature_group_result)

{   'CreationTime': datetime.datetime(2022, 7, 13, 23, 48, 0, 160000, tzinfo=tzlocal()),
    'EventTimeFeatureName': 'event_time',
    'FeatureDefinitions': [   {   'FeatureName': 'customer_id',
                                  'FeatureType': 'String'},
                              {'FeatureName': 'sex', 'FeatureType': 'Integral'},
                              {   'FeatureName': 'is_married',
                                  'FeatureType': 'Integral'},
                              {   'FeatureName': 'event_time',
                                  'FeatureType': 'String'},
                              {   'FeatureName': 'age_18-29',
                                  'FeatureType': 'Integral'},
                              {   'FeatureName': 'age_30-39',
                                  'FeatureType': 'Integral'},
                              {   'FeatureName': 'age_40-49',
                                  'FeatureType': 'Integral'},
                              {   'FeatureNa

### Prepare "has_kids" feature data and ingest data again into customer feature group. 
<a id='ingest-customer-fg' />

Verify that the feature is added to feature group before proceeding with this step. We retrieve the customer data from the csv and randomly generate 0 or 1 for "has_kids" feature and ingest into feature group

In [20]:
customers_df = pd.read_csv('../data/transformed/customers.csv')
customers_df.head(5)

Unnamed: 0,customer_id,sex,is_married,event_time,age_18-29,age_30-39,age_40-49,age_50-59,age_60-69,age_70-plus,n_days_active
0,C1,1,1,2022-07-13T23:46:23.861Z,0,0,0,1,0,0,0.521918
1,C2,0,1,2022-07-13T23:46:23.863Z,1,0,0,0,0,0,0.142466
2,C3,0,1,2022-07-13T23:46:23.872Z,0,0,0,0,1,0,0.141096
3,C4,0,1,2022-07-13T23:46:23.877Z,0,0,0,1,0,0,0.887671
4,C5,0,1,2022-07-13T23:46:23.878Z,0,1,0,0,0,0,0.265753


Use the NumPy library to generate random 1s and 0s

In [21]:
customers_df['has_kids']=np.random.randint(0, 2, customers_df.shape[0])

Drop the existing event time column and add current time as event time. These two steps are optional.

In [22]:
customers_df=customers_df.drop(['event_time'],axis=1)

In [23]:
event_timestamps = [generate_event_timestamp() for _ in range(len(customers_df))]
customers_df['event_time'] = event_timestamps
customers_df.head(5)

Unnamed: 0,customer_id,sex,is_married,age_18-29,age_30-39,age_40-49,age_50-59,age_60-69,age_70-plus,n_days_active,has_kids,event_time
0,C1,1,1,0,0,0,1,0,0,0.521918,1,2022-07-13T23:59:34.945Z
1,C2,0,1,1,0,0,0,0,0,0.142466,0,2022-07-13T23:59:34.945Z
2,C3,0,1,0,0,0,0,1,0,0.141096,0,2022-07-13T23:59:34.945Z
3,C4,0,1,0,0,0,1,0,0,0.887671,1,2022-07-13T23:59:34.945Z
4,C5,0,1,0,1,0,0,0,0,0.265753,1,2022-07-13T23:59:34.945Z


Ingest the updated data into feature group. In case ingest operation throws errors regarding feature not being present in the Feature Group, give the update operation some more time as mentioned before and try the ingest again.

In [25]:
%%time
customers_fg.ingest(data_frame=customers_df, max_processes=16, wait=True)
logger.info(f'{len(customers_df)} customer records ingested into feature group: {customers_feature_group_name}')

10000 customer records ingested into feature group: fscw-customers-07-13-23-47


CPU times: user 409 ms, sys: 175 ms, total: 584 ms
Wall time: 17.7 s


Verify online store for a specific customer_id

In [26]:
get_record_result = featurestore_runtime.get_record(
    FeatureGroupName=customers_feature_group_name,
    RecordIdentifierValueAsString=customer_id
)
pretty_printer.pprint(get_record_result)

{   'Record': [   {'FeatureName': 'customer_id', 'ValueAsString': 'C7942'},
                  {'FeatureName': 'sex', 'ValueAsString': '1'},
                  {'FeatureName': 'is_married', 'ValueAsString': '1'},
                  {   'FeatureName': 'event_time',
                      'ValueAsString': '2022-07-13T23:59:35.017Z'},
                  {'FeatureName': 'age_18-29', 'ValueAsString': '1'},
                  {'FeatureName': 'age_30-39', 'ValueAsString': '0'},
                  {'FeatureName': 'age_40-49', 'ValueAsString': '0'},
                  {'FeatureName': 'age_50-59', 'ValueAsString': '0'},
                  {'FeatureName': 'age_60-69', 'ValueAsString': '0'},
                  {'FeatureName': 'age_70-plus', 'ValueAsString': '0'},
                  {   'FeatureName': 'n_days_active',
                      'ValueAsString': '0.2356164383561644'},
                  {'FeatureName': 'has_kids', 'ValueAsString': '1'}],
    'ResponseMetadata': {   'HTTPHeaders': {   'content-length

Let us run Athena query to verify offline store. Note that the data ingestion into offline store could take some time since data is buffered, batched, and written into Amazon S3 within 15 minutes.

In [27]:
customers_query = customers_fg.athena_query()
customers_table = customers_query.table_name

In [28]:
output_location = f's3://{default_bucket}/{prefix}/query_results/'

In [29]:
query_string = f'SELECT * FROM "{customers_table}" limit 10'

In [31]:
customers_query.run(query_string=query_string,output_location=output_location)
customers_query.wait()
athena_df = customers_query.as_dataframe()
athena_df.head()

Unnamed: 0,customer_id,sex,is_married,event_time,age_18-29,age_30-39,age_40-49,age_50-59,age_60-69,age_70-plus,n_days_active,has_kids,write_time,api_invocation_time,is_deleted
0,C1257,1,1,2022-07-13T23:59:34.956Z,1,0,0,0,0,0,0.161644,0,2022-07-14 00:04:38.131,2022-07-13 23:59:42.000,False
1,C3129,0,0,2022-07-13T23:59:34.980Z,0,0,0,0,0,1,0.539041,1,2022-07-14 00:04:38.131,2022-07-13 23:59:42.000,False
2,C658,0,1,2022-07-13T23:59:34.950Z,0,1,0,0,0,0,0.110274,0,2022-07-14 00:04:38.131,2022-07-13 23:59:43.000,False
3,C1282,0,0,2022-07-13T23:59:34.957Z,0,0,0,0,0,1,0.163014,0,2022-07-14 00:04:38.131,2022-07-13 23:59:43.000,False
4,C1900,0,1,2022-07-13T23:59:34.961Z,1,0,0,0,0,0,0.669863,0,2022-07-14 00:04:38.131,2022-07-13 23:59:43.000,False


As we see from the above step, it is very easy now to modify an existing feature group, add new features, and ingest data.

In [None]:
customers_df.to_csv('../data/transformed/customers_has_kids.csv', index=False)

### Verify offline store in Athena Console

If it is for the first time we are launching Athena in AWS console we need to click on `Get Started` button and then before we run the first query we need to set up a query results location in Amazon S3. 

After setting the query results location, on the left panel we need to select the `AwsDataCatalog` as Data source and the `sagemaker_featurestore` as Database.

We can run now run a query for the offline feature store data in Athena. To select the entries from the orders feature group we use the following SQL query. You will need to replace the orders table name with the corresponded value from your environment.

```sql
select * from "<customers-table>"
limit 100
```

![Customers offline data](../images/m3_nb4_athena_query.png "Customers Offline Data")

## Optional steps
From here on in this notebook, we use the data that has the new feature "has_kids" and train the model again with the data, deploy the model and test it against sample data. The intention is not to prove that model performance improves (mind you this is sample data!) but to show a real life use case where modified feature groups can be used for training.

### Prepare model training dataset
<a id='model-training-data' />

Prepare train, test and validation data 

In [None]:
products_query = products_fg.athena_query()
products_table = products_query.table_name

orders_query = orders_fg.athena_query()
orders_table = orders_query.table_name

To prepare training, validation and test data, we run an Athena query against offline feature store and get records for which "has_kids" has been populated. Why do we do this? Because offline feature store has historical records, we want only the latest ingested data that has "has_kids" populated for retraining our model.

In [None]:
query_string = f'SELECT * FROM "{customers_table}", "{products_table}", "{orders_table}" ' \
               f'WHERE ("{orders_table}"."customer_id" = "{customers_table}"."customer_id") ' \
               f'AND ("{orders_table}"."product_id" = "{products_table}"."product_id")' \
               f'AND ("{customers_table}"."has_kids" is not null)'
query_string

In [None]:
orders_query.run(query_string=query_string, output_location=output_location)
orders_query.wait()
joined_df = orders_query.as_dataframe()
joined_df.head()

In [None]:
joined_df.shape

In [None]:
model_df = joined_df.drop(['order_id', 
                           'customer_id', 
                           'product_id', 
                           'event_time', 
                           'write_time', 
                           'api_invocation_time', 
                           'is_deleted', 
                           'product_id.1', 
                           'event_time.1', 
                           'write_time.1', 
                           'api_invocation_time.1', 
                           'is_deleted.1', 
                           'customer_id.1', 
                           'purchase_amount',
                           'event_time.2', 
                           'n_days_since_last_purchase',
                           'write_time.2', 
                           'api_invocation_time.2', 
                           'is_deleted.2'], axis=1)

In [None]:
model_df.head(5)

In [None]:
first_column = model_df.pop('is_reordered')
model_df.insert(0, 'is_reordered', first_column)
model_df.head()

In [None]:
model_df.to_csv('../data/retrain/transformed_has_kids.csv', index=False)

### Retrain the XGBoost model with the update feature group
<a id='retrain-xg-boost' />

Now lets train the model again with this new data set

In [None]:
train_df, validation_df, test_df = np.split(model_df.sample(frac=1, random_state=123), [int(.7*len(model_df)), int(.9*len(model_df))])

In [None]:
train_df.shape

In [None]:
validation_df.shape

In [None]:
test_df.shape

Store the train, validation and test data locally and i

In [None]:
train_df.to_csv('../data/retrain/train.csv', index=False)
validation_df.to_csv('../data/retrain/validation.csv', index=False)
test_df.to_csv('../data/retrain/test.csv', index=False)

In [None]:
s3.Bucket(default_bucket).Object(os.path.join(prefix, 'retrain/train.csv')).upload_file('../data/retrain/train.csv')
s3.Bucket(default_bucket).Object(os.path.join(prefix, 'retrain/validation.csv')).upload_file('../data/retrain/validation.csv')
s3.Bucket(default_bucket).Object(os.path.join(prefix, 'retrain/test.csv')).upload_file('../data/retrain/test.csv')

In [None]:
train_set_location = 's3://{}/{}/retrain/'.format(default_bucket, prefix)
validation_set_location = 's3://{}/{}/retrain/'.format(default_bucket, prefix)
test_set_location = 's3://{}/{}/retrain/'.format(default_bucket, prefix)

In [None]:
train_set_pointer = TrainingInput(s3_data=train_set_location, content_type='csv')
validation_set_pointer = TrainingInput(s3_data=validation_set_location, content_type='csv')
test_set_pointer = TrainingInput(s3_data=test_set_location, content_type='csv')

In [None]:
container_uri = sagemaker.image_uris.retrieve(region=boto_session.region_name, 
                                              framework='xgboost', 
                                              version='1.0-1', 
                                              image_scope='training')

In [None]:
xgb = sagemaker.estimator.Estimator(image_uri=container_uri,
                                    role=role, 
                                    instance_count=2, 
                                    instance_type='ml.m5.xlarge',
                                    output_path='s3://{}/{}/model-artifacts'.format(default_bucket, prefix),
                                    sagemaker_session=sagemaker_session,
                                    base_job_name='reorder-classifier')

xgb.set_hyperparameters(objective='binary:logistic',
                        num_round=100)

In [None]:
xgb.fit({'train': train_set_pointer, 'validation': validation_set_pointer})

In [None]:
xgb_predictor = xgb.deploy(initial_instance_count=2,
                           instance_type='ml.m5.xlarge')

### Real time inference using the deployed endpoint
<a id='real-time-inference' />

Lets get a record from test data and test the inference.

In [None]:
csv_serializer = CSVSerializer()
endpoint_name = xgb_predictor.endpoint_name
predictor = Predictor(endpoint_name=endpoint_name, 
                      serializer=csv_serializer)

In [None]:
test_df = pd.read_csv('../data/retrain/test.csv')
record = test_df.sample(1)
record

In [None]:
X = record.values[0]
payload = X[1:]
payload

In [None]:
%%time

predicted_class_prob = predictor.predict(payload).decode('utf-8')
logger.info(f'Predicted calss probability {predicted_class_prob}')
if float(predicted_class_prob) < 0.5:
    logger.info('Prediction (y) = Will not reorder')
else:
    logger.info('Prediction (y) = Will reorder')

### Cleanup

Now that we have seen how features can be added to feature groups, it is time to delete unwated resources like endpoints to not incur charges

In [None]:
describe_feature_group_result = sagemaker_runtime.describe_feature_group(
    FeatureGroupName=customers_feature_group_name
)
pretty_printer.pprint(describe_feature_group_result)

Delete the endpoint

In [None]:
response = sagemaker_runtime.describe_endpoint_config(EndpointConfigName=endpoint_name)
model_name = response['ProductionVariants'][0]['ModelName']
model_name

In [None]:
sagemaker_runtime.delete_model(ModelName=model_name)  

In [None]:
sagemaker_runtime.delete_endpoint(EndpointName=endpoint_name)

In [None]:
sagemaker_runtime.delete_endpoint_config(EndpointConfigName=endpoint_name)