# Update FG and Batch Ingestion

All prior notebooks have been setting up our end to end solution. Now that all those steps are complete, we will update the feature group cc-agg-batch-fg to add a new feature name. In this notebook, we will use Python SDK ingest dataframe method to the feature store.

### Recap of what is in place

Here is a recap of what we have done so far:

1. In [notebook 0](./0_prepare_transactions_dataset.ipynb), We generated a synthetic dataset of transactions, including simulated fraud attacks.
2. In [notebook 1](./1_setup.ipynb), we created our two feature groups. In that same notebook, we also created a Kinesis data stream and a Kinesis Data Analytics SQL application that consumes the transaction stream and produces aggregate features. These features are provided in near real time to Lambda, and they look back over a 10 minute window.
3. In [notebook 2](./2_batch_ingestion-chime.ipynb), we used a SageMaker Processing Job to create aggregated features and used them to feed both the training dataset as well as an online feature group. We used Glue interactive session to ingest transaction data to offline feature store.
4. In [notebook 3](./3_train_and_deploy_model-chime.ipynb), we used offline fs and trained and deployed an XGBoost model to detect fraud.
5. In [notebook 4](./4_streaming_predictions-chime.ipynb), we send transaction to feature store in near real time and make prediction fraud/non fraud

In [None]:
!pip install Faker
!pip install --upgrade sagemaker

In [None]:
from botocore.client import ClientError
from collections import defaultdict
from faker import Faker
import pandas as pd
import numpy as np
import sagemaker
import datetime
import hashlib
import random
import boto3
import math
import os
import logging
import subprocess, sys
import importlib

In [None]:
logger = logging.getLogger('__name__')
logger.setLevel(logging.DEBUG)
logger.addHandler(logging.StreamHandler())

In [None]:
logger.info(f'Using SageMaker version: {sagemaker.__version__}')
logger.info(f'Using Pandas version: {pd.__version__}')

In [None]:
faker = Faker()
faker.seed_locale('en_US', 0)

In [None]:
SEED = 123
random.seed(SEED)
np.random.seed(SEED)
faker.seed_instance(SEED)

In [None]:
TOTAL_UNIQUE_TRANSACTIONS = 5400000 # 5.4 Million
TOTAL_UNIQUE_USERS = 10000

BUCKET = 'sm-fs-demo'

In [None]:
def generate_fake_name(n: int) -> list:
    loc_ids = set()
    for _ in range(n):
        loc_id = faker.name()
        loc_ids.add(loc_id)
    return list(loc_ids) 

In [None]:
names = generate_fake_name(TOTAL_UNIQUE_USERS)

In [None]:
len(names[0:9000])

In [None]:
name_cut_list = names[0:9000]

In [None]:
assert len(name_cut_list) == 9000 

In [None]:
# inspect random sample of credit card numbers 
random.sample(names, 5)

In [None]:
from sagemaker.feature_store.feature_group import FeatureGroup
LOCAL_DIR = './data'
BUCKET = 'chime-fs-demo'
PREFIX = 'training'

sagemaker_role = sagemaker.get_execution_role()
s3_client = boto3.Session().client('s3')
sagemaker_session = sagemaker.Session()
region = sagemaker_session.boto_region_name
feature_group_name = 'cc-agg-batch-fg'

In [None]:
boto_session = boto3.Session(region_name=region)
sagemaker_client = boto_session.client(service_name='sagemaker', region_name=region)
featurestore_runtime = boto_session.client(service_name='sagemaker-featurestore-runtime', region_name=region)
sagemaker_runtime = boto_session.client(service_name='sagemaker', region_name=region)
feature_store_session = sagemaker.Session(boto_session=boto_session, 
                                          sagemaker_client=sagemaker_client, 
                                          sagemaker_featurestore_runtime_client=featurestore_runtime)


In [None]:
fg = FeatureGroup(name=feature_group_name, sagemaker_session=feature_store_session)  

In [None]:
query = fg.athena_query()
table = query.table_name

In [None]:
query_string = f'SELECT * FROM "{table}"'
%store query_string
query_string

In [None]:
query_results= 'sagemaker-fs-demo'
output_location = f's3://{BUCKET}/{query_results}/query_results/'
print(f'Athena query output location: \n{output_location}')

In [None]:
query.run(query_string=query_string, output_location=output_location)
query.wait()
df = query.as_dataframe()
df.head(5)

In [None]:
new_df = df.drop(['trans_time','write_time','api_invocation_time','is_deleted'],axis=1).head(9000)

In [None]:
import time
current_time_sec = int(round(time.time()))
new_df['trans_time'] = pd.Series([current_time_sec] * len(new_df), dtype="float64")

In [None]:
new_df['name'] = name_cut_list

In [None]:
new_df.head(5)

In [None]:
len(new_df)

In [None]:
new_df['name'] = new_df['name'].astype("str").astype("string")

In [None]:
new_df.dtypes

In [None]:
from datetime import datetime, timezone, date

def generate_event_timestamp():
    # naive datetime representing local time
    naive_dt = datetime.now()
    # take timezone into account
    aware_dt = naive_dt.astimezone()
    # time in UTC
    utc_dt = aware_dt.astimezone(timezone.utc)
    # transform to ISO-8601 format
    event_time = utc_dt.isoformat(timespec='milliseconds')
    event_time = event_time.replace('+00:00', 'Z')
    return event_time

In [None]:
feature_group =FeatureGroup(feature_group_name)

In [None]:
logger.info(f'Updating feature group: {feature_group.name} at {generate_event_timestamp()}...')

sagemaker_runtime.update_feature_group(
    FeatureGroupName=feature_group_name,
    FeatureAdditions=[
        {"FeatureName": "name", "FeatureType": "String"}
    ])

In [None]:
logger.info(f'Ingesting data into feature group: {feature_group.name} at {generate_event_timestamp()}...')
feature_group.ingest(data_frame=new_df, max_processes=16, wait=True)
logger.info(f'{len(new_df)} sample records ingested into feature group: {feature_group.name} at {generate_event_timestamp()}')

In [None]:
cc_num= '4997379740995969'
logger.info(f'ccnum={cc_num}') 

featurestore_runtime_client = sagemaker_session.boto_session.client('sagemaker-featurestore-runtime', region_name=region)

feature_record = featurestore_runtime_client.get_record(FeatureGroupName=feature_group_name, 
                                                        RecordIdentifierValueAsString=cc_num)
feature_record

In [None]:
# feature_group.delete()