# Setup

**NOTE:** Before running this notebook, be sure to set the stack name in the first code cell to match the name of the CloudFormation stack you used to create this notebook instance. If you used the default stack name, you should not need to make any updates.

This notebook performs the following setup actions for this example use of Amazon SageMaker Feature Store:

1. Create online-only feature groups
2. Create an Amazon Kinesis data stream
3. Create an Amazon Kinesis Data Applications (KDA) application

### Get ARN's of Lambda functions from CloudFormation stack outputs
1. InvokeFraudEndpointLambdaARN
2. StreamingAggLambdaARN

In [17]:
!pip install sagemaker==2.19.0

Processing /home/ec2-user/.cache/pip/wheels/3f/5f/a2/ae75ce6341001c1605ccc4675808a04f5eb1d0b441ffec4b88/sagemaker-2.19.0-py2.py3-none-any.whl
Collecting boto3>=1.16.32
  Downloading boto3-1.16.33.tar.gz (97 kB)
[K     |████████████████████████████████| 97 kB 9.0 MB/s  eta 0:00:01
Collecting smdebug-rulesconfig>=1.0.0
  Using cached smdebug_rulesconfig-1.0.0-py3-none-any.whl (14 kB)
Collecting botocore<1.20.0,>=1.19.33
  Using cached botocore-1.19.33-py2.py3-none-any.whl (7.0 MB)
Building wheels for collected packages: boto3
  Building wheel for boto3 (setup.py) ... [?25ldone
[?25h  Created wheel for boto3: filename=boto3-1.16.33-py2.py3-none-any.whl size=128452 sha256=2a67d9c7b00e5be6467738f9050225988c3d5051bf683378ce77fffacf5f3c40
  Stored in directory: /home/ec2-user/.cache/pip/wheels/8d/78/40/33932039a5c4c45176211822d559f2c6d20f5ef1cbbce39aa3
Successfully built boto3
[31mERROR: awscli 1.18.169 has requirement botocore==1.19.9, but you'll have botocore 1.19.33 which is incompatib

In [18]:
STACK_NAME = 'sm-fs-streaming-agg-stack' # if you're not using the default stack name, replace this
%store STACK_NAME

Stored 'STACK_NAME' (str)


In [19]:
import sys
import boto3

cf_client = boto3.client('cloudformation')

try:
    outputs = cf_client.describe_stacks(StackName=STACK_NAME)['Stacks'][0]['Outputs']
    for o in outputs:
        if o['OutputKey'] == 'IngestLambdaFunctionARN':
            lambda_to_fs_arn = o['OutputValue']
        if o['OutputKey'] == 'PredictLambdaFunctionARN':
            lambda_to_model_arn = o['OutputValue']
        if o['OutputKey'] == 'PredictLambdaFunctionName':
            predict_lambda_name = o['OutputValue']

except:
    msg = f'CloudFormation stack {STACK_NAME} was not found. Please set the STACK_NAME properly and re-run this cell'
    sys.exit(ValueError(msg))

In [20]:
print(f'lambda_to_model_arn: {lambda_to_model_arn}')
print(f'lambda_to_fs_arn: {lambda_to_fs_arn}')
print(f'predict_lambda_name: {predict_lambda_name}')

lambda_to_model_arn: arn:aws:lambda:us-east-1:835319576252:function:InvokeFraudEndpointLambda
lambda_to_fs_arn: arn:aws:lambda:us-east-1:835319576252:function:StreamingIngestAggFeatures
predict_lambda_name: InvokeFraudEndpointLambda


In [21]:
%store lambda_to_model_arn

Stored 'lambda_to_model_arn' (str)


In [22]:
%store predict_lambda_name

Stored 'predict_lambda_name' (str)


In [23]:
# to get the latest sagemaker python sdk
#!pip install -U sagemaker

In [24]:
from IPython.display import display_html
def restartkernel() :
    display_html("<script>Jupyter.notebook.kernel.restart()</script>",raw=True)
#restartkernel()

### Imports and other setup

In [25]:
from sagemaker import get_execution_role
import sagemaker
import boto3
import json

role = get_execution_role()
sm = boto3.Session().client(service_name='sagemaker')
smfs_runtime = boto3.Session().client(service_name='sagemaker-featurestore-runtime')

## Create online-only feature groups
When using Amazon SageMaker Feature Store, a core design decision is the definition of feature groups. For our credit card fraud detection use case, we have decided to use two of them:

1. `cc-agg-fg` - holds aggregate features that will be updated in near real-time (streaming ingestion)
2. `cc-agg-batch-fg` - holds aggregate features that will be updated in batch

Establishing a feature group is a one-time step and is done using the `CreateFeatureGroup` API. 

Feature groups can be created as **online-only**, **offline-only**, or both **online and offline**, which replicates updates from an online store to an offline store in Amazon S3. Since our focus in this example is on demonstrating the use of the feature store for online inference and streaming aggregation of features, we make each of our feature groups online-only.

In addition to a feature group name, we provide metadata about each feature in the group. We are using a json file to define the schema, but this is not a requirement. We use a schema file to demonstrate how you might capture the feature group definitions, enabling you to recreate them consistently as you move from a development environment to a test or production environment. In our schema file, we also highlight the record identifier and the event timestamp. All feature groups must have these two features, but you get to decide how to name them.

Here is a visual summary of the feature groups we will create below.

<img src="./images/feature_groups.png" />

#### cc-agg-fg schema

In [26]:
!pygmentize schema/cc-agg-fg-schema.json

{
    [94m"description"[39;49;00m: [33m"Aggregated features for each credit card, batch ingestion nightly"[39;49;00m,
    [94m"features"[39;49;00m: [
          {
              [94m"name"[39;49;00m: [33m"cc_num"[39;49;00m,
              [94m"type"[39;49;00m: [33m"bigint"[39;49;00m,
              [94m"description"[39;49;00m: [33m"Credit Card Number (Unique)"[39;49;00m
          },
          {
              [94m"name"[39;49;00m: [33m"num_trans_last_10m"[39;49;00m,
              [94m"type"[39;49;00m: [33m"bigint"[39;49;00m,
              [94m"description"[39;49;00m: [33m"Aggregated Metric: Average number of transactions for the card aggregated by past 10 minutes"[39;49;00m
          },
          {
              [94m"name"[39;49;00m: [33m"avg_amt_last_10m"[39;49;00m,
              [94m"type"[39;49;00m: [33m"double"[39;49;00m,
              [94m"description"[39;49;00m: [33m"Aggregated Metric: Average transaction amount for the card agg

#### cc-agg-batch-fg schema

In [27]:
!pygmentize schema/cc-agg-batch-fg-schema.json

{
    [94m"description"[39;49;00m: [33m"Aggregated features for each credit card, streamed intraday"[39;49;00m,
    
    [94m"features"[39;49;00m: [
          {
              [94m"name"[39;49;00m: [33m"cc_num"[39;49;00m,
              [94m"type"[39;49;00m: [33m"bigint"[39;49;00m,
              [94m"description"[39;49;00m: [33m"Credit Card Number (Unique)"[39;49;00m
          },
          {
              [94m"name"[39;49;00m: [33m"num_trans_last_1w"[39;49;00m,
              [94m"type"[39;49;00m: [33m"bigint"[39;49;00m,
              [94m"description"[39;49;00m: [33m"Aggregated Metric: Average number of transactions for the card aggregated by past 1 week"[39;49;00m
          },
          {
              [94m"name"[39;49;00m: [33m"avg_amt_last_1w"[39;49;00m,
              [94m"type"[39;49;00m: [33m"double"[39;49;00m,
              [94m"description"[39;49;00m: [33m"Aggregated Metric: Average transaction amount for the card aggregate

#### Utility functions to simplify creation of feature groups
`schema_to_defs` takes our schema file and returns feature definitions, and the names of the record identifier and event timestamp feature.

In [28]:
def schema_to_defs(filename):
    schema = json.loads(open(filename).read())
    
    feature_definitions = []
    
    for col in schema['Features']:
        feature = {'FeatureName': col['name']}
        if col['type'] == 'double':
            feature['FeatureType'] = 'Fractional'
        elif col['type'] == 'bigint':
            feature['FeatureType'] = 'Integral'
        else:
            feature['FeatureType'] = 'String'
        feature_definitions.append(feature)

    return feature_definitions, schema['record_identifier_feature_name'], schema['event_time_feature_name']

`schema_to_fg` creates a feature group from a schema file. If no s3 URI is passed, an online-only feature group is created.

In [29]:
from sagemaker import get_execution_role
import json

def create_feature_group_from_schema(filename, fg_name, role_arn=None, s3_uri=None):
    schema = json.loads(open(filename).read())
    
    feature_defs = []
    
    for col in schema['features']:
        feature = {'FeatureName': col['name']}
        if col['type'] == 'double':
            feature['FeatureType'] = 'Fractional'
        elif col['type'] == 'bigint':
            feature['FeatureType'] = 'Integral'
        else:
            feature['FeatureType'] = 'String'
        feature_defs.append(feature)

    record_identifier_name = schema['record_identifier_feature_name']
    event_time_name = schema['event_time_feature_name']

    if role_arn is None:
        role_arn = get_execution_role()

    if s3_uri is None:
        offline_config = {}
    else:
        offline_config = {'OfflineStoreConfig': {'S3StorageConfig': {'S3Uri': s3_uri}}}
        
    sm.create_feature_group(
        FeatureGroupName = fg_name,
        RecordIdentifierFeatureName = record_identifier_name,
        EventTimeFeatureName = event_time_name,
        FeatureDefinitions = feature_defs,
        Description = schema['description'],
        Tags = schema['tags'],
        OnlineStoreConfig = {'EnableOnlineStore': True},
        RoleArn = role_arn,
        **offline_config)

#### Create the two feature groups

In [30]:
create_feature_group_from_schema('schema/cc-agg-fg-schema.json', 'cc-agg-fg')

In [31]:
create_feature_group_from_schema('schema/cc-agg-batch-fg-schema.json', 'cc-agg-batch-fg')

#### Show that the feature store is aware of the new feature groups

In [32]:
sm.list_feature_groups()

{'FeatureGroupSummaries': [{'FeatureGroupName': 'transaction-feature-group-02-02-21-52',
   'FeatureGroupArn': 'arn:aws:sagemaker:us-east-1:835319576252:feature-group/transaction-feature-group-02-02-21-52',
   'CreationTime': datetime.datetime(2020, 12, 2, 2, 21, 54, 95000, tzinfo=tzlocal()),
   'FeatureGroupStatus': 'Created',
   'OfflineStoreStatus': {'Status': 'Active'}},
  {'FeatureGroupName': 'identity-feature-group-02-02-21-52',
   'FeatureGroupArn': 'arn:aws:sagemaker:us-east-1:835319576252:feature-group/identity-feature-group-02-02-21-52',
   'CreationTime': datetime.datetime(2020, 12, 2, 2, 21, 52, 666000, tzinfo=tzlocal()),
   'FeatureGroupStatus': 'Created',
   'OfflineStoreStatus': {'Status': 'Active'}},
  {'FeatureGroupName': 'cc-agg-fg',
   'FeatureGroupArn': 'arn:aws:sagemaker:us-east-1:835319576252:feature-group/cc-agg-fg',
   'CreationTime': datetime.datetime(2020, 12, 10, 19, 56, 5, 226000, tzinfo=tzlocal()),
   'FeatureGroupStatus': 'Creating'},
  {'FeatureGroupName'

#### Describe each feature group
Note that each feature group gets its own ARN, allowing you to manage IAM policies that control access to individual feature groups. The feature names and types are displayed, and the record identifier and event time features are called out specifically. Notice that there is only an `OnlineStoreConfig` and no `OfflineStoreConfig`, as we have decided not to replicate features offline for these groups.

In [33]:
sm.describe_feature_group(FeatureGroupName='cc-agg-fg')

{'FeatureGroupArn': 'arn:aws:sagemaker:us-east-1:835319576252:feature-group/cc-agg-fg',
 'FeatureGroupName': 'cc-agg-fg',
 'RecordIdentifierFeatureName': 'cc_num',
 'EventTimeFeatureName': 'trans_time',
 'FeatureDefinitions': [{'FeatureName': 'cc_num', 'FeatureType': 'Integral'},
  {'FeatureName': 'num_trans_last_10m', 'FeatureType': 'Integral'},
  {'FeatureName': 'avg_amt_last_10m', 'FeatureType': 'Fractional'},
  {'FeatureName': 'trans_time', 'FeatureType': 'Fractional'}],
 'CreationTime': datetime.datetime(2020, 12, 10, 19, 56, 5, 226000, tzinfo=tzlocal()),
 'OnlineStoreConfig': {'EnableOnlineStore': True},
 'RoleArn': 'arn:aws:iam::835319576252:role/service-role/AmazonSageMaker-ExecutionRole-20191006T135881',
 'FeatureGroupStatus': 'Creating',
 'Description': 'Aggregated features for each credit card, batch ingestion nightly',
 'ResponseMetadata': {'RequestId': 'fdf8cf0e-61e3-45b2-8cd0-22c75c1b0433',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'fdf8cf0e-61e3-45b2

In [34]:
sm.describe_feature_group(FeatureGroupName='cc-agg-batch-fg')

{'FeatureGroupArn': 'arn:aws:sagemaker:us-east-1:835319576252:feature-group/cc-agg-batch-fg',
 'FeatureGroupName': 'cc-agg-batch-fg',
 'RecordIdentifierFeatureName': 'cc_num',
 'EventTimeFeatureName': 'trans_time',
 'FeatureDefinitions': [{'FeatureName': 'cc_num', 'FeatureType': 'Integral'},
  {'FeatureName': 'num_trans_last_1w', 'FeatureType': 'Integral'},
  {'FeatureName': 'avg_amt_last_1w', 'FeatureType': 'Fractional'},
  {'FeatureName': 'trans_time', 'FeatureType': 'Fractional'}],
 'CreationTime': datetime.datetime(2020, 12, 10, 19, 56, 8, 580000, tzinfo=tzlocal()),
 'OnlineStoreConfig': {'EnableOnlineStore': True},
 'RoleArn': 'arn:aws:iam::835319576252:role/service-role/AmazonSageMaker-ExecutionRole-20191006T135881',
 'FeatureGroupStatus': 'Creating',
 'Description': 'Aggregated features for each credit card, streamed intraday',
 'ResponseMetadata': {'RequestId': '619f67b0-c511-41af-bf47-b74847f53407',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '619f67b0-c511-

## Create an Amazon Kinesis Data Stream

In [35]:
kinesis_client = boto3.client('kinesis')

In [36]:
kinesis_client.create_stream(StreamName='cc-stream', ShardCount=1)

{'ResponseMetadata': {'RequestId': 'e7d318c0-396b-9507-bf9f-675c177a49d9',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'e7d318c0-396b-9507-bf9f-675c177a49d9',
   'x-amz-id-2': 'xOVfcXh4MRESdOJfAv9cUhJQlNix80/r1IG/TeeavmoJcjp9Tue0KkldV3zpFoFmvla5M/QC0v/qKc43pPqY4O8+r0tDxYvH',
   'date': 'Thu, 10 Dec 2020 19:56:12 GMT',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '0'},
  'RetryAttempts': 0}}

In [37]:
kinesis_client.list_streams()

{'StreamNames': ['cc-stream', 'dsoaws-kinesis-data-stream'],
 'HasMoreStreams': False,
 'ResponseMetadata': {'RequestId': 'd9ded67f-2106-145d-8192-a9e20f17c883',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'd9ded67f-2106-145d-8192-a9e20f17c883',
   'x-amz-id-2': 'zBikihIQtNE28N+tzudEL4ep7ZxLjt13rja99VGtVDidvuJNQtYcB4WxFUIfP7CM2Q16EASFNrDzPNnOiTIUIurlHLA2hzTE',
   'date': 'Thu, 10 Dec 2020 19:56:13 GMT',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '81'},
  'RetryAttempts': 0}}

In [38]:
kinesis_client.describe_stream(StreamName='cc-stream')

{'StreamDescription': {'StreamName': 'cc-stream',
  'StreamARN': 'arn:aws:kinesis:us-east-1:835319576252:stream/cc-stream',
  'StreamStatus': 'CREATING',
  'Shards': [],
  'HasMoreShards': False,
  'RetentionPeriodHours': 24,
  'StreamCreationTimestamp': datetime.datetime(2020, 12, 10, 19, 56, 12, tzinfo=tzlocal()),
  'EnhancedMonitoring': [{'ShardLevelMetrics': []}],
  'EncryptionType': 'NONE'},
 'ResponseMetadata': {'RequestId': 'cd27e41d-5832-969e-956b-9b8076234a40',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'cd27e41d-5832-969e-956b-9b8076234a40',
   'x-amz-id-2': 'VoGUqiteRPxViDtXguajhJy6hURCV0BHG4sRuuw7mDGYckVG1PVSZWTilPn5kyfh3kCZBaT1lti0cE9Pw0JY+1Zu+lUIBQfG',
   'date': 'Thu, 10 Dec 2020 19:56:13 GMT',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '316'},
  'RetryAttempts': 0}}

In [39]:
import time
active_stream = False
while not active_stream:
    status = kinesis_client.describe_stream(StreamName='cc-stream')['StreamDescription']['StreamStatus']
    if (status == 'CREATING'):
        print('Waiting for the Kinesis stream to become active...')
        time.sleep(20)  
    elif (status == 'ACTIVE'): 
        active_stream = True
        print('ACTIVE')

Waiting for the Kinesis stream to become active...
ACTIVE


In [40]:
stream_arn = kinesis_client.describe_stream(StreamName='cc-stream')['StreamDescription']['StreamARN']

In [41]:
stream_arn

'arn:aws:kinesis:us-east-1:835319576252:stream/cc-stream'

## Map the Kinesis stream as an event source for Lambda fraud detection

In [42]:
lambda_client = boto3.client('lambda')

lambda_client.create_event_source_mapping(EventSourceArn=stream_arn,
                                          FunctionName=lambda_to_model_arn,
                                          StartingPosition='LATEST',
                                          Enabled=True,
                                          MaximumRecordAgeInSeconds=60*10
                                          ) #DestinationConfig would handle discarded records

{'ResponseMetadata': {'RequestId': 'c8ce7b0e-d352-4ec5-a77f-e7e69486b05d',
  'HTTPStatusCode': 202,
  'HTTPHeaders': {'date': 'Thu, 10 Dec 2020 19:56:33 GMT',
   'content-type': 'application/json',
   'content-length': '711',
   'connection': 'keep-alive',
   'x-amzn-requestid': 'c8ce7b0e-d352-4ec5-a77f-e7e69486b05d'},
  'RetryAttempts': 0},
 'UUID': '2bf7d8d7-aed9-4b04-ad32-f858e356437f',
 'StartingPosition': 'LATEST',
 'BatchSize': 100,
 'MaximumBatchingWindowInSeconds': 0,
 'ParallelizationFactor': 1,
 'EventSourceArn': 'arn:aws:kinesis:us-east-1:835319576252:stream/cc-stream',
 'FunctionArn': 'arn:aws:lambda:us-east-1:835319576252:function:InvokeFraudEndpointLambda',
 'LastModified': datetime.datetime(2020, 12, 10, 19, 56, 33, 787000, tzinfo=tzlocal()),
 'LastProcessingResult': 'No records processed',
 'State': 'Creating',
 'StateTransitionReason': 'User action',
 'DestinationConfig': {'OnFailure': {}},
 'MaximumRecordAgeInSeconds': 600,
 'BisectBatchOnFunctionError': False,
 'Maxi

## Create an Amazon Kinesis Data Applications (KDA) application

In [43]:
kda_client = boto3.client('kinesisanalytics')

In [44]:
sql_code = 'CREATE OR REPLACE STREAM "DESTINATION_SQL_STREAM" (\n' + \
                '"cc_num"              BIGINT,\n' + \
                '"num_trans_last_10m"  SMALLINT,\n' + \
                '"avg_amt_last_10m"    REAL\n);\n\n' + \
            'CREATE OR REPLACE PUMP "STREAM_PUMP" AS\n' + \
            'INSERT INTO "DESTINATION_SQL_STREAM"\n' + \
                'SELECT STREAM "cc_num", \n' + \
                    'COUNT(*) OVER LAST_10_MINUTES, \n' + \
                    'AVG("amount") OVER LAST_10_MINUTES\n' + \
                    'FROM "SOURCE_SQL_STREAM_001"\n' + \
                    'WINDOW LAST_10_MINUTES AS (\n' + \
                        'PARTITION BY "cc_num"\n' + \
                        'RANGE INTERVAL \'10\' MINUTE PRECEDING);\n'

In [45]:
kda_inputs = [{
                'NamePrefix': 'SOURCE_SQL_STREAM',
                'KinesisStreamsInput': {
                       'ResourceARN': stream_arn,
                       'RoleARN': role
                },
                'InputSchema': {
                      'RecordFormat': {
                          'RecordFormatType': 'JSON',
                          'MappingParameters': {
                              'JSONMappingParameters': {
                                  'RecordRowPath': '$'
                              }
                          },
                      },
                      'RecordEncoding': 'UTF-8',
                      'RecordColumns': [
                          {'Name': 'cc_num',  'Mapping': '$.cc_num',   'SqlType': 'DECIMAL(1,1)'},
                          {'Name': 'merchant','Mapping': '$.merchant', 'SqlType': 'VARCHAR(64)'},
                          {'Name': 'amount', 'Mapping': '$.amount', 'SqlType': 'REAL'},
                          {'Name': 'zip_code', 'Mapping': '$.zip_code', 'SqlType': 'INTEGER'}
                      ]
                }
              }                         
             ]

<h3> Create Kinesis Data Analytics Application </h3>

We first lookup Lambda ARNs from CloudFormation output, then create a Kinesis Data Analytics application that connects its output to the Streaming Lambda. This Lambda will ingest the records and write them to the SageMaker Feature Group.

In [46]:
kda_outputs = [{'LambdaOutput': {'ResourceARN': lambda_to_fs_arn, 'RoleARN': role},
                'Name': 'DESTINATION_SQL_STREAM',
                'DestinationSchema': {'RecordFormatType': 'JSON'}}]

In [47]:
kda_outputs

[{'LambdaOutput': {'ResourceARN': 'arn:aws:lambda:us-east-1:835319576252:function:StreamingIngestAggFeatures',
   'RoleARN': 'arn:aws:iam::835319576252:role/service-role/AmazonSageMaker-ExecutionRole-20191006T135881'},
  'Name': 'DESTINATION_SQL_STREAM',
  'DestinationSchema': {'RecordFormatType': 'JSON'}}]

In [48]:
kda_client.create_application(ApplicationName='cc-agg-app', 
                              Inputs=kda_inputs,
                              Outputs=kda_outputs,
                              ApplicationCode=sql_code)

InvalidArgumentException: An error occurred (InvalidArgumentException) when calling the CreateApplication operation: Kinesis Analytics service doesn't have sufficient privileges to assume the role: arn:aws:iam::835319576252:role/service-role/AmazonSageMaker-ExecutionRole-20191006T135881. Please check the role provided.

In [None]:
kda_client.describe_application(ApplicationName='cc-agg-app')

In [None]:
kda_client.start_application(ApplicationName='cc-agg-app',
                             InputConfigurations=[{'Id': '1.1',
                                                   'InputStartingPositionConfiguration': 
                                                     {'InputStartingPosition':'NOW'}}])