# Batch Ingestion
**This notebook aggregates raw features into new derived features that is used for Fraud Detection model training/inference.**

---

## Contents

1. [Background](#Background)
1. [Setup](#Setup)
1. [Create PySpark Processing Script](#Create-PySpark-Processing-Script)
1. [Run SageMaker Processing Job](#Run-SageMaker-Processing-Job)
1. [Explore Aggregated Features](#Explore-Aggregated-Features)
1. [Validate Feature Group for Records](#Validate-Feature-Group-for-Records)

### Background

- This notebook takes raw credit card transactions data (csv) generated by 
[notebook 0](./0_prepare_transactions_dataset.ipynb) and aggregates the raw features to create new features (ratios) via <b>SageMaker Processing</b> PySpark Job. These aggregated features alongside the raw original features will be leveraged in the training phase of a Credit Card Fraud Detection model in the next step (see notebook [notebook 3](./3_train_and_deploy_model.ipynb)).

- As part of the Spark job, we also select the latest weekly aggregated features - `num_trans_last_1w` and `avg_amt_last_1w` grouped by `cc_num` (credit card number) and populate these features into the <b>SageMaker Online Feature Store</b> as a feature group. This feature group (`cc-agg-batch-fg`) was created in notebook [notebook 1](./1_setup.ipynb).

- [Amazon SageMaker Processing](https://aws.amazon.com/about-aws/whats-new/2020/09/amazon-sagemaker-processing-now-supports-built-in-spark-containers-for-big-data-processing/) lets customers run analytics jobs for data engineering and model evaluation on Amazon SageMaker easily and at scale. It provides a fully managed Spark environment for data processing or feature engineering workloads.

<img src="./images/batch_ingestion.png" />

### Setup

#### Imports 

In [5]:
from sagemaker.spark.processing import PySparkProcessor
import pandas as pd
import numpy as np
import sagemaker
import logging
import random
import boto3

In [6]:
print(f'Using SageMaker version: {sagemaker.__version__}')

Using SageMaker version: 2.145.0


#### Setup Logger

In [7]:
logger = logging.getLogger('sagemaker')
logger.setLevel(logging.INFO)
logger.addHandler(logging.StreamHandler())

In [8]:
logger.info('[Batch Aggregation using SageMaker PySpark Processing Job]')

[Batch Aggregation using SageMaker PySpark Processing Job]


#### Essentials

In [9]:
sagemaker_role = sagemaker.get_execution_role()
BUCKET = 'sm-fs-demo'
INPUT_KEY_PREFIX = 'raw'
OUTPUT_KEY_PREFIX = 'aggregated'
LOCAL_DIR = './data'

### Create PySpark Script
This PySpark script does the following:

1. Aggregates raw features to derive new features (ratios).
2. Saves the aggregated features alongside the original raw features into a CSV file and writes it to S3 - will be used in the next step for model training.
3. Groups the aggregated features by credit card number and picks selected aggregated features to write to SageMaker Feature Store (Online). <br>
<b>Note: </b> The feature group was created in the previous notebook (`1_setup.ipynb`)

In [10]:
%%writefile batch_aggregation.py
from pyspark.sql.types import StructField, StructType, StringType, DoubleType, TimestampType, LongType
from pyspark.sql.functions import desc, dense_rank
from pyspark.sql import SparkSession, DataFrame
from  argparse import Namespace, ArgumentParser
from pyspark.sql.window import Window
import argparse
import logging
import boto3
import time
import sys
import os


TOTAL_UNIQUE_USERS = 10000
FEATURE_GROUP = 'cc-agg-batch-fg'

logger = logging.getLogger('sagemaker')
logger.setLevel(logging.INFO)
logger.addHandler(logging.StreamHandler())


feature_store_client = boto3.client(service_name='sagemaker-featurestore-runtime')


def parse_args() -> Namespace:
    parser = ArgumentParser(description='Spark Job Input and Output Args')
    parser.add_argument('--s3_input_bucket', type=str, help='S3 Input Bucket')
    parser.add_argument('--s3_input_key_prefix', type=str, help='S3 Input Key Prefix')
    parser.add_argument('--s3_output_bucket', type=str, help='S3 Output Bucket')
    parser.add_argument('--s3_output_key_prefix', type=str, help='S3 Output Key Prefix')
    args = parser.parse_args()
    return args
    

def define_schema() -> StructType:
    schema = StructType([StructField('tid', StringType(), True),
                         StructField('datetime', TimestampType(), True),
                         StructField('cc_num', LongType(), True),
                         StructField('amount', DoubleType(), True),
                         StructField('fraud_label', StringType(), True)])
    return schema


def aggregate_features(args: Namespace, schema: StructType, spark: SparkSession) -> DataFrame:
    logger.info('[Read Raw Transactions Data as Spark DataFrame]')
    transactions_df = spark.read.csv(f's3a://{os.path.join(args.s3_input_bucket, args.s3_input_key_prefix)}', \
                                     header=False, \
                                     schema=schema)
    logger.info('[Aggregate Transactions to Derive New Features using Spark SQL]')
    query = """
    SELECT *, \
           avg_amt_last_10m/avg_amt_last_1w AS amt_ratio1, \
           amount/avg_amt_last_1w AS amt_ratio2, \
           num_trans_last_10m/num_trans_last_1w AS count_ratio \
    FROM \
        ( \
        SELECT *, \
               COUNT(*) OVER w1 as num_trans_last_10m, \
               AVG(amount) OVER w1 as avg_amt_last_10m, \
               COUNT(*) OVER w2 as num_trans_last_1w, \
               AVG(amount) OVER w2 as avg_amt_last_1w \
        FROM transactions_df \
        WINDOW \
               w1 AS (PARTITION BY cc_num order by cast(datetime AS timestamp) RANGE INTERVAL 10 MINUTE PRECEDING), \
               w2 AS (PARTITION BY cc_num order by cast(datetime AS timestamp) RANGE INTERVAL 1 WEEK PRECEDING) \
        ) 
    """
    transactions_df.registerTempTable('transactions_df')
    aggregated_features = spark.sql(query)
    return aggregated_features


def write_to_s3(args: Namespace, aggregated_features: DataFrame) -> None:
    logger.info('[Write Aggregated Features to S3]')
    aggregated_features.coalesce(1) \
                       .write.format('com.databricks.spark.csv') \
                       .option('header', True) \
                       .mode('overwrite') \
                       .option('sep', ',') \
                       .save('s3a://' + os.path.join(args.s3_output_bucket, args.s3_output_key_prefix))
    
def group_by_card_number(aggregated_features: DataFrame) -> DataFrame: 
    logger.info('[Group Aggregated Features by Card Number]')
    window = Window.partitionBy('cc_num').orderBy(desc('datetime'))
    sorted_df = aggregated_features.withColumn('rank', dense_rank().over(window))
    grouped_df = sorted_df.filter(sorted_df.rank == 1).drop(sorted_df.rank)
    sliced_df = grouped_df.select('cc_num', 'num_trans_last_1w', 'avg_amt_last_1w')
    return sliced_df


def transform_row(sliced_df: DataFrame) -> list:
    logger.info('[Transform Spark DataFrame Row to SageMaker Feature Store Record]')
    records = []
    for row in sliced_df.rdd.collect():
        record = []
        cc_num, num_trans_last_1w, avg_amt_last_1w = row
        if cc_num:
            record.append({'ValueAsString': str(cc_num), 'FeatureName': 'cc_num'})
            record.append({'ValueAsString': str(num_trans_last_1w), 'FeatureName': 'num_trans_last_1w'})
            record.append({'ValueAsString': str(round(avg_amt_last_1w, 2)), 'FeatureName': 'avg_amt_last_1w'})
            records.append(record)
    return records


def write_to_feature_store(records: list) -> None:
    logger.info('[Write Grouped Features to SageMaker Online Feature Store]')
    success, fail = 0, 0
    for record in records:
        event_time_feature = {
                'FeatureName': 'trans_time',
                'ValueAsString': str(int(round(time.time())))
            }
        record.append(event_time_feature)
        response = feature_store_client.put_record(FeatureGroupName=FEATURE_GROUP, Record=record)
        if response['ResponseMetadata']['HTTPStatusCode'] == 200:
            success += 1
        else:
            fail += 1
    logger.info('Success = {}'.format(success))
    logger.info('Fail = {}'.format(fail))
    assert success == TOTAL_UNIQUE_USERS
    assert fail == 0


def run_spark_job():
    spark = SparkSession.builder.appName('PySparkJob').getOrCreate()
    args = parse_args()
    schema = define_schema()
    aggregated_features = aggregate_features(args, schema, spark)
    write_to_s3(args, aggregated_features)
    sliced_df = group_by_card_number(aggregated_features)
    records = transform_row(sliced_df)
    write_to_feature_store(records)
    
    
if __name__ == '__main__':
    run_spark_job()

Writing batch_aggregation.py


### Run SageMaker Processing Job

In [11]:
spark_processor = PySparkProcessor(base_job_name='sm-fs-demo', 
                                   framework_version='2.4', # spark version
                                   role=sagemaker_role, 
                                   instance_count=1, 
                                   instance_type='ml.m5.2xlarge', 
                                   env={'AWS_DEFAULT_REGION': boto3.Session().region_name},
                                   max_runtime_in_seconds=1200)

In [12]:
%%time

spark_processor.run(submit_app='batch_aggregation.py', 
                    arguments=['--s3_input_bucket', BUCKET, 
                               '--s3_input_key_prefix', INPUT_KEY_PREFIX, 
                               '--s3_output_bucket', BUCKET, 
                               '--s3_output_key_prefix', OUTPUT_KEY_PREFIX],
                    spark_event_logs_s3_uri='s3://{}/logs'.format(BUCKET),
                    logs=False)

Creating processing-job with name sm-fs-demo-2023-04-09-18-35-08-914
INFO:sagemaker:Creating processing-job with name sm-fs-demo-2023-04-09-18-35-08-914


....................................................................................................!CPU times: user 701 ms, sys: 25.4 ms, total: 727 ms
Wall time: 8min 30s


### Explore Aggregated Features 
<p> The SageMaker Processing Job above creates the aggregated features alongside the raw features and writes it to S3. 
Let us verify this output using the code below and prep it to be used in the next step for model training.</p>


Copy results csv from S3 to local directory

In [15]:
!rm {LOCAL_DIR}/{OUTPUT_KEY_PREFIX}/part*.csv

rm: cannot remove ‘./data/aggregated/part*.csv’: No such file or directory


In [16]:
!aws s3 cp s3://{BUCKET}/{OUTPUT_KEY_PREFIX}/ {LOCAL_DIR}/{OUTPUT_KEY_PREFIX}/ --recursive --exclude '_SUCCESS'

download: s3://sm-fs-demo/aggregated/part-00000-4af03980-7edc-414a-bf3f-40645cf9d130-c000.csv to data/aggregated/part-00000-4af03980-7edc-414a-bf3f-40645cf9d130-c000.csv


In [17]:
!mv {LOCAL_DIR}/{OUTPUT_KEY_PREFIX}/part*.csv {LOCAL_DIR}/{OUTPUT_KEY_PREFIX}/part.csv 

In [18]:
agg_features = pd.read_csv(f'{LOCAL_DIR}/{OUTPUT_KEY_PREFIX}/part.csv')
agg_features.dropna(inplace=True)
agg_features['cc_num'] = agg_features['cc_num'].astype(np.int64)
agg_features['fraud_label'] = agg_features['fraud_label'].astype(np.int64)
agg_features.head()

Unnamed: 0,tid,datetime,cc_num,amount,fraud_label,num_trans_last_10m,avg_amt_last_10m,num_trans_last_1w,avg_amt_last_1w,amt_ratio1,amt_ratio2,count_ratio
0,9865906a3fc8ffb36edd7413302fd50d,2020-01-01T08:03:37.000Z,4006080197832643,89.69,0,1,89.69,1,89.69,1.0,1.0,1.0
1,b18b52528c812800f93c815139480f1f,2020-01-01T11:23:16.000Z,4006080197832643,57.98,0,1,57.98,2,73.835,0.785264,0.785264,0.5
2,34b9c71ea65c2a9003dd67e65e4507ec,2020-01-02T03:45:27.000Z,4006080197832643,195.62,0,1,195.62,3,114.43,1.709517,1.709517,0.333333
3,610a9e76bd6a3fb56bed7501b3ec6c0e,2020-01-02T07:14:02.000Z,4006080197832643,653.63,0,1,653.63,4,249.23,2.622598,2.622598,0.25
4,c24caf694d7b5375b4a72224c5351949,2020-01-02T16:43:26.000Z,4006080197832643,20.18,0,1,20.18,5,203.42,0.099204,0.099204,0.2


In [19]:
agg_features.to_csv(f'{LOCAL_DIR}/{OUTPUT_KEY_PREFIX}/processing_output.csv', index=False)

Remove the intermediate `part.csv` file

In [20]:
!rm {LOCAL_DIR}/{OUTPUT_KEY_PREFIX}/part.csv

In [1]:
# def transform_row(df) -> list:
#     logger.info('[Transform Spark DataFrame Row to SageMaker Feature Store Record]')
#     records = []
#     for index, row in df.iterrows():
#         record = []
#         tid,cc_num, amount, fraud_label, num_trans_last_10m, avg_amt_last_10m, num_trans_last_1w, avg_amt_last_1w,datetime, amt_ratio1,amt_ratio2,count_ratio = row
#         if tid:
#             record.append({'ValueAsString': str(tid), 'FeatureName': 'tid'})
#             record.append({'ValueAsString': str(cc_num), 'FeatureName': 'cc_num'})
#             record.append({'ValueAsString': str(round(amount, 2)), 'FeatureName': 'amount'})
#             record.append({'ValueAsString': str(fraud_label), 'FeatureName': 'fraud_label'})
#             record.append({'ValueAsString': str(num_trans_last_1w), 'FeatureName': 'num_trans_last_1w'})
#             record.append({'ValueAsString': str(round(avg_amt_last_1w, 2)), 'FeatureName': 'avg_amt_last_1w'})
#             record.append({'ValueAsString': str(datetime), 'FeatureName': 'datetime'})
#             record.append({'ValueAsString': str(num_trans_last_10m), 'FeatureName': 'num_trans_last_10m'})
#             record.append({'ValueAsString': str(round(avg_amt_last_10m, 2)), 'FeatureName': 'avg_amt_last_10m'})
#             record.append({'ValueAsString': str(round(amt_ratio1, 2)), 'FeatureName': 'amt_ratio1'})
#             record.append({'ValueAsString': str(round(amt_ratio2, 2)), 'FeatureName': 'amt_ratio2'})
#             record.append({'ValueAsString': str(round(count_ratio, 2)), 'FeatureName': 'count_ratio'})
#             records.append(record)
#     return records

# def write_to_feature_store(records: list) -> None:
#     logger.info('[Write Grouped Features to SageMaker Feature Store]')
#     success, fail = 0, 0
#     for record in records:
#         event_time_feature = {
#                 'FeatureName': 'datetime',
#                 'ValueAsString': str(int(round(time.time())))
#             }
#         record.append(event_time_feature)
#         response = feature_store_client.put_record(FeatureGroupName='cc-train-chime-fg', Record=record)
#         if response['ResponseMetadata']['HTTPStatusCode'] == 200:
#             success += 1
#         else:
#             fail += 1
#     logger.info('Success = {}'.format(success))
#     logger.info('Fail = {}'.format(fail))
#     assert success == TOTAL_UNIQUE_USERS
#     assert fail == 0

# import pandas as pd

# df = pd.read_csv(f'{LOCAL_DIR}/{OUTPUT_KEY_PREFIX}/processing_output.csv')
# df.iloc[1]

# records = transform_row(df)
# write_to_feature_store(records)

### Validate Feature Group for Records
Let's randomly pick N credit card numbers from the `processing_output.csv` and verify if records exist in the feature group `cc-agg-batch-fg` for these card numbers.

In [13]:
N = 3 # number of random records to validate
FEATURE_GROUP = 'cc-agg-batch-fg'

In [21]:
processing_out_df = pd.read_csv(f'{LOCAL_DIR}/{OUTPUT_KEY_PREFIX}/processing_output.csv')
processing_out_df.head()

Unnamed: 0,tid,datetime,cc_num,amount,fraud_label,num_trans_last_10m,avg_amt_last_10m,num_trans_last_1w,avg_amt_last_1w,amt_ratio1,amt_ratio2,count_ratio
0,9865906a3fc8ffb36edd7413302fd50d,2020-01-01T08:03:37.000Z,4006080197832643,89.69,0,1,89.69,1,89.69,1.0,1.0,1.0
1,b18b52528c812800f93c815139480f1f,2020-01-01T11:23:16.000Z,4006080197832643,57.98,0,1,57.98,2,73.835,0.785264,0.785264,0.5
2,34b9c71ea65c2a9003dd67e65e4507ec,2020-01-02T03:45:27.000Z,4006080197832643,195.62,0,1,195.62,3,114.43,1.709517,1.709517,0.333333
3,610a9e76bd6a3fb56bed7501b3ec6c0e,2020-01-02T07:14:02.000Z,4006080197832643,653.63,0,1,653.63,4,249.23,2.622598,2.622598,0.25
4,c24caf694d7b5375b4a72224c5351949,2020-01-02T16:43:26.000Z,4006080197832643,20.18,0,1,20.18,5,203.42,0.099204,0.099204,0.2


In [22]:
cc_nums = random.sample(processing_out_df['cc_num'].tolist(), N)
cc_nums

[4667414786921170, 4978884251790033, 4466790039670992]

Using SageMaker Feature Store run-time client, we can verify if records exist in the feature group for the picked `cc_nums` 

In [23]:
feature_store_client = boto3.Session().client(service_name='sagemaker-featurestore-runtime')

INFO:botocore.credentials:Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole


In [24]:
success, fail = 0, 0
for cc_num in cc_nums:
    response = feature_store_client.get_record(FeatureGroupName=FEATURE_GROUP, 
                                               RecordIdentifierValueAsString=str(cc_num))
    if response['ResponseMetadata']['HTTPStatusCode'] == 200 and 'Record' in response.keys():
        success += 1
        print(response['Record'])
    else:
        print(response)
        fail += 1
assert success == N

[{'FeatureName': 'cc_num', 'ValueAsString': '4667414786921170'}, {'FeatureName': 'num_trans_last_1w', 'ValueAsString': '21'}, {'FeatureName': 'avg_amt_last_1w', 'ValueAsString': '763.85'}, {'FeatureName': 'trans_time', 'ValueAsString': '1681065736'}]
[{'FeatureName': 'cc_num', 'ValueAsString': '4978884251790033'}, {'FeatureName': 'num_trans_last_1w', 'ValueAsString': '24'}, {'FeatureName': 'avg_amt_last_1w', 'ValueAsString': '763.33'}, {'FeatureName': 'trans_time', 'ValueAsString': '1681065790'}]
[{'FeatureName': 'cc_num', 'ValueAsString': '4466790039670992'}, {'FeatureName': 'num_trans_last_1w', 'ValueAsString': '22'}, {'FeatureName': 'avg_amt_last_1w', 'ValueAsString': '1863.17'}, {'FeatureName': 'trans_time', 'ValueAsString': '1681065784'}]


### Create Transaction Feature Group for Training
Let's read the `processing_output.csv` file and create and ingest to the new feature group `cc-trans-fg` with all the transactions.

In [40]:
processing_out_df.count()

tid                   5400000
datetime              5400000
cc_num                5400000
amount                5400000
fraud_label           5400000
num_trans_last_10m    5400000
avg_amt_last_10m      5400000
num_trans_last_1w     5400000
avg_amt_last_1w       5400000
amt_ratio1            5400000
amt_ratio2            5400000
count_ratio           5400000
datetime1             5400000
dtype: int64

In [41]:
import boto3
from sagemaker.session import Session
from sagemaker.feature_store.feature_group import FeatureGroup

region = boto3.Session().region_name
boto_session = boto3.Session(region_name=region)
sagemaker_client = boto_session.client(service_name='sagemaker', region_name=region)
featurestore_runtime = boto_session.client(service_name='sagemaker-featurestore-runtime', region_name=region)


feature_store_session = Session(
    boto_session=boto_session,
    sagemaker_client=sagemaker_client,
    sagemaker_featurestore_runtime_client=featurestore_runtime
)

INFO:botocore.credentials:Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole


In [42]:
# processing_out_df['datetime1'] = pd.to_datetime(processing_out_df['datetime'],format='%Y-%m-%dT%H:%M:%S.%fZ').astype(str)

df1 = processing_out_df[['tid','datetime','fraud_label', 'amount', 'amt_ratio1','amt_ratio2','count_ratio']]

df1.head(5)

Unnamed: 0,tid,datetime,fraud_label,amount,amt_ratio1,amt_ratio2,count_ratio
0,9865906a3fc8ffb36edd7413302fd50d,2020-01-01T08:03:37.000Z,0,89.69,1.0,1.0,1.0
1,b18b52528c812800f93c815139480f1f,2020-01-01T11:23:16.000Z,0,57.98,0.785264,0.785264,0.5
2,34b9c71ea65c2a9003dd67e65e4507ec,2020-01-02T03:45:27.000Z,0,195.62,1.709517,1.709517,0.333333
3,610a9e76bd6a3fb56bed7501b3ec6c0e,2020-01-02T07:14:02.000Z,0,653.63,2.622598,2.622598,0.25
4,c24caf694d7b5375b4a72224c5351949,2020-01-02T16:43:26.000Z,0,20.18,0.099204,0.099204,0.2


In [43]:
df1.dtypes

tid             object
datetime        object
fraud_label      int64
amount         float64
amt_ratio1     float64
amt_ratio2     float64
count_ratio    float64
dtype: object

In [44]:
df1['tid']=df1['tid'].astype('string')
df1['datetime']=df1['datetime'].astype('string')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df1['tid']=df1['tid'].astype('string')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df1['datetime']=df1['datetime'].astype('string')


In [45]:
df1.dtypes

tid             string
datetime        string
fraud_label      int64
amount         float64
amt_ratio1     float64
amt_ratio2     float64
count_ratio    float64
dtype: object

In [36]:
featuregroup_name = 'cc-trans-fg'

feature_group = FeatureGroup(name=featuregroup_name, sagemaker_session=feature_store_session)

In [46]:
from time import gmtime, strftime, sleep
current_timestamp = strftime('%m-%d-%H-%M', gmtime())

sample_df = pd.DataFrame([['d621c8d794262ad5e8ad804cb4517395', '2023-03-01T00:00:00Z', 0,8911.09, 1.0,1.0,1.0]], 
                  columns=['tid', 'datetime', 'fraud_label', 'amount', 'amt_ratio1','amt_ratio2','count_ratio'])

# sample_df.dtypes
sample_df['tid']=sample_df['tid'].astype('string')
sample_df['datetime']=sample_df['datetime'].astype('string')
sample_df.dtypes

tid             string
datetime        string
fraud_label      int64
amount         float64
amt_ratio1     float64
amt_ratio2     float64
count_ratio    float64
dtype: object

In [47]:
feature_group.load_feature_definitions(data_frame=sample_df)

[FeatureDefinition(feature_name='tid', feature_type=<FeatureTypeEnum.STRING: 'String'>),
 FeatureDefinition(feature_name='datetime', feature_type=<FeatureTypeEnum.STRING: 'String'>),
 FeatureDefinition(feature_name='fraud_label', feature_type=<FeatureTypeEnum.INTEGRAL: 'Integral'>),
 FeatureDefinition(feature_name='amount', feature_type=<FeatureTypeEnum.FRACTIONAL: 'Fractional'>),
 FeatureDefinition(feature_name='amt_ratio1', feature_type=<FeatureTypeEnum.FRACTIONAL: 'Fractional'>),
 FeatureDefinition(feature_name='amt_ratio2', feature_type=<FeatureTypeEnum.FRACTIONAL: 'Fractional'>),
 FeatureDefinition(feature_name='count_ratio', feature_type=<FeatureTypeEnum.FRACTIONAL: 'Fractional'>)]

In [49]:
offline_feature_store_uri = f's3://sm-fs-demo/sagemaker-feature-store'

print(f'Location of offline store: {offline_feature_store_uri}')

Location of offline store: s3://sm-fs-demo/sagemaker-feature-store


In [50]:
feature_group.create( s3_uri=offline_feature_store_uri, 
                               record_identifier_name='tid', 
                               event_time_feature_name='datetime', 
                               role_arn=sagemaker_role, 
                               enable_online_store=True)

{'FeatureGroupArn': 'arn:aws:sagemaker:us-east-1:461312420708:feature-group/cc-trans-fg',
 'ResponseMetadata': {'RequestId': '1c60163f-567a-45d0-8f12-da4e3b5d0c34',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '1c60163f-567a-45d0-8f12-da4e3b5d0c34',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '88',
   'date': 'Sun, 09 Apr 2023 19:50:33 GMT'},
  'RetryAttempts': 0}}

In [None]:
feature_group.ingest(data_frame=df1, max_processes=16, wait=True)