# Account C
### Read/Write to its own Online Store (Account C) + Read/Write to Account A's Offline Store (Centralized Store)

#### Prerequisites

In [1]:
#!pip install awswrangler

#### Imports 

In [2]:
from sagemaker.feature_store.feature_group import FeatureGroup
from sagemaker import get_execution_role
from sagemaker.session import Session
import awswrangler as wr
import pandas as pd
import sagemaker
import logging
import boto3
import time
import s3fs

#### Setup Logger

In [3]:
logger = logging.getLogger('sagemaker')
logger.setLevel(logging.INFO)
logger.addHandler(logging.StreamHandler())

In [4]:
logger.info(f'[Using SageMaker version: {sagemaker.__version__}]')

[Using SageMaker version: 2.19.0]


#### Essentials 
* Create SageMaker & Feature Store Runtime Clients
* Create a Feature Store Session encapsulating the above clients
* Ensure the Execution Role you use for this notebook has both `AmazonSageMakerFullAccess` and `AmazonSageMakerFeatureStoreAccess` managed policies attached to it. If not, please make sure to attach them to the role before proceeding.

In [5]:
region = boto3.Session().region_name
boto_session = boto3.Session(region_name=region)
s3 = boto_session.resource('s3', region_name=region)
role = get_execution_role()

s3_client = boto3.client('s3', region_name=region)
sagemaker_client = boto_session.client(service_name='sagemaker', region_name=region)
featurestore_runtime = boto_session.client(service_name='sagemaker-featurestore-runtime', region_name=region)

https://sagemaker.readthedocs.io/en/stable/amazon_sagemaker_featurestore.html <br>
API Documentation: https://sagemaker.readthedocs.io/en/stable/api/prep_data/feature_store.html

In [6]:
feature_store_session = Session(
    boto_session=boto_session,
    sagemaker_client=sagemaker_client,
    sagemaker_featurestore_runtime_client=featurestore_runtime
)

In [7]:
feature_store_session.__dict__

{'_default_bucket': None,
 '_default_bucket_name_override': None,
 's3_resource': None,
 's3_client': None,
 'config': None,
 'boto_session': Session(region_name='us-east-1'),
 '_region_name': 'us-east-1',
 'sagemaker_client': <botocore.client.SageMaker at 0x7fe2a779ee48>,
 'sagemaker_runtime_client': <botocore.client.SageMakerRuntime at 0x7fe2a774ffd0>,
 'sagemaker_featurestore_runtime_client': <botocore.client.SageMakerFeatureStoreRuntime at 0x7fe2a774f198>,
 'local_mode': False}

`offline_feature_store_s3_uri` URI below is the location of your offline store

In [8]:
bucket = 'sagemaker-feature-store-account-a'
offline_feature_store_s3_uri = f's3://{bucket}/'
offline_feature_store_s3_uri

's3://sagemaker-feature-store-account-a/'

#### Load Features 

In [9]:
features = pd.read_csv('features.csv', names=['employee_id', 'name', 'age', 'sex', 'happiness_score'])
features['created_by'] = 'account-c'

In [10]:
features.dtypes

employee_id          int64
name                object
age                  int64
sex                 object
happiness_score    float64
created_by          object
dtype: object

### Ingest Features into SageMaker Feature Store

In [11]:
record_identifier_feature_name = 'employee_id'
event_time_feature_name = 'event_time'

#### Create Feature Group

In [12]:
feature_group_name = 'employees'
feature_group = FeatureGroup(name=feature_group_name, sagemaker_session=feature_store_session)
feature_group.__dict__

{'name': 'employees',
 'sagemaker_session': <sagemaker.session.Session at 0x7fe2d9cfd780>,
 'feature_definitions': []}

Feature Store supported types are `String`, `Fractional`, and `Integral`. The default type is set to `String`. This means that, if a column in your dataset is not a `float` or `long` type, it will default to `String` in your feature store.

In [13]:
def cast_object_to_string(df):
    """
    Cast object dtype to string. The SageMaker FeatureStore Python SDK will then 
    map the string dtype to String feature type.
    """
    for label in df.columns:
        if df.dtypes[label] == 'object':
            df[label] = df[label].astype('string')

In [14]:
cast_object_to_string(features)

#### Append event_time to the `features` dataframe 

In [15]:
current_time_sec = int(round(time.time()))
features[event_time_feature_name] = pd.Series([current_time_sec]*len(features), dtype='float64')

In [16]:
features.dtypes

employee_id          int64
name                string
age                  int64
sex                 string
happiness_score    float64
created_by          string
event_time         float64
dtype: object

In [17]:
features

Unnamed: 0,employee_id,name,age,sex,happiness_score,created_by,event_time
0,100,alex,23,M,1.4,account-c,1608960000.0
1,101,bria,29,F,4.5,account-c,1608960000.0
2,102,cara,43,F,4.3,account-c,1608960000.0
3,103,dave,54,M,3.5,account-c,1608960000.0
4,104,elan,61,F,4.3,account-c,1608960000.0


#### Load Feature Definitions
SageMaker FeatureStore Python SDK will auto-detect the data schema based on input data

In [18]:
feature_group.load_feature_definitions(data_frame=features)

[FeatureDefinition(feature_name='employee_id', feature_type=<FeatureTypeEnum.INTEGRAL: 'Integral'>),
 FeatureDefinition(feature_name='name', feature_type=<FeatureTypeEnum.STRING: 'String'>),
 FeatureDefinition(feature_name='age', feature_type=<FeatureTypeEnum.INTEGRAL: 'Integral'>),
 FeatureDefinition(feature_name='sex', feature_type=<FeatureTypeEnum.STRING: 'String'>),
 FeatureDefinition(feature_name='happiness_score', feature_type=<FeatureTypeEnum.FRACTIONAL: 'Fractional'>),
 FeatureDefinition(feature_name='created_by', feature_type=<FeatureTypeEnum.STRING: 'String'>),
 FeatureDefinition(feature_name='event_time', feature_type=<FeatureTypeEnum.FRACTIONAL: 'Fractional'>)]

#### Create Feature Group

In [20]:
#sagemaker_client.delete_feature_group(FeatureGroupName='employees')

In [21]:
feature_group.create(
    s3_uri=offline_feature_store_s3_uri,
    record_identifier_name=record_identifier_feature_name,
    event_time_feature_name=event_time_feature_name,
    role_arn=role,
    enable_online_store=True
)

{'FeatureGroupArn': 'arn:aws:sagemaker:us-east-1:105242341581:feature-group/employees',
 'ResponseMetadata': {'RequestId': 'f98316b5-f59b-493e-a71c-b91e5078a056',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'f98316b5-f59b-493e-a71c-b91e5078a056',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '86',
   'date': 'Sat, 26 Dec 2020 05:27:36 GMT'},
  'RetryAttempts': 0}}

In [22]:
feature_group.__dict__

{'name': 'employees',
 'sagemaker_session': <sagemaker.session.Session at 0x7fe2d9cfd780>,
 'feature_definitions': [FeatureDefinition(feature_name='employee_id', feature_type=<FeatureTypeEnum.INTEGRAL: 'Integral'>),
  FeatureDefinition(feature_name='name', feature_type=<FeatureTypeEnum.STRING: 'String'>),
  FeatureDefinition(feature_name='age', feature_type=<FeatureTypeEnum.INTEGRAL: 'Integral'>),
  FeatureDefinition(feature_name='sex', feature_type=<FeatureTypeEnum.STRING: 'String'>),
  FeatureDefinition(feature_name='happiness_score', feature_type=<FeatureTypeEnum.FRACTIONAL: 'Fractional'>),
  FeatureDefinition(feature_name='created_by', feature_type=<FeatureTypeEnum.STRING: 'String'>),
  FeatureDefinition(feature_name='event_time', feature_type=<FeatureTypeEnum.FRACTIONAL: 'Fractional'>)]}

#### Validate if feature group is created

In [23]:
feature_group.describe()

{'FeatureGroupArn': 'arn:aws:sagemaker:us-east-1:105242341581:feature-group/employees',
 'FeatureGroupName': 'employees',
 'RecordIdentifierFeatureName': 'employee_id',
 'EventTimeFeatureName': 'event_time',
 'FeatureDefinitions': [{'FeatureName': 'employee_id',
   'FeatureType': 'Integral'},
  {'FeatureName': 'name', 'FeatureType': 'String'},
  {'FeatureName': 'age', 'FeatureType': 'Integral'},
  {'FeatureName': 'sex', 'FeatureType': 'String'},
  {'FeatureName': 'happiness_score', 'FeatureType': 'Fractional'},
  {'FeatureName': 'created_by', 'FeatureType': 'String'},
  {'FeatureName': 'event_time', 'FeatureType': 'Fractional'}],
 'CreationTime': datetime.datetime(2020, 12, 26, 5, 27, 36, 37000, tzinfo=tzlocal()),
 'OnlineStoreConfig': {'EnableOnlineStore': True},
 'OfflineStoreConfig': {'S3StorageConfig': {'S3Uri': 's3://sagemaker-feature-store-account-a/'},
  'DisableGlueTableCreation': False},
 'RoleArn': 'arn:aws:iam::105242341581:role/service-role/AmazonSageMaker-ExecutionRole-202

In [25]:
sagemaker_client.list_feature_groups()

{'FeatureGroupSummaries': [{'FeatureGroupName': 'employees',
   'FeatureGroupArn': 'arn:aws:sagemaker:us-east-1:105242341581:feature-group/employees',
   'CreationTime': datetime.datetime(2020, 12, 26, 5, 27, 36, 37000, tzinfo=tzlocal()),
   'FeatureGroupStatus': 'Created'}],
 'ResponseMetadata': {'RequestId': '91dccc5a-2a74-4e56-b50b-6ff24519c653',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '91dccc5a-2a74-4e56-b50b-6ff24519c653',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '208',
   'date': 'Sat, 26 Dec 2020 05:27:44 GMT'},
  'RetryAttempts': 0}}

#### Put Records into Feature Group (Both Online & Offline)

After the FeatureGroups have been created, we can put data into the FeatureGroups by using the PutRecord API. This API can handle high TPS and is designed to be called by different streams. The data from all of these Put requests is buffered and written to S3 in chunks. The files will be written to the offline store within a few minutes of ingestion. For this example, to accelerate the ingestion process, we are specifying multiple workers to do the job simultaneously. 

In [26]:
%%time

feature_group.ingest(data_frame=features, max_workers=5, wait=True)

Started ingesting index 0 to 1
Started ingesting index 1 to 2
Started ingesting index 2 to 3
Started ingesting index 3 to 4
Started ingesting index 4 to 5
Successfully ingested row 2 to 3
Successfully ingested row 0 to 1
Successfully ingested row 1 to 2
Successfully ingested row 4 to 5
Successfully ingested row 3 to 4


CPU times: user 71.6 ms, sys: 26.4 ms, total: 98 ms
Wall time: 452 ms


IngestionManagerPandas(feature_group_name='employees', sagemaker_session=<sagemaker.session.Session object at 0x7fe2d9cfd780>, data_frame=   employee_id  name  age sex  happiness_score created_by    event_time
0          100  alex   23   M              1.4  account-c  1.608960e+09
1          101  bria   29   F              4.5  account-c  1.608960e+09
2          102  cara   43   F              4.3  account-c  1.608960e+09
3          103  dave   54   M              3.5  account-c  1.608960e+09
4          104  elan   61   F              4.3  account-c  1.608960e+09, max_workers=5, _futures={<Future at 0x7fe2a7691940 state=finished returned NoneType>: (0, 1), <Future at 0x7fe2a7691d68 state=finished returned NoneType>: (1, 2), <Future at 0x7fe2a7691fd0 state=finished returned NoneType>: (2, 3), <Future at 0x7fe2a769f2b0 state=finished returned NoneType>: (3, 4), <Future at 0x7fe2a769fcc0 state=finished returned NoneType>: (4, 5)})

#### Get Record from Online Store (Available Immediately)

To confirm that data has been ingested, we can quickly retrieve a record from the online store:

In [27]:
record_identifier = str(101)

featurestore_runtime.get_record(FeatureGroupName='employees', 
                                RecordIdentifierValueAsString=record_identifier)

{'ResponseMetadata': {'RequestId': '32c740c9-f96f-4559-85f3-b21e4fcf92db',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '32c740c9-f96f-4559-85f3-b21e4fcf92db',
   'content-type': 'application/json',
   'content-length': '368',
   'date': 'Sat, 26 Dec 2020 05:27:57 GMT'},
  'RetryAttempts': 0},
 'Record': [{'FeatureName': 'employee_id', 'ValueAsString': '101'},
  {'FeatureName': 'name', 'ValueAsString': 'bria'},
  {'FeatureName': 'age', 'ValueAsString': '29'},
  {'FeatureName': 'sex', 'ValueAsString': 'F'},
  {'FeatureName': 'happiness_score', 'ValueAsString': '4.5'},
  {'FeatureName': 'created_by', 'ValueAsString': 'account-c'},
  {'FeatureName': 'event_time', 'ValueAsString': '1608960422.0'}]}

#### Get Record from Offline Store
Now let's wait for the data to appear in our offline store before moving forward to creating a dataset. This will take approximately 5 minutes.

In [28]:
account_id = boto3.client('sts').get_caller_identity()['Account']

In [29]:
feature_group_s3_prefix = f'{account_id}/sagemaker/{region}/offline-store/{feature_group_name}/data'
feature_group_s3_prefix

'105242341581/sagemaker/us-east-1/offline-store/employees/data'

In [30]:
offline_store_contents = None
while offline_store_contents is None:
    objects = s3_client.list_objects(Bucket=bucket, Prefix=feature_group_s3_prefix)
    if 'Contents' in objects and len(objects['Contents']) > 1:
        logger.info('[Features are available in Offline Store!]')
        offline_store_contents = objects['Contents']
    else:
        logger.info('[Waiting for data in Offline Store ...]')
        time.sleep(60)

[Waiting for data in Offline Store ...]
[Waiting for data in Offline Store ...]
[Waiting for data in Offline Store ...]
[Waiting for data in Offline Store ...]
[Waiting for data in Offline Store ...]
[Waiting for data in Offline Store ...]
[Features are available in Offline Store!]


In [31]:
offline_store_contents

[{'Key': '105242341581/sagemaker/us-east-1/offline-store/employees/data/year=2020/month=12/day=26/hour=05/20201226T052702Z_dQfG217E6W3YzrQG.parquet',
  'LastModified': datetime.datetime(2020, 12, 26, 5, 33, 33, tzinfo=tzlocal()),
  'ETag': '"75c4d3de3621079b26aebe74aa98d3a7"',
  'Size': 2145,
  'StorageClass': 'STANDARD',
  'Owner': {'DisplayName': 'yavapai_testbed',
   'ID': '768394a884ee2c604687e993ff8f4f5e6320bac8de2bba100ae7686a611b9260'}},
 {'Key': '105242341581/sagemaker/us-east-1/offline-store/employees/data/year=2020/month=12/day=26/hour=05/20201226T052702Z_eBeXvcYNpyKLQ8VO.parquet',
  'LastModified': datetime.datetime(2020, 12, 26, 5, 33, 33, tzinfo=tzlocal()),
  'ETag': '"253e594a7f8a10a5cd002cb92dc1f135"',
  'Size': 2075,
  'StorageClass': 'STANDARD',
  'Owner': {'DisplayName': 'yavapai_testbed',
   'ID': '768394a884ee2c604687e993ff8f4f5e6320bac8de2bba100ae7686a611b9260'}}]

#### Inspect the Parquet Files (Offline Store)

In [32]:
s3_prefix = '/'.join(offline_store_contents[0]['Key'].split('/')[:-5])
s3_uri = f's3://{bucket}/{s3_prefix}'
s3_uri

's3://sagemaker-feature-store-account-a/105242341581/sagemaker/us-east-1/offline-store/employees/data'

In [33]:
df = wr.s3.read_parquet(path=s3_uri)

In [34]:
df

Unnamed: 0,employee_id,name,age,sex,happiness_score,created_by,event_time,write_time,api_invocation_time,is_deleted
0,102,cara,43,F,4.3,account-c,1608960000.0,2020-12-26 05:33:32.659000+00:00,2020-12-26 05:27:52+00:00,False
1,104,elan,61,F,4.3,account-c,1608960000.0,2020-12-26 05:33:32.659000+00:00,2020-12-26 05:27:52+00:00,False
2,103,dave,54,M,3.5,account-c,1608960000.0,2020-12-26 05:33:32.659000+00:00,2020-12-26 05:27:52+00:00,False
0,100,alex,23,M,1.4,account-c,1608960000.0,2020-12-26 05:33:32.703000+00:00,2020-12-26 05:27:52+00:00,False
1,101,bria,29,F,4.5,account-c,1608960000.0,2020-12-26 05:33:32.703000+00:00,2020-12-26 05:27:52+00:00,False


### Give Access to Accounts A and B (Optional)
<p> Self copying Account C's S3 objects and redefining ACLs for Account A & B </p>

In [35]:
can_id_a = "a52ce3999cdab5111cb19ca94abf5de5a69d62f34baa7d4422c630549fad3bd0"
can_id_b = "149b24f8987e48d549b9c2b494029c94d6c1e8b7b91092cad62ca7cd89aea747"
can_id_c = "768394a884ee2c604687e993ff8f4f5e6320bac8de2bba100ae7686a611b9260"

In [36]:
for content in offline_store_contents:
    key = content['Key']
    print(key)
    #!aws s3api put-object-acl --bucket $bucket --key $key --grant-read-acp id=$can_id && echo "success"
    !aws s3api copy-object --copy-source {bucket}/{key} --key {key} --bucket {bucket} --server-side-encryption aws:kms  --ssekms-key-id arn:aws:kms:us-east-1:892313895307:key/d3763b61-8d94-43bd-a3d6-4b4516ad28e7 && echo "[Self-Copy Succeeded!]"
    !aws s3api put-object-acl --bucket $bucket --key $key --grant-full-control id=$can_id_a,id=$can_id_c,id=$can_id_b && echo "[Put Object ACL Succeeded!]"
    

105242341581/sagemaker/us-east-1/offline-store/employees/data/year=2020/month=12/day=26/hour=05/20201226T052702Z_dQfG217E6W3YzrQG.parquet
{
    "ServerSideEncryption": "aws:kms",
    "SSEKMSKeyId": "arn:aws:kms:us-east-1:892313895307:key/d3763b61-8d94-43bd-a3d6-4b4516ad28e7",
    "CopyObjectResult": {
        "ETag": "\"45952a5f07959138c79dcc911f694926\"",
        "LastModified": "2020-12-26T05:34:04.000Z"
    }
}
[Self-Copy Succeeded!]
[Put Object ACL Succeeded!]
105242341581/sagemaker/us-east-1/offline-store/employees/data/year=2020/month=12/day=26/hour=05/20201226T052702Z_eBeXvcYNpyKLQ8VO.parquet
{
    "ServerSideEncryption": "aws:kms",
    "SSEKMSKeyId": "arn:aws:kms:us-east-1:892313895307:key/d3763b61-8d94-43bd-a3d6-4b4516ad28e7",
    "CopyObjectResult": {
        "ETag": "\"841bf63b46363b154cca7562d8aa3b3c\"",
        "LastModified": "2020-12-26T05:34:06.000Z"
    }
}
[Self-Copy Succeeded!]
[Put Object ACL Succeeded!]
