# Store features by using Amazon SageMaker Feature Store

**SageMaker Studio Kernel**: Data Science

In this exercise you will do:
 - Create a feature group for storing processed data

***

## Part 1/2 - Setup
Here we'll import some libraries and define some variables. You can also take a look on the scripts that were previously created for preparing the data and training our model.

In [None]:
import boto3
import csv
import logging
import pandas as pd
import sagemaker
from sagemaker.feature_store.feature_group import FeatureGroup
from sagemaker.feature_store.feature_definition import FeatureDefinition, FeatureTypeEnum
import time

In [None]:
logging.basicConfig(level=logging.INFO)
LOGGER = logging.getLogger(__name__)

In [None]:
featurestore_runtime_client = boto3.client("sagemaker-featurestore-runtime")
sagemaker_client = boto3.client("sagemaker")
s3_client = boto3.client("s3")

***

### Global configurations

Configuration variables used for Processing, Training, and registration

In [None]:
region = boto3.session.Session().region_name
role_name = "mlops-sagemaker-execution-role"
role = "arn:aws:iam::{}:role/{}".format(boto3.client('sts').get_caller_identity().get('Account'), role_name)

kms_account_id = boto3.client('sts').get_caller_identity().get('Account')

kms_alias = "ml-kms"

bucket_name = ""

In [None]:
boto_session = boto3.Session(region_name=region)

sagemaker_client = boto_session.client("sagemaker")
runtime_client = boto_session.client("sagemaker-runtime")

sagemaker_session = sagemaker.session.Session(
    boto_session=boto_session,
    sagemaker_client=sagemaker_client,
    sagemaker_runtime_client=runtime_client,
    default_bucket=bucket_name
)

In [None]:
kms_key = "arn:aws:kms:{}:{}:alias/{}".format(region, kms_account_id, kms_alias)

***

## Part 2/2: Store Features in Amazon SageMaker Feature Store

### Step 1/4: Load datasets

#### Define input variables

In [None]:
feature_store_path = "data/feature_store"

processing_output_files_path = "data/output"
processed_train_data = "train/train.csv"
processed_test_data = "test/test.csv"

In [None]:
train_data = pd.read_csv(
                "s3://{}/{}/{}".format(bucket_name, processing_output_files_path, processed_train_data),
                sep=',',
                quotechar='"',
                quoting=csv.QUOTE_ALL,
                escapechar='\\',
                encoding='utf-8',
                error_bad_lines=False
            )

test_data = pd.read_csv(
                "s3://{}/{}/{}".format(bucket_name, processing_output_files_path, processed_test_data),
                sep=',',
                quotechar='"',
                quoting=csv.QUOTE_ALL,
                escapechar='\\',
                encoding='utf-8',
                error_bad_lines=False
            )

In [None]:
train_data.head()

In [None]:
test_data.head()

In [None]:
df = pd.concat([train_data, test_data], axis=0, ignore_index=True)

In [None]:
df = df.dropna()

In [None]:
df.shape

***

### Step 2/4: Create Feature Groups

Now let's create a feature group for the transaction data

In [None]:
tweets_feature_group_name = "tweets-group-{}".format(time.strftime('%m-%d-%H-%M', time.gmtime()))
tweets_feature_group_name

#### Define the Feature Group

In [None]:
tweets_feature_group = FeatureGroup(
    name=tweets_feature_group_name, 
    sagemaker_session=sagemaker_session,
    feature_definitions=[
        FeatureDefinition(
            feature_name="user_name",
            feature_type=FeatureTypeEnum.STRING
        ),
        FeatureDefinition(
            feature_name="date",
            feature_type=FeatureTypeEnum.FRACTIONAL
        ),
        FeatureDefinition(
            feature_name="text",
            feature_type=FeatureTypeEnum.STRING
        ),
        FeatureDefinition(
            feature_name="Sentiment",
            feature_type=FeatureTypeEnum.FRACTIONAL
        )
    ]
)

#### Create the Feature Group

In [None]:
def wait_for_feature_group_creation_complete(feature_group):
    status = feature_group.describe().get('FeatureGroupStatus')
    print(f'Initial status: {status}')
    while status == 'Creating':
        LOGGER.info(f'Waiting for feature group: {feature_group.name} to be created ...')
        time.sleep(5)
        status = feature_group.describe().get('FeatureGroupStatus')
    if status != 'Created':
        LOGGER.error("{}".format(feature_group.describe().get("FailureReason")))
        raise SystemExit(f'Failed to create feature group {feature_group.name}: {status}')
    LOGGER.info(f'FeatureGroup {feature_group.name} was successfully created.')



In [None]:
tweets_feature_group.create(s3_uri="s3://{}/{}".format(bucket_name, feature_store_path),
                               record_identifier_name='user_name',
                               event_time_feature_name='date',
                               role_arn=role,
                               enable_online_store=True)

In [None]:
wait_for_feature_group_creation_complete(tweets_feature_group)

#### Ingest Data in the Feature Group

In [None]:
tweets_feature_group.ingest(
    data_frame=df, 
    max_processes=50, 
    wait=True)

### Step 3/4 Get feature record from the Online feature store

In [None]:
feature_record = featurestore_runtime_client.get_record(
    FeatureGroupName=tweets_feature_group_name,
    RecordIdentifierValueAsString="tiffany")

In [None]:
LOGGER.info(feature_record)

### Step 4/4 Explore Data from the Offline Feature Store

Amazon SageMaker Feature Store creates an offline group by using Amazon S3. The data inside this bucket can be used for Training ML models or for performing batch inference by accessing data directly from the Amazon S3 bucket.

Amazon SageMaker Feature Store for Offline groups is creating also a Catalog by using AWS Glue Data Catalog. You can retrieve data directly from the catalog by using [Amazon Athena](https://docs.aws.amazon.com/en_en/athena/latest/ug/what-is.html)

In [None]:
! aws s3 ls s3://$s3_bucket/$feature_store_path

#### Delete Feature Group

In [None]:
! aws sagemaker delete-feature-group --feature-group-name $tweets_feature_group_name

We have just seen how to prepare data using Amazon SageMaker Processing Job. In order to create a ML models please complete the following lab.

 > [Train-Build-Model](./03-Train-Build-Model.ipynb)

If we want to test the execution of a Custom Script container, we can execute the following lab.
 > [Train-Custom-Script-Container](./04-Train-Build-Model-Custom-Script-Container.ipynb)